q-learning-gym

Tabular Q-Learning from scratch on two classic Gymnasium environments: FrozenLake-v1 (8x8, slippery) and MountainCar-v0. No deep learning library, just numpy, gymnasium, pickle.

What it does

Two end-to-end Q-Learning agents trained without any RL framework:

FrozenLake 8x8 (slippery) : 64 discrete states, 4 actions. Optimistic Q-table initialization (Q=2.0) drives exploration without needing a long pure-random phase. Trained for 100,000 episodes.
MountainCar-v0 : 2D continuous state (position, velocity), 3 actions. State space is discretized into a 60x60 grid (3,600 cells). Trained for 50,000 episodes with epsilon and alpha decay schedules.

Both agents are tested with rendered animations after training, and trained Q-tables are saved as .pkl for re-use.

Why it matters

Tabular Q-learning is the foundation that every modern RL method (DQN, PPO, SAC, MuZero) generalizes from. Implementing it by hand exposes the design choices that survive into deep RL: epsilon schedules, optimistic initialization, learning-rate decay, discounting, terminal-state handling.

How it works

FrozenLake 8x8

Optimistic init: Q(s, a) = 2.0 for non-terminal states (true max is 1.0). Every unvisited cell looks overvalued, so the agent is naturally drawn to explore. Holes and goal initialized to 0.
Update rule: standard Q-learning, Q(s,a) <- Q(s,a) + alpha * (r + gamma * max_a Q(s', a) - Q(s,a)).
Hyperparameters: alpha=0.03, gamma=0.99, epsilon linearly decaying 0.5 -> 0.01 over 60k episodes.

MountainCar-v0

Discretization: np.digitize maps continuous (position, velocity) into a 60x60 grid (3,600 states). Q-table shape: (60, 60, 3).
Adaptive schedules: epsilon decays 1.0 -> 0.01 over 5,000 episodes, alpha decays 0.5 -> 0.01 over 30,000 episodes.
Terminal handling: at terminated=True, target is just the reward (no future Q to bootstrap from).

Quickstart

git clone https://github.com/Mathos34/q-learning-gym
cd q-learning-gym
python -m venv .venv && source .venv/bin/activate   # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
jupyter notebook lab_rl.ipynb

Pre-trained Q-tables are bundled in models/ so the test/animation cells run immediately. Re-training takes ~2 minutes for FrozenLake and ~3 minutes for MountainCar on a laptop CPU.

Results

Environment	Final greedy reward	Notes
FrozenLake 8x8 (slippery)	~58% success rate	Near-optimal on a highly stochastic environment (1/3 intended direction)
MountainCar-v0	~111 steps to goal (avg of last 2k episodes)	Well under the 200-step truncation; vs. random baseline that essentially never reaches the goal

About

Lab from the Advanced Machine Learning course at ECE Paris (4th-year engineering, Major Data & AI).

Built by Mathis Lacombe, AI Maker at the Intelligence Lab, ECE Paris. LinkedIn · Hugging Face

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lab_rl.ipynb		lab_rl.ipynb
prelab_rl.ipynb		prelab_rl.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

q-learning-gym

What it does

Why it matters

How it works

FrozenLake 8x8

MountainCar-v0

Quickstart

Results

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

q-learning-gym

What it does

Why it matters

How it works

FrozenLake 8x8

MountainCar-v0

Quickstart

Results

About

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages