Tabular Q-Learning from scratch on two classic Gymnasium environments: FrozenLake-v1 (8x8, slippery) and MountainCar-v0. No deep learning library, just numpy, gymnasium, pickle.
Two end-to-end Q-Learning agents trained without any RL framework:
- FrozenLake 8x8 (slippery) : 64 discrete states, 4 actions. Optimistic Q-table initialization (Q=2.0) drives exploration without needing a long pure-random phase. Trained for 100,000 episodes.
- MountainCar-v0 : 2D continuous state (position, velocity), 3 actions. State space is discretized into a 60x60 grid (3,600 cells). Trained for 50,000 episodes with epsilon and alpha decay schedules.
Both agents are tested with rendered animations after training, and trained Q-tables are saved as .pkl for re-use.
Tabular Q-learning is the foundation that every modern RL method (DQN, PPO, SAC, MuZero) generalizes from. Implementing it by hand exposes the design choices that survive into deep RL: epsilon schedules, optimistic initialization, learning-rate decay, discounting, terminal-state handling.
- Optimistic init: Q(s, a) = 2.0 for non-terminal states (true max is 1.0). Every unvisited cell looks overvalued, so the agent is naturally drawn to explore. Holes and goal initialized to 0.
- Update rule: standard Q-learning,
Q(s,a) <- Q(s,a) + alpha * (r + gamma * max_a Q(s', a) - Q(s,a)). - Hyperparameters: alpha=0.03, gamma=0.99, epsilon linearly decaying 0.5 -> 0.01 over 60k episodes.
- Discretization:
np.digitizemaps continuous (position, velocity) into a 60x60 grid (3,600 states). Q-table shape: (60, 60, 3). - Adaptive schedules: epsilon decays 1.0 -> 0.01 over 5,000 episodes, alpha decays 0.5 -> 0.01 over 30,000 episodes.
- Terminal handling: at
terminated=True, target is just the reward (no future Q to bootstrap from).
git clone https://github.com/Mathos34/q-learning-gym
cd q-learning-gym
python -m venv .venv && source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
jupyter notebook lab_rl.ipynbPre-trained Q-tables are bundled in models/ so the test/animation cells run immediately. Re-training takes ~2 minutes for FrozenLake and ~3 minutes for MountainCar on a laptop CPU.
| Environment | Final greedy reward | Notes |
|---|---|---|
| FrozenLake 8x8 (slippery) | ~58% success rate | Near-optimal on a highly stochastic environment (1/3 intended direction) |
| MountainCar-v0 | ~111 steps to goal (avg of last 2k episodes) | Well under the 200-step truncation; vs. random baseline that essentially never reaches the goal |
Lab from the Advanced Machine Learning course at ECE Paris (4th-year engineering, Major Data & AI).
Built by Mathis Lacombe, AI Maker at the Intelligence Lab, ECE Paris. LinkedIn · Hugging Face