Skip to content

Mathos34/q-learning-gym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

q-learning-gym

Tabular Q-Learning from scratch on two classic Gymnasium environments: FrozenLake-v1 (8x8, slippery) and MountainCar-v0. No deep learning library, just numpy, gymnasium, pickle.

Python NumPy Gymnasium License

What it does

Two end-to-end Q-Learning agents trained without any RL framework:

  1. FrozenLake 8x8 (slippery) : 64 discrete states, 4 actions. Optimistic Q-table initialization (Q=2.0) drives exploration without needing a long pure-random phase. Trained for 100,000 episodes.
  2. MountainCar-v0 : 2D continuous state (position, velocity), 3 actions. State space is discretized into a 60x60 grid (3,600 cells). Trained for 50,000 episodes with epsilon and alpha decay schedules.

Both agents are tested with rendered animations after training, and trained Q-tables are saved as .pkl for re-use.

Why it matters

Tabular Q-learning is the foundation that every modern RL method (DQN, PPO, SAC, MuZero) generalizes from. Implementing it by hand exposes the design choices that survive into deep RL: epsilon schedules, optimistic initialization, learning-rate decay, discounting, terminal-state handling.

How it works

FrozenLake 8x8

  • Optimistic init: Q(s, a) = 2.0 for non-terminal states (true max is 1.0). Every unvisited cell looks overvalued, so the agent is naturally drawn to explore. Holes and goal initialized to 0.
  • Update rule: standard Q-learning, Q(s,a) <- Q(s,a) + alpha * (r + gamma * max_a Q(s', a) - Q(s,a)).
  • Hyperparameters: alpha=0.03, gamma=0.99, epsilon linearly decaying 0.5 -> 0.01 over 60k episodes.

MountainCar-v0

  • Discretization: np.digitize maps continuous (position, velocity) into a 60x60 grid (3,600 states). Q-table shape: (60, 60, 3).
  • Adaptive schedules: epsilon decays 1.0 -> 0.01 over 5,000 episodes, alpha decays 0.5 -> 0.01 over 30,000 episodes.
  • Terminal handling: at terminated=True, target is just the reward (no future Q to bootstrap from).

Quickstart

git clone https://github.com/Mathos34/q-learning-gym
cd q-learning-gym
python -m venv .venv && source .venv/bin/activate   # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
jupyter notebook lab_rl.ipynb

Pre-trained Q-tables are bundled in models/ so the test/animation cells run immediately. Re-training takes ~2 minutes for FrozenLake and ~3 minutes for MountainCar on a laptop CPU.

Results

Environment Final greedy reward Notes
FrozenLake 8x8 (slippery) ~58% success rate Near-optimal on a highly stochastic environment (1/3 intended direction)
MountainCar-v0 ~111 steps to goal (avg of last 2k episodes) Well under the 200-step truncation; vs. random baseline that essentially never reaches the goal

About

Lab from the Advanced Machine Learning course at ECE Paris (4th-year engineering, Major Data & AI).

Built by Mathis Lacombe, AI Maker at the Intelligence Lab, ECE Paris. LinkedIn · Hugging Face

About

Tabular Q-Learning from scratch on FrozenLake 8x8 and MountainCar (Gymnasium).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors