A from-scratch study of tabular Q-Learning and DQN on a custom Gymnasium environment, exploring how ε-greedy exploration, reward shaping, and wind dynamics affect learned policies.
| Agent | Final Avg Reward (20-ep) | Steps to Goal | Key Behavior |
|---|---|---|---|
| Q-Learning | ~ -15 | ~15 steps | Learns wind-compensating policy |
| DQN | ~ -17 | ~17 steps | Comparable via neural function approx |
| Random | ~ -200 | timeout | No learning baseline |
rl-environment-study/
├── src/
│ ├── environments.py # WindyGridWorld (custom Gymnasium env)
│ ├── q_learning.py # Tabular Q-Learning with ε-greedy
│ ├── dqn.py # DQN with replay buffer + target network
│ └── visualization.py # Reward curves, Q-table heatmaps, policy arrows
├── notebooks/
│ └── rl_study.ipynb # Full study notebook with 9 sections
├── tests/
│ ├── test_environments.py # 10 environment tests
│ └── test_agents.py # 14 agent tests (Q-Learning + DQN)
├── evidence/ # Exported PNG evidence from notebook runs
└── pyproject.toml
- Windy GridWorld: NxM grid with column-dependent upward wind. Agent must learn to compensate for wind to reach the goal efficiently
- Q-Learning: Off-policy TD control — Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
- ε-greedy exploration: Balance exploration vs exploitation with decaying ε
- DQN: Neural function approximation with experience replay and target network stabilization
- Bellman optimality: The Q-table converges to Q* when every state-action pair is visited infinitely often with decaying step sizes
A fully executed notebook with all outputs is available as a PDF:
notebooks/rl_study.pdf
# Install
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install gymnasium
pip install -e ".[dev]"
# Run tests
python -m pytest tests/ -v
# Open the study notebook
jupyter notebook notebooks/rl_study.ipynb- Python ≥ 3.10
- gymnasium ≥ 0.29
- PyTorch ≥ 2.0
- numpy, matplotlib, pandas
Chris Schmidt — MS Applied Mathematics | AI Engineering MSE (JHU)
MIT