Teaching a neural network to master physical balance through trial, error, and zero labeled data.
Imagine trying to balance a broomstick on your palm. You'd wobble, overcorrect, drop it — and then try again. Over time, through pure trial and error, you'd get better. AgentForge works the same way, except the "palm" is a moving cart, the "broomstick" is a pole, and the "you" is a neural network that has never seen this problem before.
Traditional approaches solve this with hand-written physics equations — formulas that a human engineer explicitly programs. AgentForge takes a fundamentally different approach: the AI starts with absolutely zero knowledge and teaches itself to balance the pole purely by trying thousands of times and learning from its own mistakes. This technique is called Deep Q-Learning (DQN).
Under the hood, the agent reads 4 numbers every frame (where the cart is, how fast it's moving, how tilted the pole is, and how fast it's tilting), picks one of two actions (push left or push right), and gradually discovers which sequences of actions keep the pole upright the longest.
Success Criteria: The agent must balance the pole for an average of ≥ 195 time steps across 100 consecutive games — the official OpenAI benchmark for "solved."
AgentForge ships with a full Streamlit-powered web dashboard for exploring every aspect of the project — no terminal required.
🔗 Live at agentforge-abhi.streamlit.app
| Tab | What You Get |
|---|---|
| 📊 Training | Interactive Plotly convergence chart (hover, zoom, pan), epsilon decay, loss curve, expandable raw data table |
| 🚀 Live Train | One-click training with real-time progress bar, live-updating chart, streaming console output, and metric cards |
| 🏆 Baselines | Animated bar chart comparing Random (20) vs Heuristic (35) vs DQN (500⭐) |
| 🔬 Ablations | Dropdown to switch between 4 ablation studies with side-by-side plots |
| 🎯 Double DQN | Code comparison + convergence curves for Standard vs Double DQN |
| 🎬 Videos | Three playable gameplay videos showing the agent's learning progression |
| 📐 Architecture | Graphviz flow diagram of the DQN pipeline + hyperparameter table |
# Run locally
PYTHONPATH=. streamlit run src/dashboard.py ┌─────────────────────────────────┐
│ ENVIRONMENT (CartPole-v1) │
│ state = [x, ẋ, θ, θ̇] │
└──────────┬──────────────────────┘
│ state
▼
┌──────────────────────────────────────────────────────────────┐
│ DQN AGENT │
│ │
│ ┌──────────────┐ ε-greedy ┌──────────────────────┐ │
│ │ Policy Net │ ◄──────────────►│ Action Selection │ │
│ │ (4→128→128→2)│ explore/ │ argmax Q(s,a) │ │
│ └──────┬───────┘ exploit └──────────────────────┘ │
│ │ │
│ │ MSE Loss │
│ │ │
│ ┌──────▼───────┐ ┌──────────────────────┐ │
│ │ Target Net │ ◄── hard copy ──│ Every 500 steps │ │
│ │ (frozen) │ (sync) │ (target_update_freq) │ │
│ └──────────────┘ └──────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Experience Replay Buffer (capacity: 10,000) │ │
│ │ → stores (s, a, r, s', done) transitions │ │
│ │ → samples random mini-batches of 64 for training │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│ action
▼
┌──────────────────────────────────┐
│ reward, next_state, done │
└──────────────────────────────────┘
📖 Full deep-dive with algorithm pseudocode, math, and references →
docs/architecture.md
| Component | Implementation | Purpose |
|---|---|---|
| Q-Network | 4 → 128 → 128 → 2 (ReLU) | Approximates Q(s,a) for action selection |
| Target Network | Frozen copy, synced every 500 steps | Provides stable TD targets during training |
| Experience Replay | Circular buffer (10K capacity, batch 64) | Breaks temporal correlation in training data |
| ε-Greedy Policy | ε: 1.0 → 0.01 (decay 0.995/episode) | Balances exploration vs. exploitation |
| Optimizer | Adam (lr=0.001), MSE loss | Gradient descent with adaptive learning rate |
| Gradient Clipping | max_norm=1.0 | Prevents exploding gradients during training |
The DQN agent successfully solved CartPole-v1 by achieving a rolling average reward of ≥ 195 over 100 episodes.
The trained DQN agent massively outperforms both baseline strategies:
| Agent | Avg Reward (100 ep) | Strategy |
|---|---|---|
| Random | ~20 | Uniform random actions |
| Heuristic | ~35 | If angle > 0 → push right, else push left |
| DQN (Ours) | 500 ⭐ | Learned optimal policy via Deep Q-Learning |
We conducted 4 systematic ablation studies to understand how each hyperparameter affects convergence behavior. In each study, one parameter is varied while all others are held fixed at their tuned defaults.
How much past experience does the agent need to learn effectively?
Tested: 1K · 5K · 10K · 50K transitions
How fast should the agent transition from exploration to exploitation?
Tested: 0.990 · 0.995 · 0.999 · 0.9995
Does a deeper Q-network learn a better policy?
Tested: 1 · 2 · 3 hidden layers (128 neurons each)
How often should the target network synchronize with the policy network?
Tested: 250 · 500 · 1,000 · 2,000 steps
Standard DQN uses the same network to both select and evaluate the best next action, which causes systematic overestimation of Q-values. Double DQN fixes this with a simple but powerful change — decouple selection from evaluation:
| Action Selection | Action Evaluation | |
|---|---|---|
| DQN | Target Network | Target Network |
| Double DQN | Policy Network | Target Network |
# DQN: y = r + γ · max_a' Q_target(s', a')
# Double DQN: y = r + γ · Q_target(s', argmax_a' Q_policy(s', a'))| Agent | Convergence Episode |
|---|---|
| Standard DQN | ~563 |
| Double DQN | ~595 |
Both agents solve the environment. On CartPole-v1 the difference is marginal since Q-value overestimation is less harmful in simple environments — but Double DQN becomes critical in complex environments with large action spaces.
Reference: van Hasselt et al., "Deep Reinforcement Learning with Double Q-learning", AAAI 2016.
Three gameplay videos demonstrate the agent's learning journey from random flailing to perfect control:
| Stage | Video | Steps Survived | Description |
|---|---|---|---|
| 🔴 Untrained | 01_untrained.mp4 |
~11 | Random actions, pole falls immediately |
| 🟡 Mid-Training | 02_mid_training.mp4 |
~500 | Loaded from episode 500 checkpoint |
| 🟢 Fully Trained | 03_fully_trained.mp4 |
500 (max) | Perfect balance for the entire episode |
Videos are saved in results/videos/ and can be regenerated with PYTHONPATH=. python src/record.py.
# Clone
git clone https://github.com/Abhics8/AgentForge.git
cd AgentForge
# Install dependencies
pip install -r requirements.txt
# Train the DQN agent (1000 episodes)
PYTHONPATH=. python src/train.py
# Launch the interactive dashboard
PYTHONPATH=. streamlit run src/dashboard.py
# Evaluate against baselines
PYTHONPATH=. python src/evaluate.py
# Run all 4 ablation studies
PYTHONPATH=. python src/ablation.py
# DQN vs Double DQN comparison
PYTHONPATH=. python src/compare_dqn.py
# Record gameplay videos
PYTHONPATH=. python src/record.py
# Live demo with visual pygame window
PYTHONPATH=. python src/demo.py
# Run test suite (30 tests)
PYTHONPATH=. python -m pytest tests/ -v30/30 tests passing across 4 test classes:
tests/test_components.py::TestReplayBuffer (11 tests) ✅
tests/test_components.py::TestDQN (8 tests) ✅
tests/test_components.py::TestDQNAgent (10 tests) ✅
tests/test_components.py::TestUtils (1 test) ✅
CI runs automatically on every push via GitHub Actions across Python 3.10, 3.11, and 3.12.
AgentForge/
├── .github/workflows/
│ └── ci.yml # GitHub Actions CI (multi-Python + smoke test)
├── configs/
│ └── default.yaml # All hyperparameters (single source of truth)
├── docs/
│ └── architecture.md # Deep-dive: algorithms, math, references
├── src/
│ ├── model.py # DQN architecture (configurable depth)
│ ├── replay_buffer.py # Experience replay (10K circular buffer)
│ ├── agent.py # DQN agent (ε-greedy, target net, optimize)
│ ├── double_dqn_agent.py # Double DQN agent (decoupled evaluation)
│ ├── environment.py # Gymnasium CartPole-v1 wrapper
│ ├── train.py # Training loop with convergence detection
│ ├── evaluate.py # Baseline comparison evaluation
│ ├── compare_dqn.py # DQN vs Double DQN head-to-head
│ ├── ablation.py # 4 ablation studies framework
│ ├── record.py # Gameplay video recording
│ ├── tune.py # Hyperparameter tuning script
│ ├── demo.py # Live pygame demo for presentations
│ ├── dashboard.py # Streamlit interactive dashboard (7 tabs)
│ └── utils.py # Plotting, config loading
├── baselines/
│ ├── random_agent.py # Uniform random baseline (~20 reward)
│ └── heuristic_agent.py # Rule-based baseline (~35 reward)
├── results/
│ ├── plots/ # 14 generated visualizations (dark-themed)
│ ├── checkpoints/ # Saved model weights (.pt)
│ ├── logs/ # Training CSV logs
│ └── videos/ # 3 agent gameplay recordings
├── tests/
│ └── test_components.py # 30 unit tests
└── requirements.txt
# configs/default.yaml
environment:
name: CartPole-v1
solved_reward: 195.0 # OpenAI benchmark threshold
solved_window: 100 # Rolling window for convergence check
training:
episodes: 1000
seed: 42
network:
hidden_size: 128 # Neurons per hidden layer
num_hidden_layers: 2 # Network depth
agent:
replay_buffer_size: 10000 # Experience replay capacity
batch_size: 64 # Mini-batch size for SGD
gamma: 0.99 # Discount factor
epsilon_start: 1.0 # Initial exploration rate
epsilon_end: 0.01 # Minimum exploration rate
epsilon_decay: 0.995 # Multiplicative decay per episode
learning_rate: 0.001 # Adam optimizer LR
target_update_freq: 500 # Steps between target net syncs- Mnih et al., Playing Atari with Deep Reinforcement Learning, DeepMind, 2013
- Mnih et al., Human-level control through deep reinforcement learning, Nature, 2015
- van Hasselt et al., Deep Reinforcement Learning with Double Q-learning, AAAI, 2016
- Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed.
Built by Abhi Bhardwaj













