Skip to content

Abhics8/AgentForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎮 AgentForge

Deep Reinforcement Learning for Autonomous Control

Python PyTorch Gymnasium Streamlit Tests License

Teaching a neural network to master physical balance through trial, error, and zero labeled data.


📌 Problem Statement

Imagine trying to balance a broomstick on your palm. You'd wobble, overcorrect, drop it — and then try again. Over time, through pure trial and error, you'd get better. AgentForge works the same way, except the "palm" is a moving cart, the "broomstick" is a pole, and the "you" is a neural network that has never seen this problem before.

Traditional approaches solve this with hand-written physics equations — formulas that a human engineer explicitly programs. AgentForge takes a fundamentally different approach: the AI starts with absolutely zero knowledge and teaches itself to balance the pole purely by trying thousands of times and learning from its own mistakes. This technique is called Deep Q-Learning (DQN).

Under the hood, the agent reads 4 numbers every frame (where the cart is, how fast it's moving, how tilted the pole is, and how fast it's tilting), picks one of two actions (push left or push right), and gradually discovers which sequences of actions keep the pole upright the longest.

Success Criteria: The agent must balance the pole for an average of ≥ 195 time steps across 100 consecutive games — the official OpenAI benchmark for "solved."


🌐 Interactive Dashboard

AgentForge ships with a full Streamlit-powered web dashboard for exploring every aspect of the project — no terminal required.

Tab What You Get
📊 Training Interactive Plotly convergence chart (hover, zoom, pan), epsilon decay, loss curve, expandable raw data table
🚀 Live Train One-click training with real-time progress bar, live-updating chart, streaming console output, and metric cards
🏆 Baselines Animated bar chart comparing Random (20) vs Heuristic (35) vs DQN (500⭐)
🔬 Ablations Dropdown to switch between 4 ablation studies with side-by-side plots
🎯 Double DQN Code comparison + convergence curves for Standard vs Double DQN
🎬 Videos Three playable gameplay videos showing the agent's learning progression
📐 Architecture Graphviz flow diagram of the DQN pipeline + hyperparameter table
# Run locally
PYTHONPATH=. streamlit run src/dashboard.py

🧠 Architecture

                     ┌─────────────────────────────────┐
                     │       ENVIRONMENT (CartPole-v1)  │
                     │   state = [x, ẋ, θ, θ̇]          │
                     └──────────┬──────────────────────┘
                                │ state
                                ▼
┌──────────────────────────────────────────────────────────────┐
│                        DQN AGENT                             │
│                                                              │
│   ┌──────────────┐    ε-greedy     ┌──────────────────────┐  │
│   │ Policy Net   │ ◄──────────────►│  Action Selection    │  │
│   │ (4→128→128→2)│    explore/     │  argmax Q(s,a)       │  │
│   └──────┬───────┘    exploit      └──────────────────────┘  │
│          │                                                   │
│          │ MSE Loss                                          │
│          │                                                   │
│   ┌──────▼───────┐                 ┌──────────────────────┐  │
│   │ Target Net   │ ◄── hard copy ──│  Every 500 steps     │  │
│   │ (frozen)     │    (sync)       │  (target_update_freq) │  │
│   └──────────────┘                 └──────────────────────┘  │
│                                                              │
│   ┌──────────────────────────────────────────────────────┐   │
│   │  Experience Replay Buffer (capacity: 10,000)         │   │
│   │  → stores (s, a, r, s', done) transitions            │   │
│   │  → samples random mini-batches of 64 for training    │   │
│   └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘
                                │ action
                                ▼
                     ┌──────────────────────────────────┐
                     │   reward, next_state, done       │
                     └──────────────────────────────────┘

📖 Full deep-dive with algorithm pseudocode, math, and references → docs/architecture.md

Key Components

Component Implementation Purpose
Q-Network 4 → 128 → 128 → 2 (ReLU) Approximates Q(s,a) for action selection
Target Network Frozen copy, synced every 500 steps Provides stable TD targets during training
Experience Replay Circular buffer (10K capacity, batch 64) Breaks temporal correlation in training data
ε-Greedy Policy ε: 1.0 → 0.01 (decay 0.995/episode) Balances exploration vs. exploitation
Optimizer Adam (lr=0.001), MSE loss Gradient descent with adaptive learning rate
Gradient Clipping max_norm=1.0 Prevents exploding gradients during training

📊 Results

Training Convergence

The DQN agent successfully solved CartPole-v1 by achieving a rolling average reward of ≥ 195 over 100 episodes.

Training convergence curve showing reward over episodes

Baseline Comparison

The trained DQN agent massively outperforms both baseline strategies:

Bar chart comparing DQN, Heuristic, and Random agents

Agent Avg Reward (100 ep) Strategy
Random ~20 Uniform random actions
Heuristic ~35 If angle > 0 → push right, else push left
DQN (Ours) 500 Learned optimal policy via Deep Q-Learning

Epsilon Decay & Loss Curves

Epsilon decay over episodes Training loss curve


🔬 Ablation Studies

We conducted 4 systematic ablation studies to understand how each hyperparameter affects convergence behavior. In each study, one parameter is varied while all others are held fixed at their tuned defaults.

Ablation 1 — Replay Buffer Size

How much past experience does the agent need to learn effectively?

Tested: 1K · 5K · 10K · 50K transitions


Ablation 2 — Epsilon Decay Rate

How fast should the agent transition from exploration to exploitation?

Tested: 0.990 · 0.995 · 0.999 · 0.9995


Ablation 3 — Network Depth

Does a deeper Q-network learn a better policy?

Tested: 1 · 2 · 3 hidden layers (128 neurons each)


Ablation 4 — Target Network Update Frequency

How often should the target network synchronize with the policy network?

Tested: 250 · 500 · 1,000 · 2,000 steps


🎯 Double DQN Extension

Standard DQN uses the same network to both select and evaluate the best next action, which causes systematic overestimation of Q-values. Double DQN fixes this with a simple but powerful change — decouple selection from evaluation:

Action Selection Action Evaluation
DQN Target Network Target Network
Double DQN Policy Network Target Network
# DQN:        y = r + γ · max_a' Q_target(s', a')
# Double DQN: y = r + γ · Q_target(s', argmax_a' Q_policy(s', a'))

Head-to-Head Results

Agent Convergence Episode
Standard DQN ~563
Double DQN ~595

Both agents solve the environment. On CartPole-v1 the difference is marginal since Q-value overestimation is less harmful in simple environments — but Double DQN becomes critical in complex environments with large action spaces.

Reference: van Hasselt et al., "Deep Reinforcement Learning with Double Q-learning", AAAI 2016.


🎬 Learning Progression

Three gameplay videos demonstrate the agent's learning journey from random flailing to perfect control:

Stage Video Steps Survived Description
🔴 Untrained 01_untrained.mp4 ~11 Random actions, pole falls immediately
🟡 Mid-Training 02_mid_training.mp4 ~500 Loaded from episode 500 checkpoint
🟢 Fully Trained 03_fully_trained.mp4 500 (max) Perfect balance for the entire episode

Videos are saved in results/videos/ and can be regenerated with PYTHONPATH=. python src/record.py.


🚀 Quick Start

# Clone
git clone https://github.com/Abhics8/AgentForge.git
cd AgentForge

# Install dependencies
pip install -r requirements.txt

# Train the DQN agent (1000 episodes)
PYTHONPATH=. python src/train.py

# Launch the interactive dashboard
PYTHONPATH=. streamlit run src/dashboard.py

# Evaluate against baselines
PYTHONPATH=. python src/evaluate.py

# Run all 4 ablation studies
PYTHONPATH=. python src/ablation.py

# DQN vs Double DQN comparison
PYTHONPATH=. python src/compare_dqn.py

# Record gameplay videos
PYTHONPATH=. python src/record.py

# Live demo with visual pygame window
PYTHONPATH=. python src/demo.py

# Run test suite (30 tests)
PYTHONPATH=. python -m pytest tests/ -v

✅ Testing

30/30 tests passing across 4 test classes:

tests/test_components.py::TestReplayBuffer     (11 tests)  ✅
tests/test_components.py::TestDQN              (8 tests)   ✅
tests/test_components.py::TestDQNAgent         (10 tests)  ✅
tests/test_components.py::TestUtils            (1 test)    ✅

CI runs automatically on every push via GitHub Actions across Python 3.10, 3.11, and 3.12.


📁 Project Structure

AgentForge/
├── .github/workflows/
│   └── ci.yml                    # GitHub Actions CI (multi-Python + smoke test)
├── configs/
│   └── default.yaml              # All hyperparameters (single source of truth)
├── docs/
│   └── architecture.md           # Deep-dive: algorithms, math, references
├── src/
│   ├── model.py                  # DQN architecture (configurable depth)
│   ├── replay_buffer.py          # Experience replay (10K circular buffer)
│   ├── agent.py                  # DQN agent (ε-greedy, target net, optimize)
│   ├── double_dqn_agent.py       # Double DQN agent (decoupled evaluation)
│   ├── environment.py            # Gymnasium CartPole-v1 wrapper
│   ├── train.py                  # Training loop with convergence detection
│   ├── evaluate.py               # Baseline comparison evaluation
│   ├── compare_dqn.py            # DQN vs Double DQN head-to-head
│   ├── ablation.py               # 4 ablation studies framework
│   ├── record.py                 # Gameplay video recording
│   ├── tune.py                   # Hyperparameter tuning script
│   ├── demo.py                   # Live pygame demo for presentations
│   ├── dashboard.py              # Streamlit interactive dashboard (7 tabs)
│   └── utils.py                  # Plotting, config loading
├── baselines/
│   ├── random_agent.py           # Uniform random baseline (~20 reward)
│   └── heuristic_agent.py        # Rule-based baseline (~35 reward)
├── results/
│   ├── plots/                    # 14 generated visualizations (dark-themed)
│   ├── checkpoints/              # Saved model weights (.pt)
│   ├── logs/                     # Training CSV logs
│   └── videos/                   # 3 agent gameplay recordings
├── tests/
│   └── test_components.py        # 30 unit tests
└── requirements.txt

⚙️ Hyperparameters

# configs/default.yaml
environment:
  name: CartPole-v1
  solved_reward: 195.0          # OpenAI benchmark threshold
  solved_window: 100            # Rolling window for convergence check

training:
  episodes: 1000
  seed: 42

network:
  hidden_size: 128              # Neurons per hidden layer
  num_hidden_layers: 2          # Network depth

agent:
  replay_buffer_size: 10000     # Experience replay capacity
  batch_size: 64                # Mini-batch size for SGD
  gamma: 0.99                   # Discount factor
  epsilon_start: 1.0            # Initial exploration rate
  epsilon_end: 0.01             # Minimum exploration rate
  epsilon_decay: 0.995          # Multiplicative decay per episode
  learning_rate: 0.001          # Adam optimizer LR
  target_update_freq: 500       # Steps between target net syncs

🛠️ Tech Stack


📚 References


Built by Abhi Bhardwaj

Packages

 
 
 

Contributors