# Deep Q-Network (DQN) on CartPole-v1

This notebook demonstrates training and evaluating a Deep Q-Network (DQN) agent on the classic CartPole-v1 environment using PyTorch.

---

## Overview

We will:
- Define the Q-network and replay buffer
- Train the DQN agent
- Evaluate the learned policy
- Visualize performance

---

Let's get started!


In [None]:
# Imports and environment setup
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from collections import deque
import numpy as np

# Make sure your dqn_agent.py and utils.py are in the same directory or installed as a package
from dqn_agent import QNetwork, ReplayBuffer, train_dqn
from utils import evaluate_policy


## Hyperparameters and Environment Setup

We define environment, hyperparameters, and initialize networks, optimizer, and replay buffer.


In [None]:
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

num_episodes = 500
batch_size = 64
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
lr = 1e-2
target_update_freq = 10
test_iters = 20

q_network = QNetwork(state_size, action_size)
target_network = QNetwork(state_size, action_size)
target_network.load_state_dict(q_network.state_dict())
target_network.eval()

optimizer = optim.Adam(q_network.parameters(), lr=lr)
loss_fn = nn.SmoothL1Loss()
buffer = ReplayBuffer()


## Evaluate Untrained Policy

Let's see how the random policy performs before training.


In [None]:
avg_reward_before = evaluate_policy(q_network, env, episodes=test_iters)
print(f"Average reward before training over {test_iters} episodes: {avg_reward_before:.2f}")


## Train the DQN Agent

Start training the agent and observe episodic rewards and epsilon decay.


In [None]:
train_dqn(
    online_network=q_network,
    target_network=target_network,
    env=env,
    buffer=buffer,
    loss_fn=loss_fn,
    optimizer=optimizer,
    num_episodes=num_episodes,
    batch_size=batch_size,
    gamma=gamma,
    epsilon=epsilon,
    epsilon_min=epsilon_min,
    epsilon_decay=epsilon_decay,
    target_update_freq=target_update_freq,
)


## Evaluate Trained Policy

After training, evaluate the learned policy over multiple episodes.


In [None]:
avg_reward_after = evaluate_policy(q_network, env, episodes=test_iters)
print(f"Average reward after training over {test_iters} episodes: {avg_reward_after:.2f}")


## Visualize the Agent

Let's watch the trained agent play CartPole.


In [None]:
import time
obs = env.reset()
done = False
while not done:
    env.render()
    with torch.no_grad():
        state_tensor = torch.FloatTensor(obs).unsqueeze(0)
        action = q_network(state_tensor).argmax().item()
    obs, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    time.sleep(0.02)
env.close()


## Summary

- We successfully trained a DQN agent to solve CartPole-v1.
- The average reward increased significantly from random performance.
- The modular code structure helps in easy experimentation and extensions.

Feel free to experiment with hyperparameters or the network architecture!

---

Thank you for checking out this demo!
