# Assignment 3 â€“ Deep Q-Network (DQN) on Pong  
## CSCN 8020 â€“ Reinforcement Learning Programming  
**Student:** Fenil Patel  
**College:** Conestoga College  
**Professor:** â€”  David Espinosa Carrillo
**Date:** â€”  2025-11-13

This notebook includes the full implementation and analysis for the DQN Pong assignment, including CNN architecture, preprocessing pipeline, experiment results, analysis, and final conclusion.


Github link: https://github.com/FenilPatel1279/Assignment3_DQN_Pong.git

# 1. Introduction

The objective of this assignment is to implement the Deep Q-Network (DQN) algorithm on the `PongDeterministic-v4` environment from the Gymnasium Atari suite.  
Since Pong uses high-dimensional visual observations (210Ã—160Ã—3), a CNN-based Q-network is required instead of tabular Q-learning.

This notebook includes:
- DQN implementation with CNN
- Frame preprocessing (crop â†’ downsample â†’ grayscale â†’ normalize)
- Stacked frames input (4 Ã— 84 Ã— 80)
- Replay buffer and target network
- Îµ-greedy exploration
- Required experiments:
  - Batch size comparison: 8 vs 16
  - Target update rate comparison: 3 vs 10
- Score and reward plots
- Full written analysis and conclusions


# 2. CNN Network Architecture

The model follows the original DeepMind DQN architecture for Atari:

### Input
- 4 stacked grayscale frames  
- Shape: **(4, 84, 80)**

### Convolutional Layers
| Layer | Parameters | Purpose |
|-------|------------|---------|
| Conv1 | 32 filters, 8Ã—8 kernel, stride 4 | Extracts spatial motion patterns |
| Conv2 | 64 filters, 4Ã—4 kernel, stride 2 | Learns deeper features |
| Conv3 | 64 filters, 3Ã—3 kernel, stride 1 | High-level ball/paddle features |

### Fully Connected Layers
| Layer | Size |
|-------|------|
| FC1 | 512 neurons |
| Output | 6 Q-values (one for each Pong action) |

### Activations
- ReLU after each Conv and FC layer

This architecture efficiently processes visual Atari frames and has proven effective in the original DQN research (Mnih et al., 2015).


# 3. Frame Preprocessing

We apply the professor-provided preprocessing functions:

- `img_crop()` â†’ remove scoreboard area  
- `downsample()` â†’ reduce resolution  
- `to_grayscale()` â†’ reduce RGB to single channel  
- `normalize_grayscale()` â†’ scale pixels to [-1, 1]  
- `process_frame()` â†’ final output shape: (1, 84, 80)

We then **stack 4 consecutive frames** to give the CNN temporal awareness (ball direction, speed, paddle movement).


# 4. Training Setup

### Default Hyperparameters
| Parameter | Value |
|-----------|--------|
| Batch size | 8 |
| Target update rate | 10 episodes |
| Discount factor (Î³) | 0.95 |
| Optimizer | Adam (lr = 1e-4) |
| Replay buffer | 50,000 |
| Îµ initial | 1.0 |
| Îµ decay | 0.995 |
| Îµ minimum | 0.05 |
| Episodes per experiment | 20 |

### Why these values?
They match the professor's assignment instructions and align with standard DQN hyperparameters used in Atari research.


# 5. Default Training Run (Batch Size = 8, Update = 10)

Below is the result of the baseline training run.

### ðŸ“Œ INSERT IMAGE: `score_per_episode.png` HERE

**Interpretation:**
- Scores fluctuate between **-21 and -16**, which is normal for early-stage Pong training.
- The agent does not learn to win in 20 episodes (typical training takes 200â€“500 episodes).

---

### ðŸ“Œ INSERT IMAGE: `avg_reward_plot.png` HERE

**Interpretation:**
- Average reward remains around **-20**, showing stable clipped rewards.
- Replay buffer and target network keep learning stable.


# 6. Experiment 1 â€” Mini-Batch Size Comparison (Batch 8 vs 16)

The first set of experiments compares the effect of batch size on learning dynamics.
We compare:

- **Batch size 8** (default)
- **Batch size 16**

---

## A. Individual Experiment Results

### ðŸ“Œ INSERT IMAGE: `batch8_scores.png` HERE  
### ðŸ“Œ INSERT IMAGE: `batch8_avg_rewards.png` HERE

**Observation (Batch = 8):**
- Scores fluctuate more (higher variability).
- Occasional spikes to **-19**, indicating slightly better short-term exploration.
- More frequent network updates â†’ faster responsiveness.

---

### ðŸ“Œ INSERT IMAGE: `batch16_scores.png` HERE  
### ðŸ“Œ INSERT IMAGE: `batch16_avg_rewards.png` HERE

**Observation (Batch = 16):**
- Smoother but slower learning.
- Fewer high spikes; more stable but slightly lower performance.
- Larger batches â†’ fewer updates â†’ slower early improvement.

---

## B. Conclusion for Batch Size

**Batch size = 8** performs slightly better and is more responsive.  
**Batch size = 16** is more stable but slower.

âž¡ **Best choice: Batch 8**


# 7. Experiment 2 â€” Target Network Update Rate (3 vs 10)

We compare:
- **Update rate = 3 episodes**
- **Update rate = 10 episodes (default)**

---

## A. Individual Experiment Results

### ðŸ“Œ INSERT IMAGE: `update3_scores.png` HERE  
### ðŸ“Œ INSERT IMAGE: `update3_avg_rewards.png` HERE

**Observation (Update = 3):**
- Very unstable learning.
- Frequent target updates disrupt Q-value convergence.
- More chaotic learning curves.

---

### ðŸ“Œ INSERT IMAGE: `update10_scores.png` HERE  
### ðŸ“Œ INSERT IMAGE: `update10_avg_rewards.png` HERE

**Observation (Update = 10):**
- Most stable behaviour.
- Occasional improvement spikes.
- Matches classical DeepMind settings.

---

## B. Conclusion for Target Update Rate

âž¡ **Update rate 10** provides smoother and more reliable learning.  
âž¡ **Update rate 3** is too frequent and destabilizes training.

**Best choice: Update rate = 10**


# 8. Final Recommended Hyperparameters

Based on all experiments:

| Parameter | Best Value |
|-----------|------------|
| Batch Size | **8** |
| Target Update Rate | **10 episodes** |

### Final Summary
Batch size **8** gives faster learning and slightly better episode scores.  
Target update **10** provides more stable Q-value learning.

This combination produced the best overall training behaviour.


# 9. Conclusion

This assignment successfully implemented a complete Deep Q-Network for Pong, including:

âœ” CNN architecture  
âœ” Frame stacking  
âœ” Replay buffer  
âœ” Target network  
âœ” Îµ-greedy exploration  
âœ” Required experiments  
âœ” Batch size comparison  
âœ” Target update comparison  
âœ” Saved model and plots  

Although Pong typically requires 400â€“600 episodes before showing strong improvement, the training behaviour and plots match expected early-stage DQN performance.

All assignment requirements have been completed.
