# 1. Project Overview

This project applies Deep Q-Network (DQN) reinforcement learning to solve the LunarLander-v3 environment from OpenAI Gymnasium.

The goal is to train an agent to land a spacecraft safely between two flags using only thrust-based actions. The agent learns a policy through trial and error by interacting with the environment.

### Objectives:
- Implement a custom DQN agent
- Train the agent to solve the environment
- Monitor performance using reward trends and average scores
- Generate a demo `.gif` of successful landings

# 2. Environment Setup

The LunarLander environment simulates a spacecraft descending toward a landing pad.

### Game Rules:
- The goal is to land the spacecraft between the flags without crashing.
- The lander has 4 actions:
  1. Do nothing
  2. Fire left orientation engine
  3. Fire main engine
  4. Fire right orientation engine
- The agent receives:
  - Positive reward for landing between the flags
  - Negative reward for crashing or flying out of bounds
  - Small rewards for controlled descent

### Observation Space:
- An 8-dimensional continuous vector:
  - position (x, y)
  - linear velocities (vx, vy)
  - angle
  - angular velocity
  - left leg contact (bool)
  - right leg contact (bool)

### Action Space:
- A discrete set of 4 actions

# 3. Model & Approach

We use a Deep Q-Network (DQN) for this problem. The DQN approximates the Q-values for each state-action pair using a feedforward neural network.

### Why DQN?
LunarLander has a continuous observation space and a discrete action space, making it a good fit for DQN. The agent learns from past experiences using experience replay and target networks to stabilize training.

The model is trained using the Bellman equation with temporal difference updates.

Below is the full implementation of:
- Q-Network (`DQN Network`)
- Experience Replay (`ReplayBuffer`)
- Agent (`DQN Agent`)
- Training function with comments

### DQN

### Replay Buffer

### DQN Agent Class

### DQN Training Loop

### Training Model

# 4. Troubleshooting & Improvements

During early experiments, the agent struggled to consistently achieve high rewards. To improve learning performance, the following steps were taken:

### Key Adjustments:
- **Training Duration:** Increased the number of episodes to allow more learning opportunities.
- **Replay Buffer Size:** Ensured a large memory buffer (100,000+ transitions) for stable learning.
- **Soft Updates:** Used a `tau` value of `1e-3` for gradual target network updates.
- **Reward Monitoring:** Logged average reward across episodes and implemented early stopping.

### Evaluation Data

The test episode results were saved to a `.csv` and are shown below:

In [24]:
import pandas as pd

# Load and display test reward results
df_results = pd.read_csv("results/dqn_test_results.csv")
df_results

Unnamed: 0,Episode,Reward
0,1,60.249863
1,2,222.431689
2,3,74.346023
3,—,119.009192


# 5. Results

### Reward Trends

The following plot shows the episode rewards across training. The orange line represents the moving average over 100 episodes.

![Training Reward Plot](results/dqn_training_performance.png)

# 6. Conclusion & Discussion

### What Worked
- The DQN agent successfully learned to land the LunarLander with increasing consistency.
- Soft target updates and a replay buffer helped stabilize learning.
- The reward curve showed clear upward trends after sufficient training time.

### What Didn’t Work
- Early training showed unstable performance when:
  - The replay buffer was too small
  - The agent was trained for too few episodes
  - Rendering was accidentally enabled during training (slowed down runs)

### Future Improvements
- **Reward Shaping:** Add bonuses for slowing descent or staying upright to encourage smoother landings.
- **Longer Training:** Additional episodes could push average reward even higher and faster.
- **Model Comparison:** Test alternative algorithms like PPO, A2C, or Dueling DQNs to see how they perform on the same task.
- **Hyperparameter Search:** Explore different learning rates, batch sizes, or epsilon decay schedules.

# 7. Deliverables

### Code Repository
- GitHub URL: [Insert your GitHub repo link here]
- The repository contains:
  - Full DQN implementation
  - Training and testing scripts
  - Saved model weights
  - Plots and evaluation results
  - Notebook report

### Demo Video / GIF
- Demo of trained agent landing:  
  ![Landing Demo](results/dqn_demo.gif)

### References
- [OpenAI Gymnasium: LunarLander Environment](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)
- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
- [TQDM Progress Bars](https://github.com/tqdm/tqdm)
- [Matplotlib Plotting](https://matplotlib.org/stable/index.html)
- [Pillow for Text Overlay](https://pillow.readthedocs.io/en/stable/)

### Files in `results/` Folder:
- `dqn_training_performance.png` – reward plot
- `dqn_test_results.csv` – test episode scores
- `dqn_demo.gif` – recorded landing