# Deep Q-Network (DQN) Demo on LunarLander-v2

This notebook demonstrates training and evaluation of a vanilla Deep Q-Network (DQN) agent on the OpenAI Gym **LunarLander-v2** environment.

We will cover:
- Environment setup and agent architecture
- Training the DQN agent
- Evaluating agent performance
- Plotting results
- Key takeaways and conclusion


In [None]:
# Imports and setup
import gym
import torch
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import random

# Import custom utilities
from utils import QNetwork, training_loop, evaluate_agent

# Fix seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)


## Environment and Hyperparameters

We initialize the LunarLander environment and define key hyperparameters for the training.


In [None]:
env = gym.make("LunarLander-v2")
result = env.reset()
if isinstance(result, tuple) and len(result) == 2:
    obs, _ = result
else:
    obs = result

input_dim = env.observation_space.shape[0]
output_dim = env.action_space.n

# Hyperparameters
replay_max = 10000
learning_rate = 1e-3
n_episodes = 500
epsilon_min, epsilon_decay = 0.01, 0.995
batch_size = 128
discounted_factor = 0.99

replay_buffer = deque(maxlen=replay_max)

# Initialize networks, loss, and optimizer
Q_network = QNetwork(input_dim, output_dim)
target_network = QNetwork(input_dim, output_dim)
target_network.load_state_dict(Q_network.state_dict())
target_network.eval()
loss_fn = torch.nn.SmoothL1Loss()
optimizer = torch.optim.Adam(Q_network.parameters(), lr=learning_rate)


## Training the Agent

We run the training loop for the specified number of episodes. The target network is updated periodically to stabilize learning.


In [None]:
rewards = training_loop(env, Q_network, target_network, loss_fn, optimizer, discounted_factor,
                        n_episodes, epsilon_decay=epsilon_decay, epsilon_min=epsilon_min)


## Training Results

The plot below shows the total reward per episode over the training process.


In [None]:
plt.plot(rewards)
plt.title("Episode Rewards during Training")
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.grid(True)
plt.show()


## Evaluating the Trained Agent

Now that training is complete, we evaluate the trained model over multiple episodes without exploration (greedy policy).


In [None]:
avg_reward = evaluate_agent(env, Q_network, episodes=20, render=False)
print(f"Average Reward over 20 evaluation episodes: {avg_reward:.2f}")


## Conclusion

- The vanilla DQN agent successfully learned to solve the LunarLander-v2 task, achieving high average rewards after training.
- The use of a replay buffer and target network helps stabilize training.
- Further improvements could involve experimenting with Double DQN, prioritized replay, or dueling networks for better performance.
- This notebook serves as a foundation for understanding and extending deep reinforcement learning techniques.

Feel free to explore the code, modify hyperparameters, or try out the Double DQN variant for enhanced stability and performance!
