# Lunar Lander 

In this notebook, we will train an agent using a Deep Q-Network (DQN) to land a spacecraft safely in the LunarLander-v3 environment.


## Table of Contents
- [1. Project Overview](#1)
- [2. About the Environment](#2)
- [3. Imports and Setup](#3)
- [4. Build the Q-Network](#4)
- [5. Build the Replay Buffer](#5)
- [6. Build the DQN Agent](#6)
- [7. Train the Agent](#7)
- [8. Main Training Execution](#8)
- [9. Visualize Training Progress](#9)
- [10. Results, Interpretation, and Conclusion](#10)
- [11. References](#11)

---

<a id='1'></a>
## 1. Project Overview

Here, we’ll build an agent that learns how to land a spacecraft safely on the moon.
The goal is to test how well a Deep Q-Network (DQN) can figure out how to land using only rewards and trial-and-error.

---

<a id='2'></a>

## 2. About the Environment

The LunarLander-v3 environment simulates a spacecraft trying to land safely at a specific location.

**Action Space**: 4 discrete actions
  - 0: Do nothing
  - 1: Fire left engine
  - 2: Fire main engine
  - 3: Fire right engine
- **State Space**: 8 continuous values describing position, velocity, angle, and leg contact.
- **Rewards**:
  - Positive rewards for moving closer to the pad, slowing down, and touching down gently.
  - Penalties for fuel use, tilting, crashing.
- **Success**: An average score of 200+ points.
- We use the default settings: discrete actions, no wind, gravity at -10.0.
  
---

<a id='3'></a>

## 3. Imports and Setup
Time to gather our supplies and tools for this project.
Load required packages and check if we can use a GPU for faster training.

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque, namedtuple

# Make sure we run on GPU if available (otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

---
<a id='4'></a>

## 4. Build the Q-Network

Think of this as building the “brain” of our lander.
It looks at the situation and tries to predict what move would be the smartest.
- Looks at the current situation (speed, angle, distance)
- Tries to guess which move (action) will help us land safely
- Gets better over time by learning from rewards

We’ll build a simple neural network — just a few math layers stacked together — that can learn to make smart decisions.

---

In [None]:
class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, x):
        return self.layers(x)

---
<a id='5'></a>
## 5. Build the Replay Buffer
This is our agent’s memory.
It saves what happened (good and bad) so the agent can go back and learn from it later.

Without memory, the agent would forget everything immediately after each move.

In [None]:
class ReplayBuffer:
    def __init__(self, capacity=100_000):
        self.buffer = deque(maxlen=capacity)
        self.experience = namedtuple("Experience", ["state", "action", "reward", "next_state", "done"])

    def push(self, state, action, reward, next_state, done):
        self.buffer.append(self.experience(state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states      = torch.from_numpy(np.vstack([e.state for e in batch])).float().to(device)
        actions     = torch.from_numpy(np.vstack([e.action for e in batch])).long().to(device)
        rewards     = torch.from_numpy(np.vstack([e.reward for e in batch])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in batch])).float().to(device)
        dones       = torch.from_numpy(np.vstack([e.done for e in batch]).astype(np.uint8)).float().to(device)
        return states, actions, rewards, next_states, dones

    def __len__(self):
        return len(self.buffer)

---
<a id='6'></a>
## 6. Build the DQN Agent
Now we pull everything together!
This is the body and mind of our agent.

The agent will:
- Use the Q-network (brain) to pick actions
- Use the replay buffer (memory) to save experiences
- Learn from past mistakes to improve over time

In [None]:
class DQNAgent:
    def __init__(self, state_dim, action_dim, hidden_dim=64, lr=1e-4, gamma=0.99, tau=1e-3, batch_size=64, update_every=5):
        self.q_eval = QNetwork(state_dim, action_dim, hidden_dim).to(device)
        self.q_target = QNetwork(state_dim, action_dim, hidden_dim).to(device)
        self.optimizer = optim.Adam(self.q_eval.parameters(), lr=lr)
        
        self.gamma = gamma
        self.tau = tau
        self.batch_size = batch_size
        self.update_every = update_every
        self.step_counter = 0

        self.buffer = ReplayBuffer()
        self.loss_fn = nn.MSELoss()

    def act(self, state, epsilon=0.1):
        # Epsilon-greedy: sometimes explore, sometimes exploit
        if random.random() < epsilon:
            return random.randrange(self.q_eval.layers[-1].out_features)
        else:
            state_t = torch.tensor(state).float().unsqueeze(0).to(device)
            with torch.no_grad():
                return int(torch.argmax(self.q_eval(state_t)).item())

    def step(self, state, action, reward, next_state, done):
        # Save experience into memory
        self.buffer.push(state, action, reward, next_state, done)

        # Learn every few steps
        self.step_counter = (self.step_counter + 1) % self.update_every
        if self.step_counter == 0 and len(self.buffer) >= self.batch_size:
            self.learn()

    def learn(self):
        # Sample a batch from memory
        states, actions, rewards, next_states, dones = self.buffer.sample(self.batch_size)

        # Calculate targets
        q_targets_next = self.q_target(next_states).detach().max(dim=1, keepdim=True)[0]
        q_targets = rewards + (self.gamma * q_targets_next * (1 - dones))

        # Calculate predictions
        q_expected = self.q_eval(states).gather(1, actions)

        # Minimize the loss
        loss = self.loss_fn(q_expected, q_targets)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Slowly update the target network
        self.soft_update()

    def soft_update(self):
        for target_param, eval_param in zip(self.q_target.parameters(), self.q_eval.parameters()):
            target_param.data.copy_(self.tau * eval_param.data + (1.0 - self.tau) * target_param.data)

---
<a id='7'></a>
## 7. Train the Agent
This is like sending the agent to flight school.
It practices landing over and over, learns from crashes, and slowly gets better.

In [25]:
def train_dqn(env, agent, n_episodes=1500, max_steps=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    scores = []
    epsilon = eps_start

    for ep in range(1, n_episodes + 1):
        state, _ = env.reset(seed=ep)
        total_reward = 0

        for t in range(max_steps):
            action = agent.act(state, epsilon)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            agent.step(state, action, reward, next_state, done)

            state = next_state
            total_reward += reward

            if done:
                break

        scores.append(total_reward)
        epsilon = max(eps_end, eps_decay * epsilon)

        if ep % 100 == 0:
            avg_score = np.mean(scores[-100:])
            print(f"Episode {ep} | Average (last 100 episodes): {avg_score:.2f}")

    return scores

---
<a id='8'></a>
## 8. Main Training Execution
Here’s where we actually start the flight practice.
We’ll let the agent fly thousands of times and keep track of how it’s doing.

In [None]:
# Create environment
env = gym.make("LunarLander-v3")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

# Create agent
agent = DQNAgent(state_dim, action_dim)

# Train agent
scores = train_dqn(env, agent)

# Save the trained model
torch.save(agent.q_eval.state_dict(), "dqn_lander.pth")

# Close environment after training
env.close()

---
<a id='9'></a>

## 9. Visualize Training Progress

We’ll plot a simple graph to see how the agent’s landing skills improve over time.

In [None]:
plt.figure(figsize=(10,6))
plt.plot(np.arange(len(scores)), scores)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Training Performance Over Episodes')
plt.grid(True)
plt.savefig("training_curve.png")
plt.show()

print("Training complete!")

---
<a id='10'></a>
## 10. Results, Interpretation, and Conclusion


After training the DQN agent for 1500+ episodes, we observed the agent's total rewards improving over time. Although the agent has not consistently reached the success threshold of 200 points, the upward trend in rewards shows that it learned to land more safely and efficiently through practice.

The agent started off crashing most of the time, but with experience replay and Q-learning, it gradually figured out how to slow down, stay balanced, and land closer to the pad. While there’s still room for better performance, the learning behavior and improvement curve clearly show that the model is working.

With more training episodes, fine-tuned hyperparameters, or additional exploration strategies, the agent could likely achieve stable landings and even better scores.

Overall, this project shows that Deep Q-Networks are a powerful approach to teaching an agent how to solve complex control tasks like landing a spacecraft.


---
<a id='11'></a>
## 11. References
- [Gymnasium LunarLander-v3 Documentation](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
- [Deep Q-Learning Paper (Mnih et al., 2015)](https://arxiv.org/abs/1312.5602)
- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)
- [Gymnasium GitHub Repository](https://github.com/Farama-Foundation/Gymnasium)
- [Original Lunar Lander Reinforcement Learning Guide by sokistar24](https://github.com/sokistar24/Deep_Reinforcement_learning)
