# Baseline DQN Training on Standard Lunar Lander

This notebook trains a Deep Q-Network (DQN) agent on the standard Lunar Lander environment.

**Goal**: Verify that our DQN implementation works before moving to multi-task learning.

**Success Criteria**:
- Agent learns to land successfully (reward > 200)
- Training curves show steady improvement
- Can save and load trained model

---

## 1. Setup: Imports and Environment

In [None]:
# Add parent directory to path to import our modules
import sys
sys.path.append('..')

import numpy as np
import torch
from tqdm import tqdm
import matplotlib.pyplot as plt

# Our custom modules
from environments import make_env
from agents import DQNAgent
from utils import ReplayBuffer
from utils.plotting import LivePlotter

# Enable inline plotting for Jupyter
%matplotlib inline

print("âœ“ All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

In [None]:
# Create environment
env = make_env('standard', render_mode=None)

# Get environment info
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

print(f"Environment: {env.task_name} Lunar Lander")
print(f"State dimension: {state_dim}")
print(f"Action dimension: {action_dim}")
print(f"\nActions: 0=nothing, 1=left engine, 2=main engine, 3=right engine")

## 2. Hyperparameters

These hyperparameters control the training process. You can experiment with different values!

In [None]:
# Training hyperparameters
HYPERPARAMS = {
    # Training
    'num_episodes': 1000,           # Total episodes to train
    'batch_size': 64,               # Batch size for training
    'replay_buffer_size': 100000,   # Size of experience replay buffer
    'min_replay_size': 1000,        # Start training after this many experiences
    
    # DQN Agent
    'learning_rate': 5e-4,          # Learning rate (0.0005)
    'gamma': 0.99,                  # Discount factor
    'epsilon_start': 1.0,           # Initial exploration rate
    'epsilon_end': 0.01,            # Final exploration rate
    'epsilon_decay': 0.995,         # Epsilon decay per episode
    'target_update_freq': 10,       # Update target network every N episodes
    
    # Evaluation
    'eval_freq': 50,                # Evaluate every N episodes
    'eval_episodes': 5,             # Number of episodes for evaluation
    
    # Checkpointing
    'save_freq': 100,               # Save model every N episodes
    'model_path': '../results/models/baseline_dqn.pth'
}

# Device (use GPU if available)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
HYPERPARAMS['device'] = device

# Print hyperparameters
print("Hyperparameters:")
print("=" * 60)
for key, value in HYPERPARAMS.items():
    print(f"  {key:<25} {value}")
print("=" * 60)

## 3. Initialize Agent and Replay Buffer

In [None]:
# Create DQN agent
agent = DQNAgent(
    state_dim=state_dim,
    action_dim=action_dim,
    learning_rate=HYPERPARAMS['learning_rate'],
    gamma=HYPERPARAMS['gamma'],
    epsilon_start=HYPERPARAMS['epsilon_start'],
    epsilon_end=HYPERPARAMS['epsilon_end'],
    epsilon_decay=HYPERPARAMS['epsilon_decay'],
    target_update_freq=HYPERPARAMS['target_update_freq'],
    device=device
)

# Create replay buffer
replay_buffer = ReplayBuffer(capacity=HYPERPARAMS['replay_buffer_size'])

print(f"âœ“ Created: {agent}")
print(f"âœ“ Created: {replay_buffer}")

## 4. Training Loop with Live Visualization

This cell will train the agent and display live plots of:
- Episode rewards (how well the agent is doing)
- Training loss (how well the network is fitting)
- Epsilon (exploration rate over time)

**Note**: Training may take 10-30 minutes depending on your hardware.

In [None]:
# Create live plotter
plotter = LivePlotter(figsize=(15, 5), num_plots=3)

# Training statistics
episode_rewards = []
episode_losses = []
eval_rewards = []
eval_episodes = []

# Best model tracking
best_eval_reward = -np.inf

print("Starting training...")
print("=" * 60)

# Training loop
for episode in range(HYPERPARAMS['num_episodes']):
    state, info = env.reset()
    episode_reward = 0
    episode_loss = []
    done = False
    truncated = False
    
    # Play one episode
    while not (done or truncated):
        # Select action using epsilon-greedy
        action = agent.select_action(state)
        
        # Take action in environment
        next_state, reward, done, truncated, info = env.step(action)
        
        # Store transition in replay buffer
        replay_buffer.push(state, action, reward, next_state, done)
        
        # Train if we have enough experiences
        if len(replay_buffer) >= HYPERPARAMS['min_replay_size']:
            batch = replay_buffer.sample(HYPERPARAMS['batch_size'])
            loss = agent.update(batch)
            episode_loss.append(loss)
        
        episode_reward += reward
        state = next_state
    
    # Decay epsilon after each episode
    agent.decay_epsilon()
    agent.episodes += 1
    
    # Update target network
    if episode % agent.target_update_freq == 0:
        agent.update_target_network()
    
    # Store statistics
    episode_rewards.append(episode_reward)
    avg_loss = np.mean(episode_loss) if episode_loss else 0
    episode_losses.append(avg_loss)
    
    # Update live plots
    plotter.update_multiple(
        episode=episode,
        reward=episode_reward,
        loss=avg_loss,
        epsilon=agent.epsilon,
        smooth=True
    )
    
    # Evaluation
    if episode % HYPERPARAMS['eval_freq'] == 0 and episode > 0:
        eval_reward_mean = 0
        for _ in range(HYPERPARAMS['eval_episodes']):
            eval_state, _ = env.reset()
            eval_reward = 0
            eval_done = False
            eval_truncated = False
            
            while not (eval_done or eval_truncated):
                eval_action = agent.select_action(eval_state, epsilon=0.0)  # Greedy
                eval_state, r, eval_done, eval_truncated, _ = env.step(eval_action)
                eval_reward += r
            
            eval_reward_mean += eval_reward
        
        eval_reward_mean /= HYPERPARAMS['eval_episodes']
        eval_rewards.append(eval_reward_mean)
        eval_episodes.append(episode)
        
        # Save best model
        if eval_reward_mean > best_eval_reward:
            best_eval_reward = eval_reward_mean
            agent.save(HYPERPARAMS['model_path'])
            print(f"\nðŸŽ‰ New best! Episode {episode}, Eval Reward: {eval_reward_mean:.2f}")
    
    # Checkpoint
    if episode % HYPERPARAMS['save_freq'] == 0 and episode > 0:
        checkpoint_path = HYPERPARAMS['model_path'].replace('.pth', f'_ep{episode}.pth')
        agent.save(checkpoint_path)

print("\n" + "=" * 60)
print("Training complete!")
print(f"Best evaluation reward: {best_eval_reward:.2f}")

# Save final plots
plotter.save('../results/plots/baseline_training.png')
plotter.close()

## 5. Evaluation: Test the Trained Agent

Let's evaluate the trained agent over multiple episodes and see how well it performs.

In [None]:
# Load best model
agent.load(HYPERPARAMS['model_path'])

# Evaluate
num_eval_episodes = 100
eval_rewards_final = []

for i in tqdm(range(num_eval_episodes), desc="Evaluating"):
    state, _ = env.reset()
    episode_reward = 0
    done = False
    truncated = False
    
    while not (done or truncated):
        action = agent.select_action(state, epsilon=0.0)  # Greedy policy
        state, reward, done, truncated, _ = env.step(action)
        episode_reward += reward
    
    eval_rewards_final.append(episode_reward)

# Print statistics
print("\n" + "=" * 60)
print("Final Evaluation Results")
print("=" * 60)
print(f"Mean reward: {np.mean(eval_rewards_final):.2f} Â± {np.std(eval_rewards_final):.2f}")
print(f"Min reward: {np.min(eval_rewards_final):.2f}")
print(f"Max reward: {np.max(eval_rewards_final):.2f}")
print(f"\nSuccessful landings (reward > 200): {sum(r > 200 for r in eval_rewards_final)}/{num_eval_episodes}")

# Plot distribution
plt.figure(figsize=(10, 5))
plt.hist(eval_rewards_final, bins=20, edgecolor='black', alpha=0.7)
plt.axvline(200, color='red', linestyle='--', linewidth=2, label='Success threshold')
plt.xlabel('Episode Reward', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Evaluation Rewards', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../results/plots/baseline_eval_distribution.png', dpi=150)
plt.show()

## 6. Visualization: Watch the Trained Agent

Let's visualize the agent playing! (This will open a pygame window)

In [None]:
# Create environment with rendering
env_render = make_env('standard', render_mode='human')

# Play a few episodes
num_render_episodes = 3

for episode in range(num_render_episodes):
    state, _ = env_render.reset()
    episode_reward = 0
    done = False
    truncated = False
    
    print(f"\nPlaying episode {episode + 1}/{num_render_episodes}...")
    
    while not (done or truncated):
        action = agent.select_action(state, epsilon=0.0)  # Greedy
        state, reward, done, truncated, _ = env_render.step(action)
        episode_reward += reward
    
    print(f"  Episode reward: {episode_reward:.2f}")

env_render.close()
print("\nVisualization complete!")

## 7. Analysis: Training Curves

Let's create a comprehensive plot of the training progress.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Smoothing helper
def smooth(data, window=20):
    if len(data) < window:
        return data
    weights = np.ones(window) / window
    return np.convolve(data, weights, mode='valid')

# Plot 1: Episode Rewards
axes[0, 0].plot(episode_rewards, alpha=0.3, color='green', linewidth=1, label='Raw')
axes[0, 0].plot(smooth(episode_rewards, 20), color='green', linewidth=2, label='Smoothed (window=20)')
axes[0, 0].axhline(200, color='red', linestyle='--', label='Success threshold')
axes[0, 0].set_title('Episode Rewards During Training', fontweight='bold')
axes[0, 0].set_xlabel('Episode')
axes[0, 0].set_ylabel('Reward')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Training Loss
axes[0, 1].plot(episode_losses, alpha=0.3, color='red', linewidth=1, label='Raw')
axes[0, 1].plot(smooth(episode_losses, 20), color='red', linewidth=2, label='Smoothed')
axes[0, 1].set_title('Training Loss', fontweight='bold')
axes[0, 1].set_xlabel('Episode')
axes[0, 1].set_ylabel('Loss')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Evaluation Rewards
if eval_rewards:
    axes[1, 0].plot(eval_episodes, eval_rewards, marker='o', color='blue', linewidth=2, markersize=6)
    axes[1, 0].axhline(200, color='red', linestyle='--', label='Success threshold')
    axes[1, 0].set_title('Evaluation Rewards', fontweight='bold')
    axes[1, 0].set_xlabel('Episode')
    axes[1, 0].set_ylabel('Mean Reward (5 episodes)')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Final Evaluation Distribution
axes[1, 1].hist(eval_rewards_final, bins=20, edgecolor='black', alpha=0.7, color='purple')
axes[1, 1].axvline(200, color='red', linestyle='--', linewidth=2, label='Success threshold')
axes[1, 1].axvline(np.mean(eval_rewards_final), color='blue', linestyle='-', linewidth=2, label=f'Mean: {np.mean(eval_rewards_final):.1f}')
axes[1, 1].set_title('Final Evaluation Distribution', fontweight='bold')
axes[1, 1].set_xlabel('Reward')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('../results/plots/baseline_full_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print("âœ“ Analysis complete! Plots saved to results/plots/")

## Summary

### Key Takeaways:
1. **DQN successfully learns** to play Lunar Lander from scratch
2. **Exploration-exploitation tradeoff**: Epsilon decays from 1.0 (random) to 0.01 (greedy)
3. **Experience replay** breaks temporal correlations, stabilizing learning
4. **Target network** provides stable Q-targets, preventing oscillations

### Next Steps:
- **Notebook 2**: Multi-task baselines (Independent vs Shared DQN)
- **Notebook 3**: PCGrad (gradient conflict resolution)
- **Notebook 4**: VarShare (variational adapters)
- **Notebook 5**: Comprehensive analysis and comparison

---

**Congratulations!** You've successfully trained a DQN agent and are ready for multi-task learning!