# ARI 510 Lab 4, Lunar Lander

Ryan Smith, 12/3/24

Project for cartpole and lunar lander - Reinforcement Learning Project.

# Imports

Mainly using PyTorch for the DQN and related functions.  NumPy, OpenAI's Gymnasium for cartpole and lunar lander.  Imageio with ffmpeg for the video capturing.  Data Structure wise, deque from the python standard library collections module.  Also the standard library random module.

In [1]:
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
from collections import deque
import imageio

# Define the DQN Neural Network

This code defines the architecture of the Deep Q-Network (DQN) used to approximate the Q-value function. The DQN class inherits from nn.Module, which is the base class for all neural networks in PyTorch. The init method initializes the network's layers, creating a fully connected (nn.Linear) network with two hidden layers, each having 128 neurons and ReLU activation functions (nn.ReLU). The input dimension (input_dim) corresponds to the size of the state space, while the output dimension (output_dim) matches the number of possible actions. The forward method defines the forward pass of the network, taking a state as input (x) and passing it through the sequence of layers (self.fc) to produce the estimated Q-values for each action. This output is then used to select actions during the agent's interaction with the environment.

In [2]:
# Define the DQN neural network
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )

    def forward(self, x):
        return self.fc(x)

# Initialize Environment

This section sets up the Lunar Lander environment for the reinforcement learning task. env = gym.make("LunarLander-v3", render_mode='rgb_array') creates an instance of the Lunar Lander environment from the OpenAI Gym library, specifically version 3. The render_mode='rgb_array' argument configures the environment to provide RGB image arrays as visual output, which can be used for generating videos or displaying the agent's behavior. state_dim = env.observation_space.shape[0] retrieves the dimensionality of the state space, which represents the number of variables used to describe the environment's state (e.g., lander position, velocity, angle).  action_dim = env.action_space.n gets the number of possible actions the agent can take in the environment (e.g., fire left engine, fire main engine, fire right engine, do nothing). This information is crucial for defining the architecture of the DQN model and for action selection during training.

In [3]:
# Initialize environment
env = gym.make("LunarLander-v3", render_mode='rgb_array')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

# Initialize Model

This code snippet initializes the core components of the DQN algorithm. model is the main Deep Q-Network, which is responsible for estimating the Q-values of different actions in given states. target_model is a copy of the main model, used to provide stable target Q-values during training.  target_model.load_state_dict(model.state_dict()) synchronizes the weights of the target model with the main model initially.  optimizer = optim.Adam(model.parameters(), lr=1e-3) creates an Adam optimizer to update the weights of the main model during training, using a learning rate of 0.001. replay_buffer = deque(maxlen=100000) creates a replay buffer, which is a data structure that stores the agent's experiences (state, action, reward, next state, done) for later sampling during training. This helps break temporal correlations in the data and improves learning stability.

In [4]:
# the model and target model
model = DQN(state_dim, action_dim)
target_model = DQN(state_dim, action_dim)
target_model.load_state_dict(model.state_dict())
optimizer = optim.Adam(model.parameters(), lr=1e-3)
replay_buffer = deque(maxlen=100000)

# Hyperparameters

This section defines the key parameters that govern the learning process of the DQN agent. gamma determines how much the agent values future rewards compared to immediate ones. batch_size sets the number of experiences sampled from the replay buffer for each training step. epsilon controls the exploration-exploitation balance, with higher values leading to more random actions. epsilon_min sets a lower bound for exploration, ensuring the agent continues to explore even after extensive training. epsilon_decay controls how quickly the exploration rate decreases over time. update_target_every specifies how often the target network's weights are updated with the main network's weights.  Finally, rolling_window sets the number of recent episodes to consider when calculating rolling average metrics.

In [None]:
# Hyperparameters
gamma = 0.99
batch_size = 32
epsilon = 1.0
epsilon_min = 0.1
epsilon_decay = 0.995
update_target_every = 100
rolling_window = 250

# Training Function

This function defines the core training logic for the DQN agent. It first checks if the replay buffer has enough experiences (batch_size) to sample from. If not, it returns without training. Otherwise, it randomly samples a batch of experiences from the replay buffer. This batch includes the agent's states, actions, received rewards, next states, and whether the episode ended after each action. These elements are then converted into PyTorch tensors for efficient computation. The function calculates the predicted Q-values for the taken actions using the main DQN model. Then, it uses the target network to estimate the optimal Q-values for the next states.  These are used to calculate target Q-values, incorporating the reward and discounted future rewards. The difference between the predicted and target Q-values forms the loss, which is used to update the main DQN model's weights through backpropagation. This process optimizes the model to make better decisions over time.

In [6]:
# Training function
def train():
    if len(replay_buffer) < batch_size:
        return
    batch = random.sample(replay_buffer, batch_size)
    states, actions, rewards, next_states, dones = zip(*batch)

    states = torch.FloatTensor(states)
    actions = torch.LongTensor(actions)
    rewards = torch.FloatTensor(rewards)
    next_states = torch.FloatTensor(next_states)
    dones = torch.FloatTensor(dones)

    q_values = model(states).gather(1, actions.unsqueeze(1)).squeeze()
    with torch.no_grad():
        next_q_values = target_model(next_states).max(1)[0]
        targets = rewards + gamma * next_q_values * (1 - dones)

    loss = nn.MSELoss()(q_values, targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Helper Function - Select Action

In [7]:
# Select action function
def select_action(state):
    if random.random() < epsilon:
        return env.action_space.sample()
    else:
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        q_values = model(state_tensor)
        return torch.argmax(q_values).item()

# Metrics

This section initializes lists and variables to store and track the agent's performance throughout the training process. steps_per_episode records the number of steps taken in each episode. success_count and total_success_count keep track of successful landings within a rolling window and overall, respectively. rewards_list stores the total reward obtained in each episode.

In [8]:
# Metrics
steps_per_episode = []
success_count = 0
total_success_count = 0

# Training Loop

This section contains the main training loop that iterates through a specified number of episodes (num_episodes). Inside the loop, the environment is reset, and for each step, the agent selects an action, observes the next state and reward, stores the experience in the replay buffer, and trains the DQN model. It also updates the epsilon value for exploration and periodically updates the target network.  The loop tracks total reward and steps per episode, and updates success rate counters.

In [None]:
# Training loop
rewards_list = []
global epsilon
num_episodes = 10000

for episode in range(num_episodes):
    state, info = env.reset()
    total_reward = 0
    steps = 0
    landed_successfully = False

    while True:
        action = select_action(state)
        next_state, reward, done, truncated, info = env.step(action)
        terminal = done or truncated

        # Check if the landing was successful
        if done and reward == 100:
            landed_successfully = True

        # Intermediate rewards for partial achievements
        if not done:
            reward += -0.1 * np.linalg.norm(state[:2])  # Penalize distance from the landing pad

        replay_buffer.append((state, action, reward, next_state, terminal))
        train()
        state = next_state
        total_reward += reward
        steps += 1
        if terminal:
            break

    rewards_list.append(total_reward)
    steps_per_episode.append(steps)
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Update success rate
    if landed_successfully:
        success_count += 1
        total_success_count += 1

    # Update target model
    if episode % update_target_every == 0:
        target_model.load_state_dict(model.state_dict())

    # Generate video and metrics every 250 episodes
    if episode % 250 == 0:
        frames = []
        state, info = env.reset()
        for _ in range(1000):
            frames.append(env.render())
            action = select_action(state)
            next_state, _, done, truncated, _ = env.step(action)
            state = next_state
            if done or truncated:
                break
        imageio.mimsave(f"lunarlander_episode_{episode}.mp4", frames, fps=30)

        # Metrics and Logging
        avg_reward = np.mean(rewards_list[-rolling_window:])
        avg_steps = np.mean(steps_per_episode[-rolling_window:])
        success_rate = (success_count / rolling_window) * 100  # Rolling success rate as percentage
        overall_avg_reward = np.mean(rewards_list)
        overall_success_rate = (total_success_count / (episode + 1)) * 100  # Overall success rate as percentage

        print(f"Episode {episode}, Reward: {total_reward}, Avg Reward (last {rolling_window}): {avg_reward}, "
              f"Overall Avg Reward: {overall_avg_reward}, Steps: {steps}, Avg Steps: {avg_steps}, "
              f"Success Rate (last {rolling_window}): {success_rate:.2f}%, Overall Success Rate: {overall_success_rate:.2f}%")

        # Reset rolling success count
        success_count = 0

print("Training complete.")

  states = torch.FloatTensor(states)


Episode 0, Reward: -132.1918690988905, Avg Reward (last 250): -132.1918690988905, Overall Avg Reward: -132.1918690988905, Steps: 98, Avg Steps: 98.0, Success Rate (last 250): 0.00%, Overall Success Rate: 0.00%




Episode 250, Reward: -103.23468146314636, Avg Reward (last 250): -148.72564335879218, Overall Avg Reward: -148.659771748195, Steps: 238, Avg Steps: 204.988, Success Rate (last 250): 0.00%, Overall Success Rate: 0.00%




Episode 500, Reward: -287.9941073432341, Avg Reward (last 250): -110.99472198326845, Overall Avg Reward: -129.86483673575657, Steps: 146, Avg Steps: 508.356, Success Rate (last 250): 6.80%, Overall Success Rate: 3.39%




Episode 750, Reward: -260.9884660730068, Avg Reward (last 250): -152.87165689620497, Overall Avg Reward: -137.52356515135193, Steps: 1000, Avg Steps: 767.736, Success Rate (last 250): 14.40%, Overall Success Rate: 7.06%




Episode 1000, Reward: -200.67236100671474, Avg Reward (last 250): -205.85316258856452, Overall Avg Reward: -154.58889917662978, Steps: 316, Avg Steps: 841.644, Success Rate (last 250): 1.20%, Overall Success Rate: 5.59%




Episode 1250, Reward: -134.41790324440944, Avg Reward (last 250): -174.82446405815247, Overall Avg Reward: -158.63277705063513, Steps: 104, Avg Steps: 554.18, Success Rate (last 250): 0.80%, Overall Success Rate: 4.64%




Episode 1500, Reward: -84.65438876708137, Avg Reward (last 250): -154.3361645287428, Overall Avg Reward: -157.91715204698883, Steps: 1000, Avg Steps: 543.144, Success Rate (last 250): 0.40%, Overall Success Rate: 3.93%




Episode 1750, Reward: -164.10632596628432, Avg Reward (last 250): -138.5261120272875, Overall Avg Reward: -155.14858551076648, Steps: 269, Avg Steps: 447.16, Success Rate (last 250): 2.80%, Overall Success Rate: 3.77%




Episode 2000, Reward: -138.91246729754113, Avg Reward (last 250): -135.21944993532827, Overall Avg Reward: -152.6586885123359, Steps: 167, Avg Steps: 449.904, Success Rate (last 250): 4.40%, Overall Success Rate: 3.85%




Episode 2250, Reward: 31.63713290652568, Avg Reward (last 250): -126.11838684429011, Overall Avg Reward: -149.711076154712, Steps: 1000, Avg Steps: 544.236, Success Rate (last 250): 7.60%, Overall Success Rate: 4.26%
