In this notebook I'll train a DQN agent to play the game of Pong. The agent will be trained using Gymnasium library. 

Necessary imports

In [1]:
import warnings
warnings.filterwarnings("ignore")
!pip install gymnasium
!pip install supersuit
!pip install torch
!pip install autorom[accept-rom-license]



In [2]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import supersuit as ss
import collections
import time


Lets create the environment and see what it looks like before we start applying the wrappers from supersuit library.

In [3]:
env = gym.make("ALE/Pong-v5")
print("environment's observation space:", env.observation_space)
print("environment's action space:", env.action_space)

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


environment's observation space: Box(0, 255, (210, 160, 3), uint8)
environment's action space: Discrete(6)


## Wrappers

Now lets apply the wrappers from supersuit library. The wrappers will do the following:


The first wrapper is a color reduction wrapper that will make the image grayscale. This will save a lot of computation while we won't lose any relevant information as the game is pretty simple and the color of the ball and the paddles is the same and does not change during the game.

In [4]:
env = ss.color_reduction_v0(env, mode="full")

print("environment's observation space:", env.observation_space)

environment's observation space: Box(0, 255, (210, 160), uint8)


The second wrapper is a resize wrapper that will resize the image to 84x84. This will also save a lot of computation and will not affect the performance of the agent as the agent will be able to see both the paddle and the ball even after the resize/cut.

In [5]:
env = ss.resize_v1(env, x_size=84, y_size=84) # Resize the observation space to 84x84 

print("environment's observation space:", env.observation_space)

environment's observation space: Box(0, 255, (84, 84), uint8)


The third wrapper is a frame stacking wrapper that will stack 4 frames together. This will allow the agent to see the movement of the ball and the paddle. This is important as the agent will be able to see the direction of the ball and the paddle and also the speed of the ball, which is important for the agent to learn how to play the game.

In [6]:
env = ss.frame_stack_v1(env, 4) # Stack 4 frames together

print("environment's observation space:", env.observation_space)

environment's observation space: Box(0, 255, (84, 84, 4), uint8)


The fourth wrapper is a dtype wrapper that will convert the data type of the image from uint8 to float32. This is important as the neural network will be able to learn faster if the data type is float32 as it is a more precise data type.

In [7]:
env = ss.dtype_v0(env, dtype=np.float32) # Convert observations to float32

print("environment's observation space:", env.observation_space)

environment's observation space: Box(0.0, 255.0, (84, 84, 4), float32)


The fifth and last wrapper is a normalization wrapper that will normalize the image between 0 and 1. This is important as the neural network will be able to learn faster if the data is normalized.

In [8]:
env = ss.normalize_obs_v0(env, env_min=0, env_max=1) # Normalize observations to [0, 1]

print("environment's observation space:", env.observation_space)

environment's observation space: Box(0.0, 1.0, (84, 84, 4), float32)


Now we have our environment ready for training. Lets show how the environment looks like after starting the modelling and training phases.

In [9]:
print("environment's observation space:", env.observation_space)
print("environment's observation space shape:", env.observation_space.shape)
print("environment's action space:", env.action_space)
print("Meaning of the actions: ",env.unwrapped.get_action_meanings())

environment's observation space: Box(0.0, 1.0, (84, 84, 4), float32)
environment's observation space shape: (84, 84, 4)
environment's action space: Discrete(6)
Meaning of the actions:  ['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']


Finally we'll wrap them all together in a function that will create the environment and apply the wrappers.

In [10]:
def make_env():
    env = gym.make("ALE/Pong-v5")
    env = ss.color_reduction_v0(env, mode="full")
    env = ss.resize_v1(env, x_size=84, y_size=84)
    env = ss.frame_stack_v1(env, 4)
    env = ss.dtype_v0(env, dtype=np.float32)
    env = ss.normalize_obs_v0(env, env_min=0, env_max=1)
    return env

Lets check that our GPU's are ready for training.

In [11]:
!nvidia-smi

Wed Dec 27 12:09:04 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 2080 Ti     On  | 00000000:1A:00.0 Off |                  N/A |
| 30%   44C    P2              60W / 250W |    327MiB / 11264MiB |     18%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     On  | 00000000:1B:00.0 Off |  

## Neural Network architecture

Let's start by defining the neural network architecture that will serve as a function approximator for the Q function. The neural network will be a convolutional neural network. The input will be the wrapped environment and the output would be a vector of size 6, where each element in the vector represents the Q value of a specific action.

In [12]:
#Instantiate the cuda device object
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [13]:
def DQN(obs_shape, num_actions):
    return nn.Sequential(
        nn.Conv2d(obs_shape[2], 32, kernel_size=8, stride=4),  # Use obs_shape[2] for the number of channels
        nn.ReLU(),
        nn.Conv2d(32, 64, kernel_size=4, stride=2),
        nn.ReLU(),
        nn.Conv2d(64, 64, kernel_size=3, stride=1),
        nn.Flatten(),
        nn.Linear(64 * 7 * 7, 512),  # Ensure that the input features to this linear layer match the output from the last conv layer
        nn.ReLU(),
        nn.Linear(512, num_actions)
    )


check_env = make_env()
check_net = DQN(check_env.observation_space.shape, check_env.action_space.n).to(device)
print("Let's see how the architecture looks like:\n \n", check_net)

Let's see how the architecture looks like:
 
 Sequential(
  (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
  (1): ReLU()
  (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
  (3): ReLU()
  (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
  (5): Flatten(start_dim=1, end_dim=-1)
  (6): Linear(in_features=3136, out_features=512, bias=True)
  (7): ReLU()
  (8): Linear(in_features=512, out_features=6, bias=True)
)


## Experienvce Replay

Next step is to define the experience replay buffer.The experience replay buffer will store the experiences of the agent 

In [14]:
#Define the experience tuple to store the experience which is composed of the state, action, terminated, truncated, truncated and new_state
Experience = collections.namedtuple('Experience', field_names=['state', 'action', 'reward', 'terminated', 'truncated', 'new_state'])

class ExperienceReplay:
    def __init__(self, capacity):
        #Initialize the buffer with the capacity
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        #Return the length of the buffer
        return len(self.buffer)
    
    def append(self, experience):
        #Append the experience to the buffer
        self.buffer.append(experience)
    
    def sample(self, batch_size):
        #Sample the batch from the buffer
        
        #Choose the random indices from the buffer to be sampled
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)
        #Get the states, actions, rewards, terminated, truncated, new_states from the buffer by using the indices,
        # the zip(*[]) is used to unzip the list of tuples into the list of lists 
        states, actions, rewards, terminated, truncated, new_states = zip(*[self.buffer[idx] for idx in indices])
        
        #Return the states, actions, rewards, terminated, truncated, new_states as numpy arrays
        return np.array(states), np.array(actions), np.array(rewards, dtype=np.float32), np .array(terminated, dtype=np.float32), np.array(truncated, dtype=np.float32), np.array(new_states)
    


check_rep = ExperienceReplay(capacity = 10000)
print("The length of the buffer is:", len(check_rep))
print("Let's add some experience to the buffer")
check_rep.append(Experience(1,2,3,4,5,6))
print("The length of the buffer is:", len(check_rep))
print("Let's add some more experience to the buffer")
check_rep.append(Experience(1,2,3,4,5,6))
print("The length of the buffer is:", len(check_rep))
print("Let's sample the batch from the buffer")
states, actions, rewards, terminated, truncated, new_states = check_rep.sample(batch_size=2)
print("The states are:", states, "of type:", type(states))
print("The actions are:", actions, "of type:", type(actions))
print("The rewards are:", rewards, "of type:", type(rewards))
print("The terminated are:", terminated, "of type:", type(terminated))
print("The truncated are:", truncated, "of type:", type(truncated))
print("The new_states are:", new_states, "of type:", type(new_states))

The length of the buffer is: 0
Let's add some experience to the buffer
The length of the buffer is: 1
Let's add some more experience to the buffer
The length of the buffer is: 2
Let's sample the batch from the buffer
The states are: [1 1] of type: <class 'numpy.ndarray'>
The actions are: [2 2] of type: <class 'numpy.ndarray'>
The rewards are: [3. 3.] of type: <class 'numpy.ndarray'>
The terminated are: [4. 4.] of type: <class 'numpy.ndarray'>
The truncated are: [5. 5.] of type: <class 'numpy.ndarray'>
The new_states are: [6 6] of type: <class 'numpy.ndarray'>


##  Epsilon Greedy Policy

First of all let's define the epsilon greedy policy that will be used during the training phase. The epsilon greedy policy will be used to select the action that the agent will take. The epsilon greedy policy will select a random action with probability epsilon and will select the action with the highest Q value with probability 1-epsilon. This epsilon will be decayed over time to make the agent explore less and less as the agent learns more and more.

In [15]:

class EpsilonGreedyStrategy:
    def __init__(self, start, end, decay):
        #Initialize the start, end, decay values
        self.start = start
        self.end = end
        self.decay = decay
    
    def get_exploration_rate(self, current_step):
        #Return the exploration rate
        return self.end + (self.start - self.end) * np.exp(-1. * current_step / self.decay)
    

check_strat = EpsilonGreedyStrategy(1.0, 0.02, 100000)
print("The exploration rate at step 0 is:", check_strat.get_exploration_rate(0))
print("The exploration rate at step 1 is:", check_strat.get_exploration_rate(1))
print("The exploration rate at step 10 is:", check_strat.get_exploration_rate(10))
print("The exploration rate at step 100 is:", check_strat.get_exploration_rate(100))
print("The exploration rate at step 10000 is:", check_strat.get_exploration_rate(10000))
print("The exploration rate at step 100000 is:", check_strat.get_exploration_rate(100000))
print("The exploration rate at step 1000000 is:", check_strat.get_exploration_rate(1000000))


The exploration rate at step 0 is: 1.0
The exploration rate at step 1 is: 0.999990200049
The exploration rate at step 10 is: 0.9999020048998368
The exploration rate at step 100 is: 0.9990204898367077
The exploration rate at step 10000 is: 0.9067406696752404
The exploration rate at step 100000 is: 0.38052185234801356
The exploration rate at step 1000000 is: 0.020044491931167235


## Agent

Now lets define the agent. The agent will make use of the neural network, the experience replay buffer and the epsilon greedy policy that we defined earlier. 

In [16]:
class Agent:
    def __init__(self, env, exp_replay_buffer):
        self.env = env
        self.exp_replay_buffer = exp_replay_buffer
        self._reset()
    
    def _reset(self):
        #Reset the environment
        self.current_state, info = self.env.reset() # Get the current state and info from the environment
        self.total_reward = 0.0
        self.current_episode_steps = 0
    
    def step(self, dqn, strategy, device=device):
        #set the done_reward to None initially, this is the value which will be returned when the episode is done
        done_reward = None
        
        #Choose the action based on the strategy
        exploration_rate = strategy.get_exploration_rate(self.current_episode_steps)
        
        if not np.random.rand() < exploration_rate:
            action = self.env.action_space.sample()
            #print("The action is:", action)
        
        else:
            state = torch.tensor(np.array(self.current_state).transpose(2, 0, 1), dtype=torch.float32).to(device)# Convert the current state to a tensor of shape [ 4, 84, 84]
            #print("The state shape is:", state.shape)
            q_values = dqn(state.unsqueeze(0))
            #print("The q_values shape is:", q_values.shape)
            action = torch.argmax(q_values, dim=1).item()
        
        #Take the action in the environment and get the next state, reward, terminated, truncated and info
        new_state, reward, terminated, truncated, info = self.env.step(action)
        
        #Update the total reward and current episode steps
        self.total_reward += reward
        self.current_episode_steps += 1
        
        #Create the experience tuple and append it to the experience replay buffer
        exp = Experience(self.current_state, action, reward, terminated, truncated, new_state)
        self.exp_replay_buffer.append(exp)
        
        #Update the current state to the new state
        self.current_state = new_state
        
        #Check if the episode is done and if so, update the done_reward and reset the environment
        if terminated or truncated:
            done_reward = self.total_reward
            self._reset()
            
        #return the reward when the episode is done    
        return done_reward
    
    
#Test the agent
print(Agent(make_env(), ExperienceReplay(10000)).step(check_net, check_strat))

None


## Training

First we'll set up the hyperparameters and some utility functions to help us during the training phase.

In [17]:
from torch.utils.tensorboard import SummaryWriter
import datetime
import os
import multiprocessing

%load_ext tensorboard
print(">>>Training started at:", datetime.datetime.now())


def save_model(model, optimizer, filename="model.pth"):
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        }, filename)


>>>Training started at: 2023-12-27 12:09:05.971075


In [18]:
import wandb
import random
'''
We'll define the hyperparameters that will be used during the training phase. 
Our goal is to do a grid search over the hyperparameters to find the best combination 
of hyperparameters that will give us the best performance, so we'll define lists that
will contain all the hyperparameters that we want to search over.
'''
#Hyperparameters
lrs = [0.0001, 0.00025, 0.0005, 0.001, 0.0025, 0.005]
gammas = [0.99, 0.95, 0.9, 0.85, 0.8]
batch_sizes = [32, 64, 128, 256, 512]
buffer_sizes = [10000, 20000, 50000, 100000, 200000]
target_updates = [100, 200, 500, 1000, 2000]
epsilon_decays = [100000, 200000, 500000, 1000000, 2000000]
epsilon_starts = [1.0, 0.9, 0.8, 0.7, 0.6]
epsilon_ends = [0.02, 0.04, 0.06, 0.08, 0.1]
env_name = "ALE/Pong-v5"
num_episodes = 10000000
max_steps_per_episode = 10000000
solved_reward = 19
min_episodes = 100
max_no_improvement = 100
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

num_searches = len(lrs) * len(gammas) * len(batch_sizes) * len(buffer_sizes) * len(target_updates) * len(epsilon_decays) * len(epsilon_starts) * len(epsilon_ends)
print("The number of hyperparameter combinations to search over would be:", num_searches)
print("this is absolutely insane, so we'll just search over a few combinations by randomly sampling from the hyperparameter space")
num_searches = 100 # We'll just search over 100 combinations

The number of hyperparameter combinations to search over would be: 468750
this is absolutely insane, so we'll just search over a few combinations by randomly sampling from the hyperparameter space


Now lets start the training phase. The training phase consists of a random parameter search over the hyperparameters that we defined earlier. The random parameter search will be done for 100 trials.


In [19]:


#Remember we want to do a random grid search over the hyperparameters, so instead of looping over all the hyperparameters 
# we'll just randomly sample  100 different sets of hyperparams from the hyperparameter space.

#Also remember the device is cuda if cuda is available, else it is cpu, so we'll use the device variable to send the tensors to the device

results = [] 
for _ in range(num_searches):
    # Randomly sample hyperparameters
    lr = random.choice(lrs)
    gamma = random.choice(gammas)
    batch_size = random.choice(batch_sizes)
    buffer_size = random.choice(buffer_sizes)
    target_update = random.choice(target_updates)
    epsilon_decay = random.choice(epsilon_decays)
    epsilon_start = random.choice(epsilon_starts)
    epsilon_end = random.choice(epsilon_ends)
                                
    #Initialize wandb
    wandb.init(project="DQN_RIGHT_PONG", entity="neildlf", config={
        'lr': lr,
        'gamma': gamma,
        'batch_size': batch_size,
        'buffer_size': buffer_size,
        'target_update': target_update,
        'epsilon_decay': epsilon_decay,
        'epsilon_start': epsilon_start,
        'epsilon_end': epsilon_end,
    })
    
    print(">>>Training for parameters:", lr, gamma, batch_size, buffer_size, target_update, epsilon_decay, epsilon_start, epsilon_end, "started at:", datetime.datetime.now(), "on device:", device)
    #Create the environment
    env = make_env()
    #Create the experience replay buffer
    exp_replay_buffer = ExperienceReplay(capacity=buffer_size)
    #Create the agent
    agent = Agent(env, exp_replay_buffer)
    #Create the DQN
    dqn = DQN(env.observation_space.shape, env.action_space.n).to(device)
    #Create the target DQN
    target_dqn = DQN(env.observation_space.shape, env.action_space.n).to(device)
    #Set the target DQN's weights to be the same as the DQN
    target_dqn.load_state_dict(dqn.state_dict())
    #Set the target DQN to evaluation mode
    target_dqn.eval()
    #Create the optimizer
    optimizer = optim.Adam(dqn.parameters(), lr=lr)
    #Create the strategy
    strategy = EpsilonGreedyStrategy(epsilon_start, epsilon_end, epsilon_decay)
    #Set the frame number to 0
    frame_number = 0
    #set the episode number to 0
    episode_number = 0
    #Set the total reward list to empty
    total_reward_list = []
    
    #Set the best mean reward to -infinity initially
    best_mean_reward = -float('inf')
    #Set the no improvement count to 0
    no_improvement_count = 0
    #Loop over the episodes
    for _ in range(num_episodes):
        frame_number += 1
        epsilon = strategy.get_exploration_rate(frame_number)
        
        #Take a step in the environment
        reward = agent.step(dqn, strategy)
        
        #If the reward is not None, then the episode is done
        if reward is not None:
            episode_number += 1
            #Append the total reward to the total reward list
            total_reward_list.append(reward)
            #Get the mean of the total reward list
            mean_reward = np.mean(total_reward_list[-100:])
            #Print the episode number, frame number, reward and mean reward and epsilon
            print(f"Episode:{episode_number} | Frame:{frame_number} | Total games:{len(total_reward_list)}  | Episode reward: {reward:.3f} | Mean reward: {mean_reward:.3f} | epsilon used: {epsilon:.3f}")
            #Print all the hyperparameters used in this episode
            print(f"lr={lr} | gamma={gamma} | batch_size={batch_size} | target_update={target_update} | epsilon_decay={epsilon_decay} | epsilon_start={epsilon_start} | epsilon_end={epsilon_end} | buffer_size={buffer_size}")
            
            #Add the mean reward, episode number, frame number and epsilon to the wandb logs
            wandb.log({'Episode reward': reward, 'Mean Reward': mean_reward, 'Episode': episode_number, 'Frame': frame_number, 'Epsilon': epsilon})

            #If the mean reward is greater than "solved_reward", then we have solved the environment
            if mean_reward > solved_reward:
                print("Solved in", frame_number, "frames and", len(total_reward_list), "games!")
                #break
            
            if mean_reward > best_mean_reward:
                best_mean_reward = mean_reward
                save_model(dqn, optimizer, filename=f"model_{lr}_{gamma}_{batch_size}_{buffer_size}_{target_update}_{epsilon_decay}_{epsilon_start}_{epsilon_end}.pth")
                wandb.log({'Best Mean Reward': best_mean_reward})
            else:
                no_improvement_count += 1
            
            #enforce early stopping if the model has converged
            if episode_number >= min_episodes and no_improvement_count >= max_no_improvement:
                print("Stopping training as the model has converged!")
                break
            
        #Check if the replay buffer has enough experience to sample a batch    
        if len(exp_replay_buffer) < batch_size:
            continue
        
        #If the replay buffer has enough experience to sample a batch, then sample a batch
        batch = exp_replay_buffer.sample(batch_size)
        
        #Get the states, actions, rewards, terminated, truncated, new_states from the batch
        states_, actions_, rewards_, terminated_, truncated_, new_states_ = batch
        
        #Turn the states, actions, rewards, terminated, truncated, new_states into tensors and send them to the device
        states = torch.tensor(np.array(states_).transpose(0, 3, 1, 2), dtype=torch.float32).to(device)
        actions = torch.tensor(actions_).to(device)
        rewards = torch.tensor(rewards_).to(device)
        terminateds = torch.tensor(terminated_).to(device)
        truncateds = torch.tensor(truncated_).to(device)
        #as bitwise cuda is not implemented for floats, we will use the logical and operator to combine the terminateds and truncateds
        dones = torch.logical_or(terminateds, truncateds).to(device)
        new_states = torch.tensor(np.array(new_states_).transpose(0, 3, 1, 2), dtype=torch.float32).to(device)
        
        
        #Get the q_values from the DQN by passing the states through the DQN
        Q_values = dqn(states).gather(dim=1, index=actions.unsqueeze(-1)).squeeze(-1) # Use the gather method to get the q_values for the actions taken, and then squeeze the last dimension of the q_values for the loss calculation
        
        #Get the q_values for next state from the target DQN by passing the new_states through the target DQN
        new_state_Q_values = target_dqn(new_states).max(dim=1)[0] # Get the q_values for the new_states from the target DQN and take the max of the q_values
        new_state_Q_values[dones] = 0 # If the episode is terminated or truncated, then set the new_state_q_values to 0
        new_state_Q_values = new_state_Q_values.detach() # Detach the new_state_q_values from the computational graph so that the gradients are not calculated for the new_state_q_values
        
        #Compute the expected q_values using the bellman equation
        expected_Q_values = rewards + gamma * new_state_Q_values
        
        #Compute the loss between the Q_values and the expected_Q_values
        loss = F.smooth_l1_loss(Q_values, expected_Q_values) # Use the smooth_l1_loss function to compute the loss, I use this loss function because it is less sensitive to outliers than the mse loss function
        
        #Zero the gradients
        optimizer.zero_grad()
        #Compute the gradients
        loss.backward()
        #Clip the gradients
        for param in dqn.parameters():
            param.grad.clamp_(-1, 1) # Use the clamp_ method to clip the gradients between -1 and 1 to avoid exploding gradients
        #Update the weights of the DQN
        optimizer.step()
        
        #Check if the frame number is a multiple of the target_update and if so, update the target DQN's weights to be the same as the DQN
        if frame_number % target_update == 0:
            target_dqn.load_state_dict(dqn.state_dict())
    
    results.append({
        'lr': lr,
        'gamma': gamma,
        'batch_size': batch_size,
        'buffer_size': buffer_size,
        'target_update': target_update,
        'epsilon_decay': epsilon_decay,
        'epsilon_start': epsilon_start,
        'epsilon_end': epsilon_end,
        'best_mean_reward': best_mean_reward,
        'episodes_to_solve': episode_number,
    })
    # Save results to a file
    with open("training_results.txt", "w") as file:
        for result in results:
            file.write(str(result) + "\n")   
    
    #Close the writer
    wandb.finish()
  

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


[34m[1mwandb[0m: Currently logged in as: [33mneildlf[0m. Use [1m`wandb login --relogin`[0m to force relogin


>>>Training for parameters: 0.00025 0.95 256 10000 100 500000 0.9 0.04 started at: 2023-12-27 12:09:12.783818 on device: cuda
Episode:1 | Frame:1005 | Total games:1  | Episode reward: -21.000 | Mean reward: -21.000 | epsilon used: 0.898
lr=0.00025 | gamma=0.95 | batch_size=256 | target_update=100 | epsilon_decay=500000 | epsilon_start=0.9 | epsilon_end=0.04 | buffer_size=10000
Episode:2 | Frame:1908 | Total games:2  | Episode reward: -21.000 | Mean reward: -21.000 | epsilon used: 0.897
lr=0.00025 | gamma=0.95 | batch_size=256 | target_update=100 | epsilon_decay=500000 | epsilon_start=0.9 | epsilon_end=0.04 | buffer_size=10000
Episode:3 | Frame:2879 | Total games:3  | Episode reward: -21.000 | Mean reward: -21.000 | epsilon used: 0.895
lr=0.00025 | gamma=0.95 | batch_size=256 | target_update=100 | epsilon_decay=500000 | epsilon_start=0.9 | epsilon_end=0.04 | buffer_size=10000
Episode:4 | Frame:3968 | Total games:4  | Episode reward: -20.000 | Mean reward: -20.750 | epsilon used: 0.893
l

## Multi-GPU training

In [None]:
import multiprocessing
import os
import wandb
import torch
import torch.optim as optim
import torch.nn.functional as F
from datetime import datetime

def train_hyperparameters(hyperparams):
    lr, gamma, batch_size, buffer_size, target_update, epsilon_decay, epsilon_start, epsilon_end = hyperparams
    device_id = multiprocessing.current_process()._identity[0] % torch.cuda.device_count()
    device = torch.device(f'cuda:{device_id}' if torch.cuda.is_available() else 'cpu')
    os.environ['CUDA_VISIBLE_DEVICES'] = str(device_id)

    wandb.init(project="DQN_RIGHT_PONG", entity="neildlf", config={
        'lr': lr,
        'gamma': gamma,
        'batch_size': batch_size,
        'buffer_size': buffer_size,
        'target_update': target_update,
        'epsilon_decay': epsilon_decay,
        'epsilon_start': epsilon_start,
        'epsilon_end': epsilon_end,
    })

    print(f"Training on GPU {device_id} for parameters: {lr}, {gamma}, {batch_size}, {buffer_size}, {target_update}, {epsilon_decay}, {epsilon_start}, {epsilon_end}")

    env = make_env()
    exp_replay_buffer = ExperienceReplay(capacity=buffer_size)
    agent = Agent(env, exp_replay_buffer)
    dqn = DQN(env.observation_space.shape, env.action_space.n).to(device)
    target_dqn = DQN(env.observation_space.shape, env.action_space.n).to(device)
    target_dqn.load_state_dict(dqn.state_dict())
    target_dqn.eval()
    optimizer = optim.Adam(dqn.parameters(), lr=lr)
    strategy = EpsilonGreedyStrategy(epsilon_start, epsilon_end, epsilon_decay)

    frame_number = 0
    episode_number = 0
    best_mean_reward = -float('inf')
    no_improvement_count = 0

    for _ in range(num_episodes):
        frame_number += 1
        epsilon = strategy.get_exploration_rate(frame_number)
        
        reward = agent.step(dqn, strategy, device)
        if reward is not None:
            episode_number += 1
            total_reward_list.append(reward)
            mean_reward = np.mean(total_reward_list[-100:])
            wandb.log({'Mean Reward': mean_reward, 'Episode': episode_number, 'Frame': frame_number, 'Epsilon': epsilon})

            if mean_reward > best_mean_reward:
                best_mean_reward = mean_reward
                save_model(dqn, optimizer, filename=f"model_{lr}_{gamma}_{batch_size}_{buffer_size}_{target_update}_{epsilon_decay}_{epsilon_start}_{epsilon_end}.pth")
                wandb.log({'Best Mean Reward': best_mean_reward})
            else:
                no_improvement_count += 1

            if episode_number >= min_episodes and no_improvement_count >= max_no_improvement:
                print("Stopping training due to no improvement.")
                break

        if len(exp_replay_buffer) < batch_size:
            continue

        states_, actions_, rewards_, terminated_, truncated_, new_states_ = exp_replay_buffer.sample(batch_size)
        states = torch.tensor(states_, dtype=torch.float32).to(device)
        actions = torch.tensor(actions_).to(device)
        rewards = torch.tensor(rewards_).to(device)
        terminateds = torch.tensor(terminated_).to(device)
        truncateds = torch.tensor(truncated_).to(device)
        dones = torch.logical_or(terminateds, truncateds).to(device)
        new_states = torch.tensor(new_states_, dtype=torch.float32).to(device)

        Q_values = dqn(states).gather(dim=1, index=actions.unsqueeze(-1)).squeeze(-1)
        new_state_Q_values = target_dqn(new_states).max(dim=1)[0]
        new_state_Q_values[dones] = 0
        new_state_Q_values = new_state_Q_values.detach()
        expected_Q_values = rewards + gamma * new_state_Q_values
        loss = F.smooth_l1_loss(Q_values, expected_Q_values)

        optimizer.zero_grad()
        loss.backward()
        for param in dqn.parameters():
            param.grad.clamp_(-1, 1)
        optimizer.step()

        if frame_number % target_update == 0:
            target_dqn.load_state_dict(dqn.state_dict())

    wandb.finish()

# Hyperparameters and constants
num_episodes = 10000000
max_steps_per_episode = 10000000
solved_reward = 19
min_episodes = 100
max_no_improvement = 100
lrs = [0.0001, 0.00025, 0.0005, 0.001, 0.0025, 0.005]
gammas = [0.99, 0.95, 0.9, 0.85, 0.8]
batch_sizes = [32, 64, 128, 256, 512]
buffer_sizes = [10000, 20000, 50000, 100000, 200000]
target_updates = [100, 200, 500, 1000, 2000]
epsilon_decays = [100000, 200000, 500000, 1000000, 2000000]
epsilon_starts = [1.0, 0.9, 0.8, 0.7, 0.6]
epsilon_ends = [0.02, 0.04, 0.06, 0.08, 0.1]
env_name = "ALE/Pong-v5"



all_hyperparams = [(lr, gamma, batch_size, buffer_size, target_update, epsilon_decay, epsilon_start, epsilon_end)
                   for lr in lrs for gamma in gammas for batch_size in batch_sizes
                   for buffer_size in buffer_sizes for target_update in target_updates
                   for epsilon_decay in epsilon_decays for epsilon_start in epsilon_starts
                   for epsilon_end in epsilon_ends]

pool = multiprocessing.Pool(processes=6)  # Adjust based on your GPU count
pool.map(train_hyperparameters, all_hyperparams)
pool.close()
pool.join()

# Save results to a file
with open("training_results.txt", "w") as file:
    for result in results:
        file.write(str(result) + "\n")
