## Under Construction!

The **DQNAgent** class comprises methods for updating the target model, constructing neural networks for action and target networks, executing forward passes, recording experiences, decision-making based on the Boltzmann policy, computing TD errors for prioritization during replay, and implementing experience replay mechanisms for training.


This code is a PyTorch implementation of a *Double Deep Q-Network (DDQN) agent with prioritized experience replay* for the CartPole-v1 environment provided by [Frama Foundation Gymnasium](https://gymnasium.farama.org/environments/classic_control/cart_pole/).

### Recent Updates:

- **Algorithm Updates:**
  - Replaced Q-learning with expected-sarsa, specifically utilizing averageQ over maxQ. Expected Sarsa is on-policy learning, which is known to work better with function approximation (the RL deadly triad)
  - Transitioned from epsilon-greedy to a Boltzmann policy for action selection. Why: I noticed that random guessing works better than a poorly trained network. This is because a slightest bias in the poorly trained network will make the agent to push the cart in one direction, causing it to fail. Random guessing however is on average unbiased. Thus, I choose the Boltzmann policy that is always probabilistic and lower the temperature slowly.

- **State Utilization:**
  - Introduced an additional output utilizing states in an attempt to enhance training stability. Initial observations suggest limited effectiveness, prompting consideration for incorporating state learning into the summary writer for further analysis (to be done). CUrrently states are an output of the system, the idea was that since Q values constantly change and we know bootstrapping is unstable, I tried to give the network a stable output. To this point not effective on its own.

- **Optimization Technique:**
  - Discontinued the active use of Optuna for hyperparameter optimization. *Note: Optuna was previously used for hyperparameter optimization.*

- **Visualization and Hyperparameter Tuning:**
  - Utilizes TensorBoard for visualization, providing insights into training metrics and Q-value graphs.
  - Previously, Optuna was used for hyperparameter optimization; however, the code no longer actively employs it.

Remember to monitor the TensorBoard Q graphs for a visual representation of the training progress.

Author: Iman Mossavat  
Date: 19-December-2023  
Institution: Fontys ICT



In [None]:
#!pip install gymnasium[classic-control]


Collecting gymnasium[classic-control]
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium[classic-control])
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.29.1


In [1]:
#!pip install numpy
!pip install optuna
#!pip install matplotlib
#!pip install torch torchvision
#!pip install tensorboard


Collecting optuna
  Downloading optuna-3.5.0-py3-none-any.whl (413 kB)
     ------------------------------------ 413.4/413.4 kB 737.1 kB/s eta 0:00:00
Collecting alembic>=1.5.0
  Downloading alembic-1.13.0-py3-none-any.whl (230 kB)
     ------------------------------------ 230.6/230.6 kB 941.6 kB/s eta 0:00:00
Collecting colorlog
  Downloading colorlog-6.8.0-py3-none-any.whl (11 kB)
Collecting Mako
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
     ---------------------------------------- 78.6/78.6 kB 2.2 MB/s eta 0:00:00
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.13.0 colorlog-6.8.0 optuna-3.5.0



[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
#from google.colab import drive

# Mount your Google Drive
#drive.mount('/content/drive')


import random
from collections import deque
import optuna
import copy

import numpy as np
import cv2
import PIL
from PIL import Image
from matplotlib import pyplot as plt

log_dir = "runs"
empty_initial_frames = True
FRAMES = True # create an environment with RGB frames

In [3]:
import gymnasium as gym
print(gym.__version__)


0.29.1


In [4]:


import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
import torch.nn.functional as F


%load_ext tensorboard



In [5]:
class FrameHandler:
    def __init__(self):
        self.frame_stack = deque(maxlen=4) # stack of 4 pytorch tensors containing containing preprocessed frames

    def preprocess_frame(self, frame, D=84):
        """Resizing to DxD (84x84) pixels numpy array, converting to grayscale, and normalizing pixel values. """

        frame = cv2.resize(frame, (D, D))
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        frame = frame / 255.0
        return frame

    def update_framestack(self, frame):
        # Pre-process, converts to tensor and append the new frame at the beginning
        preprocessed_frame = self.preprocess_frame(frame)
        preprocessed_frame_tensor = self.convert_frame_to_tensor(preprocessed_frame)

        self.frame_stack.appendleft(preprocessed_frame_tensor)

    def initialize_frame_stack(self, frame, empty_initial_frames=True):
        self.frame_stack.clear()
        empty_frame = np.zeros_like(frame)


        for _ in range(4):
          if empty_initial_frames:
            self.update_framestack(empty_frame)
          else:
            self.frame_stack.extend(frame)

        self.update_framestack(frame)

    def convert_frame_to_tensor(self, frame):
        frame_tensor = torch.from_numpy(frame).float()
        return frame_tensor

    def clear(self):
       self.frame_stack.clear()

    def convert_frame_stack_to_tensor(self):
      # useful to build an input to CNN
      frames_tensor = torch.stack(list(self.frame_stack))  # Stack the frames along a new dimension
      frames_tensor = frames_tensor.unsqueeze(0)  # Add a batch dimension
      return frames_tensor

In [None]:
# Define the CNN layers for processing the stack of frames
self.conv_layers = nn.Sequential(
    nn.Conv2d(in_channels=4, out_channels=8, kernel_size=5, stride=1),
    nn.MaxPool2d(kernel_size=2, stride=2),  # Adding MaxPooling
    nn.ReLU(),
    nn.Conv2d(in_channels=8, out_channels=8, kernel_size=3, stride=1, groups = 2),
    nn.MaxPool2d(kernel_size=2, stride=2),  # Adding MaxPooling
    nn.ReLU(),
    nn.Conv2d(in_channels=8, out_channels=8, kernel_size=3, stride=1, groups = 2),
    nn.MaxPool2d(kernel_size=2, stride=2),  # Adding MaxPooling
    nn.ReLU(),
    nn.Conv2d(in_channels=8, out_channels=8, kernel_size=3, stride=1),
    nn.MaxPool2d(kernel_size=2, stride=2),  # Adding MaxPooling
    nn.ReLU(),
    nn.Conv2d(in_channels=8, out_channels=8, kernel_size=3, stride=1),
    nn.Flatten()
)

In [6]:
class CustomCNN(nn.Module):
    def __init__(self, state_size= 4, image_size=(4, 84, 84), action_size=2, dropout_prob= 0.3):
        super(CustomCNN, self).__init__()
        self.image_size = image_size
        self.state_size = state_size
        self.action_size = action_size
        self.dropout_prob = dropout_prob

        # Define the CNN layers for processing the stack of frames
        self.conv_layers = nn.Sequential(
            nn.Conv2d(in_channels=4, out_channels=32, kernel_size=3, stride=1),
            nn.MaxPool2d(kernel_size=2, stride=2),  # Adding MaxPooling
            nn.ReLU(),
            nn.Dropout(p=self.dropout_prob),  # Adding dropout after ReLU
            nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, groups = 8),
            nn.MaxPool2d(kernel_size=2, stride=2),  # Adding MaxPooling
            nn.ReLU(),
            nn.Dropout(p=self.dropout_prob),  # Adding dropout after ReLU
            nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, groups = 4),
            nn.MaxPool2d(kernel_size=2, stride=2),  # Adding MaxPooling
            nn.ReLU(),
            nn.Dropout(p=self.dropout_prob),  # Adding dropout after ReLU
            nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, groups = 2),
            nn.MaxPool2d(kernel_size=2, stride=2),  # Adding MaxPooling
            nn.ReLU(),
            nn.Dropout(p=self.dropout_prob),  # Adding dropout after ReLU
            nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.Dropout(p=self.dropout_prob),  # Adding dropout after ReLU
            nn.Flatten()
        )

        # Calculate the input size for the fully connected layers after CNN
        conv_output_size = self._get_conv_output_size(self.image_size)


        # Define the fully connected layers for Q-values and state replication
        self.fc_q_values = nn.Sequential(
            nn.Linear(128, self.action_size)
        )

        self.fc_replicate_state = nn.Sequential(
            nn.Linear(128, state_size)  # Adjust output size for state replication
        )

        self.fc_common = nn.Sequential(
            nn.Linear(conv_output_size, 128),  # Adjust output size for state replication
            nn.ReLU()
        )

    def _get_conv_output_size(self, shape):
        dummy_input = torch.rand(1, *shape)
        dummy_output = self.conv_layers(dummy_input)
        return dummy_output.size(1)

    def forward(self, x):
        x = self.conv_layers(x)
        x = self.fc_common(x)
        replicated_state = self.fc_replicate_state(x)
        q_values = self.fc_q_values(x)
        return q_values, replicated_state

In [7]:
class Memory:
    def __init__(self, maxlen):
        """
        Initializes a Memory object with a deque.
        This object allows for management of temporal difference error (TDE) and prioritiezed experience replay.

        Args:
            maxlen (int): Maximum length of the memory.

        Returns:
            None
        """
        self.memory = deque(maxlen=maxlen)

    def __len__(self):
        """
        Returns the length of the deque.

        Returns:
            int: Length of the memory deque.
        """
        return len(self.memory)

    def remember(self, observation, action, reward, next_observation, done, tde, auxiliary_data=None):
        """
      remembers an experience in the agent's memory.

      Args:
          observation (object): The current observation/state.
          action (int): The action taken in the current state.
          reward (float): The reward received after taking the action.
          next_observation (object): The next observation/state after taking the action.
          done (bool): A flag indicating if the episode terminates after this step.
          auxiliary_data (object, optional): Additional data associated with the experience. Defaults to None.

      Returns:
          float: The Temporal Difference Error (tde) computed from the given experience.
        """
        if auxiliary_data is not None:
            if not isinstance(auxiliary_data, torch.Tensor):
                auxiliary_data_torch = torch.from_numpy(auxiliary_data).float().unsqueeze(0)
            else:
                auxiliary_data_torch = auxiliary_data
            self.memory.append((observation, action, reward, next_observation, done, auxiliary_data_torch, tde))
        else:
            self.memory.append((observation, action, reward, next_observation, done, None, tde))

    def sample_batch_with_priority(self, batch_size, beta):
        """
        Samples a batch with priority based on the calculated probabilities.

        Args:
            batch_size (int): Size of the batch to be sampled.
            beta (float): Parameter used to calculate probabilities.

        Returns:
            list: A batch of experiences sampled with priority.
        """
        tde = self.get_all_tde()
        prob = self.calculate_probabilities(tde, beta)
        indices = np.random.choice(len(self.memory), size=batch_size, p=prob, replace=False)
        minibatch = [self.memory[i] for i in indices]
        return minibatch, indices


    def get_all_tde(self):
        tde = abs(np.array([item[-1] for item in self.memory]))
        return tde

    def sort_memory_by_tde(self):

      tde_values = np.array([item[-1] for item in self.memory])
      indices = np.argsort(tde_values)
      self.memory = deque([self.memory[i] for i in indices], maxlen=self.memory.maxlen)

    def calculate_probabilities(self, tde, beta):
        exp_tde = np.exp(beta * (tde - np.median(tde)))
        prob = exp_tde / np.sum(exp_tde)
        return prob

    def update_td_errors_by_index(self, indices, targets, originals):
        for i,index in enumerate(indices):
            _, action, _, _, _, _, _ = self.memory[index]  # Retrieve action from memory

            td_error = targets[i][action].detach().numpy() - originals[i][action].detach().numpy()

            updated_entry = list(self.memory[index])
            updated_entry[-1] = td_error
            self.memory[index] = tuple(updated_entry)

We have deque that serves as the replay memory, where each element in the replay memory is a tuple consisting of state, action, next state, reward, and TDE. Additionally, each state in these tuples is represented as a deque object that holds a stack of four images. This structure is used to store and manage experiences for training your agent.

In [8]:
class DQNAgent(nn.Module):
    def __init__(self, image_size, action_size, state_size = 4, maxlen=2048, minlen = 1024,epsilon = 1.0,epsilon_min = 0.01,epsilon_decay = 0.99,gamma = 0.90, update_frequency= 100):
        super(DQNAgent, self).__init__()
        self.image_size = image_size # (4, 84, 84) for a image
        self.state_size = state_size #  (1,4) for gym states (cart-pole)
        self.action_size = action_size
        self.maxlen = maxlen
        self.memory = Memory(maxlen=self.maxlen)
        self.minlen= minlen

        self.frame_handler = FrameHandler()

        self.gamma = gamma  # Discount factor

        self.epsilon = epsilon  # Exploration rate
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay

        self.model = self._build_model()
        if self.model.dropout_prob == 0:
          self.DropOut = False
        else:
          self.DropOut = True
        self.target_model = self._build_model()
        self.update_target_model()
        self.target_model.eval()

        self.total_steps = 0
        self.update_frequency = update_frequency




    def update_target_model_(self):
      with torch.no_grad():
        self.target_model.load_state_dict(self.model.state_dict())
        input_data = torch.rand(1, *self.image_size)
        assert torch.allclose(self.model(input_data), self.target_model(input_data), atol=1e-6)
        self.target_model.eval()

    def update_target_model(self):
        with torch.no_grad():
            self.target_model.load_state_dict(self.model.state_dict())


            if not self.DropOut:
              input_data = torch.rand(1, *self.image_size)
              model_outputs = self.model(input_data)
              target_model_outputs = self.target_model(input_data)

              # Separate the outputs if they're tuples
              model_output_1, model_output_2 = model_outputs
              target_model_output_1, target_model_output_2 = target_model_outputs


              # Assert for the first output
              assert torch.allclose(model_output_1, target_model_output_1, atol=1e-6)



              # Assert for the second output
              assert torch.allclose(model_output_2, target_model_output_2, atol=1e-6)

            self.target_model.eval()





    def _build_model(self):
      return CustomCNN(image_size= self.image_size, action_size= self.action_size)

    def forward(self, x, mode='primary'):
        # accepts numpy array or PyTorch, can choose with DQN policy to use

        if mode == 'primary':
            model = self.model
        elif mode == 'target':
            model = self.target_model
        else:
            raise ValueError('unidentified forward calc mode')

        if isinstance(x, np.ndarray):
            x = torch.from_numpy(x).float().unsqueeze(0)
        elif not isinstance(x, torch.Tensor):
            raise TypeError("Input must be a Numpy array or a PyTorch tensor")

        output = model(x)

        # Handle tuple output, if any, by unpacking and squeezing
        if isinstance(output, tuple):
            return tuple(o.squeeze() for o in output)
        else:
            return output.squeeze()


    def act(self, observation):
        output = self.forward(observation, mode='primary')
        if isinstance(output, tuple):
            q_values = output[0]
        else:
            q_values = output

        # Calculate action probabilities using softmax with temperature (epsilon)
        action_probs = F.softmax(q_values / self.epsilon, dim=-1)

        # Sample an action according to the probability distribution
        action = np.random.choice(self.action_size, p=action_probs.detach().numpy())

        return action


    def td_error(self, observation, action, reward, next_observation, done):
      # Get Q-values for the current and next observations from the policy network
      q_values_observation,_ = self.forward(observation)
      q_values_next_observation,_ = self.forward(next_observation)

      # Calculate the target Q-value using the Q-learning update rule
      target = q_values_observation.clone().detach()  # Detach from computational graph

      if not done:
          max_next_q_value = torch.max(q_values_next_observation) # small bug: this TDE is for Q-learning not expected sarsa, but I guess for initialization that it is used, it does not matter too much.
          tde = q_values_observation[action] - (reward + self.gamma * max_next_q_value)
      else:
          tde = q_values_observation[action] - reward
      tde = tde.detach().item()

      assert isinstance(tde, (np.generic, float)) and np.isscalar(tde), f"tde should be a NumPy scalar tde: {tde}, act {action} reward {reward} ntype: {type(tde)}"

      return tde


    def refresh_memory(self):
        temp_storage = self.memory.memory.copy()
        self.memory.memory.clear()
        for observation, action, reward, next_observation, done, auxiliary_data, _ in temp_storage:
            tde = self.td_error(observation, action, reward, next_observation, done)
            self.memory.remember(
                          observation=observation,
                          action=action,
                          reward=reward,
                          next_observation=next_observation,
                          done=done,
                          tde=tde,
                          auxiliary_data=auxiliary_data
                      )
        self.memory.sort_memory_by_tde()
        print('Memory refreshed / sorted')

    def replay(self, batch_size, optimizer, beta=1):
        assert self.minlen > batch_size
        if len(self.memory) < self.minlen:
            return None, None

        minibatch, indices = self.memory.sample_batch_with_priority(batch_size=batch_size, beta=beta)

        targets = torch.zeros(batch_size, self.action_size)
        originals = torch.zeros(batch_size, self.action_size)
        aux_outputs = torch.zeros(batch_size, self.state_size)  #
        auxiliary_data_ = torch.zeros(batch_size,  self.state_size)  #

        for i, (observation, action, reward, next_observation, done, auxiliary_data,tde) in enumerate(minibatch):
            # Forward pass to get Q-values and auxiliary output if available
            if auxiliary_data is not None:
                q_values_observation, aux_output = self.forward(observation, mode='primary')
                q_values_next_observation, _ = self.forward(next_observation, mode='target')
            else:
                q_values_observation = self.forward(observation, mode='primary')
                q_values_next_observation = self.forward(next_observation, mode='target')

            assert q_values_observation is not None
            assert q_values_next_observation is not None
            originals[i] = q_values_observation.clone()  # Store original Q-values
            targets[i] = q_values_observation.clone().detach()

            if auxiliary_data is not None:
                # Use the auxiliary output for the second output in the network
                # Modify this section to suit how the auxiliary output affects target calculations
                aux_outputs[i] = aux_output
                auxiliary_data_[i] = auxiliary_data

            if not done:

                if True: #boltzmann on-policy sarsa
                  # Calculate action probabilities using Boltzmann distribution
                  action_probs = F.softmax(q_values_next_observation / self.epsilon, dim=-1)

                  # Calculate the average Q-value based on the Boltzmann policy
                  avg_next_q_value = torch.sum(action_probs * q_values_next_observation)

                  # Update the target based on the average Q-value
                  targets[i][action] = reward + self.gamma * avg_next_q_value

            else:
                targets[i][action] = reward

        loss_q = nn.MSELoss()(originals, targets)

        if auxiliary_data is not None:
            # Calculate auxiliary loss if auxiliary_data is available
            # Modify this section according to how the auxiliary output influences the loss calculation
            auxiliary_loss = nn.MSELoss()(aux_output, auxiliary_data)
            loss = loss_q + auxiliary_loss
        else:
            # Use only the Q-value loss if there's no auxiliary data
            loss = loss_q

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if self.total_steps % self.update_frequency == 0:
          self.update_target_model()  # Update the target model
          print('update target')
          self.refresh_memory()
          print('memory refresh')




        self.total_steps += 1

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

        if False:  # keep cycling from hot to cold (these if True statements need to be coded properly with a flag to turn on or off later)
            if self.epsilon < self.epsilon_min or self.epsilon == self.epsilon_min:
                self.epsilon = 1


        # Update TD errors for elements used in the minibatch
        self.memory.update_td_errors_by_index(indices, targets, originals)

        return loss.item(), originals




Custom reward (not used/tested)

Hyper pararmeter optimization via Optuna. Currently inactive, if you want to use this say for  `epsilon_min`, use

    epsilon_min = trial.suggest_float('epsilon_min', 1e-3, 1e-1, log=True)

and change `n_trials` in to set the number of search iterations

    study.optimize(lambda trial: objective(trial, writer), n_trials=1)  # Adjust the number of trials



In [9]:
def objective(trial, writer):

    NUM_EPISODES = 600

    # env = gym.make('CartPole-v1', new_step_api=True)
    if True:
      env = gym.make("CartPole-v1",  render_mode="rgb_array")
    else:
      env = gym.make('CartPole-v1')

    image_size = (4,84,84)
    state_size = env.observation_space.shape[0]

    action_size = env.action_space.n

    # Define hyperparameters to search
    epsilon = 20
    epsilon_min = 1 # trial.suggest_float('epsilon_min', 1e-3, 1e-1, log=True)
    epsilon_decay = 0.9955 # trial.suggest_float('epsilon_decay',0.98, 0.99, log=True)
    gamma = 0.95 # trial.suggest_float('gamma',0.9, 1, log=True)



    lr = 0.0025
    batch_size = 64
    beta = 0.8

    # Initialize your DQNAgent with the required parameters
    agent = DQNAgent(
        image_size=image_size,
        action_size=action_size,
        minlen=2048,
        maxlen=2048,
        epsilon=epsilon,
        epsilon_min=epsilon_min,
        epsilon_decay=epsilon_decay,
        gamma=gamma
    )

    if False:
      # Assuming you have the path to your saved model
      saved_model_path = '/content/drive/My Drive/best_model.pth'

      # Load the saved model state dict
      best_model_state_dict = torch.load(saved_model_path)

      # Set the loaded state dict to your agent's model
      agent.model.load_state_dict(best_model_state_dict)
      agent.target_model.load_state_dict(best_model_state_dict)  # If needed for target model

      # Ensure evaluation mode for inference
      agent.target_model.eval()  # If needed for target model




    # Define optimizer with suggested learning rate
    optimizer = optim.AdamW(agent.model.parameters(), lr=lr)


    best_episode = 0
    best_time = 0
    T = np.zeros(NUM_EPISODES)
    global_step = 0
    # Training loop
    for episode in range(NUM_EPISODES):
        total_loss = 0  # Initialize loss for each episode
        N= 0

        # Perform training steps (agent.replay(), etc.) here
        state = env.reset()
        frame = env.render()

        agent.frame_handler.initialize_frame_stack(frame)
        observation = agent.frame_handler.convert_frame_stack_to_tensor()

        for time in range(500):

            action = agent.act(observation)

            new_state, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated

            if terminated:
              reward = reward - 10

            next_frame = env.render()
            agent.frame_handler.update_framestack(next_frame)
            next_observation = agent.frame_handler.convert_frame_stack_to_tensor()

            # reward = custom_reward(next_state)

            if time > 3:
              tde = agent.td_error(observation, action, reward, next_observation, done)
              agent.memory.remember(
                  observation=observation,
                  action=action,
                  reward=reward,
                  next_observation=next_observation,
                  done=done,
                  tde=tde,
                  auxiliary_data = state)
              writer.add_scalar('TDError', tde, global_step)  # Log the loss for each episode

            loss, originals = agent.replay(batch_size=batch_size, optimizer=optimizer, beta=beta)



            if loss is not None:
              total_loss += loss
              N+=1
              writer.add_scalar('Loss', loss, global_step)  # Log the loss for each episode

              flat_originals = originals.view(-1)  # Flatten the tensor
              median_originals = torch.median(flat_originals)
              quantiles = torch.quantile(flat_originals, torch.tensor([0.25, 0.5, 0.75]))

              # Log median and quantiles using SummaryWriter
              writer.add_scalar('Q_values/Quantile_25', quantiles[0], global_step)
              writer.add_scalar('Q_values/Quantile_50', quantiles[1], global_step)
              writer.add_scalar('Q_values/Quantile_75', quantiles[2], global_step)



            observation = next_observation.clone()
            state = new_state
            global_step+= 1

            if done:
              break


        # Log values to TensorBoard
        writer.add_scalar('Epsilon', agent.epsilon, global_step)
        writer.add_scalar('Time', time, global_step)
        writer.add_scalar('espisode', episode, global_step)

        T[episode] = time

        if N!= 0:
            average_loss = total_loss / N  # Calculate average loss per episode
        else:
            average_loss = np.inf

        print(f"Episode: {episode + 1}/{NUM_EPISODES}, survival time: {time:.0f}, Average Loss: {average_loss:.4f}, epsilon: {agent.epsilon:.5f}")


    env.close()


    trial_data[trial.number] = {
        'model_state_dict': agent.model.state_dict().copy(),
    }
    # Return the value to minimize (average loss)
    return -np.percentile(T, 90)-np.max(T)




# Initialize TensorBoard writer
writer = SummaryWriter(log_dir=log_dir)
trial_data = {}


# Create a study object and optimize hyperparameters
study = optuna.create_study(direction='minimize')
# study.optimize(objective, n_trials=10)  # You can adjust the number of trials
study.optimize(lambda trial: objective(trial, writer), n_trials=1)  # Adjust the number of trials

# Close TensorBoard writer
# writer.close()

# Get the best hyperparameters found during optimization
best_params = study.best_params
print('Best hyperparameters:', best_params)

# Find the best trial based on your objective function
best_trial_number = study.best_trial.number

# Retrieve data for the best trial
best_trial_data = trial_data[best_trial_number]

# Save the model state dictionary of the best trial
best_model_state_dict = best_trial_data['model_state_dict']
torch.save(best_model_state_dict, '/content/drive/My Drive/best_model.pth')

writer.close()
%tensorboard --logdir=runs

[I 2023-12-19 14:33:41,611] A new study created in memory with name: no-name-1609780c-835e-4fe0-95e5-39e4cfd34eff


Episode: 1/600, survival time: 31, Average Loss: inf, epsilon: 20.00000
Episode: 2/600, survival time: 10, Average Loss: inf, epsilon: 20.00000
Episode: 3/600, survival time: 26, Average Loss: inf, epsilon: 20.00000
Episode: 4/600, survival time: 12, Average Loss: inf, epsilon: 20.00000
Episode: 5/600, survival time: 16, Average Loss: inf, epsilon: 20.00000
Episode: 6/600, survival time: 11, Average Loss: inf, epsilon: 20.00000
Episode: 7/600, survival time: 15, Average Loss: inf, epsilon: 20.00000
Episode: 8/600, survival time: 29, Average Loss: inf, epsilon: 20.00000
Episode: 9/600, survival time: 12, Average Loss: inf, epsilon: 20.00000
Episode: 10/600, survival time: 32, Average Loss: inf, epsilon: 20.00000
Episode: 11/600, survival time: 10, Average Loss: inf, epsilon: 20.00000
Episode: 12/600, survival time: 27, Average Loss: inf, epsilon: 20.00000
Episode: 13/600, survival time: 16, Average Loss: inf, epsilon: 20.00000
Episode: 14/600, survival time: 33, Average Loss: inf, epsil

  return F.mse_loss(input, target, reduction=self.reduction)


update target
Memory refreshed / sorted
memory refresh
Episode: 105/600, survival time: 25, Average Loss: 36.4675, epsilon: 19.73121
Episode: 106/600, survival time: 18, Average Loss: 21.5561, epsilon: 18.11081
Episode: 107/600, survival time: 20, Average Loss: 28.3127, epsilon: 16.47420
Episode: 108/600, survival time: 18, Average Loss: 15.3696, epsilon: 15.12127
Episode: 109/600, survival time: 15, Average Loss: 21.2289, epsilon: 14.06853
Episode: 110/600, survival time: 21, Average Loss: 13.6062, epsilon: 12.73962
update target
Memory refreshed / sorted
memory refresh
Episode: 111/600, survival time: 36, Average Loss: 13.9540, epsilon: 10.78160
Episode: 112/600, survival time: 11, Average Loss: 14.6749, epsilon: 10.21359
Episode: 113/600, survival time: 8, Average Loss: 10.7686, epsilon: 9.80731
Episode: 114/600, survival time: 14, Average Loss: 9.3312, epsilon: 9.16577
Episode: 115/600, survival time: 24, Average Loss: 11.1247, epsilon: 8.18842
update target
Memory refreshed / sort

[W 2023-12-19 14:40:56,689] Trial 0 failed with parameters: {} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "C:\School\Semester 6 AI\venv\Lib\site-packages\optuna\study\_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "C:\Users\timoo\AppData\Local\Temp\ipykernel_17120\1873511191.py", line 168, in <lambda>
    study.optimize(lambda trial: objective(trial, writer), n_trials=1)  # Adjust the number of trials
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\timoo\AppData\Local\Temp\ipykernel_17120\1873511191.py", line 105, in objective
    loss, originals = agent.replay(batch_size=batch_size, optimizer=optimizer, beta=beta)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\timoo\AppData\Local\Temp\ipykernel_17120\946891037.py", line 209, in replay
    loss.backward()
  File "C:\School\Semester 6

KeyboardInterrupt: 

In [None]:
%tensorboard --logdir=runs