This notebook is devoted to perform hyperparameter tunning of a DDQN agent in order to solve a pretty basic environment as Cartpole-v0. Besides learning how different hyperparameters affect the learning curve of our agent, this project is aimed at sanity checking the implementation of my agent, which will be used later on to solve more complex environments.

In [1]:
import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from utils import create_settings, create_key, save_data
import sys
sys.path.insert(1,'/home/axelbm23/Code/ML_AI/Algos/ReinforcementLearning/')
from agents import DDQN
import time
import tensorflow as tf
import tf_keras
from typing import Any,Optional

2024-07-19 09:54:53.788747: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-19 09:54:53.817672: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


We will conduct some analysis on four different parameters as they seem seem to be the ones that affect the results in a major degree. For each parameter, we will produce 3 runs, as RL algorithms are more heavily influenced by the initial conditions (i.e network weights) that other techniques. The parameters we will play with are:
1) complexity of the network, i.e number of layers and number of nodes per layer
2) Learning rate of our optimizer
3) batch size
4) type of update on the target network, either soft update or hard copy.
5) greedy step, i.e how large the exploratio phase is

In [2]:
# Set up the default values
N_ITERATIONS = 3
GAMMA = 0.99
GREEDY_STEP = 999e-3
EPISODES = 5
BUFF_SIZE = 1_000
BATCH_SIZE = 64
NN_COPY_CADENCY = 10
SOFT_UPDATE = 0.005
NEURONS = [128]*2
ACT_AS_IN = False
ADD_LOGS = False
LOSS_FUNC = 'mean_squared_error'
ADAM_LR = 0.001
OUTPUT_PATH = f'{os.getcwd()}/results/rewards_losses'

# According to openai/gym/wiki
# Cartpole-v0 is solved when it reaches an average reward
# of 195 over 100 consecutive episodes
env = gym.make("CartPole-v0")
def_nn_arch = {'neurons': NEURONS,
               'action_as_input':ACT_AS_IN,
                'loss_function': LOSS_FUNC,
                'optimizer': tf.keras.optimizers.Adam(learning_rate=ADAM_LR),
                }

def_agent = {'gamma': GAMMA,
            'greedy_step': GREEDY_STEP,
            'environment': env,
            'episodes': EPISODES,
            'buff_size': BUFF_SIZE, 
            'replay_mini_batch': BATCH_SIZE,
            'nn_copy_cadency': NN_COPY_CADENCY,
            'nn_architecture': def_nn_arch,
            'soft_update': SOFT_UPDATE,
            'add_logs':ADD_LOGS}

  logger.deprecation(
2024-07-19 09:54:56.609041: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-19 09:54:56.670014: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-19 09:54:56.670049: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-19 09:54:56.685954: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-19 09:54:56.686009: I external/local_x

In [3]:
# Create different settings for each one of the experiments
network_arch_sett = ([64]*2,[256]*4)
lr_sett = [0.1,0.05]
batch_siz_sett = [32,128]
target_update_sett = [(25, 0.01),(None,0.005), (None,0.01)]
greedy_step_sett = [99e-2, 9985e-4]
experiment = {'nn_arch':network_arch_sett,
               'lr':lr_sett,
               'target_update':target_update_sett,
               'greedy_step':greedy_step_sett,
               'default':[None]}

Train our agent for each set of parameters

In [4]:
for key,val in experiment.items():
   for sett_i in val:
      sett_key = create_key(sett_i)
      for it in range(N_ITERATIONS):
         ag_sett = create_settings(key, sett_i,def_nn_arch, def_agent)
         model_label = f'{key}_{sett_key}_iter_{it}'if key!='default' else f'{key}_iter_{it}'
         # Initialize the class again as the network weights need to be random
         ddqn = DDQN(sett=ag_sett)
         t1 = time.time()
         rewards, losses, logs = ddqn.learn()
         exec_time = round(time.time()-t1, 3)
         
         # Save all the information we need for the post analysis,
         # basically execution time, rewards and losses
         save_data(rewards, losses, exec_time, model_label, OUTPUT_PATH)
         
print(f'All data has been created')

Hard copy policy_weights to target_weights
episode 0/4, greedy_param=0.97335 reward=27.0, avg_rew=27.0, avg_rew(100)=27.0
episode 1/4, greedy_param=0.96077 reward=13.0, avg_rew=20.0, avg_rew(100)=20.0


I0000 00:00:1721375697.568092  707139 service.cc:145] XLA service 0x7f975001e140 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1721375697.568127  707139 service.cc:153]   StreamExecutor device (0): NVIDIA GeForce RTX 4050 Laptop GPU, Compute Capability 8.9
2024-07-19 09:54:57.581627: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-19 09:54:57.628881: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907
I0000 00:00:1721375698.444056  707139 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


episode 2/4, greedy_param=0.92585 reward=37.0, avg_rew=25.667, avg_rew(100)=25.667
episode 3/4, greedy_param=0.91389 reward=13.0, avg_rew=22.5, avg_rew(100)=22.5
episode 4/4, greedy_param=0.89578 reward=20.0, avg_rew=22.0, avg_rew(100)=22.0
Hard copy policy_weights to target_weights
episode 0/4, greedy_param=0.96462 reward=36.0, avg_rew=36.0, avg_rew(100)=36.0
episode 1/4, greedy_param=0.95216 reward=13.0, avg_rew=24.5, avg_rew(100)=24.5


ValueError: Unknown variable: <KerasVariable shape=(4, 64), dtype=float32, path=sequential_2/dense_6/kernel>. This optimizer can only be called for the variables it was originally built with. When working with a new set of variables, you should recreate a new optimizer instance.