This notebook is devoted to perform hyperparameter tunning of a DDQN agent in order to solve a pretty basic environment as Cartpole-v0. Besides learning how different hyperparameters affect the learning curve of our agent, this project is aimed at sanity checking the implementation of my agent, which will be used later on to solve more complex environments.

In [1]:
import glob
import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from utils import create_settings, create_key, save_data, solve_metric
import sys
sys.path.insert(1,'/home/axelbm23/Code/ML_AI/Algos/ReinforcementLearning/')
from agents import DDQN,Agent_Performance
import time
import tensorflow as tf
import tf_keras
from typing import Any,Optional

2024-07-19 14:19:24.912882: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-19 14:19:24.958674: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


We will conduct some analysis on four different parameters as they seem seem to be the ones that affect the results in a major degree. For each parameter, we will produce 3 runs, as RL algorithms are more heavily influenced by the initial conditions (i.e network weights) that other techniques. The parameters we will play with are:
1) complexity of the network, i.e number of layers and number of nodes per layer
2) Learning rate of our optimizer
3) batch size
4) type of update on the target network, either soft update or hard copy.
5) greedy step, i.e how large the exploratio phase is

In [2]:
# Set up the default values
N_ITERATIONS = 3
GAMMA = 0.99
GREEDY_STEP = 999e-3
EPISODES = 3
BUFF_SIZE = 1_000
BATCH_SIZE = 64
NN_COPY_CADENCY = 10
SOFT_UPDATE = 0.005
NEURONS = [128]*2
ACT_AS_IN = False
ADD_LOGS = False
LOSS_FUNC = 'mean_squared_error'
ADAM_LR = 0.001
OUTPUT_PATH = f'{os.getcwd()}/results/rewards_losses'
SOLVED = 195


# According to openai/gym/wiki
# Cartpole-v0 is solved when it reaches an average reward
# of 195 over 100 consecutive episodes
env = gym.make("CartPole-v0")
def_nn_arch = {'neurons': NEURONS,
               'action_as_input':ACT_AS_IN,
                'loss_function': LOSS_FUNC,
                'optimizer': tf.keras.optimizers.Adam(learning_rate=ADAM_LR),
                }

def_agent = {'gamma': GAMMA,
            'greedy_step': GREEDY_STEP,
            'environment': env,
            'episodes': EPISODES,
            'buff_size': BUFF_SIZE, 
            'replay_mini_batch': BATCH_SIZE,
            'nn_copy_cadency': NN_COPY_CADENCY,
            'nn_architecture': def_nn_arch,
            'soft_update': SOFT_UPDATE,
            'add_logs':ADD_LOGS}

  logger.deprecation(
2024-07-19 14:19:27.622385: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-19 14:19:27.645105: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-19 14:19:27.645143: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-19 14:19:27.647208: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-19 14:19:27.647236: I external/local_x

In [3]:
# Create different settings for each one of the experiments
network_arch_sett = ([64]*2,[256]*4)
lr_sett = [0.1,0.05]
batch_siz_sett = [32,128]
target_update_sett = [(25, 0.01),(None,0.005), (None,0.01)]
greedy_step_sett = [99e-2, 9985e-4]
experiment = {'nn_arch':network_arch_sett,
               'lr':lr_sett,
               'target_update':target_update_sett,
               'greedy_step':greedy_step_sett,
               'default':[None]}

Train our agent for each set of parameters

In [4]:
for key,val in experiment.items():
   for sett_i in val:
      sett_key = create_key(sett_i)
      for it in range(N_ITERATIONS):
         ag_sett = create_settings(key,sett_i,def_nn_arch, def_agent)
         model_label = f'{key}_{sett_key}_iter_{it}'if key!='default' else f'{key}_iter_{it}'
         # Initialize the class again as the network weights need to be random
         ddqn = DDQN(sett=ag_sett)
         t1 = time.time()
         rewards, losses, logs = ddqn.learn()
         exec_time = round(time.time()-t1, 3)
         
         # Save all the information we need for the post analysis,
         # basically execution time, rewards and losses
         save_data(rewards, losses, exec_time, model_label, OUTPUT_PATH)
         
print(f'All data has been created')

Hard copy policy_weights to target_weights
episode 0/2, greedy_param=0.97823 reward=22.0, avg_rew=22.0, avg_rew(100)=22.0
episode 1/2, greedy_param=0.95216 reward=27.0, avg_rew=24.5, avg_rew(100)=24.5
episode 2/2, greedy_param=0.93985 reward=13.0, avg_rew=20.667, avg_rew(100)=20.667
Hard copy policy_weights to target_weights
episode 0/2, greedy_param=0.9753 reward=25.0, avg_rew=25.0, avg_rew(100)=25.0


I0000 00:00:1721391568.599251  898376 service.cc:145] XLA service 0x7f53000076d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1721391568.599285  898376 service.cc:153]   StreamExecutor device (0): NVIDIA GeForce RTX 4050 Laptop GPU, Compute Capability 8.9
2024-07-19 14:19:28.610822: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-19 14:19:28.654178: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8907
I0000 00:00:1721391569.415404  898376 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


episode 1/2, greedy_param=0.93329 reward=44.0, avg_rew=34.5, avg_rew(100)=34.5
episode 2/2, greedy_param=0.91389 reward=21.0, avg_rew=30.0, avg_rew(100)=30.0
Hard copy policy_weights to target_weights
episode 0/2, greedy_param=0.9851 reward=15.0, avg_rew=15.0, avg_rew(100)=15.0
episode 1/2, greedy_param=0.96559 reward=20.0, avg_rew=17.5, avg_rew(100)=17.5
episode 2/2, greedy_param=0.94551 reward=21.0, avg_rew=18.667, avg_rew(100)=18.667
Hard copy policy_weights to target_weights
episode 0/2, greedy_param=0.96946 reward=31.0, avg_rew=31.0, avg_rew(100)=31.0
episode 1/2, greedy_param=0.95502 reward=15.0, avg_rew=23.0, avg_rew(100)=23.0
episode 2/2, greedy_param=0.94079 reward=15.0, avg_rew=20.333, avg_rew(100)=20.333
Hard copy policy_weights to target_weights
episode 0/2, greedy_param=0.97335 reward=27.0, avg_rew=27.0, avg_rew(100)=27.0
episode 1/2, greedy_param=0.95885 reward=15.0, avg_rew=21.0, avg_rew(100)=21.0
episode 2/2, greedy_param=0.94741 reward=12.0, avg_rew=18.0, avg_rew(100)=

We will perform two tasks here. The first one will be to analyze the reward/loss function of each individual try of the algorithm to make some hypothesis about what is going on with it. The second part, will be to compute some performance metrics to quantify the algorithm performance across different dimensions.

In [5]:
# Import the saved files and compute statistics for each set of params
files = glob.glob(f'{OUTPUT_PATH}/*.csv')
param_sett_ids = set(x.split('/')[-1].split('_iter')[0] for x in files)
stats = {}
for param in param_sett_ids:
    # Select only the trials of this parameter settings
    files_for_this_param = [x for x in files if x.split('/')[-1].split('_iter')[0]==param]
    rew = []
    loss = []
    for trial_param in files_for_this_param:
        df = pd.read_csv(trial_param)
        rew.append(df['rewards'].to_numpy())
        loss.append(df['losses'].to_numpy())
    ag_perf = Agent_Performance(rewards=rew, losses=loss, solved_func=solve_metric, solved_rew=SOLVED)
    stats[param] = ag_perf.compute_statistics()
# Create a dataframe where each row is the param settings and each column is the statistic 
stats_df = pd.DataFrame.from_dict(stats).T

Unnamed: 0,target_update_25-0.01,target_update_None-0.01,greedy_step_0.99,nn_arch_256-256-256-256,lr_0.1,greedy_step_0.9985,default,target_update_None-0.005,nn_arch_64-64,lr_0.05
avg_end_rew,13.0,29.0,14.333333,16.0,30.0,13.0,25.0,14.333333,18.333333,29.333333
best_worst_ratio_end_rew,0.272727,2.1875,1.3,0.75,1.5,0.25,1.176471,1.0,0.615385,1.611111
avg_epi_rew,19.333333,26.555556,18.333333,18.777778,24.333333,27.555556,17.777778,15.0,23.111111,20.333333
avg_epi_rew_increase,-5.0,3.5,-2.333333,-4.5,5.166667,-11.666667,6.666667,-1.333333,-1.166667,8.333333
std_epi_rew_increase,16.708281,23.001812,8.185353,5.575243,16.260894,34.183329,10.606602,2.27303,13.35727,6.867799
n_solved,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
avg_ep_to_solve,,,,,,,,,,
max_drawdown,20.333333,11.333333,10.333333,10.333333,3.333333,41.666667,2.666667,3.333333,12.333333,0.0
