<a href="https://colab.research.google.com/github/Brownwang0426/RGRL/blob/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cloning git

In [None]:
# !git clone https://Brownwang0426:token@github.com/Brownwang0426/RGRL.git
!git clone https://github.com/Brownwang0426/RGRL.git

# Installing requirements

In [None]:
!sudo apt-get install python3.10

In [None]:
!pip install pandas==2.0.3 numpy==1.25.2 scipy==1.11.4 swig==4.2.1 ufal.pybox2d==2.3.10.3 gym==0.25.2 pygame==2.5.2 tqdm torch==2.0.1

# Changing directory

In [None]:
import os
os.chdir('/content/RGRL')

# Importing modules

In [1]:
import gym

import numpy as np
import math
from scipy.special import softmax

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.utils.rnn as rnn_utils
from torch.utils.data import DataLoader, TensorDataset, Subset

import csv

import multiprocessing as mp
import os
import sys
import copy
import random
import gc
import time
from tqdm import tqdm


# Checking cuda

In [None]:
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"Device {i}: {torch.cuda.get_device_name(i)}")
    device_index = 0
    device = torch.device(f"cuda:{device_index}")
    print('using cuda...')
else:
    device = torch.device("cpu")
    print('using cpu...')
assert device != torch.device("cpu") # Sorry, but we really recommend you to run it on GPU :-) Nvidia needs your money :-)

In [3]:
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

# Control board

Crucial configurations regarding how your agent will learn in the environment. The meanings are as follow:
(the configs starting with ⚠️ are what we suggest you must tune according to your specific need in your task)
(the configs starting with ◀️ are what we suggest you to play with to see the effect)

| Configs   | Type   | Description                                                                 |
|------------|--------|-----------------------------------------------------------------------------|
| ⚠️game_name  | STR| The name of the environment.                                |
| ⚠️max_steps_for_each_episode | +INT | The maximun steps that the agent will go through while not done. In some environments, it is crucial to increase your "max_steps_for_each_episode" so that your agent can "live long enough" to obatin some better rewards to gradually and heuristically learn better strategy.                    |
| ◀️ensemble_size  | +INT | The size of the neural ensemble which the agent is comprised of. The bigger, the better, but the longer training time without parallel training. :-D                  |
| ⚠️state_size  | +INT | The size of the state as input data.                    |
| ⚠️hidden_size   | +INT |The size of the hidden layers. We suggest hidden_size >= state_size.           |
| ⚠️action_size   | +INT | The size of action per step as input data.   |
| ⚠️time_size  | +INT |The length of the sequence of actions. Namely, how many steps in the future the agent will predict or use to discern the present best action.                |
| ⚠️reward_size  | +INT |The size of the reward as output data.                          |
| ⚠️neural_type  | STR |  [**`rnn`**, **`gru`**, **`lstm`**, **`rnn_att`**] The type of neural network you prefer. For now, we support rnn, gru, lstm, and rnn_att (recurrent attention). More to come in the future (or you can build one yourself :-D in the models repository).           |
| ⚠️num_layers  | +INT |The number of layers in rnn, gru, lstm, and rnn_att (recurrent attention). We suggest no less than 2 (>= 2) to provide more flexibility and memory capacity for neural networks.                         |
| ⚠️num_heads  | +INT/None |The number of heads in multi-head attention (Should be able to devide hidden_size) (Should be None for non-attention neural_type).                         |
| hidden_activation  | STR | [**`relu`**, **`leaky_relu`**, **`sigmoid`**, **`tanh`**] The type of activation function in the hidden layers.              |
| output_activation  | STR | [**`relu`**, **`leaky_relu`**, **`sigmoid`**, **`tanh`**] The type of activation function in the output layer.                      |
| shift  | 0/±FLOAT |The value in f(x+shift) where f(x) is activation function in the output layer. This value is interesting. If this value is negatively large, the agent will act more conservatively and prone to exploit known strategy. If this value is positively large, the agent to act more radically and prone to explore all possible strategies before settling down. We humorously refer to this variable as the "playboy variable," drawing an analogy to individuals who change partners frequently in search of the ideal match because the individuals always think there might be better choice :-P But we can't really write this into the paper... you know...             |
| init   | STR | [**`random_normal`**, **`random_uniform`**, **`xavier_normal`**, **`xavier_uniform`**, **`glorot_normal`**, **`glorot_uniform`**] The initialization method you prefer.                          |
| opti   | STR | [**`adam`**, **`sgd`**, **`rmsprop`**]  The optimization method you prefer.             |
| loss  | STR | [**`mean_squared_error`**, **`binary_crossentropy`**] The loss or error function you prefer.                           |
| bias  | BOLEAN |Whether you want add bias.                          |
| drop_rate   | 0/+FLOAT |The drop-rate for drop-out.              |
| ⚠️alpha   | 0/+FLOAT |The learning rate for neural networks weight matrices.                           |
| ⚠️iteration_for_learning   | +INT |The iteration for learning.              |
| load_pre_model  | BOLEAN |Whether you want to load previous trained model.                          |
| noise_t  |  +INT |The times applying gaussian noise to the initializated actions of the agent, similar to diffusion model's adding gaussian noise.          |
| noise_r  |  0/+FLOAT |The noise range to the initializated actions of the agent.                     |
| ⚠️beta  |  0/+FLOAT |The updating rate for updating actions of the agent.              |
| ⚠️iteration_for_deducing  |  +INT |The iteration for updating actions of the agent.                           |
| episode_for_training  | +INT |How many epsiodes will your agent run in the training mode where your agent will learn offline.              |
| chunk_size  | +INT |The maximum chunk size for sequentializing state, action, reward. We suggest chunk_size = time_size.      |
| batch_size_for_offline_learning  |+INT | After how many epsodes will your agent start learning from experience buffer.                           |
| PER_epsilon  | 0/+FLOAT |The epsilon for prioritized experience replay.              |
| PER_exponent  | 0/+FLOAT |The expoenet for prioritized experience replay.                           |
| episode_for_testing  | +INT |How many epsiodes will your agent run in the testing mode where your agent will not learn offline.                        |
| render_for_human  | BOLEAN | Wether you want to render the visual result for each step in the testing mode.              |


## blackjack

In [4]:
game_name = 'Blackjack-v1'          #⚠️
max_steps_for_each_episode = 10     #⚠️
  

ensemble_size = 5                   #◀️
state_size =  201                   #⚠️
hidden_size = 250                   #⚠️
action_size = 2                     #⚠️
time_size = 5                       #⚠️
reward_size = 100                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'         
bias = False
drop_rate = 0.001                   
alpha = 0.1                         #⚠️                      
iteration_for_learning = 2000       #⚠️                 
load_pre_model = False           
  
  
noise_t = 1                        
noise_r = 0.1                       #⚠️
beta = 0.1                          #⚠️                        
iteration_for_deducing = 200        #⚠️


episode_for_training = 100000
chunk_size = 1
batch_size_for_offline_learning = 100 
PER_epsilon = 0.000001              
PER_exponent = 1                                      


episode_for_testing = 100
render_for_human = True



## cartpole

In [5]:
game_name = 'CartPole-v1'           #⚠️
max_steps_for_each_episode = 2000   #⚠️
  
  
ensemble_size = 5                   #◀️
state_size =  400                   #⚠️
hidden_size = 400                   #⚠️
action_size = 2                     #⚠️
time_size = 15                      #⚠️
reward_size = 100                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'   
bias = False      
drop_rate = 0.0
alpha = 0.1                         #⚠️                      
iteration_for_learning = 1000       #⚠️               
load_pre_model = False           
  
  
noise_t = 1                        
noise_r = 0.1                 
beta = 0.1                          #⚠️                        
iteration_for_deducing = 50         #⚠️


episode_for_training = 100000
chunk_size = time_size                     
batch_size_for_offline_learning = 1 
PER_epsilon = 0.000001              
PER_exponent = 1                                        


episode_for_testing = 100
render_for_human = True



## mountain car

In [6]:
game_name =  'MountainCar-v0'       #⚠️
max_steps_for_each_episode = 200    #⚠️


ensemble_size = 5                   #◀️
state_size =  200                   #⚠️
hidden_size = 200                   #⚠️
action_size = 3                     #⚠️
time_size = 25                      #⚠️
reward_size = 100                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'      
bias = False   
drop_rate = 0.001                   
alpha = 0.1                         #⚠️                         
iteration_for_learning = 1000       #⚠️                 
load_pre_model = False              
  
  
noise_t = 1                       
noise_r = 0.1                       #⚠️
beta = 0.1                          #⚠️                     
iteration_for_deducing = 100        #⚠️


episode_for_training = 100000
chunk_size = 2
batch_size_for_offline_learning = 10 
PER_epsilon = 0.000001              
PER_exponent = 1                                        


episode_for_testing = 100
render_for_human = True



## mountain car continuous

In [7]:
game_name = 'MountainCarContinuous-v0'       #⚠️
max_steps_for_each_episode = 200             #⚠️


ensemble_size = 5                   #◀️
state_size =  2                     #⚠️
hidden_size = 100                   #⚠️
action_size = 1                     #⚠️
time_size = 25                      #⚠️
reward_size = 100                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'      
bias = False   
drop_rate = 0.001                   
alpha = 0.01                        #⚠️                         
iteration_for_learning = 1000       #⚠️                 
load_pre_model = False              
  
  
noise_t = 1                       
noise_r = 0.001                     #⚠️
beta = 0.01                         #⚠️                     
iteration_for_deducing = 100        #⚠️


episode_for_training = 100000
chunk_size = 2
batch_size_for_offline_learning = 10 
PER_epsilon = 0.000001              
PER_exponent = 1                               


episode_for_testing = 100
render_for_human = True



## acrobot

In [8]:
game_name = 'Acrobot-v1'            #⚠️
max_steps_for_each_episode = 200    #⚠️
  
  
ensemble_size = 5                   #◀️
state_size =  600                   #⚠️
hidden_size = 600                   #⚠️
action_size = 3                     #⚠️
time_size = 25                      #⚠️
reward_size = 100                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'   
bias = False      
drop_rate = 0.001                   
alpha = 0.1                         #⚠️                      
iteration_for_learning = 1000       #⚠️              
load_pre_model = False           
  
  
noise_t = 1                        
noise_r = 0.01                      #⚠️
beta = 0.1                          #⚠️                       
iteration_for_deducing = 100        #⚠️


episode_for_training = 100000
chunk_size = 2
batch_size_for_offline_learning = 10 
PER_epsilon = 0.000001              
PER_exponent = 1                    


episode_for_testing = 100
render_for_human = True



## pendulum

In [9]:
game_name = "Pendulum-v1"           #⚠️
max_steps_for_each_episode = 100    #⚠️
  
  
ensemble_size = 5                   #◀️
state_size =  300                   #⚠️
hidden_size = 300                   #⚠️
action_size = 1                     #⚠️
time_size = 25                      #⚠️
reward_size = 100                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'   
bias = False      
drop_rate = 0.001                   
alpha = 0.01                        #⚠️                      
iteration_for_learning = 1000       #⚠️                 
load_pre_model = False           
  
  
noise_t = 1                        
noise_r = 0.001                     #⚠️
beta = 0.01                         #⚠️                        
iteration_for_deducing = 100        #⚠️


episode_for_training = 100000
chunk_size = 2
batch_size_for_offline_learning = 10 
PER_epsilon = 0.000001              
PER_exponent = 1                    


episode_for_testing = 100
render_for_human = True



## lunar lander

In [10]:
game_name = "LunarLander-v2"        #⚠️
max_steps_for_each_episode = 200    #⚠️


ensemble_size = 5                   #◀️
state_size =  800                   #⚠️
hidden_size = 800                   #⚠️
action_size = 4                     #⚠️
time_size = 25                      #⚠️
reward_size = 250                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'    
bias = False     
drop_rate = 0.001                   
alpha = 0.1                         #⚠️                        
iteration_for_learning = 1000       #⚠️    
load_pre_model = False        
  
  
noise_t = 1                         
noise_r = 0.01                      #⚠️
beta = 0.1                          #⚠️      
iteration_for_deducing = 100        #⚠️


episode_for_training = 100000
chunk_size = 2
batch_size_for_offline_learning = 10
PER_epsilon = 0.000001             
PER_exponent = 1                       


episode_for_testing = 100
render_for_human = True



## bipedal walker

In [11]:
game_name = 'BipedalWalker-v3'      #⚠️
max_steps_for_each_episode = 1000   #⚠️
  
  
ensemble_size = 5                   #◀️
state_size =  24                    #⚠️
hidden_size = 100                   #⚠️
action_size = 4                     #⚠️
time_size = 25                      #⚠️
reward_size = 100                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'     
bias = False    
drop_rate = 0.001                   
alpha = 0.01                        #⚠️                      
iteration_for_learning = 1000       #⚠️        
load_pre_model = False           
  
  
noise_t = 1                        
noise_r = 0.001                     #⚠️
beta = 0.01                         #⚠️                        
iteration_for_deducing = 100        #⚠️


episode_for_training = 100000
chunk_size = 2
batch_size_for_offline_learning = 10 
PER_epsilon = 0.000001              
PER_exponent = 1                                     


episode_for_testing = 100
render_for_human = True



## your present config

In [12]:
game_name = 'CartPole-v1'           #⚠️
max_steps_for_each_episode = 2000   #⚠️
  
  
ensemble_size = 10                  #◀️
state_size =  400                   #⚠️
hidden_size = 400                   #⚠️
action_size = 2                     #⚠️
time_size = 15                      #⚠️
reward_size = 100                   #⚠️
neural_type = 'gru'                 #⚠️
num_layers = 2                      #⚠️
num_heads = None                    #⚠️
hidden_activation = 'tanh'          
output_activation = 'sigmoid'       
shift = 0.0                         
init = "random_normal"              
opti = 'sgd'                        
loss = 'mean_squared_error'   
bias = False      
drop_rate = 0.001
alpha = 0.1                         #⚠️                      
iteration_for_learning = 10000      #⚠️               
load_pre_model = False           
  
  
noise_t = 1                        
noise_r = 0.1                 
beta = 0.1                          #⚠️                        
iteration_for_deducing = 500        #⚠️


episode_for_training = 100000
chunk_size = time_size                     
batch_size_for_offline_learning = 10
PER_epsilon = 0.000001              
PER_exponent = 1                                        


episode_for_testing = 100
render_for_human = True



In [13]:
suffix                      = f"game={game_name}_type={neural_type}_ensemble={ensemble_size:05d}_drop={drop_rate:.5f}_learn={iteration_for_learning:05d}_interval={batch_size_for_offline_learning:05d}_deduce={iteration_for_deducing:05d}"
directory                   = f'/content/result/{game_name}/'
model_directory             = f'/content/result/{game_name}/model_{suffix}'+'_%s.h5'
performance_log_directory   = f'/content/result/{game_name}/performace_log_{suffix}.csv'

# Importing local modules

In [14]:
if   game_name == 'Blackjack-v1':
    from envs.env_blackjack   import vectorizing_state, vectorizing_action, vectorizing_reward
elif   game_name == 'CartPole-v1':
    from envs.env_cartpole    import vectorizing_state, vectorizing_action, vectorizing_reward
elif game_name == 'MountainCar-v0':
    from envs.env_mountaincar import vectorizing_state, vectorizing_action, vectorizing_reward
elif game_name == 'MountainCarContinuous-v0':
    from envs.env_mountaincar_continuous import vectorizing_state, vectorizing_action, vectorizing_reward
elif game_name == 'Acrobot-v1':
    from envs.env_acrobot import vectorizing_state, vectorizing_action, vectorizing_reward
elif game_name == "Pendulum-v1":
    from envs.env_pendulum import vectorizing_state, vectorizing_action, vectorizing_reward
elif game_name == "LunarLander-v2":
    from envs.env_lunarlander import vectorizing_state, vectorizing_action, vectorizing_reward
elif game_name == 'BipedalWalker-v3':
    from envs.env_bipedalwalker import vectorizing_state, vectorizing_action, vectorizing_reward
else:
   raise RuntimeError('missing env functions')

In [15]:
if neural_type == 'rnn_att':
    from models.model_rnn_att import build_model
    from utils.util_rnn_att   import initialize_pre_activated_action, \
                                 update_pre_activated_action, \
                                 sequentialize, \
                                 update_model, \
                                 save_performance_to_csv
else:
    from models.model_rnn import build_model
    from utils.util_rnn   import initialize_pre_activated_action, \
                                 update_pre_activated_action, \
                                 sequentialize, \
                                 update_model, \
                                 save_performance_to_csv

# Deducing -> Learning
Training mode where your agent will learn offline. You can see here how your agent learn overtime and improve its performance.

Creating or loading models

In [16]:

if not os.path.exists(directory):
    os.makedirs(directory)

if load_pre_model == False:

    model_list = []
    for _ in range(ensemble_size):
        model = build_model(state_size,
                            hidden_size,
                            action_size,
                            time_size,
                            reward_size,
                            neural_type,
                            num_layers,
                            num_heads,
                            hidden_activation,
                            output_activation,
                            shift,
                            init,
                            opti,
                            loss,
                            bias,
                            drop_rate,
                            alpha)
        model.to(device)
        model_list.append(model)

elif load_pre_model == True:

    model_list = []
    for _ in range(ensemble_size):
        model = build_model(state_size,
                            hidden_size,
                            action_size,
                            time_size,
                            reward_size,
                            neural_type,
                            num_layers,
                            num_heads,
                            hidden_activation,
                            output_activation,
                            shift,
                            init,
                            opti,
                            loss,
                            bias,
                            drop_rate,
                            alpha)
        model.to(device)
        model_list.append(model)

    for i in range(len(model_list)):
        model_list[i].load_state_dict(torch.load( model_directory  % i ))



Creating Streams

In [17]:
stream_list = []
for _ in range(ensemble_size):
    stream  = torch.cuda.Stream()
    stream_list.append(stream)


Creating desired reward

In [18]:
desired_reward = torch.ones((1, reward_size))

Putting all the previous works into play

In [None]:

performance_log = []
performance_log.append([0, -sys.maxsize])

for training_episode in tqdm(range(episode_for_training)):

    total_steps = 0

    # initializing short term experience replay buffer
    short_term_state_list  = []
    short_term_action_list = []
    short_term_reward_list = []

    # initializing environment
    env                    = gym.make(game_name)
    env._max_episode_steps = max_steps_for_each_episode
    state                  = env.reset()
    summed_reward          = 0

    # observing state
    state    = vectorizing_state(state)
    short_term_state_list.append(state)

    done = False
    while not done:

        # initializing and updating action
        state                 = torch.tensor(np.atleast_2d(state), dtype=torch.float)
        pre_activated_action  = initialize_pre_activated_action(init, noise_t, noise_r, (time_size, action_size))
        pre_activated_action  = torch.tensor(pre_activated_action[np.newaxis, :, :], dtype=torch.float)
        pre_activated_action  = update_pre_activated_action(iteration_for_deducing,
                                                            model_list,
                                                            state,
                                                            pre_activated_action,
                                                            desired_reward,
                                                            beta,
                                                            device)
        action, action_       = vectorizing_action(pre_activated_action)
        short_term_action_list.append(action)

        # executing action
        state, reward, done, info = env.step(action_)
        total_steps              += 1

        # observing actual reward
        summed_reward += reward
        reward = vectorizing_reward(state, reward, summed_reward, done, reward_size)
        short_term_reward_list.append(reward)

        # observing state
        state    = vectorizing_state(state)
        short_term_state_list.append(state)

        if done: 
            if  (total_steps >= chunk_size):
                print(f'Episode {training_episode+1}: Summed_Reward = {summed_reward}')
                performance_log.append([training_episode+1, summed_reward])
                save_performance_to_csv(performance_log, performance_log_directory)
                break
            else:
                done = False
        else:
            pass




    env.close()




    # sequentializing short term experience replay buffer 
    short_term_sequentialized_state_list   ,\
    short_term_sequentialized_action_list  ,\
    short_term_sequentialized_reward_list  ,\
    short_term_sequentialized_n_state_list = sequentialize(short_term_state_list, short_term_action_list, short_term_reward_list, chunk_size, device)
    
    
    

    if training_episode==0:
        long_term_sequentialized_state_list      = copy.deepcopy(short_term_sequentialized_state_list    )
        long_term_sequentialized_action_list     = copy.deepcopy(short_term_sequentialized_action_list   )
        long_term_sequentialized_reward_list     = copy.deepcopy(short_term_sequentialized_reward_list   )
        long_term_sequentialized_n_state_list    = copy.deepcopy(short_term_sequentialized_n_state_list  )
    else:
        long_term_sequentialized_state_list      = long_term_sequentialized_state_list   + short_term_sequentialized_state_list  
        long_term_sequentialized_action_list     = long_term_sequentialized_action_list  + short_term_sequentialized_action_list 
        long_term_sequentialized_reward_list     = long_term_sequentialized_reward_list  + short_term_sequentialized_reward_list 
        long_term_sequentialized_n_state_list    = long_term_sequentialized_n_state_list + short_term_sequentialized_n_state_list
        



    # batch offline learning
    if (training_episode+1) % batch_size_for_offline_learning == 0:




        # training with Prioritized Experience Replay (PER)
        for i, model in enumerate(model_list):
            with torch.cuda.stream(stream_list[i]):
                model                     = update_model(iteration_for_learning,
                                                         long_term_sequentialized_state_list   ,
                                                         long_term_sequentialized_action_list  ,
                                                         long_term_sequentialized_reward_list  ,
                                                         long_term_sequentialized_n_state_list ,
                                                         model,
                                                         PER_epsilon,
                                                         PER_exponent)
                model_list[i]             = model
        torch.cuda.synchronize()




        # saving:
        for i in range(len(model_list)):
            torch.save(model_list[i].state_dict(), model_directory % i)


        gc.collect()
        torch.cuda.empty_cache()

# Deducing only
Testing mode where your trained agent in the training mode will not learn offline. It just keeps running each episode without learning new stuff.

Loading models

In [None]:
model_list = []
for _ in range(ensemble_size):
    model = build_model(state_size,
                        hidden_size,
                        action_size,
                        time_size,
                        reward_size,
                        neural_type,
                        num_layers,
                        num_heads,
                        hidden_activation,
                        output_activation,
                        shift,
                        init,
                        opti,
                        loss,
                        bias,
                        drop_rate,
                        alpha)
    model.to(device)
    model_list.append(model)

for i in range(len(model_list)):
    model_list[i].load_state_dict(torch.load(model_directory % i))

Creating desired reward ... again

In [None]:
desired_reward = torch.ones((1, reward_size))

Putting all the previous works into play ... again

But this time the agent does not learn

In [None]:
total_summed_reward = 0

for testing_episode in range(episode_for_testing):

    if render_for_human == True:
        env = gym.make( game_name, render_mode="human")
    else:
        env = gym.make( game_name)
    env._max_episode_steps = max_steps_for_each_episode
    state                  = env.reset()
    if render_for_human == True:
        env.render()
    summed_reward = 0

    state = vectorizing_state(state)

    done = False
    while not done:

        state                 = torch.tensor(np.atleast_2d(state), dtype=torch.float)
        pre_activated_action  = initialize_pre_activated_action(init, noise_t, noise_r, (time_size, action_size))
        pre_activated_action  = torch.tensor(pre_activated_action[np.newaxis, :, :], dtype=torch.float)
        pre_activated_action  = update_pre_activated_action(iteration_for_deducing,
                                                            model_list,
                                                            state,
                                                            pre_activated_action,
                                                            desired_reward,
                                                            beta,
                                                            device)
        action, action_       = vectorizing_action(pre_activated_action)

        state, reward, done,  info = env.step(action_)
        if render_for_human == True:
            env.render()

        summed_reward += reward

        state = vectorizing_state(state)

        if done:
            break

    env.close()

    print("Summed reward:", summed_reward)
    print(f'Episode: {testing_episode + 1}')
    print('Everaged summed reward:')
    total_summed_reward += summed_reward
    print(total_summed_reward/(testing_episode + 1))

