<a href="https://colab.research.google.com/github/Brownwang0426/Reversal-Generative-Reinforcement-Learning/blob/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up (for colab)

In [None]:
!sudo apt-get install python3.10
!pip install torch==2.0.1 
!pip install numpy==1.25.2 scipy==1.11.4 swig==4.2.1 ufal.pybox2d==2.3.10.3 gymnasium==1.0.0 minigrid==3.0.0 tqdm==4.67.1 dill==0.3.8

In [None]:
!git clone https://github.com/Brownwang0426/Reversal-Generative-Reinforcement-Learning.git

In [None]:
import os
os.chdir('/content/Reversal-Generative-Reinforcement-Learning')

# Setting up (for local)
CUDA Toolkit 11.8 \
cuDNN 8.9.x \
pip install torch==2.0.1 --extra-index-url https://download.pytorch.org/whl/cu118  \
pip install numpy==1.25.2 scipy==1.11.4 swig==4.2.1 ufal.pybox2d==2.3.10.3 gymnasium==1.0.0 minigrid==3.0.0 tqdm==4.67.1 dill==0.3.8

# Importing modules

In [None]:
import gymnasium as gym
from gymnasium.wrappers import TimeLimit
import minigrid

import numpy as np
import math
from scipy.special import softmax

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.utils.rnn as rnn_utils
from torch.utils.data import DataLoader, TensorDataset, Subset

import csv

import multiprocessing as mp
import os
import sys
import copy
import random
import gc
import time
from tqdm import tqdm
from collections import defaultdict

import itertools

import dill

import warnings
warnings.filterwarnings('ignore')

import concurrent.futures
import hashlib

# Checking cuda

In [None]:
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"Device {i}: {torch.cuda.get_device_name(i)}")
    device_index = 0
    device = torch.device(f"cuda:{device_index}")
    print('using cuda...')
else:
    device = torch.device("cpu")
    print('using cpu...')

torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

# Control board

Crucial configurations regarding how your agent will learn in the environment. The meanings are as follow:
(the configs starting with ⚠️ are what we suggest you must tune according to your specific need in your task)
(the configs starting with ◀️ are what we suggest you to play with to see the effect)



## Configs meaning
| Configs   | Type   | Description                                                                 |
|------------|--------|-----------------------------------------------------------------------------|
| ⚠️game_name  | STR| The name of the environment.                                |
| ⚠️max_steps_for_each_episode | +INT | The maximun steps that the agent will go through while not done. In some environments, it is crucial to increase your "max_steps_for_each_episode" so that your agent can "live long enough" to obatin some better rewards to gradually and heuristically learn better strategy.                    |
| ⚠️seed | +INT/None | The seed for environment. None for random environment each episode.                    |
| load_pretrained_model  | BOLEAN |Whether you want to load previous trained model.                          |
| ◀️ensemble_size  | +INT | The size of the neural ensemble which the agent is comprised of. The bigger, the better, but the longer training time without parallel training. :-D                  |
| ⚠️state_size  | +INT | The size of the state as input data.                    |
| ⚠️action_size   | +INT | The size of action per step as input data.   |
| ⚠️reward_size  | +INT |The size of the reward as output data.                          |
| ⚠️feature_size   | +INT |The size of the hidden layers.       |
| ⚠️history_size  | 0/+INT |How many steps in the history for state and action will the agent take into consideration.                           |
| ⚠️future_size  | +INT |The length of the sequence of actions. Namely, how many steps in the future the agent will predict or use to discern the present best action.                |
| ⚠️window_size  | +INT | window size for sequentializing short term experience replay buffer into long term. **`Shall be less than the possible minimum steps in a episode and shall be less or equal to future_size.`**             |
| ⚠️neural_type  | STR |  [**`rnn`**, **`gru`**, **`lstm`**, **`td`**, **`rnn_td`**, **`gru_td`**, **`lstm_td`**] The type of neural network you prefer. For now, we support rnn, gru, lstm, and td (Transformer decoder only). More to come in the future (or you can build one yourself :-D in the models repository).           |
| ⚠️num_layers  | +INT |The number of layers in rnn, gru, lstm, and td (Transformer decoder only).     |
| ⚠️num_heads  | +INT/None |The number of heads in multi-head attention. **`Shall be able to devide feature_size`**. **`Shall be None for non-attention neural_type`**.                         |
| init   | STR | [**`random_normal`**, **`random_uniform`**, **`xavier_normal`**, **`xavier_uniform`**, **`glorot_normal`**, **`glorot_uniform`**] The initialization method you prefer for initiating neural net ensemble of your agent.                          |
| opti   | STR | [**`adam`**, **`sgd`**, **`rmsprop`**]  The optimization method you prefer.             |
| loss  | STR | [**`mean_squared_error`**, **`binary_crossentropy`**] The loss or error function you prefer.                           |
| bias  | BOLEAN |Whether you want add bias.                          |
| drop_rate   | 0/+FLOAT |The drop-rate for drop-out.              |
| alpha   | 0/+FLOAT |The learning rate for neural networks weight matrices.                           |
| epoch_for_learning   | +INT |The iteration for learning per experience.              |
| init_   | STR | [**`random_normal`**, **`random_uniform`**, **`xavier_normal`**, **`xavier_uniform`**, **`glorot_normal`**, **`glorot_uniform`**] The initialization method you prefer for initiating actions of your agent.                         |
| greed_epsilon_t  |  +INT |The times applying gaussian noise to the initializated actions of the agent, similar to diffusion model's adding gaussian noise.          |
| greed_epsilon_init  |  +FLOAT |The initial greed_epsilon or noise range to initializate the actions of the agent. The higher the value is, the more exploration-oriented the agent will be in the begining.                    |
| greed_epsilon_decay  |  +FLOAT | The rate of decaying for greed_epsilon for each step and eposide.  |
| greed_epsilon_min  |  +FLOAT |A very small number representing the lower bound of the greed_epsilon.        |
| beta  |  0/+FLOAT |The updating rate for updating actions of the agent.              |
| ◀️epoch_for_planning  |  +INT |The iteration for updating actions of the agent.                           |
| episode_for_training  | +INT |How many epsiodes will your agent run in the training mode where your agent will learn offline.              |
| batch_size_for_planing  | +INT | How many neural ents will be merged into a batch for each iteration in planning. **`Not usaed for the present time`**.              |
| ⚠️batch_size_for_executing| +INT | How many steps will the agent skip planning and simply take actions planned before. **`Shall be less or equal to future_size`**. |
| ⚠️batch_size_for_learning  | +INT | Batch size for learning or training neural nets.              |
| buffer_limit  | +INT |The maximum size for your buffer.              |



## dualism principles

| neural weights | neural actions (and experiences) |
|----------|----------|
| Re-updated   | Not re-updated. New ones are created and stored each time   |
| Stable initialization   | Unstable initialization at first and then stable initialization gradually   |

## frozen lake

In [None]:
game_name =  'FrozenLake-v1'         #⚠️   gym.make(game_name, max_episode_steps=max_steps_for_each_episode, is_slippery=False, map_name="4x4")
max_steps_for_each_episode = None    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  16                     #⚠️
action_size = 4                      #⚠️
reward_size = 100                    #⚠️
feature_size = 200                   #⚠️
history_size  = 10                   #⚠️
future_size = 10                     #⚠️
window_size = 2                      #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                   
epoch_for_learning  = 5              

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_init  = 0.1       
greed_epsilon_decay = 0.95                 
greed_epsilon_min   = 1e-20        
beta = 1                         
epoch_for_planning = 5               #◀️          

episode_for_training = 100000

batch_size_for_executing = 10        #⚠️

batch_size_for_learning = 10         #⚠️  

buffer_limit = 10000                       
 



## blackjack

In [None]:
game_name = 'Blackjack-v1'           #⚠️
max_steps_for_each_episode = None    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  201                    #⚠️
action_size = 2                      #⚠️
reward_size = 100                    #⚠️
feature_size = 250                   #⚠️
history_size  = 5                    #⚠️
future_size = 5                      #⚠️
window_size = 2                      #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                   
epoch_for_learning  = 5              

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_init  = 0.1       
greed_epsilon_decay = 0.95                 
greed_epsilon_min   = 1e-20        
beta = 1                         
epoch_for_planning = 5               #◀️          

episode_for_training = 100000

batch_size_for_executing = 10        #⚠️

batch_size_for_learning = 10         #⚠️     

buffer_limit = 10000                       
 



## cartpole

In [None]:
game_name = 'CartPole-v1'            #⚠️
max_steps_for_each_episode = 1000    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  400                    #⚠️
action_size = 2                      #⚠️
reward_size = 100                    #⚠️
feature_size = 500                   #⚠️
history_size  = 10                   #⚠️
future_size = 50                     #⚠️
window_size = 8                      #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                   
epoch_for_learning  = 5              

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_init  = 0.1       
greed_epsilon_decay = 0.95                 
greed_epsilon_min   = 1e-20        
beta = 1                         
epoch_for_planning = 5               #◀️          

episode_for_training = 100000

batch_size_for_executing = 10        #⚠️

batch_size_for_learning = 10         #⚠️       

buffer_limit = 10000                       
 



## mountain car

In [None]:
game_name =  'MountainCar-v0'        #⚠️
max_steps_for_each_episode = None    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  200                    #⚠️
action_size = 3                      #⚠️
reward_size = 100                    #⚠️
feature_size = 300                   #⚠️
history_size  = 25                   #⚠️
future_size = 75                     #⚠️
window_size = 50                     #⚠️   
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                   
epoch_for_learning  = 5              

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_init  = 0.1       
greed_epsilon_decay = 0.95                 
greed_epsilon_min   = 1e-20        
beta = 1                         
epoch_for_planning = 5               #◀️          

episode_for_training = 100000

batch_size_for_executing = 10        #⚠️

batch_size_for_learning = 10         #⚠️        

buffer_limit = 10000                       
 



## acrobot

In [None]:
game_name = 'Acrobot-v1'             #⚠️
max_steps_for_each_episode = None    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  600                    #⚠️
action_size = 3                      #⚠️
reward_size = 100                    #⚠️
feature_size = 700                   #⚠️
history_size  = 25                   #⚠️
future_size = 75                     #⚠️
window_size = 50                     #⚠️   
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                   
epoch_for_learning  = 5              

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_init  = 0.1       
greed_epsilon_decay = 0.95                 
greed_epsilon_min   = 1e-20        
beta = 1                         
epoch_for_planning = 5               #◀️          

episode_for_training = 100000

batch_size_for_executing = 10        #⚠️        

batch_size_for_learning = 10         #⚠️  

buffer_limit = 10000                       
 



## lunar lander

In [None]:
game_name = "LunarLander-v3"         #⚠️
max_steps_for_each_episode = None    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  800                    #⚠️
action_size = 4                      #⚠️
reward_size = 250                    #⚠️
feature_size = 950                   #⚠️
history_size  = 25                   #⚠️
future_size = 75                     #⚠️
window_size = 50                     #⚠️   
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                   
epoch_for_learning  = 5              

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_init  = 0.1       
greed_epsilon_decay = 0.95                 
greed_epsilon_min   = 1e-20        
beta = 1                         
epoch_for_planning = 5               #◀️          

episode_for_training = 100000

batch_size_for_executing = 10        #⚠️          

batch_size_for_learning = 10         #⚠️ 

buffer_limit = 10000                       
 



## door key

In [None]:
game_name = "MiniGrid-DoorKey-5x5-v0"#⚠️
max_steps_for_each_episode = None    #⚠️
seed = 1                             #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  157                    #⚠️
action_size = 7                      #⚠️
reward_size = 100                    #⚠️
feature_size = 250                   #⚠️
history_size  = 15                   #⚠️
future_size = 10                     #⚠️
window_size = 10                     #⚠️   
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                   
epoch_for_learning  = 5              

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_init  = 0.1       
greed_epsilon_decay = 0.95                 
greed_epsilon_min   = 1e-20        
beta = 1                         
epoch_for_planning = 5               #◀️          

episode_for_training = 100000

batch_size_for_executing = 1         #⚠️         

batch_size_for_learning = 10         #⚠️  

buffer_limit = 10000                       
 



## your present config

In [None]:
game_name =  'MountainCar-v0'        #⚠️
max_steps_for_each_episode = None    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  200                    #⚠️
action_size = 3                      #⚠️
reward_size = 100                    #⚠️
feature_size = 300                   #⚠️
history_size  = 25                   #⚠️
future_size = 75                     #⚠️
window_size = 50                     #⚠️   
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                   
epoch_for_learning  = 5              

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_init  = 0.1       
greed_epsilon_decay = 0.95                 
greed_epsilon_min   = 1e-20        
beta = 1                         
epoch_for_planning = 5               #◀️          

episode_for_training = 100000

batch_size_for_executing = 10        #⚠️

batch_size_for_learning = 10         #⚠️        

buffer_limit = 10000                       
 



In [None]:
episode_for_testing = 100
render_for_human = True

suffix                 = f"game_{game_name}-type_{neural_type}-ensemble_{ensemble_size:05d}-learn_{epoch_for_learning:05d}-plan_{epoch_for_planning:05d}"
directory              = f'./result/{game_name}/'
performance_directory  = f'./result/{game_name}/performace-{suffix}.csv'
model_directory        = f'./result/{game_name}/model-{suffix}.pth'
buffer_directory       = f'./result/{game_name}/buffer-{suffix}.dill'

if not os.path.exists(directory):
    os.makedirs(directory)

# Importing local modules

In [None]:
game_modules = {
    'FrozenLake-v1': 'envs.env_frozenlake',
    'Blackjack-v1': 'envs.env_blackjack',
    'CartPole-v1': 'envs.env_cartpole',
    'MountainCar-v0': 'envs.env_mountaincar',
    'Acrobot-v1': 'envs.env_acrobot',
    'LunarLander-v3': 'envs.env_lunarlander',
    'MiniGrid-DoorKey-5x5-v0': 'envs.env_doorkey'
}
if game_name in game_modules:
    game_module = __import__(game_modules[game_name], fromlist=['vectorizing_state', 'vectorizing_action', 'vectorizing_reward'])
    vectorizing_state  = game_module.vectorizing_state
    vectorizing_action = game_module.vectorizing_action
    vectorizing_reward = game_module.vectorizing_reward
else:
    raise RuntimeError('Missing env functions')

In [None]:
model_modules = {
    'td': 'models.model_td',
    'rnn_td': 'models.model_rnn_td',
    'gru_td': 'models.model_gru_td',
    'rnn_td': 'models.model_rnn_td',
    'rnn': 'models.model_rnn',
    'gru': 'models.model_rnn',
    'lstm': 'models.model_rnn'
}
if neural_type in model_modules:
    model_module = __import__(model_modules[neural_type], fromlist=['build_model'])
    build_model  = model_module.build_model
else:
    raise RuntimeError('Missing model functions')

from utils.util_func  import load_performance_from_csv,\
                             load_buffer_from_pickle,\
                             retrieve_history,\
                             retrieve_present,\
                             initialize_future_action, \
                             initialize_desired_reward,\
                             update_future_action, \
                             sequentialize, \
                             update_long_term_experience_replay_buffer,\
                             update_model_list,\
                             limit_buffer,\
                             save_performance_to_csv,\
                             save_buffer_to_pickle


# planning -> Learning
Training mode where your agent will learn offline. You can see here how your agent learn overtime and improve its performance.

## Creating or loading models

In [None]:

# creating empty log for recording performance
performance_log  = []

# setting the last episode number for performance log
last_episode = 0

# creating model list
sequence_size = history_size + future_size 
model_list = []
for _ in range(ensemble_size):
    model = build_model(state_size,
                        action_size,
                        reward_size,
                        feature_size,
                        sequence_size,
                        neural_type,
                        num_layers,
                        num_heads,
                        init,
                        opti,
                        loss,
                        bias,
                        drop_rate,
                        alpha)
    model.to(device)
    model_list.append(model)

# creating space for storing tensors as experience replay buffer
history_state_stack        = torch.empty(0).to(device)
history_action_stack       = torch.empty(0).to(device)
present_state_stack        = torch.empty(0).to(device)
future_action_stack        = torch.empty(0).to(device)
future_reward_stack        = torch.empty(0).to(device)
future_state_stack         = torch.empty(0).to(device)
history_state_hash_list    = list()
history_action_hash_list   = list()
present_state_hash_list    = list()
future_action_hash_list    = list()
future_reward_hash_list    = list()
future_state_hash_list     = list()

# load from pre-trained models if needed
if load_pretrained_model == False:
    pass
elif load_pretrained_model == True:
    try:
        model_dict = torch.load(model_directory)
        for i, model in enumerate(model_list):
            model.load_state_dict(model_dict[f'model_{i}'])
        history_state_stack, \
        history_action_stack, \
        present_state_stack, \
        future_action_stack, \
        future_reward_stack, \
        future_state_stack , \
        history_state_hash_list, \
        history_action_hash_list, \
        present_state_hash_list, \
        future_action_hash_list, \
        future_reward_hash_list, \
        future_state_hash_list = load_buffer_from_pickle(buffer_directory)
        history_state_stack    = history_state_stack.to (device) 
        history_action_stack   = history_action_stack.to(device) 
        present_state_stack    = present_state_stack.to (device) 
        future_action_stack    = future_action_stack.to (device) 
        future_reward_stack    = future_reward_stack.to (device) 
        future_state_stack     = future_state_stack .to (device) 
        performance_log        = load_performance_from_csv(performance_directory)
        last_episode           = performance_log[-1][0] + 1 if len(performance_log) > 0 else 0
        greed_epsilon_init     = max(greed_epsilon_init * (greed_epsilon_decay ** last_episode), greed_epsilon_min)
        print('Loaded pre-trained models.')
    except:
        print('Failed loading pre-trained models. Now using new models.')
        pass

## Putting all the previous works into play

In [None]:
"""
We don't randomize desired reward anymore because:
1 - It is not typical in RL.
2 - There are many more effective methods like epsilon-greedy, intrinsic motivation, and reward shaping that can drive an agent to explore effectively.
3 - Those methods are designed to balance exploration and exploitation in a way that promotes learning while keeping the agent on a meaningful path toward mastering the environment.
"""

"""
We no longer use Prioritized Experience Replay (PER) but rather use Random Experience Replay because:
1 - Although the agent might not learn as quickly as with Prioritized Experience Replay (PER), random experience replay allows the agent to gradually improve and become more stable. Over time, this leads to better long-term performance.
2 - Since random experience replay avoids focusing on just the most "critical" experiences, it reduces the risk of instability or overfitting. The agent ends up with a more robust and reliable policy that performs well in a variety of situations.
3 - PER is GPU VRAM killer
"""

"""
planning phase:
    (history_state) (history_action) present_state         future_action                 desired_rewar
                                     -observed by agent    -planned by agent             -planned by agent
learning phase:
                                     present_state         future_action                 future_reward             future_state
                                     -observed by agent    -observed/executed by agent   -observed by agent        -observed by agent
                                                                                         -criterion set by human
"""

# starting each episode
for training_episode in tqdm(range(episode_for_training)):

    # initializing summed reward
    summed_reward  = 0

    """
    We filled short term experience replay buffer with some dummy data to insure that history exists.
    """
    # initializing short term experience replay buffer
    state_list  = []
    action_list = []
    reward_list = []
    for _ in range(history_size):
        state_list .append(torch.zeros(state_size  ).to(device) - 1)
        action_list.append(torch.zeros(action_size ).to(device)    )
        reward_list.append(torch.zeros(reward_size ).to(device)    ) 

    # initializing environment
    env            = gym.make(game_name, max_episode_steps=max_steps_for_each_episode)
    state, info    = env.reset(seed = seed)
    
    # observing state
    state          = vectorizing_state(state, device)
    state_list.append(state)

    # starting each step
    done = False
    truncated = False 
    while not done and not truncated:

        """
        We let agent took some history states and actions into consideration.
        """
        # initializing and updating action by desired reward                                  
        history_state, \
        history_action  = retrieve_history(state_list, action_list, history_size, device)
        present_state   = retrieve_present(state_list, device)
        future_action   = initialize_future_action(init_, greed_epsilon_t, greed_epsilon_init, (1, future_size, action_size), device)
        desired_reward  = initialize_desired_reward((1, future_size, reward_size), device)
        future_action   = update_future_action(epoch_for_planning,
                                               model_list,
                                               history_state ,
                                               history_action,
                                               present_state,
                                               future_action,
                                               desired_reward,
                                               beta)

        """
        We let agent execute several planned actions rather than one at a time to make data gathering more efficient. 
        batch_size_for_executing shall be less or equal to future_size.
        """
        # taking actions and skip planning 
        for i in range(batch_size_for_executing):

            # printing steps
            print(f'\rStep: {len(action_list)+1-history_size}\r', end='', flush=True)

            # observing action
            action, action_  = vectorizing_action(future_action[:, i:, :], device)
            action_list.append(action)
            
            # executing action
            state, reward, done, truncated, info = env.step(action_)

            # summing reward
            summed_reward += reward

            # observing actual reward
            reward = vectorizing_reward(state, reward, summed_reward, done, reward_size, device)
            reward_list.append(reward)

            # observing state
            state  = vectorizing_state(state, device)
            state_list.append(state)

            # if done then continue for a short period
            if done or truncated:
                break
            else:
                pass
            
    # closing env
    env.close()




    # recording performance
    print(f'Episode {training_episode + last_episode}: Summed_Reward = {summed_reward}')
    performance_log.append([training_episode + last_episode, summed_reward])




    """
    window_size shall be less or equal to future_size and shall be smaller or equal to the possible minimum steps in an env.
    """
    # sequentializing short term experience replay buffer
    history_state_list   ,\
    history_action_list   ,\
    present_state_list   ,\
    future_action_list   ,\
    future_reward_list   ,\
    future_state_list    = sequentialize(state_list  ,
                                         action_list ,
                                         reward_list ,
                                         history_size,
                                         window_size)


    

    # storing sequentialized short term experience to long term experience replay buffer when it is a new experience
    history_state_stack, \
    history_action_stack, \
    present_state_stack, \
    future_action_stack, \
    future_reward_stack, \
    future_state_stack , \
    history_state_hash_list  , \
    history_action_hash_list  , \
    present_state_hash_list  , \
    future_action_hash_list  , \
    future_reward_hash_list  , \
    future_state_hash_list   = update_long_term_experience_replay_buffer(history_state_stack,
                                                                         history_action_stack,
                                                                         present_state_stack,
                                                                         future_action_stack,
                                                                         future_reward_stack,
                                                                         future_state_stack ,
                                                                         history_state_hash_list  ,
                                                                         history_action_hash_list  ,
                                                                         present_state_hash_list  ,
                                                                         future_action_hash_list  ,
                                                                         future_reward_hash_list  ,
                                                                         future_state_hash_list   ,
                                                                         history_state_list   ,
                                                                         history_action_list   ,
                                                                         present_state_list,
                                                                         future_action_list,
                                                                         future_reward_list,
                                                                         future_state_list )


    

    """
    We use batch_size to make training more efficient.
    """
    # training
    model_list = update_model_list(epoch_for_learning ,
                                   history_state_stack,
                                   history_action_stack,
                                   present_state_stack,
                                   future_action_stack,
                                   future_reward_stack,
                                   future_state_stack ,
                                   model_list,
                                   batch_size_for_learning
                                   )




    # limit_buffer
    history_state_stack, \
    history_action_stack, \
    present_state_stack, \
    future_action_stack, \
    future_reward_stack, \
    future_state_stack , \
    history_state_hash_list  , \
    history_action_hash_list  , \
    present_state_hash_list  , \
    future_action_hash_list  , \
    future_reward_hash_list  , \
    future_state_hash_list   = limit_buffer(history_state_stack,
                                            history_action_stack,
                                            present_state_stack,
                                            future_action_stack,
                                            future_reward_stack,
                                            future_state_stack ,
                                            history_state_hash_list  ,
                                            history_action_hash_list  ,
                                            present_state_hash_list  ,
                                            future_action_hash_list  ,
                                            future_reward_hash_list  ,
                                            future_state_hash_list ,
                                            buffer_limit  )




    """
    We set a lower bound for greed_epsilon_init to prevent it from becoming too small which is similar to initialzing the weights in neural networks to nearly zero.
    """
    # decreasing decay rate
    greed_epsilon_init = greed_epsilon_init * greed_epsilon_decay
    greed_epsilon_init = max(greed_epsilon_init , greed_epsilon_min)




    # saving final reward to log
    save_performance_to_csv(performance_log, performance_directory)

    # saving nn models
    model_dict = {}
    for i, model in enumerate(model_list):
        model_dict[f'model_{i}'] = model.state_dict()
    torch.save(model_dict, model_directory)

    # saving long term experience replay buffer
    save_buffer_to_pickle(buffer_directory,
                          history_state_stack,
                          history_action_stack,
                          present_state_stack,
                          future_action_stack,
                          future_reward_stack,
                          future_state_stack,
                          history_state_hash_list,
                          history_action_hash_list,
                          present_state_hash_list,
                          future_action_hash_list,
                          future_reward_hash_list,
                          future_state_hash_list)




    # clear up
    gc.collect()
    torch.cuda.empty_cache()

# planning only
Testing mode where your trained agent in the training mode will not learn offline. It just keeps running each episode without learning new stuff.

## Loading models

In [None]:
sequence_size = history_size + future_size 
model_list = []
for _ in range(ensemble_size):
    model = build_model(state_size,
                        action_size,
                        reward_size,
                        feature_size,
                        sequence_size ,
                        neural_type,
                        num_layers,
                        num_heads,
                        init,
                        opti,
                        loss,
                        bias,
                        drop_rate,
                        alpha)
    model.to(device)
    model_list.append(model)

model_dict = torch.load(model_directory)
for i, model in enumerate(model_list):
    model.load_state_dict(model_dict[f'model_{i}'])

## Putting all the previous works into play ... again

But this time the agent does not learn

In [None]:
# score recorder
total_summed_reward = 0

# starting each episode
for testing_episode in range(episode_for_testing):

    # initializing summed reward
    summed_reward  = 0

    # initializing short term experience replay buffer
    state_list  = []
    action_list = []
    for _ in range(history_size):
        state_list .append(torch.zeros(state_size  ).to(device) - 1)
        action_list.append(torch.zeros(action_size ).to(device)    )

    # initializing environment
    env = gym.make(game_name, max_episode_steps = max_steps_for_each_episode,
                   render_mode = "human" if render_for_human else None)
    state, info = env.reset(seed = seed)
    if render_for_human == True:
        env.render()

    # observing state
    state = vectorizing_state(state, device)
    state_list.append(state)

    # starting each step
    done = False
    truncated = False
    while not done and not truncated:
        
        # initializing and updating action   
        history_state, \
        history_action = retrieve_history(state_list, action_list, history_size, device)
        present_state  = retrieve_present(state_list, device)
        future_action  = initialize_future_action(init_, greed_epsilon_t, greed_epsilon_init, (1, future_size, action_size), device)
        desired_reward = initialize_desired_reward((1, future_size, reward_size), device)
        future_action  = update_future_action(epoch_for_planning,
                                              model_list,
                                              history_state ,
                                              history_action,
                                              present_state,
                                              future_action,
                                              desired_reward,
                                              beta)
    
         # taking actions and skip planning 
        for i in range(batch_size_for_executing):

            print(f'\rStep: {len(action_list)+1}\r', end='', flush=True)

            # observing action
            action, action_  = vectorizing_action(future_action[:, i:, :], device)
            action_list.append(action)

            # executing action
            state, reward, done, truncated, info = env.step(action_)
            if render_for_human == True:
                env.render()
                
            # summing reward
            summed_reward += reward
            
            # observing state
            state = vectorizing_state(state, device)
            state_list.append(state)
            
            # terminating episode if done or truncated
            if done or truncated:
                break
            else:
                pass
        
    # closing env
    env.close()

    # recording
    print("Summed reward:", summed_reward)
    print(f'Episode: {testing_episode + 1}')
    print('Everaged summed reward:')
    total_summed_reward += summed_reward
    print(total_summed_reward/(testing_episode + 1))

