<a href="https://colab.research.google.com/github/Brownwang0426/Reversal-Generative-Reinforcement-Learning/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up (for colab)

In [None]:
!sudo apt-get install python3.10
!pip install torch==2.0.1 
!pip install numpy==1.25.2 scipy==1.11.4 swig==4.2.1 ufal.pybox2d==2.3.10.3 gymnasium==1.0.0 minigrid==3.0.0 tqdm==4.67.1 dill==0.3.8

In [None]:
!git clone --branch main https://github.com/Brownwang0426/Reversal-Generative-Reinforcement-Learning.git

In [None]:
import os
os.chdir('/content/Reversal-Generative-Reinforcement-Learning')

# Setting up (for local)
CUDA Toolkit 11.8 \
cuDNN 8.9.x \
pip install torch==2.0.1 --extra-index-url https://download.pytorch.org/whl/cu118  \
pip install numpy==1.25.2 scipy==1.11.4 swig==4.2.1 ufal.pybox2d==2.3.10.3 gymnasium==1.0.0 minigrid==3.0.0 tqdm==4.67.1 dill==0.3.8

# Importing modules

In [None]:
import gymnasium as gym
from gymnasium.wrappers import TimeLimit
import minigrid

import numpy as np
import math
from scipy.special import softmax

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.utils.rnn as rnn_utils
from torch.utils.data import DataLoader, TensorDataset, Subset

import csv

import multiprocessing as mp
import os
import sys
import copy
import random
import gc
import time
from tqdm import tqdm
from collections import defaultdict

import itertools

import dill

import warnings
warnings.filterwarnings('ignore')

import concurrent.futures
import hashlib

# Checking cuda

In [None]:
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"Device {i}: {torch.cuda.get_device_name(i)}")
    device_index = 0
    device = torch.device(f"cuda:{device_index}")
    print('using cuda...')
else:
    device = torch.device("cpu")
    print('using cpu...')

torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

# Control board

Crucial configurations regarding how your agent will learn in the environment. The meanings are as follow:
(the configs starting with ⚠️ are what we suggest you must tune according to your specific need in your task)
(the configs starting with ◀️ are what we suggest you to play with to see the effect)



## Configs meaning
| Configs   | Type   | Description                                                                 |
|------------|--------|-----------------------------------------------------------------------------|
| ⚠️game_name  | STR| The name of the environment.                                |
| ⚠️max_steps_for_each_episode | +INT | The maximun steps that the agent will go through while not done. In some environments, it is crucial to increase your "max_steps_for_each_episode" so that your agent can "live long enough" to obatin some better rewards to gradually and heuristically learn better strategy.                    |
| ⚠️seed | +INT/None | The seed for environment. None for random environment each episode.                    |
| load_pretrained_model  | BOLEAN |Whether you want to load previous trained model.                          |
| ◀️ensemble_size  | +INT | The size of the neural ensemble which the agent is comprised of. The bigger, the better, but the longer training time without parallel training. :-D                  |
| ⚠️state_size  | +INT | The size of the state as input data.                    |
| ⚠️action_size   | +INT | The size of action per step as input data.   |
| ⚠️reward_size  | +INT |The size of the reward as output data.                          |
| ⚠️feature_size   | +INT |The size of the hidden layers. **`Shall be bigger than the sum of state_size, action_size and reward_size`**.      |
| ⚠️history_size  | 0/+INT |How many steps in the history for state and action will the agent take into consideration.                           |
| ⚠️future_size  | +INT |The length of the sequence of actions. Namely, how many steps in the future the agent will predict or use to discern the present best action.                |
| ⚠️neural_type  | STR |  [**`rnn`**, **`gru`**, **`lstm`**, **`td`**, **`rnn_td`**, **`gru_td`**, **`lstm_td`**] The type of neural network you prefer. For now, we support rnn, gru, lstm, and td (Transformer decoder only). More to come in the future (or you can build one yourself :-D in the models repository).           |
| ⚠️num_layers  | +INT |The number of layers in rnn, gru, lstm, and td (Transformer decoder only).     |
| ⚠️num_heads  | +INT/None |The number of heads in multi-head attention. **`Shall be able to devide feature_size`**. **`Shall be None for non-attention neural_type`**.                         |
| init   | STR | [**`random_normal`**, **`random_uniform`**, **`xavier_normal`**, **`xavier_uniform`**, **`glorot_normal`**, **`glorot_uniform`**] The initialization method you prefer for initiating neural net ensemble of your agent.                          |
| opti   | STR | [**`adam`**, **`sgd`**, **`rmsprop`**]  The optimization method you prefer.             |
| loss  | STR | [**`mean_squared_error`**, **`binary_crossentropy`**] The loss or error function you prefer.                           |
| bias  | BOLEAN |Whether you want add bias.                          |
| drop_rate   | 0/+FLOAT |The drop-rate for drop-out.              |
| alpha   | 0/+FLOAT |The learning rate for neural networks weight matrices.                           |
| itrtn_for_learning   | +INT |The iteration for learning per experience.              |
| init_   | STR | [**`random_normal`**, **`random_uniform`**] The initialization method you prefer for initiating actions of your agent.                         |
| greed_epsilon_t  |  +INT |The times applying gaussian noise to the initializated actions of the agent, similar to diffusion model's adding gaussian noise.          |
| greed_epsilon_r  |  +FLOAT |The initial greed_epsilon or noise range to initializate the actions of the agent. The higher the value is, the more exploration-oriented the agent will be in the begining.                    |
| ⚠️greed_epsilon_decay| 	+FLOAT	|The rate of decaying for greed_epsilon for each step and eposide.|
| greed_epsilon_min  | 	+FLOAT	|A very small number representing the lower bound of the greed_epsilon.|
| beta  |  0/+FLOAT |The updating rate for updating actions of the agent.              |
| itrtn_for_planning  |  +INT |The iteration for updating actions of the agent.                           |
| episode_for_training  | +INT |How many epsiodes will your agent run in the training mode where your agent will learn offline.              |
| episode_for_validation  | +INT |How many epsiodes will your agent start from regular starting point for validating the actual perfromance of the agent.              |
| ⚠️batch_size_for_executing| +INT | How many steps will the agent skip planning and simply take actions planned before. **`Shall be less or equal to future_size`**. |
| ⚠️batch_size_for_learning  | +INT | Batch size for learning or training neural nets.              |
| buffer_limit  | +INT |The maximum size for your buffer.              |


## frozen lake

In [None]:
game_name =  'FrozenLake-v1'         #⚠️   gym.make(game_name, max_episode_steps=max_steps_for_each_episode, is_slippery=False, map_name="4x4")
max_steps_for_each_episode = 20      #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  26                     #⚠️
action_size = 4                      #⚠️
reward_size = 100                    #⚠️
feature_size = 150                   #⚠️
history_size  = 0                    #⚠️
future_size = 10                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning  = 150

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_r     = 1e-5  
greed_epsilon_decay = 0.000          #⚠️               
greed_epsilon_min   = 1e-10    
beta = 1                     
itrtn_for_planning  = 100          

episode_for_training = 100000

episode_for_validation = 5

batch_size_for_executing = 1         #⚠️

batch_size_for_learning = 1          #⚠️       

buffer_limit = 10000   



## blackjack

In [None]:
game_name = 'Blackjack-v1'           #⚠️
max_steps_for_each_episode = None    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  400                    #⚠️
action_size = 2                      #⚠️
reward_size = 100                    #⚠️
feature_size = 750                   #⚠️
history_size  = 0                    #⚠️
future_size = 5                      #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning  = 150

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_r     = 1e-5  
greed_epsilon_decay = 0.000          #⚠️               
greed_epsilon_min   = 1e-10    
beta = 1                     
itrtn_for_planning  = 100          

episode_for_training = 100000

episode_for_validation = 5

batch_size_for_executing = 1         #⚠️

batch_size_for_learning = 1          #⚠️       

buffer_limit = 10000   

## cartpole

In [None]:
game_name = 'CartPole-v1'            #⚠️
max_steps_for_each_episode = 1000    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  500                    #⚠️
action_size = 2                      #⚠️
reward_size = 100                    #⚠️
feature_size = 750                   #⚠️
history_size  = 0                    #⚠️
future_size = 25                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning  = 150

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_r     = 1e-5  
greed_epsilon_decay = 0.000          #⚠️               
greed_epsilon_min   = 1e-10    
beta = 1                     
itrtn_for_planning  = 100          

episode_for_training = 100000

episode_for_validation = 5

batch_size_for_executing = 5         #⚠️

batch_size_for_learning = 1          #⚠️       

buffer_limit = 10000   



## mountain car

In [None]:
game_name =  'MountainCar-v0'        #⚠️
max_steps_for_each_episode = None    #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  300                    #⚠️
action_size = 3                      #⚠️
reward_size = 100                    #⚠️
feature_size = 500                   #⚠️
history_size  = 0                    #⚠️
future_size = 75                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning  = 150

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_r     = 1e-5  
greed_epsilon_decay = 0.000          #⚠️               
greed_epsilon_min   = 1e-10    
beta = 1                     
itrtn_for_planning  = 100          

episode_for_training = 100000

episode_for_validation = 5

batch_size_for_executing = 5         #⚠️

batch_size_for_learning = 1          #⚠️       

buffer_limit = 10000   




## acrobot

In [None]:
game_name = 'Acrobot-v1'             #⚠️
max_steps_for_each_episode = 250     #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  700                    #⚠️
action_size = 3                      #⚠️
reward_size = 100                    #⚠️
feature_size = 1000                  #⚠️
history_size  = 0                    #⚠️
future_size = 75                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning  = 150

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_r     = 1e-5  
greed_epsilon_decay = 0.000          #⚠️               
greed_epsilon_min   = 1e-10    
beta = 1                     
itrtn_for_planning  = 100          

episode_for_training = 100000

episode_for_validation = 5

batch_size_for_executing = 5         #⚠️

batch_size_for_learning = 1          #⚠️       

buffer_limit = 10000   


## lunar lander

In [None]:
game_name = "LunarLander-v3"         #⚠️
max_steps_for_each_episode = 250     #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  900                    #⚠️
action_size = 4                      #⚠️
reward_size = 250                    #⚠️
feature_size = 1200                  #⚠️
history_size  = 75                   #⚠️
future_size = 75                     #⚠️ 
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning  = 150

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_r     = 1e-5  
greed_epsilon_decay = 0.000          #⚠️               
greed_epsilon_min   = 1e-10    
beta = 1                     
itrtn_for_planning  = 100          

episode_for_training = 100000

episode_for_validation = 5

batch_size_for_executing = 5         #⚠️

batch_size_for_learning = 1          #⚠️       

buffer_limit = 10000   




## door key

In [None]:
game_name = "MiniGrid-DoorKey-5x5-v0"#⚠️
max_steps_for_each_episode = None    #⚠️
seed = 1                             #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  257                    #⚠️
action_size = 7                      #⚠️
reward_size = 100                    #⚠️
feature_size = 500                   #⚠️
history_size  = 0                    #⚠️
future_size = 10                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning  = 150

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_r     = 1e-5  
greed_epsilon_decay = 0.000          #⚠️               
greed_epsilon_min   = 1e-10    
beta = 1                     
itrtn_for_planning  = 100          

episode_for_training = 100000

episode_for_validation = 5

batch_size_for_executing = 1         #⚠️

batch_size_for_learning = 1          #⚠️       

buffer_limit = 10000   


## your present config

In [None]:
game_name = "LunarLander-v3"         #⚠️
max_steps_for_each_episode = 250     #⚠️
seed = None                          #⚠️

load_pretrained_model = True

ensemble_size = 5                    #◀️

state_size =  900                    #⚠️
action_size = 4                      #⚠️
reward_size = 250                    #⚠️
feature_size = 1200                  #⚠️
history_size  = 75                   #⚠️
future_size = 75                     #⚠️ 
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning  = 150

init_ = "random_uniform"
greed_epsilon_t     = 1
greed_epsilon_r     = 1e-5  
greed_epsilon_decay = 0.000          #⚠️               
greed_epsilon_min   = 1e-10    
beta = 1                     
itrtn_for_planning  = 100          

episode_for_training = 100000

episode_for_validation = 5

batch_size_for_executing = 5         #⚠️

batch_size_for_learning = 1          #⚠️       

buffer_limit = 10000   




In [None]:
episode_for_testing = 100
render_for_human = True

suffix                 = f"game_{game_name}-type_{neural_type}-ensemble_{ensemble_size:05d}-learn_{itrtn_for_learning:05d}-plan_{itrtn_for_planning:05d}"
directory              = f'./result/{game_name}/'
performance_directory  = f'./result/{game_name}/performace-{suffix}.csv'
model_directory        = f'./result/{game_name}/model-{suffix}.pth'
buffer_directory       = f'./result/{game_name}/buffer-{suffix}.dill'

if not os.path.exists(directory):
    os.makedirs(directory)

# Importing local modules

In [None]:
game_modules = {
    'FrozenLake-v1': 'envs.env_frozenlake',
    'Blackjack-v1': 'envs.env_blackjack',
    'CartPole-v1': 'envs.env_cartpole',
    'MountainCar-v0': 'envs.env_mountaincar',
    'Acrobot-v1': 'envs.env_acrobot',
    'LunarLander-v3': 'envs.env_lunarlander',
    'MiniGrid-DoorKey-5x5-v0': 'envs.env_doorkey'
}
if game_name in game_modules:
    game_module = __import__(game_modules[game_name], fromlist=['vectorizing_state', 'vectorizing_action', 'vectorizing_reward'])
    vectorizing_state  = game_module.vectorizing_state
    vectorizing_action = game_module.vectorizing_action
    vectorizing_reward = game_module.vectorizing_reward
    randomizer         = game_module.randomizer
else:
    raise RuntimeError('Missing env functions')

In [None]:
model_modules = {
    'td': 'models.model_td',
    'rnn_td': 'models.model_rnn_td',
    'gru_td': 'models.model_gru_td',
    'rnn_td': 'models.model_rnn_td',
    'rnn': 'models.model_rnn',
    'gru': 'models.model_rnn',
    'lstm': 'models.model_rnn'
}
if neural_type in model_modules:
    model_module = __import__(model_modules[neural_type], fromlist=['build_model'])
    build_model  = model_module.build_model
else:
    raise RuntimeError('Missing model functions')

from utils.util_func  import load_performance_from_csv,\
                             load_buffer_from_pickle,\
                             retrieve_history,\
                             retrieve_present,\
                             initialize_future_action, \
                             initialize_desired_reward,\
                             update_future_action, \
                             sequentialize, \
                             update_long_term_experience_replay_buffer,\
                             update_model_list,\
                             limit_buffer,\
                             save_performance_to_csv,\
                             save_buffer_to_pickle


# planning -> Learning
Training mode where your agent will learn offline. You can see here how your agent learn overtime and improve its performance.

## Creating or loading models

In [None]:

# creating empty log for recording performance
performance_log  = []

# setting the last episode number for performance log
last_episode = 0

# creating model list
sequence_size = history_size + future_size 
model_list = []
for _ in range(ensemble_size):
    model = build_model(state_size,
                        action_size,
                        reward_size,
                        feature_size,
                        sequence_size,
                        neural_type,
                        num_layers,
                        num_heads,
                        init,
                        opti,
                        loss,
                        bias,
                        drop_rate,
                        alpha)
    model.to(device)
    model_list.append(model)

# creating space for storing tensors as experience replay buffer
history_state_stack        = torch.empty(0).to(device)
history_action_stack       = torch.empty(0).to(device)
present_state_stack        = torch.empty(0).to(device)
future_action_stack        = torch.empty(0).to(device)
future_reward_stack        = torch.empty(0).to(device)
future_state_stack         = torch.empty(0).to(device)
history_state_hash_list    = list()
history_action_hash_list   = list()
present_state_hash_list    = list()
future_action_hash_list    = list()
future_reward_hash_list    = list()
future_state_hash_list     = list()

# load from pre-trained models if needed
if load_pretrained_model == True:
    try:
        model_dict = torch.load(model_directory)
        for i, model in enumerate(model_list):
            model.load_state_dict(model_dict[f'model_{i}'])
        history_state_stack, \
        history_action_stack,\
        present_state_stack, \
        future_action_stack, \
        future_reward_stack, \
        future_state_stack,  \
        history_state_hash_list, \
        history_action_hash_list, \
        present_state_hash_list, \
        future_action_hash_list, \
        future_reward_hash_list, \
        future_state_hash_list = load_buffer_from_pickle(buffer_directory)
        history_state_stack    = history_state_stack.to (device) 
        history_action_stack   = history_action_stack.to(device) 
        present_state_stack    = present_state_stack.to (device) 
        future_action_stack    = future_action_stack.to (device) 
        future_reward_stack    = future_reward_stack.to (device) 
        future_state_stack     = future_state_stack .to (device) 
        performance_log        = load_performance_from_csv(performance_directory)
        last_episode           = performance_log[-1][0] if len(performance_log) > 0 else 0
        greed_epsilon_r        = max(greed_epsilon_r - (greed_epsilon_decay * last_episode), greed_epsilon_min)
        print('Loaded pre-trained models.')
    except:
        print('Failed loading pre-trained models. Now using new models.')

## Putting all the previous works into play

In [None]:
"""
We don't randomize desired reward anymore because:
1 - It is not typical in RL.
2 - There are many more effective methods like epsilon-greedy, intrinsic motivation, and reward shaping that can drive an agent to explore effectively.
3 - Those methods are designed to balance exploration and exploitation in a way that promotes learning while keeping the agent on a meaningful path toward mastering the environment.
"""

# starting each episode
for training_episode in tqdm(range(episode_for_training)):
    latest_episode = training_episode + last_episode + 1

    # initializing summed reward
    summed_reward  = 0

    # initializing short term experience replay buffer
    state_list  = []
    action_list = []
    reward_list = []
    for _ in range(history_size):
        state_list .append(torch.zeros(state_size  ).to(device) - 1)
        action_list.append(torch.zeros(action_size ).to(device) - 1)
        reward_list.append(vectorizing_reward(None, 0, 0, False, reward_size, device)) 

    # initializing environment
    env            = gym.make(game_name, max_episode_steps=max_steps_for_each_episode)
    if latest_episode % episode_for_validation != 0:
        env = randomizer(env)
    state, info    = env.reset(seed = seed)
    
    # observing state
    state          = vectorizing_state(state, False, device)
    state_list.append(state)

    # starting each step
    post_done_counter = 0
    post_done_steps = future_size
    done_flag = False
    done = False
    truncated = False
    while not done_flag and not truncated:
        
        """"
        We let agent took some history states and actions into consideration.
        """
        # initializing and updating action by desired reward                                  
        history_state, \
        history_action  = retrieve_history(state_list, action_list, history_size, device)
        present_state   = retrieve_present(state_list, device)
        future_action   = initialize_future_action(init_, greed_epsilon_t, greed_epsilon_r, (1, future_size, action_size), device)
        desired_reward  = initialize_desired_reward((1, future_size, reward_size), device)
        future_action   = update_future_action(itrtn_for_planning,
                                               model_list,
                                               history_state ,
                                               history_action,
                                               present_state,
                                               future_action,
                                               desired_reward,
                                               beta)

        """
        We let agent execute several planned actions rather than one at a time to make data gathering more efficient. 
        batch_size_for_executing shall be less or equal to future_size.
        """
        # taking actions and skip planning 
        for i in range(batch_size_for_executing):

            # observing action
            action, action_  = vectorizing_action(future_action[:, i:, :], device)
            action_list.append(action)

            # executing action
            state, reward, done, truncated, info = env.step(action_)

            # summing reward
            if done:
                reward = 0
            summed_reward += reward

            # observing actual reward
            reward = vectorizing_reward(state, reward, summed_reward, done, reward_size, device)
            reward_list.append(reward)

            # observing state
            state = vectorizing_state(state, done, device)
            state_list.append(state)

            """
            We expanded the condition for terminating an episode to include the case where the count is smaller than the sum of the history and future sizes. 
            Though it is contrary to common practice in RL, this is for better handling the sequentialization of the short-term experience replay buffer with fixed window length.
            And it is also for agent to plan ahead even after the episode is done.
            We give a done flag to state to indicate that the environment is done so that the agent won't be confused. 
            The done flag shall affect the state in a considerable way to remind the agent that the environment is done.
            """
            # if done then continue for a short period. Then store experience to short term experience replay buffer
            if done:
                post_done_counter += 1
                if post_done_counter >= post_done_steps:
                    done_flag = True
                    break            
            elif truncated:
                break
            else:
                print(f'\rStep: {len(action_list)+1}\r', end='', flush=True)
                
    # closing env
    env.close()




    # recording performance
    if latest_episode % episode_for_validation == 0:
        print(f'Episode {latest_episode}: Summed_Reward = {summed_reward}')
        performance_log.append([latest_episode, summed_reward])




    # sequentializing short term experience replay buffer
    history_state_list   ,\
    history_action_list   ,\
    present_state_list   ,\
    future_action_list   ,\
    future_reward_list   ,\
    future_state_list    = sequentialize(state_list  ,
                                         action_list ,
                                         reward_list ,
                                         history_size,
                                         future_size)


    

    # storing sequentialized short term experience to long term experience replay buffer 
    history_state_stack, \
    history_action_stack, \
    present_state_stack, \
    future_action_stack, \
    future_reward_stack, \
    future_state_stack,\
    history_state_hash_list  , \
    history_action_hash_list  , \
    present_state_hash_list  , \
    future_action_hash_list  , \
    future_reward_hash_list  , \
    future_state_hash_list      = update_long_term_experience_replay_buffer(history_state_stack,
                                                                            history_action_stack,
                                                                            present_state_stack,
                                                                            future_action_stack,
                                                                            future_reward_stack,
                                                                            future_state_stack ,
                                                                            history_state_hash_list  ,
                                                                            history_action_hash_list  ,
                                                                            present_state_hash_list  ,
                                                                            future_action_hash_list  ,
                                                                            future_reward_hash_list  ,
                                                                            future_state_hash_list   ,
                                                                            history_state_list   ,
                                                                            history_action_list   ,
                                                                            present_state_list,
                                                                            future_action_list,
                                                                            future_reward_list,
                                                                            future_state_list )


    

    """
    We use batch_size to make training more efficient.
    """
    # training
    dataset     = TensorDataset     (history_state_stack,
                                     history_action_stack,
                                     present_state_stack,
                                     future_action_stack,
                                     future_reward_stack,
                                     future_state_stack  )
    model_list  = update_model_list (itrtn_for_learning ,
                                     dataset,
                                     model_list,
                                     batch_size_for_learning
                                     )




    # limit_buffer
    history_state_stack, \
    history_action_stack, \
    present_state_stack, \
    future_action_stack, \
    future_reward_stack, \
    future_state_stack , \
    history_state_hash_list  , \
    history_action_hash_list  , \
    present_state_hash_list  , \
    future_action_hash_list  , \
    future_reward_hash_list  , \
    future_state_hash_list   = limit_buffer(history_state_stack,
                                            history_action_stack,
                                            present_state_stack,
                                            future_action_stack,
                                            future_reward_stack,
                                            future_state_stack ,
                                            history_state_hash_list  ,
                                            history_action_hash_list  ,
                                            present_state_hash_list  ,
                                            future_action_hash_list  ,
                                            future_reward_hash_list  ,
                                            future_state_hash_list ,
                                            buffer_limit  )




    """
    We set a decay rate for greed_epsilon_r to make the agent more greedy as time goes by.
    We set a lower bound for greed_epsilon_r to prevent it from becoming too small which is similar to initialzing the weights in neural networks to nearly zero.
    """
    # decreasing decay rate
    greed_epsilon_r = greed_epsilon_r - greed_epsilon_decay
    greed_epsilon_r = max(greed_epsilon_r , greed_epsilon_min)




    # saving when reaching episode_for_validation
    if latest_episode % episode_for_validation == 0:
        
        # saving final reward to log
        save_performance_to_csv(performance_log, performance_directory)

        # saving nn models
        model_dict = {}
        for i, model in enumerate(model_list):
            model_dict[f'model_{i}'] = model.state_dict()
        torch.save(model_dict, model_directory)

        # saving long term experience replay buffer
        save_buffer_to_pickle(buffer_directory,
                              history_state_stack,
                              history_action_stack,
                              present_state_stack,
                              future_action_stack,
                              future_reward_stack,
                              future_state_stack,
                              history_state_hash_list,
                              history_action_hash_list,
                              present_state_hash_list,
                              future_action_hash_list,
                              future_reward_hash_list,
                              future_state_hash_list)



 
    # clear up
    gc.collect()
    torch.cuda.empty_cache()

100%|██████████| 150/150 [00:51<00:00,  2.92it/s]
100%|██████████| 150/150 [00:59<00:00,  2.51it/s]
100%|██████████| 150/150 [01:01<00:00,  2.42it/s]
100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
100%|██████████| 150/150 [01:03<00:00,  2.35it/s]
  0%|          | 2/100000 [38:54<32716:43:02, 1177.83s/it]

Step: 164

100%|██████████| 150/150 [01:03<00:00,  2.37it/s]
100%|██████████| 150/150 [01:02<00:00,  2.40it/s]
100%|██████████| 150/150 [01:03<00:00,  2.37it/s]
100%|██████████| 150/150 [00:57<00:00,  2.60it/s]
100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
  0%|          | 3/100000 [1:06:23<38690:34:18, 1392.90s/it]

Step: 139

100%|██████████| 150/150 [01:03<00:00,  2.37it/s]
100%|██████████| 150/150 [01:01<00:00,  2.43it/s]
100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
100%|██████████| 150/150 [00:58<00:00,  2.55it/s]
100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
  0%|          | 4/100000 [1:30:32<39308:39:17, 1415.17s/it]

Episode 5: Summed_Reward = -428.25974835377446


100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
100%|██████████| 150/150 [01:02<00:00,  2.39it/s]
100%|██████████| 150/150 [00:58<00:00,  2.58it/s]
100%|██████████| 150/150 [01:00<00:00,  2.50it/s]
100%|██████████| 150/150 [00:59<00:00,  2.53it/s]
  0%|          | 5/100000 [1:55:15<39987:50:54, 1439.63s/it]

Step: 116

100%|██████████| 150/150 [00:57<00:00,  2.60it/s]
100%|██████████| 150/150 [01:00<00:00,  2.46it/s]
100%|██████████| 150/150 [00:58<00:00,  2.58it/s]
100%|██████████| 150/150 [01:01<00:00,  2.43it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
  0%|          | 6/100000 [2:15:48<38030:06:14, 1369.17s/it]

Step: 129

100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
100%|██████████| 150/150 [00:58<00:00,  2.56it/s]
100%|██████████| 150/150 [01:01<00:00,  2.44it/s]
100%|██████████| 150/150 [00:58<00:00,  2.58it/s]
100%|██████████| 150/150 [01:03<00:00,  2.37it/s]
  0%|          | 7/100000 [2:38:22<37891:59:39, 1364.21s/it]

Step: 102

100%|██████████| 150/150 [01:03<00:00,  2.37it/s]
100%|██████████| 150/150 [01:01<00:00,  2.46it/s]
100%|██████████| 150/150 [00:59<00:00,  2.53it/s]
100%|██████████| 150/150 [01:02<00:00,  2.38it/s]
100%|██████████| 150/150 [00:58<00:00,  2.57it/s]
  0%|          | 8/100000 [2:57:45<36114:41:13, 1300.23s/it]

Step: 136

100%|██████████| 150/150 [00:58<00:00,  2.54it/s]
100%|██████████| 150/150 [00:59<00:00,  2.54it/s]
100%|██████████| 150/150 [01:00<00:00,  2.49it/s]
100%|██████████| 150/150 [00:59<00:00,  2.53it/s]
100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
  0%|          | 9/100000 [3:20:59<36934:18:44, 1329.75s/it]

Episode 10: Summed_Reward = -31.933185639124517


100%|██████████| 150/150 [00:59<00:00,  2.51it/s]
100%|██████████| 150/150 [00:59<00:00,  2.51it/s]
100%|██████████| 150/150 [01:02<00:00,  2.40it/s]
100%|██████████| 150/150 [01:00<00:00,  2.48it/s]
100%|██████████| 150/150 [00:59<00:00,  2.51it/s]
  0%|          | 10/100000 [3:47:04<38945:49:55, 1402.19s/it]

Step: 119

100%|██████████| 150/150 [01:02<00:00,  2.39it/s]
100%|██████████| 150/150 [00:57<00:00,  2.59it/s]
100%|██████████| 150/150 [01:01<00:00,  2.44it/s]
100%|██████████| 150/150 [00:58<00:00,  2.57it/s]
100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
  0%|          | 11/100000 [4:08:19<37866:04:08, 1363.33s/it]

Step: 142

100%|██████████| 150/150 [00:58<00:00,  2.59it/s]
100%|██████████| 150/150 [00:59<00:00,  2.52it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
100%|██████████| 150/150 [01:02<00:00,  2.40it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
  0%|          | 12/100000 [4:33:05<38898:19:37, 1400.51s/it]

Step: 155

100%|██████████| 150/150 [01:01<00:00,  2.43it/s]
100%|██████████| 150/150 [01:02<00:00,  2.42it/s]
100%|██████████| 150/150 [01:00<00:00,  2.48it/s]
100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
100%|██████████| 150/150 [00:57<00:00,  2.62it/s]
  0%|          | 13/100000 [4:59:22<40382:20:56, 1453.95s/it]

Step: 324

100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
100%|██████████| 150/150 [00:58<00:00,  2.58it/s]
100%|██████████| 150/150 [01:01<00:00,  2.43it/s]
100%|██████████| 150/150 [00:58<00:00,  2.57it/s]
100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
  0%|          | 14/100000 [5:38:30<47884:46:55, 1724.09s/it]

Episode 15: Summed_Reward = -478.43427025676


100%|██████████| 150/150 [01:00<00:00,  2.49it/s]
100%|██████████| 150/150 [01:01<00:00,  2.44it/s]
100%|██████████| 150/150 [01:01<00:00,  2.44it/s]
100%|██████████| 150/150 [01:00<00:00,  2.49it/s]
100%|██████████| 150/150 [00:58<00:00,  2.55it/s]
  0%|          | 15/100000 [6:04:45<46633:03:01, 1679.04s/it]

Step: 166

100%|██████████| 150/150 [00:55<00:00,  2.68it/s]
100%|██████████| 150/150 [00:55<00:00,  2.71it/s]
100%|██████████| 150/150 [00:58<00:00,  2.58it/s]
100%|██████████| 150/150 [00:58<00:00,  2.56it/s]
100%|██████████| 150/150 [01:01<00:00,  2.44it/s]
  0%|          | 16/100000 [6:31:59<46263:39:09, 1665.76s/it]

Step: 132

100%|██████████| 150/150 [01:05<00:00,  2.28it/s]
100%|██████████| 150/150 [01:01<00:00,  2.46it/s]
100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
100%|██████████| 150/150 [01:04<00:00,  2.31it/s]
100%|██████████| 150/150 [01:03<00:00,  2.37it/s]
  0%|          | 17/100000 [6:55:39<44210:01:38, 1591.83s/it]

Step: 152

100%|██████████| 150/150 [01:05<00:00,  2.30it/s]
100%|██████████| 150/150 [01:05<00:00,  2.29it/s]
100%|██████████| 150/150 [01:08<00:00,  2.20it/s]
100%|██████████| 150/150 [01:07<00:00,  2.24it/s]
100%|██████████| 150/150 [01:08<00:00,  2.18it/s]
  0%|          | 18/100000 [7:22:25<44321:41:08, 1595.87s/it]

Step: 153

100%|██████████| 150/150 [01:03<00:00,  2.35it/s]
100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
100%|██████████| 150/150 [01:00<00:00,  2.48it/s]
100%|██████████| 150/150 [00:59<00:00,  2.52it/s]
  0%|          | 19/100000 [7:48:53<44257:52:31, 1593.59s/it]

Episode 20: Summed_Reward = -726.9235884101283


100%|██████████| 150/150 [01:02<00:00,  2.39it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
100%|██████████| 150/150 [01:03<00:00,  2.37it/s]
100%|██████████| 150/150 [01:02<00:00,  2.39it/s]
100%|██████████| 150/150 [01:01<00:00,  2.42it/s]
  0%|          | 20/100000 [8:17:14<45153:01:57, 1625.83s/it]

Step: 192

100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
100%|██████████| 150/150 [01:01<00:00,  2.46it/s]
100%|██████████| 150/150 [00:57<00:00,  2.61it/s]
100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
100%|██████████| 150/150 [00:59<00:00,  2.51it/s]
  0%|          | 21/100000 [8:49:08<47554:57:35, 1712.34s/it]

Step: 115

100%|██████████| 150/150 [00:59<00:00,  2.51it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
100%|██████████| 150/150 [01:01<00:00,  2.46it/s]
100%|██████████| 150/150 [00:59<00:00,  2.54it/s]
  0%|          | 22/100000 [9:10:01<43726:46:46, 1574.51s/it]

Step: 135

100%|██████████| 150/150 [00:59<00:00,  2.53it/s]
100%|██████████| 150/150 [01:00<00:00,  2.48it/s]
100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
100%|██████████| 150/150 [01:02<00:00,  2.40it/s]
100%|██████████| 150/150 [00:58<00:00,  2.54it/s]
  0%|          | 23/100000 [9:33:31<42356:57:02, 1525.20s/it]

Step: 129

100%|██████████| 150/150 [00:55<00:00,  2.72it/s]
100%|██████████| 150/150 [00:55<00:00,  2.68it/s]
100%|██████████| 150/150 [00:56<00:00,  2.63it/s]
100%|██████████| 150/150 [00:59<00:00,  2.54it/s]
100%|██████████| 150/150 [01:02<00:00,  2.42it/s]
  0%|          | 24/100000 [9:56:05<40930:33:19, 1473.85s/it]

Episode 25: Summed_Reward = -127.89711955207126


100%|██████████| 150/150 [01:00<00:00,  2.46it/s]
100%|██████████| 150/150 [01:03<00:00,  2.34it/s]
100%|██████████| 150/150 [01:03<00:00,  2.35it/s]
100%|██████████| 150/150 [00:59<00:00,  2.53it/s]
100%|██████████| 150/150 [00:58<00:00,  2.54it/s]
  0%|          | 25/100000 [10:23:18<42253:44:49, 1521.52s/it]

Step: 150

100%|██████████| 150/150 [01:03<00:00,  2.35it/s]
100%|██████████| 150/150 [01:02<00:00,  2.40it/s]
100%|██████████| 150/150 [01:00<00:00,  2.50it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
100%|██████████| 150/150 [01:00<00:00,  2.46it/s]
  0%|          | 26/100000 [10:49:00<42426:59:12, 1527.77s/it]

Step: 135

100%|██████████| 150/150 [01:04<00:00,  2.34it/s]
100%|██████████| 150/150 [01:06<00:00,  2.25it/s]
100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
100%|██████████| 150/150 [01:01<00:00,  2.43it/s]
100%|██████████| 150/150 [01:05<00:00,  2.27it/s]
  0%|          | 27/100000 [11:12:42<41542:44:55, 1495.94s/it]

Step: 155

100%|██████████| 150/150 [01:00<00:00,  2.48it/s]
100%|██████████| 150/150 [00:58<00:00,  2.57it/s]
100%|██████████| 150/150 [01:01<00:00,  2.44it/s]
100%|██████████| 150/150 [00:59<00:00,  2.52it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
  0%|          | 28/100000 [11:38:49<42134:38:11, 1517.27s/it]

Step: 138

100%|██████████| 150/150 [00:56<00:00,  2.65it/s]
100%|██████████| 150/150 [00:57<00:00,  2.62it/s]
100%|██████████| 150/150 [00:57<00:00,  2.63it/s]
100%|██████████| 150/150 [00:58<00:00,  2.55it/s]
100%|██████████| 150/150 [00:58<00:00,  2.56it/s]
  0%|          | 29/100000 [12:02:39<41404:49:23, 1491.01s/it]

Episode 30: Summed_Reward = -257.78283708309266


100%|██████████| 150/150 [01:00<00:00,  2.46it/s]
100%|██████████| 150/150 [00:59<00:00,  2.54it/s]
100%|██████████| 150/150 [01:03<00:00,  2.38it/s]
100%|██████████| 150/150 [01:00<00:00,  2.47it/s]
100%|██████████| 150/150 [00:59<00:00,  2.54it/s]
  0%|          | 30/100000 [12:35:19<45314:26:22, 1631.81s/it]

Step: 97

100%|██████████| 150/150 [00:59<00:00,  2.50it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
100%|██████████| 150/150 [01:01<00:00,  2.44it/s]
100%|██████████| 150/150 [01:01<00:00,  2.44it/s]
100%|██████████| 150/150 [00:59<00:00,  2.50it/s]
  0%|          | 31/100000 [12:54:06<41103:49:40, 1480.20s/it]

Step: 135

100%|██████████| 150/150 [00:58<00:00,  2.55it/s]
100%|██████████| 150/150 [01:00<00:00,  2.46it/s]
100%|██████████| 150/150 [01:00<00:00,  2.50it/s]
100%|██████████| 150/150 [00:57<00:00,  2.63it/s]
100%|██████████| 150/150 [00:55<00:00,  2.70it/s]
  0%|          | 32/100000 [13:17:16<40352:55:04, 1453.17s/it]

Step: 101

100%|██████████| 150/150 [01:06<00:00,  2.27it/s]
100%|██████████| 150/150 [01:05<00:00,  2.29it/s]
100%|██████████| 150/150 [01:04<00:00,  2.34it/s]
100%|██████████| 150/150 [01:02<00:00,  2.39it/s]
100%|██████████| 150/150 [01:05<00:00,  2.29it/s]
  0%|          | 33/100000 [13:36:18<37764:05:38, 1359.96s/it]

Step: 129

100%|██████████| 150/150 [01:06<00:00,  2.26it/s]
100%|██████████| 150/150 [01:07<00:00,  2.21it/s]
100%|██████████| 150/150 [01:07<00:00,  2.23it/s]
100%|██████████| 150/150 [01:03<00:00,  2.35it/s]
100%|██████████| 150/150 [01:02<00:00,  2.38it/s]
  0%|          | 34/100000 [13:59:29<38022:21:34, 1369.27s/it]

Episode 35: Summed_Reward = -561.5574215314017


100%|██████████| 150/150 [00:59<00:00,  2.54it/s]
100%|██████████| 150/150 [01:01<00:00,  2.42it/s]
100%|██████████| 150/150 [00:58<00:00,  2.58it/s]
100%|██████████| 150/150 [01:01<00:00,  2.43it/s]
100%|██████████| 150/150 [00:58<00:00,  2.55it/s]
  0%|          | 35/100000 [14:25:58<39854:13:41, 1435.25s/it]

Step: 135

100%|██████████| 150/150 [00:58<00:00,  2.58it/s]
100%|██████████| 150/150 [01:03<00:00,  2.37it/s]
100%|██████████| 150/150 [00:58<00:00,  2.56it/s]
100%|██████████| 150/150 [01:00<00:00,  2.46it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
  0%|          | 36/100000 [14:49:20<39575:56:27, 1425.25s/it]

Step: 136

100%|██████████| 150/150 [01:02<00:00,  2.42it/s]
100%|██████████| 150/150 [01:00<00:00,  2.48it/s]
100%|██████████| 150/150 [00:59<00:00,  2.51it/s]
100%|██████████| 150/150 [00:58<00:00,  2.57it/s]
100%|██████████| 150/150 [01:02<00:00,  2.38it/s]
  0%|          | 37/100000 [15:12:43<39384:52:11, 1418.38s/it]

Step: 131

100%|██████████| 150/150 [00:57<00:00,  2.61it/s]
100%|██████████| 150/150 [01:03<00:00,  2.36it/s]
100%|██████████| 150/150 [01:02<00:00,  2.42it/s]
100%|██████████| 150/150 [01:01<00:00,  2.43it/s]
100%|██████████| 150/150 [00:59<00:00,  2.53it/s]
  0%|          | 38/100000 [15:35:28<38942:12:41, 1402.45s/it]

Step: 133

100%|██████████| 150/150 [01:01<00:00,  2.43it/s]
100%|██████████| 150/150 [01:01<00:00,  2.45it/s]
100%|██████████| 150/150 [01:02<00:00,  2.40it/s]
100%|██████████| 150/150 [01:02<00:00,  2.41it/s]
100%|██████████| 150/150 [00:58<00:00,  2.56it/s]
  0%|          | 39/100000 [15:59:06<39071:44:47, 1407.13s/it]

Step: 81

# planning only
Testing mode where your trained agent in the training mode will not learn offline. It just keeps running each episode without learning new stuff.

## Loading models

In [None]:
sequence_size = history_size + future_size 
model_list = []
for _ in range(ensemble_size):
    model = build_model(state_size,
                        action_size,
                        reward_size,
                        feature_size,
                        sequence_size ,
                        neural_type,
                        num_layers,
                        num_heads,
                        init,
                        opti,
                        loss,
                        bias,
                        drop_rate,
                        alpha)
    model.to(device)
    model_list.append(model)

model_dict = torch.load(model_directory)
for i, model in enumerate(model_list):
    model.load_state_dict(model_dict[f'model_{i}'])

performance_log        = load_performance_from_csv(performance_directory)
last_episode           = performance_log[-1][0] + 1 if len(performance_log) > 0 else 0
greed_epsilon_r        = max(greed_epsilon_r - (greed_epsilon_decay * last_episode), greed_epsilon_min)

## Putting all the previous works into play ... again

But this time the agent does not learn

In [None]:
# score recorder
total_summed_reward = 0

# starting each episode
for testing_episode in range(episode_for_testing):

    # initializing summed reward
    summed_reward  = 0

    # initializing short term experience replay buffer
    state_list  = []
    action_list = []
    for _ in range(history_size):
        state_list .append(torch.zeros(state_size  ).to(device) - 1)
        action_list.append(torch.zeros(action_size ).to(device) - 1)

    # initializing environment
    env = gym.make(game_name, max_episode_steps = max_steps_for_each_episode,
                   render_mode = "human" if render_for_human else None)
    env = randomizer(env)
    state, info = env.reset(seed = seed)
    if render_for_human == True:
        env.render()

    # observing state
    state = vectorizing_state(state, False, device)
    state_list.append(state)

    # starting each step
    done = False
    truncated = False
    while not done and not truncated:
        
        # initializing and updating action   
        history_state, \
        history_action = retrieve_history(state_list, action_list, history_size, device)
        present_state  = retrieve_present(state_list, device)
        future_action  = initialize_future_action(init_, greed_epsilon_t, greed_epsilon_r, (1, future_size, action_size), device)
        desired_reward = initialize_desired_reward((1, future_size, reward_size), device)
        future_action  = update_future_action(itrtn_for_planning,
                                              model_list,
                                              history_state ,
                                              history_action,
                                              present_state,
                                              future_action,
                                              desired_reward,
                                              beta)
    
         # taking actions and skip planning 
        for i in range(batch_size_for_executing):

            print(f'\rStep: {len(action_list)+1}\r', end='', flush=True)

            # observing action
            action, action_  = vectorizing_action(future_action[:, i:, :], device)
            action_list.append(action)

            # executing action
            state, reward, done, truncated, info = env.step(action_)
            if render_for_human == True:
                env.render()
                
            # summing reward
            summed_reward += reward
            
            # observing state
            state = vectorizing_state(state, done, device)
            state_list.append(state)
            
            # terminating episode if done or truncated
            if done or truncated:
                break
        
    # closing env
    env.close()

    # recording
    print("Summed reward:", summed_reward)
    print(f'Episode: {testing_episode + 1}')
    print('Everaged summed reward:')
    total_summed_reward += summed_reward
    print(total_summed_reward/(testing_episode + 1))

