<a href="https://colab.research.google.com/github/Brownwang0426/Reversal-Generative-Reinforcement-Learning/blob/main-fast/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up (for local)
CUDA Toolkit 11.8 \
cuDNN 8.9.x\
python3.10\
git clone --branch main https://github.com/Brownwang0426/Reversal-Generative-Reinforcement-Learning.git\
pip install -r requirements.txt

# Setting up (for colab)

In [None]:
# before restart
!sudo apt-get install python3.10
!git clone --branch main-fast https://github.com/Brownwang0426/Reversal-Generative-Reinforcement-Learning.git
import os
os.chdir('/content/Reversal-Generative-Reinforcement-Learning')
!pip install -r requirements.txt

In [None]:
# restart
from IPython.display import display, Javascript
display(Javascript('google.colab.kernel.restart()'))

In [None]:
# after restart
import os
os.chdir('/content/Reversal-Generative-Reinforcement-Learning')

# Importing modules

In [None]:
import gymnasium as gym
from gymnasium.wrappers import TimeLimit
import minigrid

import numpy as np
import math
from scipy.special import softmax

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.utils.rnn as rnn_utils
from torch.utils.data import DataLoader, TensorDataset, Subset

import csv

import multiprocessing as mp
import os
import sys
import copy
import random
import gc
import time
from tqdm.auto import tqdm
from collections import defaultdict

import itertools

import dill

import warnings
warnings.filterwarnings('ignore')

import concurrent.futures
import hashlib

# Checking cuda

In [None]:
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"Device {i}: {torch.cuda.get_device_name(i)}")
    device_index = 0
    device = torch.device(f"cuda:{device_index}")
    device_ = torch.device("cpu")
    print('using cuda...')
else:
    device = torch.device("cpu")
    device_ = torch.device("cpu")
    print('using cpu...')
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

# Control board

Crucial configurations regarding how your agent will learn in the environment. The meanings are as follow:
(the configs starting with ⚠️ are what we suggest you must tune according to your specific need in your task)
(the configs starting with ◀️ are what we suggest you to play with to see the effect)



## Configs meaning
| Configs   | Type   | Description                                                                 |
|------------|--------|-----------------------------------------------------------------------------|
| ⚠️game_name  | STR| The name of the environment.                                |
| ⚠️max_steps_for_each_episode | +INT | The maximun steps that the agent will go through while not done. In some environments, it is crucial to increase your "max_steps_for_each_episode" so that your agent can "live long enough" to obatin some better rewards to gradually and heuristically learn better strategy.                    |
| ⚠️seed | +INT/None | The seed for environment. None for random environment each episode.                    |
| load_pretrained_model  | BOLEAN |Whether you want to load previous trained model.                          |
| ◀️ensemble_size  | +INT | The size of the neural ensemble which the agent is comprised of. The bigger, the better, but the longer training time without parallel training. :-D                  |
| ⚠️validation_size    | +INT | delayed learning interval           |
| ⚠️state_size  | +INT | The size of the state as input data.                    |
| ⚠️action_size   | +INT | The size of action per step as input data.   |
| ⚠️reward_size  | +INT |The size of the reward as output data.                          |
| ⚠️feature_size   | +INT |The size of the hidden layers. **`Shall be bigger than the sum of state_size, action_size and reward_size`**.      |
| ⚠️history_size  | 0/+INT |How many steps in the history for state and action will the agent take into consideration.                           |
| ⚠️future_size  | +INT |The length of the sequence of actions in learning phase. Namely, how many steps in the future the agent will predict or use to discern the present best action.                |
| ⚠️neural_type  | STR |  [**`rnn`**, **`gru`**, **`lstm`**, **`td`**] The type of neural network you prefer. For now, we support rnn, gru, lstm, and td (Transformer decoder only). More to come in the future (or you can build one yourself :-D in the models repository).           |
| ⚠️num_layers  | +INT |The number of layers in rnn, gru, lstm, and td (Transformer decoder only).     |
| ⚠️num_heads  | +INT/None |The number of heads in multi-head attention. **`Shall be able to devide feature_size`**. **`Shall be None for non-attention neural_type`**.                         |
| init   | STR | [**`random_normal`**, **`random_uniform`**, **`xavier_normal`**, **`xavier_uniform`**, **`glorot_normal`**, **`glorot_uniform`**] The initialization method you prefer for initiating neural net ensemble of your agent.                          |
| opti   | STR | [**`adam`**, **`sgd`**, **`rmsprop`**]  The optimization method you prefer.             |
| loss  | STR | [**`mean_squared_error`**, **`binary_crossentropy`**] The loss or error function you prefer.                           |
| bias  | BOLEAN |Whether you want add bias.                          |
| drop_rate   | 0/+FLOAT |The drop-rate for drop-out.              |
| alpha   | 0/+FLOAT |The learning rate for neural networks weight matrices.                           |
| itrtn_for_learning   | +INT |The iteration for learning per experience.              |
| beta  |  0/+FLOAT |The updating rate for updating actions of the agent.              |
| max_itrtn_for_planning   | +INT |The maximum iteration for planning before scaled down by average reward.              |
| window_size   | +INT | The window size to determine the latest averaged reward in the replay buffer.          |
| itrtn_for_planning (adaptive)  |  +INT |The iteration for updating actions of the agent.                           |
| episode_for_training  | +INT |How many epsiodes will your agent run in the training mode where your agent will learn offline.              |
| buffer_limit  | +INT |The maximum size for your buffer.              |
| per  | BOLEAN | Whether to use Prioritized Experience Replay or not.              |
| render_for_human  | BOLEAN | If you want to visualize the process.              |

## frozen lake

In [None]:
game_name =  'FrozenLake-v1'         #⚠️   gym.make(game_name, max_episode_steps=max_steps_for_each_episode, is_slippery=False, map_name="4x4")
max_steps_for_each_episode = 25      #⚠️
seed = None                          #⚠️
load_pretrained_model = True
ensemble_size = 5                    #◀️
validation_size = 10                 #⚠️
state_size = 36                      #⚠️
action_size = 4                      #⚠️
reward_size = 100                    #⚠️
feature_size = 100                   #⚠️
history_size = 25                    #⚠️
future_size = 25                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️

## cartpole

In [None]:
game_name = 'CartPole-v1'            #⚠️
max_steps_for_each_episode = 1000    #⚠️
seed = None                          #⚠️
load_pretrained_model = True
ensemble_size = 5                    #◀️
validation_size = 10                 #⚠️
state_size =  260                    #⚠️
action_size = 2                      #⚠️
reward_size = 100                    #⚠️
feature_size = 400                   #⚠️
history_size = 1000                  #⚠️
future_size = 50                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️

## mountain car

In [None]:
game_name =  'MountainCar-v0'        #⚠️
max_steps_for_each_episode = 250     #⚠️
seed = None                          #⚠️
load_pretrained_model = True
ensemble_size = 5                    #◀️
validation_size = 10                 #⚠️
state_size =  160                    #⚠️
action_size = 3                      #⚠️
reward_size = 100                    #⚠️
feature_size = 300                   #⚠️
history_size  = 250                  #⚠️
future_size = 150                    #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️










## acrobot

In [None]:
game_name = 'Acrobot-v1'             #⚠️
max_steps_for_each_episode = 250     #⚠️
seed = None                          #⚠️
load_pretrained_model = True
ensemble_size = 5                    #◀️
validation_size = 10                 #⚠️
state_size =  360                    #⚠️
action_size = 3                      #⚠️
reward_size = 100                    #⚠️
feature_size = 500                   #⚠️
history_size  = 250                  #⚠️
future_size = 150                    #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️








## lunar lander

In [None]:
game_name = "LunarLander-v3"         #⚠️
max_steps_for_each_episode = 200     #⚠️
seed = None                          #⚠️
load_pretrained_model = True
ensemble_size = 10                   #◀️
validation_size = 10                 #⚠️
state_size =  460                    #⚠️
action_size = 4                      #⚠️
reward_size = 100                    #⚠️
feature_size = 500                   #⚠️
history_size = 200                   #⚠️
future_size = 150                    #⚠️ 
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️

## door key

In [None]:
game_name = "MiniGrid-DoorKey-5x5-v0"#⚠️
max_steps_for_each_episode = 25      #⚠️
seed = 1                             #⚠️
load_pretrained_model = True
ensemble_size = 5                    #◀️
validation_size = 10                 #⚠️
state_size =  267                    #⚠️
action_size = 6                      #⚠️
reward_size = 100                    #⚠️
feature_size = 400                   #⚠️
history_size  = 25                   #⚠️
future_size = 25                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️




## your present config

In [None]:
game_name = 'CartPole-v1'            #⚠️
max_steps_for_each_episode = 1000    #⚠️
seed = None                          #⚠️
load_pretrained_model = True
ensemble_size = 5                    #◀️
validation_size = 10                 #⚠️
state_size =  260                    #⚠️
action_size = 2                      #⚠️
reward_size = 100                    #⚠️
feature_size = 400                   #⚠️
history_size = 1000                  #⚠️
future_size = 50                     #⚠️
neural_type = 'td'                   #⚠️
num_layers = 3                       #⚠️
num_heads = 10                       #⚠️

In [None]:
init = "xavier_normal"
opti = 'sgd'
loss = 'mean_squared_error'
bias = False
drop_rate = 0.0
alpha = 0.1                  
itrtn_for_learning = 1500
beta = 0.1     
max_itrtn_for_planning = 50         
window_size = 50
episode_for_training = 100000   
buffer_limit = 50000   
per = False
render_for_human = False


In [None]:

suffix                 = f"game_{game_name}-type_{neural_type}-ensemble_{ensemble_size:05d}-learn_{itrtn_for_learning:05d}"
directory              = f'./result/{game_name}/'
performance_directory  = f'./result/{game_name}/performace-{suffix}.csv'
model_directory        = f'./result/{game_name}/model-{suffix}.pth'
buffer_directory       = f'./result/{game_name}/buffer-{suffix}.dill'
if not os.path.exists(directory):
    os.makedirs(directory)

# Importing local modules

In [None]:
game_modules = {
    'FrozenLake-v1': 'envs.env_frozenlake',
    'CartPole-v1': 'envs.env_cartpole',
    'MountainCar-v0': 'envs.env_mountaincar',
    'Acrobot-v1': 'envs.env_acrobot',
    'LunarLander-v3': 'envs.env_lunarlander',
    'MiniGrid-DoorKey-5x5-v0': 'envs.env_doorkey'
}
if game_name in game_modules:
    game_module = __import__(game_modules[game_name], fromlist=['vectorizing_state', 'vectorizing_action', 'vectorizing_reward', 'averaging_reward', 'randomizer'])
    vectorizing_state   = game_module.vectorizing_state
    vectorizing_action  = game_module.vectorizing_action
    vectorizing_reward  = game_module.vectorizing_reward
    averaging_reward    = game_module.averaging_reward
    randomizer          = game_module.randomizer
else:
    raise RuntimeError('Missing env functions')


In [None]:
model_modules = {
    'td': 'models.model_td',
    'rnn': 'models.model_rnn',
    'gru': 'models.model_rnn',
    'lstm': 'models.model_rnn'
}
if neural_type in model_modules:
    model_module = __import__(model_modules[neural_type], fromlist=['build_model'])
    build_model  = model_module.build_model
else:
    raise RuntimeError('Missing model functions')

from utils.util_func  import load_performance_from_csv,\
                             load_buffer_from_pickle,\
                             retrieve_history,\
                             retrieve_present,\
                             initialize_future_action, \
                             initialize_desired_reward,\
                             update_future_action, \
                             sequentialize, \
                             update_long_term_experience_replay_buffer,\
                             update_model_list,\
                             limit_buffer,\
                             save_performance_to_csv,\
                             save_buffer_to_pickle


# planning -> Learning
Training mode where your agent will learn offline. You can see here how your agent learn overtime and improve its performance.

## Creating or loading models

In [None]:

# creating empty log for recording performance
performance_log  = []

# setting the last episode number for performance log
last_episode = 0

# creating model list
model_list = []
for _ in range(ensemble_size):
    model = build_model(state_size,
                        action_size,
                        reward_size,
                        feature_size,
                        history_size,
                        future_size, 
                        neural_type,
                        num_layers,
                        num_heads,
                        init,
                        opti,
                        loss,
                        bias,
                        drop_rate,
                        alpha)
    model.to(device)
    model_list.append(model)

# creating space for storing tensors as experience replay buffer
history_state_stack    = torch.empty(0).to(device_, non_blocking=True)
present_state_stack    = torch.empty(0).to(device_, non_blocking=True)
future_action_stack    = torch.empty(0).to(device_, non_blocking=True)
future_reward_stack    = torch.empty(0).to(device_, non_blocking=True)
history_state_hash_set = set()
present_state_hash_set = set()
future_action_hash_set = set()
future_reward_hash_set = set()

# load from pre-trained models if needed
if load_pretrained_model == True:
    try:
        model_dict = torch.load(model_directory)
        for i, model in enumerate(model_list):
            model.load_state_dict(model_dict[f'model_{i}'])
        history_state_stack, \
        present_state_stack, \
        future_action_stack, \
        future_reward_stack, \
        history_state_hash_set, \
        present_state_hash_set, \
        future_action_hash_set, \
        future_reward_hash_set = load_buffer_from_pickle(buffer_directory)
        history_state_stack    = history_state_stack.to (device_) 
        present_state_stack    = present_state_stack.to (device_) 
        future_action_stack    = future_action_stack.to (device_) 
        future_reward_stack    = future_reward_stack.to (device_) 
        performance_log        = load_performance_from_csv(performance_directory)
        last_episode           = performance_log[-1][0] if len(performance_log) > 0 else 0
        print('Loaded pre-trained models.')
    except:
        print('Failed loading pre-trained models. Now using new models.')

# retreive highest reward
if len(performance_log) > 0:
    itrtn_for_planning = averaging_reward([entry[1] for entry in performance_log], max_itrtn_for_planning, window_size)
else:
    itrtn_for_planning = 0



## Putting all the previous works into play

In [None]:





# starting each episode
for training_episode in tqdm(range(episode_for_training)):
    current_episode  = training_episode + last_episode + 1

    # initializing summed reward
    summed_reward  = 0

    # initializing short term experience replay buffer
    state_list  = []
    action_list = []
    reward_list = []
    for _ in range(history_size):
        state_list .append(torch.zeros(state_size  ).to(device_, non_blocking=True) - 1 )
        action_list.append(torch.zeros(action_size ).to(device_, non_blocking=True) - 1 )
        reward_list.append(torch.zeros(reward_size ).to(device_, non_blocking=True) - 1 )

    # initializing environment
    if game_name == 'FrozenLake-v1'  :
        env        = gym.make(game_name, max_episode_steps=max_steps_for_each_episode, is_slippery=False, map_name="4x4", render_mode = "human" if render_for_human else None)
    else:
        env        = gym.make(game_name, max_episode_steps=max_steps_for_each_episode, render_mode = "human" if render_for_human else None)
    state, info    = env.reset(seed = seed)
    if render_for_human == True:
        env.render()

    # observing state
    state          = vectorizing_state(state, False, False, device_)
    state_list.append(state)

    # starting each step
    post_done_truncated_counter = 0
    post_done_truncated_steps = future_size
    done_truncated_flag = False
    total_step = 0
    while not done_truncated_flag:

        """
        We let agent took some history states into consideration.
        """
        """
        The final desired reward is factually the last time step in desired reward.
        """
        # initializing and updating action by desired reward
        history_state   = retrieve_history(state_list, action_list, history_size, device_)
        present_state   = retrieve_present(state_list, device_)
        future_action   = initialize_future_action ((1, future_size, action_size), device_)
        desired_reward  = initialize_desired_reward((1, future_size, reward_size), device_)
        future_action   = update_future_action(1 + itrtn_for_planning ,
                                               model_list,
                                               history_state,
                                               present_state,
                                               future_action,
                                               desired_reward,
                                               beta)

        # observing action
        action, action_  = vectorizing_action(future_action, device_)
        action_list.append(action)

        # executing action
        state, reward, done, truncated, info = env.step(action_)
        if (render_for_human == True) and (post_done_truncated_counter == 0):
            env.render()

        # summing reward
        if post_done_truncated_counter > 0:
            reward = 0
        summed_reward += reward

        # observing actual reward
        reward = vectorizing_reward(state, done, truncated, reward, summed_reward, reward_size, device_)
        reward_list.append(reward)

        # observing state
        state = vectorizing_state(state, done, truncated, device_)
        state_list.append(state)

        """
        We expanded the condition for terminating an episode to include the case where the count is smaller than the sum of the history and future sizes.
        Though it is contrary to common practice in RL, this is for better handling the sequentialization of the short-term experience replay buffer with fixed window length.
        And it is also for agent to plan ahead even after the episode is done.
        We give a done flag to state to indicate that the environment is done so that the agent won't be confused.
        The done flag shall affect the state in a considerable way to remind the agent that the environment is done.
        """
        # if done then continue for a short period. Then store experience to short term experience replay buffer
        if done or truncated:
            post_done_truncated_counter += 1
            if post_done_truncated_counter >= post_done_truncated_steps:
                done_truncated_flag = True
                break
        else:
            total_step += 1
            print(f'\rStep: {total_step}\r', end='', flush=True)

    # closing env
    env.close()




    # recording performance
    print(f'Episode {current_episode}: Summed_Reward = {summed_reward}')
    performance_log.append([current_episode, summed_reward])




    # sequentializing short term experience replay buffer
    history_state_list   ,\
    present_state_list   ,\
    future_action_list   ,\
    future_reward_list    = sequentialize(state_list  ,
                                          action_list ,
                                          reward_list ,
                                          history_size,
                                          future_size)




    """
    We dropped duplicated experiences in the buffer.
    """
    # storing sequentialized short term experience to long term experience replay buffer
    history_state_stack, \
    present_state_stack, \
    future_action_stack, \
    future_reward_stack, \
    history_state_hash_set  , \
    present_state_hash_set  , \
    future_action_hash_set  , \
    future_reward_hash_set     = update_long_term_experience_replay_buffer(history_state_stack,
                                                                           present_state_stack,
                                                                           future_action_stack,
                                                                           future_reward_stack,
                                                                           history_state_hash_set  ,
                                                                           present_state_hash_set  ,
                                                                           future_action_hash_set  ,
                                                                           future_reward_hash_set  ,
                                                                           history_state_list   ,
                                                                           present_state_list,
                                                                           future_action_list,
                                                                           future_reward_list)




    # training
    if current_episode % validation_size == 0:
        dataset     = TensorDataset    (history_state_stack,
                                        present_state_stack,
                                        future_action_stack,
                                        future_reward_stack)
        model_list  = update_model_list(itrtn_for_learning ,
                                        dataset,
                                        model_list,
                                        per
                                        )




        """
        We limit buffer to save vram.
        """
        # limit_buffer
        history_state_stack, \
        present_state_stack, \
        future_action_stack, \
        future_reward_stack, \
        history_state_hash_set  , \
        present_state_hash_set  , \
        future_action_hash_set  , \
        future_reward_hash_set  = limit_buffer(history_state_stack,
                                               present_state_stack,
                                               future_action_stack,
                                               future_reward_stack,
                                               history_state_hash_set  ,
                                               present_state_hash_set  ,
                                               future_action_hash_set  ,
                                               future_reward_hash_set  ,
                                               buffer_limit  )




        # saving nn models
        model_dict = {}
        for i, model in enumerate(model_list):
            model_dict[f'model_{i}'] = model.state_dict()
        torch.save(model_dict, model_directory)

        # saving long term experience replay buffer
        save_buffer_to_pickle(buffer_directory,
                              history_state_stack,
                              present_state_stack,
                              future_action_stack,
                              future_reward_stack,
                              history_state_hash_set,
                              present_state_hash_set,
                              future_action_hash_set,
                              future_reward_hash_set)

        # saving final reward to log
        save_performance_to_csv(performance_log, performance_directory)

        # retreive highest reward
        itrtn_for_planning = averaging_reward([entry[1] for entry in performance_log], max_itrtn_for_planning, window_size)

        # clear up
        gc.collect()
        torch.cuda.empty_cache()


