# Reinforcement Learning Tutorial

**This tutorial was tested with the version `0.0.1-beta4` of NeuroTorch.**

<table class="nt-notebook-buttons" align="center">
  <td>
    <a target="_blank" href="https://NeuroTorch.github.io/NeuroTorch/"><img src="https://github.com/NeuroTorch/NeuroTorch/blob/main/images/neurotorch_logo_32px.png?raw=true" width=32px height=32px  />Documentation</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/NeuroTorch/NeuroTorch/blob/main/tutorials/reinforcement_learning/tutorial.ipynb"><img src="https://github.com/NeuroTorch/NeuroTorch/blob/main/images/colab_logo_32px.png?raw=true" width=32px height=32px  />Run in Google Colab</a>
</td>
  <td>
    <a target="_blank" href="https://github.com/NeuroTorch/NeuroTorch/blob/main/tutorials/reinforcement_learning/tutorial.ipynb"><img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width=32px height=32px />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/NeuroTorch/NeuroTorch/blob/main/tutorials/reinforcement_learning/tutorial.ipynb"><img src="https://github.com/NeuroTorch/NeuroTorch/blob/main/images/download_logo_32px.png?raw=true" width=32px height=32px />Download notebook</a>
  </td>
</table>

In this tutorial we will be learning how to use NeuroTorch to train an agent in a [gym](https://www.gymlibrary.dev/content/basic_usage/) environment.

## Setup

You can now install the dependencies by running the following commands:

In [1]:
%%capture
#@title Install dependencies {display-mode: "form"}

RunningInCOLAB = 'google.colab' in str(get_ipython()) if hasattr(__builtins__,'__IPYTHON__') else False

!pip install git+https://github.com/NeuroTorch/NeuroTorch.git@119-rl-ppo
!pip install pythonbasictools
!pip install gym[box2d]==0.26.2

Collecting git+https://github.com/NeuroTorch/NeuroTorch.git@119-rl-ppo
  Cloning https://github.com/NeuroTorch/NeuroTorch.git (to revision 119-rl-ppo) to c:\users\gince\appdata\local\temp\pip-req-build-nl05nrjs
  Resolved https://github.com/NeuroTorch/NeuroTorch.git to commit fc67d4af63e921165a135232a2db507cff88f95b
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none -q https://github.com/NeuroTorch/NeuroTorch.git 'C:\Users\gince\AppData\Local\Temp\pip-req-build-nl05nrjs'
  Running command git checkout -b 119-rl-ppo --track origin/119-rl-ppo
  branch '119-rl-ppo' set up to track 'origin/119-rl-ppo'.
  Switched to a new branch '119-rl-ppo'
You should consider upgrading via the 'C:\github\NeuroTorch\tutorials\reinforcement_learning\rl_venv\Scripts\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'C:\github\NeuroTorch\tutorials\reinforcement_learning\rl_venv\Scripts\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'C:\github\NeuroTorch\tutorials\reinforcement_learning\rl_venv\Scripts\python.exe -m pip install --upgrade pip' command.


If you have a cuda device and want to use it for this tutorial (it is recommended to do so), you can uninstall pytorch with `pip uninstall torch` and re-install it with the right cuda version by generating a command with [PyTorch GetStarted](https://pytorch.org/get-started/locally/) web page.

After setting up the virtual environment, we will need to import the necessary packages.

In [2]:
import gym
import numpy as np
import torch.nn

from pythonbasictools.device import log_device_setup, DeepLib
from pythonbasictools.logging import logs_file_setup

import neurotorch as nt
from neurotorch.callbacks.early_stopping import EarlyStoppingThreshold
from neurotorch.rl.agent import Agent
from neurotorch.rl.rl_academy import RLAcademy
from neurotorch.rl.utils import TrajectoryRenderer, space_to_continuous_shape

import matplotlib.pyplot as plt
%matplotlib inline
%matplotlib notebook

In [3]:
logs_file_setup("rl_tutorial", add_stdout=False)
log_device_setup(deepLib=DeepLib.Pytorch)
if torch.cuda.is_available():
	torch.cuda.set_per_process_memory_fraction(0.8)

INFO:root:Logs file at: .//logs/logs-06-03-2023/rl_tutorial-16781266451481016.log

INFO:root:__Python VERSION: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)]
INFO:root:Number of available cores: 8.
INFO:root:Number of available logical processors: 16.
INFO:root:__pyTorch VERSION:1.13.0+cu117
INFO:root:__CUDA VERSION:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:59:34_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

INFO:root:__nvidia-smi: Not Found
INFO:root:__CUDNN VERSION:8500
INFO:root:__Number CUDA Devices:1
INFO:root:
-------------------------
DEVICE: cuda
-------------------------

INFO:root:NVIDIA GeForce RTX 3070 Laptop GPU
INFO:root:Memory Usage:
INFO:root:Allocated: 0.0 GB
INFO:root:Cached:   0.0 GB
INFO:root:Memory summary: 
|                  PyTorch CUDA memory summary, device ID 0                 |
|--------------

## Initialization

In [4]:
# env parameters
env_id = "LunarLander-v2"
continuous_action = True

# Network parameters
use_spiking_policy = True  # Type of the policy
n_hidden_units = 128
n_critic_hidden_units = 128
n_encoder_steps = 8

# Trainer parameters
n_iterations = 100
n_epochs = 30
n_new_trajectories = 1
last_k_rewards = 10

In [5]:
env = gym.make(env_id, render_mode="rgb_array", continuous=continuous_action)
continuous_obs_shape = space_to_continuous_shape(getattr(env, "single_observation_space", env.observation_space))
continuous_action_shape = space_to_continuous_shape(getattr(env, "single_action_space", env.action_space))

Here we're initializing a callback of the trainer used to save the network during the training.

In [6]:
if use_spiking_policy:
    checkpoint_folder = f"data/tr_data/checkpoints_{env_id}_snn-policy"
else:
    checkpoint_folder = f"data/tr_data/checkpoints_{env_id}_classical-policy"
checkpoint_manager = nt.CheckpointManager(
    checkpoint_folder=checkpoint_folder,
    save_freq=int(0.1*n_iterations),
    metric=RLAcademy.CUM_REWARDS_METRIC_KEY,
    minimise_metric=False,
    save_best_only=True,
)

Here, we are initializing the learning algorithm that will be used to train the agent. For now, this learning algorithm it's the popular [Proximal Policy Optimisation](https://arxiv.org/pdf/1707.06347.pdf) from OpenAI.

In [7]:
ppo_la = nt.rl.PPO(
    critic_criterion=torch.nn.SmoothL1Loss(),
)

It is now the time to define our policy. For short, the policy is the model that will be used to take the actions in the environment. The critic is the model used to estimate the rewards-to-go of the states that the agent will encounter.

In [8]:
if use_spiking_policy:
    policy = nt.SequentialRNN(
        input_transform=[
            nt.transforms.ConstantValuesTransform(n_steps=n_encoder_steps)
        ],
        layers=[
            nt.SpyLIFLayerLPF(
                continuous_obs_shape[0], n_hidden_units, use_recurrent_connection=False
            ),
            nt.SpyLILayer(n_hidden_units, continuous_action_shape[0]),
        ],
        output_transform=[
            (
                nt.transforms.ReduceFuncTanh(nt.transforms.ReduceMean(dim=1))
                if continuous_action else
                nt.transforms.ReduceMax(dim=1)
            )
        ],
    ).build()
else:
    policy = nt.Sequential(
        layers=[
            torch.nn.Linear(continuous_obs_shape[0], n_hidden_units),
            torch.nn.Dropout(0.1),
            torch.nn.PReLU(),
            torch.nn.Linear(n_hidden_units, n_hidden_units),
            torch.nn.Dropout(0.1),
            torch.nn.PReLU(),
            torch.nn.Linear(n_hidden_units, continuous_action_shape[0]),
            (torch.nn.Tanh() if continuous_action else torch.nn.Identity())
        ]
    ).build()

AttributeError: module 'neurotorch.transforms' has no attribute 'ReduceFuncTanh'

And we're defining the agent using the policy and the critic.

In [None]:
agent = Agent(
    env=env,
    behavior_name=env_id,
    policy=policy,
    critic=nt.Sequential(
        layers=[
            torch.nn.Linear(continuous_obs_shape[0], n_critic_hidden_units),
            torch.nn.Dropout(0.1),
            torch.nn.PReLU(),
            torch.nn.Linear(n_critic_hidden_units, n_critic_hidden_units),
            torch.nn.Dropout(0.1),
            torch.nn.PReLU(),
            torch.nn.Linear(n_critic_hidden_units, 1),
        ]
    ).build(),
    checkpoint_folder=checkpoint_manager.checkpoint_folder,
)

We create an early stopping callback that will stop the training if the mean of the last k cumulative rewards is better or equal than 230 (at 200 the environnement is considered as solved).

In [None]:
early_stopping = EarlyStoppingThreshold(
    metric=f"mean_last_{last_k_rewards}_rewards",
    threshold=230.0,
    minimize_metric=False,
)

Here is the RLAcademy. This is a special type of Trainer used to train the agent in a reinforcement learning pipeline.

In [None]:
academy = RLAcademy(
    agent=agent,
    callbacks=[checkpoint_manager, ppo_la, early_stopping],
)
print(f"Academy:\n{academy}")

## Training time!

In the next cell, we will start the actual training with the following parameter:

    - `n_iterations`: The number of time the trainer will generate trajectories and will do an optimisation pass.
    - `n_epochs`: The number of time the trainer will pass through the buffer of episodes for an optimisation pass.
    - `n_batches`: The number of batch to do at each epoch.
    - `n_new_trajectories`: The number of new trajectories to generate at each iteration.
    - `batch_size`: The number of episodes for a single batch.
    - `buffer_size`: The size of the buffer.
    - `clear_buffer`: Wheater to clear or the the buffer before each iteration.
    - `last_k_rewards`: The number of k previous rewards to show in the metrics.

In [None]:
history = academy.train(
    env,
    n_iterations=n_iterations,
    n_epochs=n_epochs,
    n_batches=-1,
    n_new_trajectories=n_new_trajectories,
    batch_size=4096,
    buffer_size=np.inf,
    clear_buffer=True,
    randomize_buffer=True,
    load_checkpoint_mode=nt.LoadCheckpointMode.LAST_ITR,
    force_overwrite=False,
    verbose=True,
    render=False,
    last_k_rewards=last_k_rewards,
)
if not getattr(env, "closed", False):
    env.close()

In [None]:
history.plot(show=True)

## Test Phase

In the next cell, we will generate new trajectories of the agent just to see how it will perform.

In [None]:
agent.load_checkpoint(
    checkpoints_meta_path=checkpoint_manager.checkpoints_meta_path,
    load_checkpoint_mode=nt.LoadCheckpointMode.BEST_ITR
)
env = gym.make(env_id, render_mode="rgb_array", continuous=continuous_action)
agent.eval()
gen_trajectories_out = academy.generate_trajectories(
    n_trajectories=10, epsilon=0.0, verbose=True, env=env, render=True, re_trajectories=True,
)
best_trajectory_idx = np.argmax([t.cumulative_reward for t in gen_trajectories_out.trajectories])
trajectory_renderer = TrajectoryRenderer(trajectory=gen_trajectories_out.trajectories[best_trajectory_idx], env=env)

cumulative_rewards = gen_trajectories_out.cumulative_rewards
print(f"Buffer: {gen_trajectories_out.buffer}")
print(f"Cumulative rewards: {np.nanmean(cumulative_rewards):.3f} +/- {np.nanstd(cumulative_rewards):.3f}")
best_cum_reward_fmt = f"{cumulative_rewards[best_trajectory_idx]:.3f}"
print(f"Best trajectory: {best_trajectory_idx}, cumulative reward: {best_cum_reward_fmt}")

## Visualize the best trajectory and save it

In [None]:
trajectory_renderer.render(
    filename=(
        f"{agent.checkpoint_folder}/figures/trajectory_{best_trajectory_idx}-"
        f"CR{best_cum_reward_fmt.replace('.', '_')}"
    ),
    file_extension="gif",
)