# Reinforcement Learning Tutorial

**This tutorial was tested with the version `0.0.1-beta0` of NeuroTorch.**

### **Warning**

The reinforcement pipeline is currently under development and there are several issues to fix at this time. If you change the environment to one with continuous action, you may notice an error where the actions result in a bunch of NaN. If you think you know what cause it, please communicate with us. In addition, with discrete actions, the PPO algorithm doesn't seem to converge with good cumulative rewards every time and the test cumulative rewards don't seem to match the train one. Again, if you think you know what cause this instability, please communicate with us. We are sorry for this inconvenient, and thank you for your patience.

In this tutorial we will be learning how to use NeuroTorch to train an agent in a [gym](https://www.gymlibrary.dev/content/basic_usage/) environment.

## Setup

You can now install the dependencies by running the following commands:

In [None]:
!pip install -r rl_requirements.txt

If you have a cuda device and want to use it for this tutorial (it is recommended to do so), you can uninstall pytorch with `pip uninstall torch` and re-install it with the right cuda version by generating a command with [PyTorch GetStarted](https://pytorch.org/get-started/locally/) web page.

After setting up the virtual environment, we will need to import the necessary packages.

In [None]:
import gym
import numpy as np
import torch.nn

from pythonbasictools.device import log_device_setup, DeepLib
from pythonbasictools.logging import logs_file_setup

import neurotorch as nt
from neurotorch.rl.agent import Agent
from neurotorch.rl.rl_academy import RLAcademy
from neurotorch.rl.utils import TrajectoryRenderer, space_to_continuous_shape
from neurotorch.transforms.spikes_encoders import SpikesEncoder

In [None]:
logs_file_setup("rl_tutorial", add_stdout=False)
log_device_setup(deepLib=DeepLib.Pytorch)
if torch.cuda.is_available():
	torch.cuda.set_per_process_memory_fraction(0.8)

## Initialization

In [None]:
env_id = "LunarLander-v2"
env = gym.vector.make(env_id, num_envs=10, render_mode="rgb_array")
use_spiking_policy = True  # Type of the policy

Here we're initializing a callback of the trainer used to save the network during the training.

In [None]:
if use_spiking_policy:
    checkpoint_folder = f"data/tr_data/checkpoints_{env_id}_snn-policy"
else:
    checkpoint_folder = f"data/tr_data/checkpoints_{env_id}_default-policy"
checkpoint_manager = nt.CheckpointManager(
    checkpoint_folder=checkpoint_folder,
    save_freq=10,
    metric=RLAcademy.CUM_REWARDS_METRIC_KEY,
    minimise_metric=False,
    save_best_only=True,
)

Here, we are initializing the learning algorithm that will be used to train the agent. For now, this learning algorithm it's the popular [Proximal Policy Optimisation](https://arxiv.org/pdf/1707.06347.pdf) from OpenAI.

In [None]:
ppo_la = nt.rl.PPO(
    tau=0.0,
    critic_weight=0.5,
    entropy_weight=0.01,
    gae_lambda=1.0,
    default_critic_lr=1e-3,
    default_policy_lr=5e-4,
    critic_criterion=torch.nn.SmoothL1Loss(),
    clip_ratio=0.2,
    critic_clip=0.2,
)

It is now the time to define our policy. For short, the policy is the model that will be used to take the actions in the environment. The critic is the model used to estimate the rewards-to-go of the states that the agent will encounter.

In [None]:
if use_spiking_policy:
    policy = nt.SequentialRNN(
        input_transform=[
            SpikesEncoder(
                n_steps=8,
                n_units=space_to_continuous_shape(env.single_observation_space)[0],
                spikes_layer_type=nt.SpyLIFLayer,
            )
        ],
        layers=[
            nt.SpyLIFLayer(
                space_to_continuous_shape(env.single_observation_space)[0], 128, use_recurrent_connection=False
            ),
            nt.SpyLILayer(128, space_to_continuous_shape(env.single_action_space)[0]),
        ],
        output_transform=[nt.transforms.ReduceMax(dim=1)],
    ).build()
else:
    policy = nt.Sequential(
        layers=[
            torch.nn.Linear(space_to_continuous_shape(env.single_observation_space)[0], 128),
            torch.nn.Dropout(0.1),
            torch.nn.PReLU(),
            torch.nn.Linear(128, 128),
            torch.nn.Dropout(0.1),
            torch.nn.PReLU(),
            torch.nn.Linear(128, space_to_continuous_shape(env.single_action_space)[0]),
        ]
    ).build()

And we're defining the agent using the policy and the critic.

In [None]:
agent = Agent(
    env=env,
    behavior_name=env_id,
    policy=policy,
    critic=nt.Sequential(
        layers=[
            torch.nn.Linear(space_to_continuous_shape(env.single_observation_space)[0], 128),
            torch.nn.Dropout(0.1),
            torch.nn.PReLU(),
            torch.nn.Linear(128, 128),
            torch.nn.Dropout(0.1),
            torch.nn.PReLU(),
            torch.nn.Linear(128, 1),
        ]
    ).build(),
    checkpoint_folder=checkpoint_manager.checkpoint_folder,
)

Here is the RLAcademy. This is a special type of Trainer used to train the agent in a reinforcement learning pipeline.

In [None]:
academy = RLAcademy(
    agent=agent,
    callbacks=[checkpoint_manager, ppo_la],
    normalize_rewards=False,
    init_epsilon=0.00,
    use_priority_buffer=True,
)

## Training time!

In the next cell, we will start the actual training with the following parameter:

    - `n_iterations`: The number of time the trainer will generate trajectories and will do an optimisation pass.
    - `n_epochs`: The number of time the trainer will pass through the buffer of episodes for an optimisation pass.
    - `n_batches`: The number of batch to do at each epoch.
    - `n_new_trajectories`: The number of new trajectories to generate at each iteration.
    - `batch_size`: The number of episodes for a single batch.
    - `buffer_size`: The size of the buffer.
    - `clear_buffer`: Wheater to clear or the the buffer before each iteration.
    - `last_k_rewards`: The number of k previous rewards to show in the metrics.

In [None]:
history = academy.train(
    env,
    n_iterations=500,
    n_epochs=30,
    n_batches=-1,
    n_new_trajectories=10,
    batch_size=4096,
    buffer_size=np.inf,
    clear_buffer=True,
    randomize_buffer=True,
    load_checkpoint_mode=nt.LoadCheckpointMode.LAST_ITR,
    force_overwrite=False,
    verbose=True,
    render=False,
    last_k_rewards=10,
)
if not env.closed:
    env.close()

In [None]:
history.plot(show=True)

## Test Phase

In the next cell, we will generate new trajectories of the agent just to see how it will perform.

In [None]:
agent.load_checkpoint(
    checkpoints_meta_path=checkpoint_manager.checkpoints_meta_path,
    load_checkpoint_mode=nt.LoadCheckpointMode.BEST_ITR
)
env = gym.make(env_id, render_mode="rgb_array")
agent.eval()
gen_trajectories_out = academy.generate_trajectories(
    n_trajectories=10, epsilon=0.0, verbose=True, env=env, render=True, re_trajectories=True,
)
best_trajectory_idx = np.argmax([t.cumulative_reward for t in gen_trajectories_out.trajectories])
trajectory_renderer = TrajectoryRenderer(trajectory=gen_trajectories_out.trajectories[best_trajectory_idx], env=env)
trajectory_renderer.render()
trajectory_renderer.to_mp4(f"figures/trajectory_{best_trajectory_idx}.mp4")
cumulative_rewards = gen_trajectories_out.cumulative_rewards
print(f"Buffer: {gen_trajectories_out.buffer}")
print(f"Cumulative rewards: {np.nanmean(cumulative_rewards):.3f} +/- {np.nanstd(cumulative_rewards):.3f}")
n_terminated = sum([int(e.terminal) for e in gen_trajectories_out.buffer])
print(f"{n_terminated = }")