![DSME-logo](./static/DSME_logo.png)

#  Reinforcement Learning and Learning-based Control

<p style="font-size:12pt";> 
<b> Prof. Dr. Sebastian Trimpe, Dr. Friedrich Solowjow </b><br>
<b> Institute for Data Science in Mechanical Engineering(DSME) </b><br>
<a href = "mailto:rllbc@dsme.rwth-aachen.de">rllbc@dsme.rwth-aachen.de</a><br>
</p>

---
Reinforce Implementation

Notebook Authors: Ramil Sabirov

## Library Imports

In [None]:
import os
import time
import random
import warnings
from datetime import datetime

import numpy as np
import matplotlib.pyplot as plt
from tqdm import notebook
from easydict import EasyDict as edict
from IPython.display import Video
import math

import utils.helper_fns as hf

import gym
import wandb
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions.categorical import Categorical

warnings.filterwarnings("ignore", category=DeprecationWarning)

plt.rcParams['figure.dpi'] = 100
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

%load_ext autoreload
%autoreload 2

## Initializations

### Experiment Init

We primarily use dictionaries for initializing experiment parameters and training hyperparameters. We use the `EasyDict` (imported as `edict`) library, which allows us to access dict values as attributes while retaining the operations and properties of the original python `dict`! [[Github Link](https://github.com/makinacorpus/easydict)]

In this notebook we use a few `edicts` with `exp` being one of them. It is initialized in the following cell and has keys and values containing information about the experiment. Although the dict is initialized in this section, we keep adding new keys and values to the dict in the later sections as well.  

This notebook supports gym environments with observation space of type `gym.spaces.Box` and action space of type `gym.spaces.Discrete`. Eg: Acrobot-v1, CartPole-v1, MountainCar-v0

In [None]:
exp = edict()

exp.exp_name = 'REINFORCE'  # algorithm name, in this case it should be 'REINFORCE'
exp.env_id = 'CartPole-v1'  # name of the gym environment to be used in this experiment. Eg: Acrobot-v1, CartPole-v1, MountainCar-v0
exp.device = device.type  # save the device type used to load tensors and perform tensor operations

set_random_seed = True  # set random seed for reproducibility of python, numpy and torch
exp.seed = 2

# name of the project in Weights & Biases (wandb) to which logs are patched. (only if wandb logging is enabled)
# if the project does not exist in wandb, it will be created automatically
wandb_prj_name = f"RLLBC_{exp.env_id}"

# name prefix of output files generated by the notebook
exp.run_name = f"{exp.env_id}__{exp.exp_name}__{exp.seed}__{datetime.now().strftime('%y%m%d_%H%M%S')}"

if set_random_seed:
    random.seed(exp.seed)
    np.random.seed(exp.seed)
    torch.manual_seed(exp.seed)
    torch.backends.cudnn.deterministic = set_random_seed

### Agent Model Class

The `Agent` class consists of a deep MLP policy that is trained during training. The network takes as input the representation of the state, passes it through several hidden layers, and finally evaluates to a probability distribution over all actions with the `softmax` function.

The `Agent` class has two methods:
1. `get_action_and_logprob` evaluates the network and samples an action from the resulting probability distribution. It also returns the logarithm of the action probability $\log \pi(a_t| s_t)$ which is used for obtaining gradient estimates for training.

2. `get_action` evaluates the network to the probability distribution and either samples an action from that distribution (greedy = false) or returns the action with the maximal probability (greedy = true).

In [None]:
class Agent(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(np.array(env.observation_space.shape).prod(), 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, env.action_space.n),
            nn.Softmax(dim=-1)
        )

    def get_action_and_logprob(self, x):

        action_probs = self.network(x)
        probs = Categorical(probs=action_probs)
        action = probs.sample()

        return action, probs.log_prob(action)

    def get_action(self, x, greedy=False):
        action_probs = self.network(x)

        if greedy:
            action = action_probs.argmax(dim=1)
        else:
            probs = Categorical(probs=action_probs)
            action = probs.sample()

        return action


### Agent Hyperparams & Training Params Init
The second dictionary, `hypp`, is initialized in the following cell. It has keys and values containing the hyperparameters necessary to the algorithm.

The parameters and hyperparameters in this section are broadly categorized as below:
1. Flags for logging: 
    - Stored in the `exp` dict. 
    - By default, all logged parameters are saved as tensorboard logs with the name `exp.run_name`
    - To enable logging of gym videos of the agent's interaction with the env set `exp.capture_video = True`
    - Patch tensorboard logs and gym videos to Weigths & Biases (wandb) by setting `exp.enable_wandb_logging = True`
2. Flags and parameters to generate average performance throughout training:
    - Stored in the `exp` dict
    - If you'd like to later see and compare the performance of multiple agents during training (in Section 1.5.1), set `exp.eval_agent = True`
    - Every `exp.eval_frequency` episodes the trained agent is evaluated using the `envs_eval` by playing out `exp.eval_count` episodes
    - To speed up training set `exp.eval_agent = False` 
3. Parameters and hyperparameters related to the algorithm:
    - Stored in the `hypp` dict

Note: 
1. If Weigths and Biases (wandb) logging is enabled, when you run the  "Training The Agent" cell enter your wandb's api key when prompted. 

In [None]:
hypp = edict()

# flags for logging purposes
exp.enable_wandb_logging = True
exp.capture_video = True

# flags to generate agent's average performance during training
exp.eval_agent = True  # disable to speed up training
exp.eval_count = 10
exp.eval_frequency = 50

# agent training specific parameters and hyperparameters
hypp.total_timesteps = 100000  # the training duration in number of time steps
hypp.learning_rate = 3e-4  # the learning rate for the optimizer
hypp.gamma = 0.99  # decay factor of future rewards

## Training the Agent

Before we begin training the agent, we first initialize the logging (based on the respective flags in the `exp` dict), the object of the `Agent` class, and the optimizer, followed by an initial set of observations. 


After that comes the main training loop which is comprised of:  
1. Collect the trajectory i.e. executing the environment until it is finished, while keeping track of received rewards
2. Compute the returns $G_t$ for every step $t$ of the trajectory
3. Compute the loss $L =  -\sum_{t=1}^{n}\log \pi_\theta(a_t|s_t) * G_t$.
4. Perform gradient descent (which is equivalent to gradient ascent regarding the gradient $\frac{\delta}{\delta\theta}{-L}$)

Post completion of the main training loop, we save a copy of the following:
1. `exp` and `hypp` dicts into a `.config` file (can be found in `trained_agent` folder)
2. `agent` (instance of `Agent` class) for later evaluation (can be found in `trained_agent` folder)
3. agent performance progress throughout training if `exp.eval_agent=True` (can be found in `tracked_data` folder)

Note: Both episode and training stats are tracked by function `add_summary`. The `log_training_param_flag` ensures that the the training stats after each `hypp.update_epochs` is logged on to tensorboard (tb) when the first episode stat is logged while filling the rollout buffer

Note: we have two vectorised gym environments, `envs` and `envs_eval` in the initalizations. `envs` is used to fill the rollout buffer with trajectories and `envs_eval` is used to evaluate the agent performance at different stages of training.

In [None]:
# Init tensorboard logging and wandb logging
writer = hf.setup_logging(wandb_prj_name, exp, hypp)

env = hf.make_single_env(exp.env_id, exp.seed, False, exp.run_name)
envs_eval = gym.vector.SyncVectorEnv([hf.make_env(exp.env_id, exp.seed + i, i, False, None) for i in range(1)])

# init list to track agent's performance throughout training
tracked_returns_over_training = []
tracked_episode_len_over_training = []
tracked_episode_count = []
greedy_evaluation = True
eval_max_return = -math.inf

agent = Agent(env).to(device)
optimizer = optim.Adam(agent.parameters(), lr=hypp.learning_rate)

start_time = time.time()

pbar = notebook.tqdm(total=hypp.total_timesteps)

# the maximum number of steps an evironment is rolled out
max_steps = 1000

global_step = 0
episode_step = 0
gradient_step = 0

while global_step < hypp.total_timesteps:

    next_obs = env.reset()
    rewards = torch.zeros(max_steps).to(device)
    actions = torch.zeros((max_steps,) + env.action_space.shape).to(device)
    obs = torch.zeros((max_steps, ) + env.observation_space.shape).to(device)
    logprobs = torch.zeros(max_steps).to(device)

    episode_length = 0

    # collect trajectory
    for step in range(max_steps):

        episode_length = episode_length + 1
        global_step = global_step + 1

        next_obs = torch.tensor(next_obs).to(device)
        obs[step] = next_obs

        # choose action according to agent network
        action, log_prob = agent.get_action_and_logprob(next_obs)

        # apply action to envs
        next_obs, reward, done, info = env.step(action.cpu().item())

        rewards[step] = torch.tensor(reward).to(device)
        actions[step] = action
        logprobs[step] = log_prob

        if done:
            # episode has been finished
            episode_step = episode_step + 1
            break

    # evaluate model
    if (episode_step % exp.eval_frequency == 0) and exp.eval_agent:
        tracked_return, tracked_episode_len = hf.evaluate_agent(envs_eval, agent, exp.eval_count, exp.seed)
        tracked_returns_over_training.append(tracked_return)
        tracked_episode_len_over_training.append(tracked_episode_len)
        tracked_episode_count.append([episode_step, global_step])

        # if there has been improvment of the model - save model, create video, log video to wandb
        if np.mean(tracked_return) > eval_max_return:
            eval_max_return = np.mean(tracked_return)
            hf.save_model(agent, exp.run_name, print_path=False)
            if exp.capture_video:
                filepath, _ = hf.create_folder_relative(f"videos/{exp.run_name}")
                video_file = f"{filepath}/{episode_step}.mp4"
                hf.record_video(exp.env_id, agent, device, video_file, greedy=greedy_evaluation)
                if wandb.run is not None:
                    wandb.log({"video": wandb.Video(video_file, fps=4, format="gif")})

    # calculate returns
    returns = torch.zeros(episode_length)
    for t in reversed(range(episode_length)):

        if t == episode_length-1:
            returns[t] = rewards[t]
        else:
            returns[t] = returns[t+1] * hypp.gamma + rewards[t]

    # calculate loss from tensors
    loss = torch.zeros((1,))
    for t in range(episode_length):
        loss -= returns[t] * logprobs[t]

    # logging information regarding agent performance
    writer.add_scalar("rollout/episodic_return", sum(rewards), global_step)
    writer.add_scalar("rollout/episodic_length", episode_length, global_step)

    # logging information about the loss
    writer.add_scalar("train/policy_loss", loss, global_step)
    writer.add_scalar("others/SPS", int(global_step / (time.time() - start_time)), global_step)
    writer.add_scalar("Charts/gradient_step", gradient_step, global_step)
    writer.add_scalar("Charts/episode_step", episode_step, global_step)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    gradient_step = gradient_step + 1
    pbar.update(min(episode_length, hypp.total_timesteps - pbar.n))
    pbar.set_description(f"episode={episode_step}, episodic_return={sum(rewards)}")

# one last evaluation stage
if exp.eval_agent:
    tracked_return, tracked_episode_len = hf.evaluate_agent(envs_eval, agent, exp.eval_count, exp.seed, greedy_evaluation)
    tracked_returns_over_training.append(tracked_return)
    tracked_episode_len_over_training.append(tracked_episode_len)
    tracked_episode_count.append([episode_step, global_step])

    # if there has been improvment of the model - save model, create video, log video to wandb
    if np.mean(tracked_return) > eval_max_return:
        eval_max_return = np.mean(tracked_return)
        hf.save_model(agent, exp.run_name, print_path=False)
        if exp.capture_video:
            filepath = f"{os.getcwd()}/logs/videos/{exp.run_name}/{episode_step}.mp4"
            hf.record_video(exp.env_id, agent, device, filepath, greedy=greedy_evaluation)
            if wandb.run is not None:
                wandb.log({"video": wandb.Video(filepath, fps=4, format="gif")})

    hf.save_tracked_values(tracked_returns_over_training, tracked_episode_len_over_training, tracked_episode_count, exp.eval_count, exp.run_name)    

env.close()
writer.close()
pbar.close()
if wandb.run is not None:
    wandb.finish(quiet=True)
    wandb.init(mode= 'disabled')

hf.save_train_config_to_yaml(exp, hypp)

## Compare Trained Agents and Display Behaviour

### Compare Agents

Here you can build a plot to compare average rewards over multiple episodes. In the dict `eval_params`, you can specify the episode count in the key `num_episodes`.

After successfully running the previous cells, you will now have a trained agent model with the name `<exp.run_name>.st` saved to the trained_agent folder. To load the agent, you can either set `eval_params.agent_name00 = exp.run_name` or manually enter its name (without the extension .st).

An example of a trained agent model would be `MountainCar-v0__PPO__1__230223_101914.st`.

Add additional agent models from the trained_agent folder to build an average performance comparison. To do this, create new keys of the format `agent_namexx` in the `eval_params` dict. 

An example would be: `eval_params.agent_namexx = MountainCar-v0__PPO__1__230223_001255`

The function `load_multiple_models` will load the agents to the dict `agents`.

In [None]:
eval_params = edict()  # eval_params - dict containing parameters necessary for the setup to evalaute trained agents

eval_params.env_id = 'CartPole-v1'
eval_params.num_envs = 4
eval_params.num_episodes = 50
eval_params.capture_video = False

def record_trigger(x): return x==0 # 

eval_params.folder_path = f"{os.getcwd()}/logs/trained_agent"
eval_params.agent_name00 = exp.run_name


agents = edict()  # agents - dict containing trained agent models
agents = hf.load_multiple_agent_models(eval_params)

agent_stats = edict()  # agent_stats - dict containing statistics captured during evaluation of each agent listed in eval_params

epsiodic_return_over_runs = []
avg_eps_return_over_runs = []
std_eps_return_over_runs = []

# self-define plot labels using a list of strings (length should be equal to number of agents being evaluated)
# when set to None: plotter function generates default labels
agent_labels = None

for idx, agent_model in enumerate(agents.values()):
    agent_idx = f"{idx:02d}"
    agent_name_key = f"agent_name{agent_idx}"
    save_vid_dir = f"captured_vid_dir{agent_idx}"
    epsiodic_return_over_runs = f"epsiodic_return_over_runs{agent_idx}"

    agent_name = eval_params[agent_name_key]
    eval_params[save_vid_dir] = f"{agent_name}__TrainedAgent"

    envs = gym.vector.SyncVectorEnv([hf.make_env(eval_params.env_id, exp.seed+i, i, False, eval_params[save_vid_dir], record_trigger) for i in range(eval_params.num_envs)])

    agent_stats[epsiodic_return_over_runs] = hf.evaluate_agent(envs, agents[f"agent{agent_idx}"], eval_params.num_episodes, exp.seed, greedy_actor=False)

    avg_return = np.mean(agent_stats[epsiodic_return_over_runs])
    std_return = np.std(agent_stats[epsiodic_return_over_runs])

    avg_eps_return_over_runs.append(avg_return), std_eps_return_over_runs.append(std_return)

hf.plotter_trained_agent_comparison(eval_params, avg_eps_return_over_runs, std_eps_return_over_runs, eval_params.num_episodes, agent_labels)

### Compare Performance of Agents During Training

When `exp.eval_agent = True`, the performance progress of the agent throughout its training is saved as a csv file in the `tracked_data` folder. In order to compare performance progress of different agents create an `dict` containing the names of the csv files (without `.csv` extension) and use function `plotter_agents_training_stats`

In [None]:
eval_params = edict()  # eval_params - dict containing parameters necessary for the setup to evalaute trained agents

eval_params.agent_name00 = exp.run_name
# eval_params.agent_nameXX = "CartPole-v1__PPO__1__230128_211640"

agent_labels = []

episode_axis_limit = None

hf.plotter_agents_training_stats(eval_params, agent_labels, episode_axis_limit, plot_returns=True, plot_episode_len=True)

### Display Trained Agent Behaviour 

Set `agent_name` to run name of previous run or to `exp.run_name` for the current run.

In [None]:
agent_name = exp.run_name

filepath, _ = hf.create_folder_relative(f"videos/{agent_name}")
hf.record_video(exp.env_id, agent_name, device, f"{filepath}/best.mp4", greedy=True)
Video(filename=f"{filepath}/best.mp4", html_attributes='loop autoplay')

## TensorBoard Inline

In [None]:
# %load_ext tensorboard
# %tensorboard --logdir runs --host localhost