<a href="https://colab.research.google.com/github/RL-Starterpack/rl-starterpack/blob/main/exercises/TQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RL Tutorial - **TQL Exercise**

## Setup

In [None]:
#@title Run this cell to clone the RL tutorial repository and install it
try:
  import rl_starterpack
  print('RL-Starterpack repo succesfully installed!')
except ImportError:
  print('Cloning RL-Starterpack package...')

  !git clone https://github.com/RL-Starterpack/rl-starterpack.git
  print('Installing RL-StarterPack package...')
  !pip install -e rl-starterpack[full] &> /dev/null
  print('\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
  print('Please restart the runtime to use the newly installed package!')
  print('Runtime > Restart Runtime')
  print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')

In [None]:
#@title Run this cell to install additional dependencies (will take ~30s)
!apt-get remove ffmpeg > /dev/null # Removing due to restrictive license
!apt-get install -y xvfb x11-utils > /dev/null

In [None]:
#@title Run this cell to import the required libraries
try:
    from rl_starterpack import OpenAIGym, TQL, experiment, vis_utils
except ImportError:
    print('Please run the first cell! If you already ran it, make sure to restart the runtime after the package is installed.')
    raise
from itertools import chain
from tqdm.auto import tqdm
import numpy as np
import scipy.stats as st
import pandas as pd
import altair as alt
import torch
import gym
import torchviz
%matplotlib inline
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

# Setup display to show video renderings
if 'display' not in globals():
    display = Display(visible=0, size=(1400, 900))
    display.start()

## Exercise

### FrozenLake: Tabular Q-learning method
First we are going to see how RL works from the outside-in. Later we will get to grips with the details of the TQL method.

The RL starterpack repository contains agent implementations as well as helper code to run experiments and train agents.
We will use the repository's implementation of tabular Q-learning to demonstrate how this code fits together and how we visualise the results.

#### Environment and TQL agent
We set up our environment and a constructor function to create a Tabular Q-learning agent.

In [None]:
env = OpenAIGym(level='FrozenLake', max_timesteps=100)

def agent_fn():
    return TQL(
        state_space=env.state_space, action_space=env.action_space,
        learning_rate=0.3, discount=0.9, exploration=0.1
    )

The environment limits episodes to 100 time-steps.
We need this limit, as otherwise agents' policies can sometimes get stuck in infinite loops.
The agent's parameters are:

  - `learning_rate`: a "step size" for the temporal difference update
  - `discount`: a factor that determines how rewards are temporally discounted
  - `exploration`: a rate that controls the agent's balance between exploration and exploitation

In [None]:
num_runs = 5                  # number of training +evaluation loops we run
num_episodes_train = 1000     # number of training episodes per run
num_episodes_eval = 37        # number of evaluation episodes per run
pbar = tqdm(range(num_runs))  # This wraps the run iterator with a progress bar
pbar.set_postfix({'mean return': 'n/a'})
run_returns = list()
for run in pbar:
    # Create and train an agent
    agent = agent_fn()
    _ = experiment.train(agent, env, num_episodes_train, use_pbar=True)

    # Evaluation loop
    eval_returns = experiment.evaluate(agent, env, num_episodes_eval, use_pbar=True)
    pbar.set_postfix({'mean return': '{:.2f}'.format(eval_returns.mean())})

    # Close agent
    agent.close()

    # Record evaluation return
    run_returns.append(pd.DataFrame(data=dict(evaluation=np.arange(num_episodes_eval),
                                              run=run,
                                              eval_return=eval_returns)))
    
# Combine data frames
run_returns = pd.concat(run_returns).reset_index(drop=True)

What returns do you expect to see from each episode? Run the next block to see if you are right.

In [None]:
run_returns.sample(6)

Now we can examine the variation in returns across training runs and evaluation episodes.

In [None]:
alt.Chart(run_returns).mark_rect().encode(
    x='evaluation:O',
    y='run:O',
    color='eval_return:O'
)

We see that there is variation in the success rate between the training runs.
That is, some training runs appear to have resulted in more or less successful agents.
Also, we note that due to the stochastic nature of the environment, each agent has variation in the returns across evaluation episodes.

We can calculate the means of the evaluation returns and their standard errors for each training run.

In [None]:
run_returns.groupby('run')['eval_return'].agg([np.mean, st.sem])

We can also examine how the agent from the last training run solves the environment.

In [None]:
experiment.evaluate_render(agent, env, ipythondisplay, sleep=0.5)

Your results may vary but more than likely it is not an impressive solution. The agent takes many wrong steps.

### Tune the hyperparameters

Tuning the hyperparameters is one thing we can try to improve our agents performance.
Fill in some values for the hyperparameters below to investigate how this affects the mean return.

Remember that even for fixed values of the hyperparameters the results will vary every time.

In [None]:
# TODO: Fill in these hyperparameters
learning_rate = None  # Speed at which the agent learns. Between (0,1)
discount_rate = None  # How much future rewards are discounted at each step. Between (0,1)
exploration = None    # During training the agent will take a random action and "explore" with this probability.
                      # Between (0,1)

# Create the agent with the given parameters
agent = TQL(state_space=env.state_space, action_space=env.action_space,
            learning_rate=learning_rate, discount=discount_rate, exploration=exploration)

# Train the agent
train_returns = experiment.train(agent, env, num_episodes=1000)
# experiment.train(agent, env, num_episodes=1000, reward_shaping_fn=reward_shaping_fn)

# Evaluate the agent
returns = experiment.evaluate(agent, env, num_episodes=100)
print(f'Mean return: {returns.mean():.3f} +/- {st.sem(returns):.3f}')

Do you have a good understanding of what each parameter does?

We can visualise the returns achieved during training.
The blue line are the raw returns and the orange line is a smoothed version of the raw returns, so any trend is apparent.

In [None]:
vis_utils.draw_returns_chart(train_returns, smoothing_window=40)

High values for the exploration parameter will decrease the mean training return. Why?

Let's examine how our new agent solves the task.

In [None]:
experiment.evaluate_render(agent, env, ipythondisplay, sleep=0.5)

Hopefully this new agent has learnt a better policy. Your mileage may vary, but it is unlikely it has reached a perfect solution yet.

### Reward shaping

Another way to help the agent learn a better policy is a method called reward shaping.
This is useful when the reward signal that the environment provides is not optimal for learning.
In this Frozen Lake environment, landing on a hole terminates the episode and provides a reward of 0.
A reward of 0 is the same as for other non-goal states, and so it does not signal to the agent that this outcome should be avoided.

A *reward shaping function* takes the reward provided by the environment and amends it to improve learning.
In the Frozen Lake environment, a reward of -1 for landing on a hole might be a better signal for the agent.
Fill in the function below to see if training improves.

In [None]:
def reward_shaping_fn(reward, terminal, next_state):
    """
    Shapes the reward before passing it on to the agent.
    
    Args:
        reward (float): Reward returned by the environment for the action which was just performed.
        terminal (int): Boolean int representing whether the current episode has ended (if episode has ended =1, otherwise =0).
        next_state (object): Next state. In the case of FrozenLake this is a np.ndarray of a scalar. i.e. np.array(0)
        
    Returns:
        reward (float): The modified reward.
        terminal (int): The `terminal` input needs to be passed through.
    """
    # TODO: Fill in if your agent is having a hard time solving the environment!
    
    return reward, terminal

# Create a new agent with the existing parameters
agent = TQL(state_space=env.state_space, action_space=env.action_space,
            learning_rate=learning_rate, discount=discount_rate, exploration=exploration)

# Train the agent using reward shaping
train_returns = experiment.train(agent, env, num_episodes=1000, reward_shaping_fn=reward_shaping_fn)

# Evaluate the agent
returns = experiment.evaluate(agent, env, num_episodes=100)
print(f'Mean return: {returns.mean():.3f} +/- {st.sem(returns):.3f}')

Hopefully your mean return is now higher! Returns above 0.7 are possible.

In [None]:
#@title _<sub><sup>SOLUTION: Expand this cell to see a working TQL implementation </sup></sub>_

# Provide some helpful reward shaping
def reward_shaping_fn(reward, terminal, next_state):
    del next_state # unused
    if terminal == 1 and reward == 0.0:
        # Penalize the agent for failing to reach the goal
        return -1.0, terminal
    else:
        return reward, terminal

# Create a new agent with the existing parameters
agent = TQL(state_space=env.state_space, action_space=env.action_space,
            learning_rate=learning_rate, discount=discount_rate, exploration=exploration)

# Train the agent using reward shaping
train_returns = experiment.train(agent, env, num_episodes=1000, reward_shaping_fn=reward_shaping_fn)

# Evaluate the agent
returns = experiment.evaluate(agent, env, num_episodes=100)
print(f'Mean return: {returns.mean():.3f} +/- {st.sem(returns):.3f}')

If you wish to investigate TQL further, please have a look at the implementation of our [TQL agent](https://github.com/RL-Starterpack/rl-starterpack/blob/main/rl_starterpack/agents/tql.py).
In particular look at the `TQL` class that implements `exploration_policy` and `q_learning_policy`.
Feel free to implement your own agent that redefines these methods in any way you see fit.