# Day 2, Part B: TD3 Algorithm

## Learning goals
- Find out why TD3 is more performant for this environment than PPO
- Walk through the TD3 code and learn how this author constructed it
- See examples of terminology useage that are different from the CartPole example
- Learn what a replay buffer is

## Definitions
- **Simulation environment**: Notice that this is not the same as the python/conda environment.  The simulation environment is the simulated world where the reinforcement learning takes place.  It provides opportunities for an agent to learn and explore, and ideally provides challenges that aid in efficient learning.
- **Agent (aka actor or policy)**: An entity in the simulation environment that performs actions.  The agent could be a person, a robot, a car, a thermostat, etc.
- **State variable**: An observed variable in the simulation environment.  They can be coordinates of objects or entities, an amount of fuel in a tank, air temperature, wind speed, etc.
- **Action variable**: An action that the agent can perform.  Examples: step forward, increase velocity to 552.5 knots, push object left with force of 212.3 N, etc.
- **Reward**: A value given to the agent for doing something considered to be 'good'.  Reward is commonly assigned at each time step and cumulated during a learning episode.
- **Episode**: A learning event consisting of multiple steps in which the agent can explore.  It starts with the unmodified environment and continues until the goal is achieved or something prevents further progress, such as a robot getting stuck in a hole.  Multiple episodes are typically run in loops until the model is fully trained.
- **Model (aka policy or agent)**: An RL model is composed of the modeling architecture (e.g., neural network) and parameters or weights that define the unique behavior of the model.
- **Policy (aka model or agent)**: The parameters of a model that encode the best choices to make in an environment.  The choices are not necessarily good ones until the model undergoes training.  The policy (or model) is the "brain" of the agent.
- **Replay Buffer**: A place in memory to store state, action, reward and other variables describing environmental state transitions. It is effectively the agent's memory of past experiences.

## TD3 vs PPO

One of the big differences between these two is that PPO is an on-policy method, while TD3 is an off-policy method.  On-policy means that the value of the next action is determined using the current actor policy, and off-policy means that the value is determined by a different function, such as a value function.

In this specific case, TD3 builds two Q-functions (twin quality value functions) that map future expected rewards given the current action (current time step).  On the other hand, PPO makes all reward estimates by applying the current actor policy along multi-step trajectories.  By using the same policy to estimate rewards as the actor policy,  PPO needs to learn over more time steps to gain the same range of exploration as TD3.

TD3 also builds a replay buffer as it learns off-policy.  This makes it more sample-efficient and therefore a great choice when simulations (or real-world robots) are slower than the algorithm.

Off-policy methods tend to be less stable than on-policy methods, but TD3 has some tricks for reducing instability, which will be discussed below.

Check out [TD3notebook.ipynb](https://github.com/Quansight/Practical-RL/blob/main/TD3notebook.ipynb) - this is a direct translation from the author-provided `main.py`: all we've done is stashed the configuration variables into a dictionary, named `args`, and shoved all the code that would be executed normally into a function called `main()` so it can be called simply in the notebook.

In this notebook, let's walk through the code a bit.  To keep the notebook functional, we've removed the `main()` definition.  In general, the function is mostly concerned with setting values for variables to be used in the `for` loop, which is where the meat of the learning happens.  Let's look more closely...

In [None]:
import numpy as np
import torch
import gym
import pybullet_envs
import os
import sys
from pathlib import Path

We have the original TD3 algorithm as a python file in this repo, so we can import it as a submodule and use it in the algorithm below.

In [None]:
sys.path.append(str(Path().resolve().parent))
import utils
import TD3

## Evaluation
This [first function](https://github.com/sfujim/TD3/blob/master/main.py#L15) is used to evaluate the policy, either while the agent is learning or afterward when the model is fully trained.
- It first makes a new environment with a fixed random seed
- Then it loops through several learning episodes and records the reward earned from each one
- The average reward is calculated, printed to the screen, and returned to the calling function

In [None]:
# Runs policy for X episodes and returns average reward
# A fixed seed is used for the eval environment
def eval_policy(policy, env_name, seed, eval_episodes=10):
    eval_env = gym.make(env_name)
    eval_env.seed(seed + 100)

    avg_reward = 0.
    for _ in range(eval_episodes):
        state, done = eval_env.reset(), False
        while not done:
            action = policy.select_action(np.array(state))
            state, reward, done, _ = eval_env.step(action)
            avg_reward += reward

    avg_reward /= eval_episodes

    print("---------------------------------------")
    print(f"Evaluation over {eval_episodes} episodes: {avg_reward:.3f}")
    print("---------------------------------------")
    
    return avg_reward

## Variables and Initialization

This first part of the code is simply a dictionary of parameters to be specified for the modeling.

In [None]:
args = {
        "policy" : "TD3",                  # Policy name
        "env" : "AntBulletEnv-v0",         # OpenAI gym environment name
        "seed" : 0,                        # Sets Gym, PyTorch and Numpy seeds
        "start_timesteps" : 25e3,          # Time steps initial random policy is used
        "eval_freq" : 5e3,                 # How often (time steps) we evaluate
        "max_timesteps" : 2e6,             # Max time steps to run environment
        "expl_noise" : 0.1,                # Std of Gaussian exploration noise
        "batch_size" : 256,                # Batch size for both actor and critic
        "discount" : 0.99,                 # Discount factor
        "tau" : 0.005,                     # Target network update rate
        "policy_noise" : 0.2,              # Noise added to target policy during critic update
        "noise_clip" : 0.5,                # Range to clip target policy noise
        "policy_freq" : 2,                 # Frequency of delayed policy updates
        "save_model" : "store_true",       # Save model and optimizer parameters
        "load_model" : "",                 # Model load file name, "" doesn't load, "default" uses file_name
       }

Make a file name to keep track of the models we've made.

In [None]:
file_name = f"{args['policy']}_{args['env']}_{args['seed']}"

Make sure some subfolders are present to save the results and the model.

In [None]:
if not os.path.exists("./results"):
    os.makedirs("./results")

if args['save_model'] and not os.path.exists("./models"):
    os.makedirs("./models")

>**In the next cell, make the gym environment just like we did in the CartPole example.**  Use `args['env']` as the environment name and return the usual `env` object.

<details>
<summary>Click to reveal answer</summary>
env = gym.make(args['env'])
</details>
<br>

Set the random seeds for the environment, Torch (if we run on GPU), and NumPy.

In [None]:
env.seed(args['seed'])
env.action_space.seed(args['seed'])
torch.manual_seed(args['seed'])
np.random.seed(args['seed'])

We need the algorithm (TD3) to know some things about the environment, including the dimensions of the state and action spaces.  TD3 also needs to know the largest action value to expect.

>Try printing some of the following values to get a better understanding of what values are being passed to TD3 (**just print the kwargs dict**).  The state dimensions might be larger than you expected.  If you go to the walker base class for pybullet there is a `calc_state` function ([here](https://github.com/bulletphysics/bullet3/blob/a62fb187a5c83a2e1e3e0376565ab3ae47870465/examples/pybullet/gym/pybullet_envs/robot_locomotors.py#L35)).  See if you can find a few of the state variables.

In [None]:
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0] 
max_action = float(env.action_space.high[0])

In [None]:
kwargs = {
    "state_dim": state_dim,
    "action_dim": action_dim,
    "max_action": max_action,
    "discount": args['discount'],
    "tau": args['tau'],
}

## TD3 Tricks

TD3 is an improvement upon DDPG.  Some folks refer to those improvements as "tricks" because they are fairly simple.

One way to improve exploration is to simply add noise to the actions during learning.  This ensures that the decisions made by the agent are not the same every time.  Even as the agent learns better actions, it will continue to try actions that are at least a little bit different from the known high-reward actions.

As you read on OpenAI Spinning Up, they list the three "tricks":
>**Trick One: Clipped Double-Q Learning**. TD3 learns two Q-functions instead of one (hence “twin”), and uses the smaller of the two Q-values to form the targets in the Bellman error loss functions.
>
>**Trick Two: “Delayed” Policy Updates**. TD3 updates the policy (and target networks) less frequently than the Q-function. The paper recommends one policy update for every two Q-function updates.
>
>**Trick Three: Target Policy Smoothing**. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.

The three variables below are each used in the tricks, and the noise variables are scaled to the action space.

In [None]:
# Trick One
kwargs["noise_clip"] = args['noise_clip'] * max_action
# Trick Two
kwargs["policy_freq"] = args['policy_freq']
# Trick Three
kwargs["policy_noise"] = args['policy_noise'] * max_action

>**In your own words, write a description of each of the tricks, stating cleary why they help learning.**  Feel free to review the Spinning up descriptions and reviewing the TD3 paper.  Some explanation is given in this notebook too.  We will ask three of you to describe one of the tricks.  As we discuss them, feel free to update your description.

- Trick One: (type answer here)
- Trick Two: (type answer here)
- Trick Three: (type answer here)

Initialize the TD3 policy.

>**But first, go back to the CartPole example (Day1, Part A) and find the cell where we created an instance of the PPO algorithm.  What name did we give PPO in that case?  What name does the author of TD3 give below?**

<details>
<summary>Click to reveal answer</summary>
For CartPole, we followed OpenAI's convention of naming the algorithm "model", but here, TD3 is given the name "policy".  This kind of inconsistency in terminology is common in RL, so keep in mind that "model" and "policy" are equivalent between these two examples.  You might see "agent" or "actor" used in other code as well.
</details>
<br>

In [None]:
policy = TD3.TD3(**kwargs)

This cell just loads a previous model or starts a new one.

In [None]:
if args['load_model'] != "":
    policy_file = file_name if args['load_model'] == "default" else args['load_model']
    policy.load(f"./models/{policy_file}")

## Experience Replay Buffer

This buffer is what keeps track of past experiences.  The algorithm will sample from this buffer to estimate the value of the agent's next action.  The buffer does not keep all experiences, but ideally it keeps a representative range of them.

The experiences are state transitions tied to actions and rewards.  

>**Look at the file `utils.py` for what else is stored in the buffer.  Describe the values that you can by listing them here.**

- (type answer here)
- (type answer here)
- (type answer here)
- (type answer here)
- (type answer here)

In [None]:
replay_buffer = utils.ReplayBuffer(state_dim, action_dim)

## Learning over many episodes

Scan through the code in the next cell, then keep reading to learn about parts of the code.

In [None]:
# Evaluate untrained policy and save as the first one in a sequence of trained policies
evaluations = [eval_policy(policy, args['env'], args['seed'])]

state, done = env.reset(), False
episode_reward = 0
episode_timesteps = 0
episode_num = 0

for t in range(int(args['max_timesteps'])):

    episode_timesteps += 1

    # Select action randomly or according to policy
    if t < args['start_timesteps']:
        action = env.action_space.sample()
    else:
        action = (
            policy.select_action(np.array(state))
            + np.random.normal(0, max_action * args['expl_noise'], size=action_dim)
        ).clip(-max_action, max_action)

    # Perform action
    next_state, reward, done, _ = env.step(action) 
    done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0

    # Store data in replay buffer
    replay_buffer.add(state, action, next_state, reward, done_bool)

    state = next_state
    episode_reward += reward

    # Train agent after collecting sufficient data
    if t >= args['start_timesteps']:
        policy.train(replay_buffer, args['batch_size'])

    if done: 
        # +1 to account for 0 indexing. +0 on ep_timesteps since it will increment +1 even if done=True
        print(f"Total T: {t+1} Episode Num: {episode_num+1} Episode T: {episode_timesteps} Reward: {episode_reward:.3f}")
        
        # Reset environment
        state, done = env.reset(), False
        episode_reward = 0
        episode_timesteps = 0
        episode_num += 1 

    # Evaluate episode
    if (t + 1) % args['eval_freq'] == 0:
        evaluations.append(eval_policy(policy, args['env'], args['seed']))
        np.save(f"./results/{file_name}", evaluations)
        if args['save_model']: 
            policy.save(f"./models/{file_name}")

In the following section, note that for the first `start_timesteps` number of time steps, the action is simply filled from random sampling of possible choices; this helps fill the replay buffer and give a baseline before actual policy choices are made.

```python     
    if t < args['start_timesteps']:
        action = env.action_space.sample()
    else:
        action = (
            policy.select_action(np.array(state))
            + np.random.normal(0, max_action * args['expl_noise'], size=action_dim)
        ).clip(-max_action, max_action)
```

The bulk of the actual training happens in only a few lines.  The below section takes the selected action from above, applies it to the environment, and returns the new environment state, including the reward and a done flag.  It then checks whether the number of time steps reached the maximum or not.

```python
    next_state, reward, done, _ = env.step(action) 
    done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0
```

The outcome of the time step is saved to the experience replay buffer.

```python
    replay_buffer.add(state, action, next_state, reward, done_bool)
```
Then the code updates the state, saves the reward, and, if the replay buffer has recieved enough baseline values, trains the policy.  At this point, the ant will explore the environment by trying to move its legs such that it receives high rewards.

```python
    state = next_state
    episode_reward += reward

    # Train agent after collecting sufficient data
    if t >= args['start_timesteps']:
        policy.train(replay_buffer, args['batch_size'])
```

Once the environment reaches the described `done` state, the environment and some variables are reset.

```python
    if done: 
        # +1 to account for 0 indexing. +0 on ep_timesteps since it will increment +1 even if done=True
        print(f"Total T: {t+1} Episode Num: {episode_num+1} Episode T: {episode_timesteps} Reward: {episode_reward:.3f}")
        
        # Reset environment
        state, done = env.reset(), False
        episode_reward = 0
        episode_timesteps = 0
        episode_num += 1 
```

Before starting a new episode, every `eval_freq` number of time steps, the policy is evaluated against a number of episodes outside the training process, and saves the current policy for good measure.

```python
    # Evaluate episode
    if (t + 1) % args['eval_freq'] == 0:
        evaluations.append(eval_policy(policy, args['env'], args['seed']))
        np.save(f"./results/{file_name}", evaluations)
        if args['save_model']: 
            policy.save(f"./models/{file_name}")
```

That's it.  It's nice having all the complicated heavy lifting already coded for us.  

If you run the notebook, as is, it will train for two million time steps with all the standard hyperparameters the TD3 authors set up and out will pop a policy that  allows the robot ant to sprint like the animation below.

In [None]:
ipd.Image("../animations/base_ant.png")