# Policy Gradients

Prerequisites:
- [Value based methods](https://theaicore.com/app/training/intro-to-rl)
- [Neural Networks](https://theaicore.com/app/training/neural-networks)

Previously we looked at value based methods; those which estimated how good certain states were (value function) and how good certain actions were from certain states (action-value or Q function).

In this notebook we'll look at policy gradient based methods

## What's the goal of reinforcement learning?

The goal of a reinforcement learning agent is to maximise expected reward over it's lifetime.
What the agent experiences over it's lifetime, including rewards, states and actions defines it's *trajectory*.
The trajectories that an agent might experience depend on what actions it takes from any given state, that is, what policy the agent follows.

We can formulate this as below.

![](./images/policy-gradient-objective.jpg)

Where the policy is a function with parameters $\theta$.

What we'd like to do, is to find parameters that maximise this objective, $J$, and hence find an optimal parameterisation for our poilcy.

Because the objective if fully differentiable, we can use gradient **ascent** to improve our objective with respect to our parameters.

Below we analytically derive the gradient of the objective with respect to the parameters.

![](./images/policy-gradient-derivation.jpg)

Now we can use this derivative in our gradient ascent update rule to adjust the weights in a direction that should increase the objective. 
Note the update is in the direction of the gradient because this is gradient ascent, not descent. 
That's because the objective represents our expected reward which we want to maximise, rather than a loss which we might want to minimise in a different case.

![](./images/policy-gradient-update.jpg)

This algorithm is called REINFORCE (REward Increment = Nonnegative Factor $\times$ Offset Reinforcement $\times$ Characteristic Eligibility). This name describes the structure of the parameter updates. But don't worry about the acronym.

Let's build a neural network which will act as our agent's policy

In [15]:
import torch

class NN(torch.nn.Module):
    def __init__(self, layers, embedding=False, distribution=False):
        super().__init__()
        l = []
        for idx in range(len(layers) - 1):
            l.append(torch.nn.Linear(layers[idx], layers[idx+1]))   # add a linear layer
            if idx + 1 != len(layers) - 1: # if this is not the last layer ( +1 = zero indexed) (-1 = layer b4 last)
                l.append(torch.nn.ReLU())   # activate
        if distribution:    # if a probability dist output is required
            l.append(torch.nn.Softmax())    # apply softmax to output
            
        self.layers = torch.nn.Sequential(*l) # unpack layers & turn into a function which applies them sequentially 

    def forward(self, x):
        return self.layers(x)
    


Let's use this neural network to model the policy which will control our agent in Griddy!

In [16]:
import gym
from time import sleep
import matplotlib.pyplot as plt
from torch.utils.tensorboard import SummaryWriter
import numpy as np
from GriddyEnv import GriddyEnv # get griddy

In [17]:
def train(env, optimiser, agent_tag, epochs=100, episodes=30, use_baseline=False, use_causality=False):
    assert not (use_baseline and use_causality)   # cant implement both simply
    baseline = 0
    try:
        for epoch in range(epochs):
            avg_reward = 0
            objective = 0
            for episode in range(episodes):
                done = False
                state = env.reset()
                log_policy = []

                rewards = []

                step = 0

                # RUN AN EPISODE
                while not done:     # while the episode is not terminated
                    state = torch.Tensor(state)     # correct data type for passing to model
                    # print('STATE:', state)
                    state = state.view(np.prod(state.shape))

                    action_distribution = policy(state)     # get a distribution over actions from the policy given the state
                    # print('ACTION DISTRIBUTION:', action_distribution)

                    action = torch.distributions.Categorical(action_distribution).sample()      # sample from that distrbution
                    action = int(action)
                    # print('ACTION:', action)

                    new_state, reward, done, info = env.step(action)    # take timestep

                    rewards.append(reward)

                    state = new_state
                    log_policy.append(torch.log(action_distribution[action]))

                    step += 1
                    if done:
                        break
                    if step > 10000000:
                        # break
                        pass

                avg_reward += ( sum(rewards) - avg_reward ) / ( episode + 1 )   # accumulate avg reward
                writer.add_scalar(f'{agent_tag}/Reward/Train', avg_reward, epoch*episodes + episode)     # plot the latest reward

                # update baseline
                if use_baseline:
                    baseline += ( sum(rewards) - baseline ) / (epoch*episodes + episode + 1)    # accumulate average return  

                for idx in range(len(rewards)):     # for each timestep experienced in the episode
                    # add causality
                    if use_causality:   
                        weight = sum(rewards[idx:])     # only weight the log likelihood of this action by the future rewards, not the total
                    else:
                        weight = sum(rewards) - baseline           # weight by the total reward from this episode
                    objective += log_policy[idx] * weight   # add the weighted log likelihood of this taking action to 


            objective /= episodes   # average over episodes
            objective *= -1     # invert to represent reward rather than cost


            # UPDATE POLICY
            # print('updating policy')
            print('EPOCH:', epoch, f'AVG REWARD: {avg_reward:.2f}')
            objective.backward()    # backprop
            optimiser.step()    # update params
            optimiser.zero_grad()   # reset gradients to zero

            # VISUALISE AT END OF EPOCH AFTER UPDATING POLICY
            state = env.reset()
            done = False
            while not done:
                env.render()
                state = torch.Tensor(state)
                state = state.view(np.prod(state.shape))
                action_distribution = policy(state)
                action = torch.distributions.Categorical(action_distribution).sample()
                action = int(action)
                state, reward, done, info = env.step(action)
                sleep(0.01)
    except KeyboardInterrupt:
        print('interrupted')
        env.close()

    env.close()
    checkpoint = {
        'model': policy,
        'state_dict': policy.state_dict() 
    }
    torch.save(checkpoint, f'agents/trained-agent-{agent_tag}.pt')


writer = SummaryWriter()

env = GriddyEnv(time_penalty=True)

policy = NN([np.prod(env.observation_space.shape), 32, env.action_space.n], distribution=True)

lr = 0.001
weight_decay = 1
optimiser = torch.optim.SGD(policy.parameters(), lr=lr, weight_decay=weight_decay)
agent_tag = 'griddy'

train(
    env,
    optimiser,
    agent_tag,
    use_baseline=True,
    use_causality=False,
    epochs=30,
    episodes=30
)

EPOCH: 0 AVG REWARD: -59.43
EPOCH: 1 AVG REWARD: -24.87
EPOCH: 2 AVG REWARD: -16.47
EPOCH: 3 AVG REWARD: -15.67
EPOCH: 4 AVG REWARD: -18.03
EPOCH: 5 AVG REWARD: -9.80
EPOCH: 6 AVG REWARD: -14.90
EPOCH: 7 AVG REWARD: -13.97
EPOCH: 8 AVG REWARD: -16.97
EPOCH: 9 AVG REWARD: -11.63
EPOCH: 10 AVG REWARD: -12.10
EPOCH: 11 AVG REWARD: -12.50
EPOCH: 12 AVG REWARD: -12.57
EPOCH: 13 AVG REWARD: -13.97
EPOCH: 14 AVG REWARD: -8.73
EPOCH: 15 AVG REWARD: -8.33
EPOCH: 16 AVG REWARD: -6.90
EPOCH: 17 AVG REWARD: -8.50
EPOCH: 18 AVG REWARD: -13.33
EPOCH: 19 AVG REWARD: -9.03
EPOCH: 20 AVG REWARD: -8.83
EPOCH: 21 AVG REWARD: -8.60
EPOCH: 22 AVG REWARD: -10.07
EPOCH: 23 AVG REWARD: -8.13
EPOCH: 24 AVG REWARD: -9.80
EPOCH: 25 AVG REWARD: -7.53
EPOCH: 26 AVG REWARD: -7.13
EPOCH: 27 AVG REWARD: -7.27
EPOCH: 28 AVG REWARD: -7.20
EPOCH: 29 AVG REWARD: -5.70


# Can we solve a harder challenge?

Now let's move onto a more challenging environment called CartPole. The aim is to have a cart move along one dimension and balance an inverted pole vertically. Note that because we set up the size of the policy network programmatically, we can use exactly the same training loop! This is the first time we've used the same learning algorithm to solve different environments. 

In [18]:
writer = SummaryWriter() # create new tensorboard writer

env = gym.make('CartPole-v0') # make cartpole environment

policy = NN([np.prod(env.observation_space.shape), 32, env.action_space.n], distribution=True) 

lr = 0.001
weight_decay = 1
optimiser = torch.optim.SGD(policy.parameters(), lr=lr, weight_decay=weight_decay)
agent_tag = 'cartpole'

train(
    env,
    optimiser,
    agent_tag,
    use_baseline=True,
    use_causality=False,
    epochs=10,
    episodes=30
)


EPOCH: 0 AVG REWARD: 22.77
EPOCH: 1 AVG REWARD: 19.73
EPOCH: 2 AVG REWARD: 21.67
EPOCH: 3 AVG REWARD: 27.17
EPOCH: 4 AVG REWARD: 25.23
EPOCH: 5 AVG REWARD: 24.13
EPOCH: 6 AVG REWARD: 25.97
EPOCH: 7 AVG REWARD: 26.87
EPOCH: 8 AVG REWARD: 26.53
EPOCH: 9 AVG REWARD: 27.90


## How can we improve this learning algorithm?

What we just implemented is the vanilla policy gradient algorithm.

Notice anything that you think could be improved?

What are we weighting the likelihood of taking each trajectory by?

### Baselines

What if all of the trajectories receive a similar total reward (e.g. "*bad*" trajectories give 99 total reward and "good" trajectories give 100 total reward)?
In this case the likelihood of all trajectories will be increased.

What if the total reward over every trajectory is negative (e.g. negative reward every timestep and zero upon reaching a terminal state)?
In any of these cases the log probability of ANY trajectory taken will be reduced by the updates.

We don't want any of these things to happen, so we can introduce baselines. Baselines help you to see the reward from the current trajectory in the context of the reward from each of the others, giving you a relative measure of how good they were, not just an absolute measure.

![](./images/policy-gradient-baseline.jpg)

A pretty standard baseline to use is the average baseline. But there are others.

![](./images/policy-gradient-average-baseline.jpg)

Even though we adjust the objective, it remains unbiased in expectation because the expectation of the baseline is zero


### Causality
What rewards can an action take responsibilty for?
Surely an reward received before an action taken later in the trajectory shouldn't indicate that the later action was good. Whatever the action taken later in time was, this reward was received before then.

To account for this, we should only weight how good the action was by the rewards which it led the agent to receive from that point in time onwards. The rewards attained before the action was taken are not eligible for making that action more likely; that action was taken after the reward was received, so it can't be accountable for it.

![](./images/policy-gradient-causality.jpg)
### Can we combine both?
To combine both, we would need to have a baseline for each point in time; how much reward can I expect to get from timestep t? For any games without a fixed episode length, or even worse games with an infinite horizon, we'll need to compute a baseline for each timestep. This makes combining causality and baselines difficult, but we'll see how to achieve this in a later notebook, combining policy gradients with some things we've learnt in previous notebooks. (hint: what function represents the same thing as a lookup table for baselines?)

Let's now add options to our training function for our algorithm to use baselines and causality. 

In [19]:
def train(env, optimiser, agent_tag, epochs=100, episodes=30, use_baseline=False, use_causality=False):
    assert not (use_baseline and use_causality)   # cant implement both simply
    baseline = 0
    try:
        for epoch in range(epochs):
            avg_reward = 0
            objective = 0
            for episode in range(episodes):
                done = False
                state = env.reset()
                log_policy = []

                rewards = []

                step = 0

                # RUN AN EPISODE
                while not done:     # while the episode is not terminated
                    state = torch.Tensor(state)     # correct data type for passing to model
                    # print('STATE:', state)
                    state = state.view(np.prod(state.shape))

                    action_distribution = policy(state)     # get a distribution over actions from the policy given the state
                    # print('ACTION DISTRIBUTION:', action_distribution)

                    action = torch.distributions.Categorical(action_distribution).sample()      # sample from that distrbution
                    action = int(action)
                    # print('ACTION:', action)

                    new_state, reward, done, info = env.step(action)    # take timestep

                    rewards.append(reward)

                    state = new_state
                    log_policy.append(torch.log(action_distribution[action]))

                    step += 1
                    if done:
                        break
                    if step > 10000000:
                        # break
                        pass

                avg_reward += ( sum(rewards) - avg_reward ) / ( episode + 1 )   # accumulate avg reward
                writer.add_scalar(f'{agent_tag}/Reward/Train', avg_reward, epoch*episodes + episode)     # plot the latest reward

                # update baseline
                if use_baseline:
                    baseline += ( sum(rewards) - baseline ) / (epoch*episodes + episode + 1)    # accumulate average return  

                for idx in range(len(rewards)):     # for each timestep experienced in the episode
                    # add causality
                    if use_causality:   
                        weight = sum(rewards[idx:])     # only weight the log likelihood of this action by the future rewards, not the total
                    else:
                        weight = sum(rewards) - baseline           # weight by the total reward from this episode
                    objective += log_policy[idx] * weight   # add the weighted log likelihood of this taking action to 


            objective /= episodes   # average over episodes
            objective *= -1     # invert to represent reward rather than cost


            # UPDATE POLICY
            # print('updating policy')
            print('EPOCH:', epoch, f'AVG REWARD: {avg_reward:.2f}')
            objective.backward()    # backprop
            optimiser.step()    # update params
            optimiser.zero_grad()   # reset gradients to zero

            # VISUALISE AT END OF EPOCH AFTER UPDATING POLICY
            state = env.reset()
            done = False
            while not done:
                env.render()
                state = torch.Tensor(state)
                state = state.view(np.prod(state.shape))
                action_distribution = policy(state)
                action = torch.distributions.Categorical(action_distribution).sample()
                action = int(action)
                state, reward, done, info = env.step(action)
                sleep(0.01)
    except KeyboardInterrupt:
        print('interrupted')
        env.close()

    env.close()
    checkpoint = {
        'model': policy,
        'state_dict': policy.state_dict() 
    }
    torch.save(checkpoint, f'agents/trained-agent-{agent_tag}.pt')


writer = SummaryWriter()

env = gym.make('CartPole-v0')

policy = NN([np.prod(env.observation_space.shape), 32, env.action_space.n], distribution=True)

lr = 0.001
weight_decay = 1
optimiser = torch.optim.SGD(policy.parameters(), lr=lr, weight_decay=weight_decay)
agent_tag = 'cartpole-improved'

train(
    env,
    optimiser,
    agent_tag,
    use_baseline=True,
    use_causality=False,
    epochs=30,
    episodes=30
)

EPOCH: 0 AVG REWARD: 19.57
EPOCH: 1 AVG REWARD: 21.50
EPOCH: 2 AVG REWARD: 21.37
EPOCH: 3 AVG REWARD: 19.23
EPOCH: 4 AVG REWARD: 24.43
EPOCH: 5 AVG REWARD: 22.63
EPOCH: 6 AVG REWARD: 21.40
EPOCH: 7 AVG REWARD: 23.53
EPOCH: 8 AVG REWARD: 20.53
EPOCH: 9 AVG REWARD: 21.83
EPOCH: 10 AVG REWARD: 25.43
EPOCH: 11 AVG REWARD: 21.90
EPOCH: 12 AVG REWARD: 24.47
EPOCH: 13 AVG REWARD: 20.33
EPOCH: 14 AVG REWARD: 21.93
EPOCH: 15 AVG REWARD: 23.20
EPOCH: 16 AVG REWARD: 27.63
EPOCH: 17 AVG REWARD: 25.70
EPOCH: 18 AVG REWARD: 24.17
EPOCH: 19 AVG REWARD: 29.67
EPOCH: 20 AVG REWARD: 27.80
EPOCH: 21 AVG REWARD: 30.17
EPOCH: 22 AVG REWARD: 30.27
EPOCH: 23 AVG REWARD: 36.73
EPOCH: 24 AVG REWARD: 33.40
EPOCH: 25 AVG REWARD: 30.27
EPOCH: 26 AVG REWARD: 33.27
EPOCH: 27 AVG REWARD: 37.60
EPOCH: 28 AVG REWARD: 41.77
EPOCH: 29 AVG REWARD: 41.60


# How far can we push this algorithm?

Let's try yet another environment, where our agent has to control a spacecraft

In [20]:
writer = SummaryWriter() # create new tensorboard writer

env = gym.make('LunarLander-v2') # make cartpole environment

policy = NN([np.prod(env.observation_space.shape), 32, env.action_space.n], distribution=True) 

lr = 0.001
weight_decay = 1
optimiser = torch.optim.SGD(policy.parameters(), lr=lr, weight_decay=weight_decay)
agent_tag = 'lunar-lander'

train(
    env,
    optimiser,
    agent_tag,
    use_baseline=True,
    use_causality=False,
    epochs=100,
    episodes=30
)

EPOCH: 0 AVG REWARD: -219.82
EPOCH: 1 AVG REWARD: -211.72
EPOCH: 2 AVG REWARD: -273.54
EPOCH: 3 AVG REWARD: -154.07
EPOCH: 4 AVG REWARD: -114.12
EPOCH: 5 AVG REWARD: -147.29
EPOCH: 6 AVG REWARD: -136.77
EPOCH: 7 AVG REWARD: -143.92
EPOCH: 8 AVG REWARD: -130.39
EPOCH: 9 AVG REWARD: -134.38
EPOCH: 10 AVG REWARD: -120.44
EPOCH: 11 AVG REWARD: -114.12
EPOCH: 12 AVG REWARD: -127.83
EPOCH: 13 AVG REWARD: -131.89
EPOCH: 14 AVG REWARD: -124.84
EPOCH: 15 AVG REWARD: -129.16
EPOCH: 16 AVG REWARD: -123.58
EPOCH: 17 AVG REWARD: -132.09
EPOCH: 18 AVG REWARD: -126.34
EPOCH: 19 AVG REWARD: -136.73
EPOCH: 20 AVG REWARD: -134.17
EPOCH: 21 AVG REWARD: -125.20
EPOCH: 22 AVG REWARD: -140.15
EPOCH: 23 AVG REWARD: -130.41
EPOCH: 24 AVG REWARD: -134.21
EPOCH: 25 AVG REWARD: -130.60
EPOCH: 26 AVG REWARD: -120.91
EPOCH: 27 AVG REWARD: -122.27
EPOCH: 28 AVG REWARD: -125.91
EPOCH: 29 AVG REWARD: -132.34
EPOCH: 30 AVG REWARD: -128.91
EPOCH: 31 AVG REWARD: -120.35
EPOCH: 32 AVG REWARD: -120.66
EPOCH: 33 AVG REWARD

## REINFORCE Algorithm summary

### Online or offline?
There is a distinction between collecting experience and updating the policy. So REINFORCE is offline.

### Model based or model free?
There's no mention of a transition function in this algorithm, so it's model free!

### On-policy or off-policy?
The gradient signal comes from rewards obtained on a trajectory that was produced by following the current policy. So REINFORCE is on-policy.

# Deploying our trained agents

The goal of all of this has been to produce agents that are ready to go and do things autonomously. So let's write a function that does that

In [21]:
def deploy(env, saved_model):
    
#     policy = NN([np.prod(env.observation_space.shape), 32, env.action_space.n], distribution=True) # we must remember the architecture
    policy = saved_model['model']
    policy.load_state_dict(saved_model['state_dict']) # load in our pre-trained model
    policy.eval() # put our model in evaluation mode
    try:
        for episode in range(100): # keep demonstrating your skills
                done = False # not done yet
                observation = env.reset() # initialise the environemt
                while not done: # until the episode is over
                    observation = torch.Tensor(observation) # turn observation to tensor
                    observation = observation.view(np.prod(observation.shape)) # view observation as vector
                    action_distribution = policy(observation) # infer what actions to take with what probability
                    action = torch.distributions.Categorical(action_distribution).sample() # sample an action from that distribution
                    action = int(action) # make it an int not a float
                    observation, reward, done, info = env.step(action) # take an action and transition the environment
                    env.render() # show us the environment
                    sleep(0.01)
    except KeyboardInterrupt:
        env.close()
       
griddy_env = GriddyEnv()
griddy_agent_params = torch.load('agents/trained-agent-griddy.pt')
deploy(griddy_env, griddy_agent_params)

cartpole_env = gym.make('CartPole-v0')
cartpole_agent_params = torch.load('agents/trained-agent-cartpole.pt')
deploy(cartpole_env, cartpole_agent_params)


lunar_lander_env = gym.make('LunarLander-v2')
lunar_lander_agent_params = torch.load('agents/trained-agent-lunar-lander.pt')
deploy(lunar_lander_env, lunar_lander_agent_params)

DeprecatedEnv: Env LunarLander-v0 not found (valid versions include ['LunarLander-v2'])

# Any final words?

Great, so policy gradient methods can at least do as well as the previous algorithms that we've seen.

Policy gradient based methods depend on the objective being differentiable with respect to the policy parameters.

For any RL agent, the objective (total expected reward attained) is produced as a result of following some policy. If the policy is better, then this objective will be larger.

In policy and value iteration (value based techniques), the policy was produced as a result of taking the action with the max state-value. This max operation is not differentiable, and so neither was the policy. For policy gradients to be followed, we must have a differentiable policy.

Policy gradient methods will also work with a partially observable environment.

# Next steps

- [Trust Region Policy Optimisation (TRPO)]()
- [Upside Down RL (UDRL)]()