# Policy Gradients

Prerequisites:
- [Value based methods](https://theaicore.com/app/training/intro-to-rl)
- [Neural Networks](https://theaicore.com/app/training/neural-networks)

Previously we looked at value based methods; those which estimated how good certain states were (value function) and how good certain actions were from certain states (action-value or Q function).

In this notebook we'll look at policy gradient based methods

## What's the goal of reinforcement learning?

The goal of a reinforcement learning agent is to maximise expected reward over it's lifetime.
What the agent experiences over it's lifetime, including rewards, states and actions defines it's *trajectory*.
The trajectories that an agent might experience depend on what actions it takes from any given state, that is, what policy the agent follows.

We can formulate this as below.

# J

Where the policy is a function with parameters $\theta$.

What we'd like to do, is to find parameters that maximise this objective, $J$, and hence find an optimal parameterisation for our poilcy.

Because the objective if fully differentiable, we can use gradient **ascent** to improve our objective with respect to our parameters.

Below we analytically derive the gradient of the objective with respect to the parameters.

# del J

Now we can use this derivative in our gradient ascent update rule. 
Note the update is in the direction of the gradient because this is gradient ascent, not descent. 
That's because it represents our expected reward which we want to maximise, rather than a loss which we might want to minimise in a different case.

# update rule

What does this say?

# verbose update

So let's build a neural network which will act as our agent's policy

In [1]:
import torch

class NN(torch.nn.Module):
    def __init__(self, layers, embedding=False, distribution=False):
        super().__init__()
        l = []
        for idx in range(len(layers) - 1):
            if idx == 0 and embedding:
                l.append(torch.nn.Embedding(layers[idx], layers[idx+1]))
                continue
            l.append(torch.nn.Linear(layers[idx], layers[idx+1]))   # add a linear layer
            if idx != len(layers) - 2: # if this is not the last layer
                l.append(torch.nn.ReLU())   # activate
        if distribution:    # if a probability dist output is required
            l.append(torch.nn.Softmax())    # apply softmax to output
            
        self.layers = torch.nn.Sequential(*l)

    def forward(self, x):
        return self.layers(x)
    


Let's use this neural network to model the policy which will control our agent in Griddy!

In [15]:
import gym
from time import sleep
import matplotlib.pyplot as plt
from torch.utils.tensorboard import SummaryWriter
import numpy as np
from GriddyEnv import GriddyEnv # get griddy



In [16]:

writer = SummaryWriter()
# def train(env, optimiser, epochs=100, episodes=30, use_baseline=False, use_causality=False):
# #     try:
#     for epoch in range(epochs):
#         avg_reward = 0
#         objective = 0
#         for episode in range(episodes):
#             done = False
#             state = env.reset()
#             log_policy = []

#             rewards = []

#             step = 0

#             # RUN AN EPISODE
#             while not done:     # while the episode is not terminated
#                 state = torch.Tensor(state)     # correct data type for passing to model
#                 print('STATE:', state)
#                 state = state.view(np.prod(state.shape))
#                 print('FLATTENED STATE:', state)

#                 action_distribution = policy(state)     # get a distribution over actions from the policy given the state
#                 print('ACTION DISTRIBUTION:', action_distribution)

#                 action = torch.distributions.Categorical(action_distribution).sample()      # sample from that distrbution
#                 print(action)
#                 action = int(action)
#                 # print('ACTION:', action)

#                 new_state, reward, done, info = env.step(action)    # take timestep

#                 rewards.append(reward)

#                 state = new_state
#                 log_policy.append(torch.log(action_distribution[action]))

#                 step += 1
#                 if done:
#                     break
#                 if step > 10000000:
#                     # break
#                     pass

#             avg_reward += ( sum(rewards) - avg_reward ) / ( episode + 1 )   # accumulate avg reward
#             writer.add_scalar('Reward/Train', avg_reward, epoch*episodes + episode)     # plot the latest reward

#             for idx in range(len(rewards)):     # for each timestep experienced in the episode
#                 weight = sum(rewards)           # weight by the total reward from this episode
#                 objective += log_policy[idx] * weight   # add the weighted log likelihood of this taking action to 


#         objective /= episodes   # average over episodes
#         objective *= -1     # invert to represent reward rather than cost


#         # UPDATE POLICY
#         # print('updating policy')
#         print('EPOCH:', epoch, f'AVG REWARD: {avg_reward:.2f}')
#         objective.backward()    # backprop
#         optimiser.step()    # update params
#         optimiser.zero_grad()   # reset gradients to zero

#         # VISUALISE AT END OF EPOCH AFTER UPDATING POLICY
#         state = env.reset()
#         done = False
#         while not done:
#             env.render()
#             state = torch.Tensor(state)
#             action_distribution = policy(state)
#             action = torch.distributions.Categorical(action_distribution).sample()
#             action = int(action)
#             state, reward, done, info = env.step(action)
#             sleep(0.01)

#     env.close()
# #     except KeyboardInterrupt:
# #         env.close()
# #     except RuntimeError:
# #         env.close()


# env = GriddyEnv()
# # env = gym.make('CartPole-v0')

# policy = NN([np.prod(env.observation_space.shape), 32, env.action_space.n], distribution=True)

# lr = 0.001
# weight_decay = 1
# optimiser = torch.optim.SGD(policy.parameters(), lr=lr, weight_decay=weight_decay)

# train(
#     env,
#     optimiser,
#     epochs=400,
#     episodes=30
# )

Now let's move onto a more challenging environment called CartPole. The aim is to have a cart move along one dimension and balance an inverted pole vertically. 

## How can we improve this learning algorithm?

What we just implemented is the vanilla policy gradient algorithm.

Notice anything that you think could be improved?

### Baselines

What if all of the trajectories receive a similar reward (e.g. 
what if the total reward over every trajectory is negative (e.g. negative reward every timestep and zero upon reaching a terminal state)?

### Causality
What rewards can an action take responsibilty for?
Surely an reward received before an action taken later in the trajectory shouldn't indicate that the later action was good. Whatever the action taken later in time was, this reward was received before then.

To account for this, we should only weight how good the action was by the rewards which it led the agent to receive from that point in time onwards.

In [20]:

def train(env, optimiser, epochs=100, episodes=30, use_baseline=False, use_causality=False):
    assert not (use_baseline and use_causality)   # cant implement both simply
    baseline = 0
    try:
        for epoch in range(epochs):
            avg_reward = 0
            objective = 0
            for episode in range(episodes):
                done = False
                state = env.reset()
                log_policy = []

                rewards = []

                step = 0

                # RUN AN EPISODE
                while not done:     # while the episode is not terminated
                    state = torch.Tensor(state)     # correct data type for passing to model
                    # print('STATE:', state)
                    action_distribution = policy(state)     # get a distribution over actions from the policy given the state
                    # print('ACTION DISTRIBUTION:', action_distribution)

                    action = torch.distributions.Categorical(action_distribution).sample()      # sample from that distrbution
                    action = int(action)
                    # print('ACTION:', action)

                    new_state, reward, done, info = env.step(action)    # take timestep

                    rewards.append(reward)

                    state = new_state
                    log_policy.append(torch.log(action_distribution[action]))

                    step += 1
                    if done:
                        break
                    if step > 10000000:
                        # break
                        pass

                avg_reward += ( sum(rewards) - avg_reward ) / ( episode + 1 )   # accumulate avg reward
                writer.add_scalar('Reward/Train', avg_reward, epoch*episodes + episode)     # plot the latest reward

                # update baseline
                if use_baseline:
                    baseline += ( sum(rewards) - baseline ) / (epoch*episodes + episode + 1)    # accumulate average return  

                for idx in range(len(rewards)):     # for each timestep experienced in the episode
                    # add causality
                    if use_causality:   
                        weight = sum(rewards[idx:])     # only weight the log likelihood of this action by the future rewards, not the total
                    else:
                        weight = sum(rewards) - baseline           # weight by the total reward from this episode
                    objective += log_policy[idx] * weight   # add the weighted log likelihood of this taking action to 


            objective /= episodes   # average over episodes
            objective *= -1     # invert to represent reward rather than cost


            # UPDATE POLICY
            # print('updating policy')
            print('EPOCH:', epoch, f'AVG REWARD: {avg_reward:.2f}')
            objective.backward()    # backprop
            optimiser.step()    # update params
            optimiser.zero_grad()   # reset gradients to zero

            # VISUALISE AT END OF EPOCH AFTER UPDATING POLICY
            state = env.reset()
            done = False
            while not done:
                env.render()
                state = torch.Tensor(state)
                state = state.view(np.prod(state.shape))
                action_distribution = policy(state)
                action = torch.distributions.Categorical(action_distribution).sample()
                action = int(action)
                state, reward, done, info = env.step(action)
                sleep(0.01)
    except KeyboardInterrupt:
        print('interrupted')
        env.close()

    env.close()

writer = SummaryWriter()

env = gym.make('CartPole-v0')

policy = NN([np.prod(env.observation_space.shape), 32, env.action_space.n], distribution=True)

lr = 0.001
weight_decay = 1
optimiser = torch.optim.SGD(policy.parameters(), lr=lr, weight_decay=weight_decay)

train(
    env,
    optimiser,
    use_baseline=True,
    use_causality=False,
    epochs=30,
    episodes=30
)

EPOCH: 0 AVG REWARD: 23.97
EPOCH: 1 AVG REWARD: 21.90
EPOCH: 2 AVG REWARD: 31.47
EPOCH: 3 AVG REWARD: 26.23
EPOCH: 4 AVG REWARD: 24.00
EPOCH: 5 AVG REWARD: 31.57
interrupted


The improved learning algorithm that we just implemented is called REINFORCE (REward Increment = Nonnegative Factor $\times$ Offset Reinforcement $\times$ Characteristic Eligibility). This name describes the structure of the parameter updates.

## REINFORCE Algorithm summary

### Online or offline?
There is a distinction between collecting experience and updating the policy. So REINFORCE is offline.

### Model based or model free?
There's no mention of a transition function in this algorithm, so it's model free!

### On-policy or off-policy?
The gradient signal comes from rewards obtained on a trajectory that was produced by following the current policy. So REINFORCE is on-policy.

Great, so policy gradient methods can at least do as well as the previous algorithms that we've seen.

Policy gradient based methods depend on the objective being differentiable with respect to the policy parameters.

For any RL agent, the objective (total expected reward attained) is produced as a result of following some policy. If the policy is better, then this objective will be larger.

In policy and value iteration (value based techniques), the policy was produced as a result of taking the action with the max state-value. This max operation is not differentiable. As such, policy gradients could not be used.

Like fitted value iteration and q-learning, we can use a 