# REINFORCE - Monte Carlo Policy Gradient 

REINFORCE is most likely the simplest policy gradient method. It uses Monte Carlo returns as a direct estimate for the policy's Q value. 

The policy gradient is:
$\begin{aligned}
\nabla_\theta J(\theta)
&= \mathbb{E}_\pi [Q^\pi(s, a) \nabla_\theta \ln \pi_\theta(a \vert s)] & \\
&= \mathbb{E}_\pi [G_t \nabla_\theta \ln \pi_\theta(A_t \vert S_t)] & \scriptstyle{\text{; Because } Q^\pi(S_t, A_t) = \mathbb{E}_\pi[G_t \vert S_t, A_t]}
\end{aligned}$

We can use the Monte Carlo return of Gt (the discounted value of future rewards) in place of the policy's Q value in the gradient update.

The algorithm is very simple compared to DQN and with much less moving parts.

Below is the pseudocode simplest case of episodic REINFORCE:

![Pseudocode](https://miro.medium.com/max/4800/1*NQkoA-eQOUXHqIln-WzMpw.png)

The following is a PyTorch implementation of the REINFORCE algorithm for both discrete and continuous action spaces.

## Discrete Policy
The discete action space policy is simply a neural network with the following layer sizes:
- Input : State Observation size
- Hidden : Whatever you please.
- Output: Number of possible actions.

The policy then uses a softmax activation function at the end to give probabilities for selecting an action.

In [1]:
from torch import nn
import torch.nn as nn
import torch.nn.functional as F
import torch

class Discrete_Policy(nn.Module):

    def __init__(self, input_size, hidden_size, nb_actions) -> None:
        super(Discrete_Policy, self).__init__()
        self.policy_net = nn.Sequential(nn.Linear(input_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, nb_actions),
                                        nn.Softmax(dim=-1))
        
    def forward(self,obs):
        return self.policy_net(obs)

## Continuous Policy
The continuous action space policy is a little more complicated. This can be done differently depending on what you want to do but the following uses two separate neural networks to approximate a mean and std. deviation respectively.

The sizes of both neural networks are as follows:

- Input : State Observation size
- Hidden : Whatever you please.
- Output: 1

One doesnt need to use a separate neural network for the std deviation, they can use a single learned parameter or even some fixed constant but all will have different results and this involves some experimentation.

In [2]:
class Continuous_Policy(nn.Module):

    def __init__(self, input_size, hidden_size, nb_actions) -> None:
        super(Continuous_Policy, self).__init__()
        self.mean_net = nn.Sequential(nn.Linear(input_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, nb_actions))
        self.standard_deviation_net = nn.Sequential(nn.Linear(input_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, hidden_size),
                                        nn.LeakyReLU(),
                                        nn.Linear(hidden_size, nb_actions))
        
    def forward(self,obs):
        return self.mean_net(obs),torch.abs(self.standard_deviation_net(obs))

## Agent 
The following is the agent class:

In [3]:
from torch.distributions import Categorical, Normal
from torch.optim import AdamW
import torch

class REINFORCE_Agent:

    def __init__(self,input_size,hidden_size,nb_actions,discrete=True,discount_factor=0.9,learning_rate=0.0001):
        self.discrete = discrete
        if discrete:
            self.policy = Discrete_Policy(input_size,hidden_size,nb_actions)
        else:
            self.policy = Continuous_Policy(input_size,hidden_size,nb_actions)
        
        self.memory = []
        self.gamma = discount_factor
        
        self.optimizer = AdamW(self.policy.parameters(),learning_rate)

        self.best_weights = self.policy.state_dict()
        
    def act(self,obs):
        if self.discrete:
            # Create categorical distribution using the action probabilities given by the agent's policy
            action_dist = Categorical(self.policy(obs))

        else:
            mu,dev = self.policy(obs)
            # Create a Normal distribution using the policy given mean and std deviation
            action_dist = Normal(mu,dev)
        
        # Return the action distribution created according to policy
        return action_dist

    def cache(self,reward,log_action_prob):
        """
        Add reward and the log of the performed action's probability
        """
        self.memory.append((reward,log_action_prob))

    def clear_memory(self):
        self.memory.clear()
    
    def train(self):
        self.policy.train()
    
    def eval(self):
        self.policy.load_state_dict(self.best_weights)
        self.policy.eval()
        
    def update_model(self):
        
        returns_to_go = []
        # Calculate the Monte Carlo estimates i.e (sum of discounted future rewards for each timestep)
        Gt=0
        for (reward,_) in self.memory[::-1]:
            Gt = reward+self.gamma*Gt
            returns_to_go.append(Gt)

        returns_to_go = returns_to_go[::-1]

        # Sum every transitions loss: the policy gradient as stated above is:
        # ∇J = E[ Gt * ∇log(π(a|s))]
        # PyTorch calculates the gradient of the loss to use in the policies parameter updates
        # therefore we only need to make the loss = Gt * log(action probability)
        # REINFORCE assumes gradient ascent so we make it negative to work with pytorch gradient descent
        losses = []
        for Gt,(reward,log_action_prob) in zip(returns_to_go,self.memory):
            losses.append(-log_action_prob*Gt)
        
        # clear the optimizers current gradients
        self.optimizer.zero_grad()

        # sum the losses
        loss = torch.stack(losses).sum()

        # backpropagate the loss to calculate the gradients
        loss.backward()

        # make one optimization step
        self.optimizer.step()

        # clear the episode memory from the agents memory
        self.clear_memory()
    
    def update_best_weights(self):
        self.best_weights = self.policy.state_dict()

    
    

## Training Loop
The last thing needed is the training loop.

In [4]:
from collections import deque
import gym
from numpy import double
import torch
import argparse

def train(agent, env, num_episodes, num_steps, log_every=20, render_from=1000):
    """
    :param agent: the agent to be trained.
    :param env: the gym environment.
    :param num_episodes: the number of episodes to train.
    :param num_steps: the max number of steps possible per episode.
    :param log_every: The frequency of logging. Default logs every 20 episodes.
    :param render_from: The episode number to start rendering to screen to allow one to view agent. Rendering significantly slows down training.
    """
    
    # Running Average Reward Memory
    running_avg_reward = deque(maxlen=100)
    best_running_avg = 0
    agent.train()
    for episode in range(1,num_episodes+1):
        # Starting state observation
        obs = env.reset()
        reward_total = 0
        for step in range(num_steps):
            if episode>render_from:
                env.render()
            # Return agents action probability distribution
            action_dist = agent.act(torch.tensor(obs,dtype=torch.float32))
            # Sample an action from this distribution
            action = action_dist.sample()
            # Take a step in the environment with the action drawn
            obs , reward, done, info = env.step(action.numpy())
            reward_total+=reward
            # Save the transition information needed for update
            agent.cache(reward,action_dist.log_prob(action))
            # if done then log and break
            if done or step == num_steps-1:
                if episode % log_every ==0 and len(running_avg_reward)>0:
                    running_avg = sum(running_avg_reward)/len(running_avg_reward)
                    if running_avg > best_running_avg:
                        best_running_avg = running_avg
                        agent.update_best_weights()
                    print("Episode {0:4d} finished after {1:4d} timesteps with a total reward of {2:3.1f} | Running Average: {3:3.2f}".format(episode,step+1,reward_total,running_avg))
                break
        # Update model every episode
        agent.update_model()
        running_avg_reward.append(reward_total)
        

    env.close()

In [5]:

def eval(agent, env, num_episodes, num_steps, log_every=20):
    """
    :param agent: the agent to be trained.
    :param env: the gym environment.
    :param num_episodes: the number of episodes to eval.
    :param num_steps: the max number of steps possible per episode.
    :param log_every: The frequency of logging. Default logs every 20 episodes.
    """
    agent.eval()
    # Running Average Reward Memory
    running_avg_reward = deque(maxlen=100)
    for episode in range(1,num_episodes+1):
        # Starting state observation
        obs = env.reset()
        reward_total = 0
        for step in range(num_steps):
           
            env.render()
            # Return agents action probability distribution
            action_dist = agent.act(torch.tensor(obs,dtype=torch.float32))
            # Sample an action from this distribution
            action = action_dist.sample()
            # Take a step in the environment with the action drawn
            obs , reward, done, info = env.step(action.numpy())
            reward_total+=reward
           
            # if done then log and break
            if done or step == num_steps-1:
                if episode % log_every ==0 and len(running_avg_reward)>0:
                    running_avg = sum(running_avg_reward)/len(running_avg_reward)
                    print("Episode {0:4d} finished after {1:4d} timesteps with a total reward of {2:3.1f} | Running Average: {3:3.2f}".format(episode,step+1,reward_total,running_avg))
                break
        running_avg_reward.append(reward_total)
    env.close()

## Time to Train

In [6]:
env = gym.make("CartPole-v0")

if (isinstance(env.action_space,gym.spaces.Discrete)):
    discrete = True
    nb_actions = env.action_space.n
else:
    discrete = False
    nb_actions = env.action_space.shape[0]

# Hyper parameters
HIDDEN_SIZE = 256
GAMMA = 0.99
LEARNING_RATE = 0.0005
EPISODES_TO_TRAIN = 1000
EPISODES_TO_EVAL = 20
MAX_STEPS = 201
LOG_EVERY = 20
RENDER_FROM_EP = 2000

agent = REINFORCE_Agent(env.observation_space.shape[0],HIDDEN_SIZE,nb_actions,discrete,GAMMA,LEARNING_RATE)

print("Training...")

train(agent,env,EPISODES_TO_TRAIN,MAX_STEPS,LOG_EVERY,RENDER_FROM_EP)

print("Evaluating...")

eval(agent,env,EPISODES_TO_EVAL,MAX_STEPS,log_every=1)

Training...
Episode   20 finished after    6 timesteps with a total reward of 6.0 | Running Average: 6.32
Episode   40 finished after    5 timesteps with a total reward of 5.0 | Running Average: 6.23
Episode   60 finished after    4 timesteps with a total reward of 4.0 | Running Average: 6.02
Episode   80 finished after    6 timesteps with a total reward of 6.0 | Running Average: 5.78
Episode  100 finished after    5 timesteps with a total reward of 5.0 | Running Average: 5.75
Episode  120 finished after    5 timesteps with a total reward of 5.0 | Running Average: 5.55
Episode  140 finished after    5 timesteps with a total reward of 5.0 | Running Average: 5.28
Episode  160 finished after    5 timesteps with a total reward of 5.0 | Running Average: 5.19
Episode  180 finished after    8 timesteps with a total reward of 8.0 | Running Average: 5.35
Episode  200 finished after    5 timesteps with a total reward of 5.0 | Running Average: 5.39
Episode  220 finished after    5 timesteps with 