In the last lesson, we  talked about value-based methods (Monte Carlo, Temporal Difference, Q-Learning). In this lesson, we will talk about policy-based methods. More specifically, POLICY GRADIENT!!!

---

Before we talk about policy gradient. Let's review the difference between value-based and policy-based methods.

Value-Based | Policy-Based
:--: | :--:
Optimizing a **value function** (which will lead to an optimal policy) | Optimizing a **policy** directly
**Deterministic** (the policy outputs are fixed) | **Deterministic** or **Stochastic**
Relatively **short** training cost | Relatively **long** training cost
Generally for **discrete** action space | Suitable for both **discrete** and **continuous** action space

In this lesson, we will specifically discusses one of the policy gradient algorithm called the [REINFORCE](https://www.analyticsvidhya.com/blog/2020/11/reinforce-algorithm-taking-baby-steps-in-reinforcement-learning/) 

We will begin by importing necessary packages and defining the policy network that the agent will use to decide its actions:

In [1]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# Define Policy Network
class SoftmaxPolicy(nn.Module):
    def __init__(self, n_inputs, n_outputs):
        super(SoftmaxPolicy, self).__init__()
        self.fc1 = nn.Linear(n_inputs, 16)
        self.fc2 = nn.Linear(16, 16)
        self.fc3 = nn.Linear(16, n_outputs)
        self.ReLU = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.ReLU(x)
        x = self.fc2(x)
        x = self.ReLU(x)
        x = self.fc3(x)
        return x

# Define Policy Gradient Agent
class SoftmaxAgent:
    def __init__(self, n_inputs, n_outputs):
        self.policy_network = SoftmaxPolicy(n_inputs, n_outputs)
        self.optimizer = optim.Adam(self.policy_network.parameters(), lr=0.01)
        self.gamma = 0.99

    def get_action(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0)
        action_scores = self.policy_network(state)
        action_probs = torch.softmax(action_scores, dim=1)
        action = np.random.choice(len(action_probs[0]), p=action_probs.detach().numpy()[0])
        log_prob = torch.log(action_probs[0, action])
        return action, log_prob

    def update_policy(self, rewards, log_probs):
        discounted_rewards = []
        for t in range(len(rewards)):
            Gt = 0 
            pw = 0
            for r in rewards[t:]:
                Gt = Gt + self.gamma**pw * r
                pw = pw + 1
            discounted_rewards.append(Gt)

        discounted_rewards = torch.tensor(discounted_rewards)
        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9) # normalize discounted rewards

        policy_gradient = []
        for log_prob, Gt in zip(log_probs, discounted_rewards):
            policy_gradient.append(-log_prob * Gt)

        self.optimizer.zero_grad()
        policy_gradient = torch.stack(policy_gradient).sum()
        policy_gradient.backward()
        self.optimizer.step()

Perhaps you understand most of the code above so far (assume you know how a basic neural networks works) except for **get_action** and **update_policy**

___

**get_action**
1. We pass in the current state to the policy network
2. From the network output, we use softmax (oh softmax!) to get probability distribution of actions (how likely we will choose each action)
3. We choose an action according to the probability and return both the chosen *action* and the log probability of choosing that  *action* (for why we need the log probability is because of [Policy Gradient Theorem](https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d), the math theory behind this algorithm)

**update_policy**
1. Initialize an empty list for discounted rewards
2. Using the *rewards* from playing the game to update the policy according to the following equation from HuggingFace


**One more thing**
- If you are wondering about why we want to have *normalized discounted rewards*, check [this](https://datascience.stackexchange.com/questions/20098/why-do-we-normalize-the-discounted-rewards-when-doing-policy-gradient-reinforcem).
- It is okay if you don't know what this algorithm is doing. To put it simply, this algorithm is an intelligent deep learning approach to learn the optimal policy via gradient ascent (maximize rewards). 
---

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_pseudocode.png" alt="Policy gradient pseudocode"/>

**Question 4.1**: Is the action space here **discrete** or **continuous** ?

**Question 4.2**: Is the observation space **discrete** or **continuous** ?

Next, we will train the agent to play [CartPole](https://gymnasium.farama.org/environments/classic_control/cart_pole/)!

In [None]:
# Training the Agent

env = gym.make("CartPole-v1", render_mode="rgb_array")
agent = SoftmaxAgent(env.observation_space.shape[0], env.action_space.n)
n_episodes = 80

for episode in range(n_episodes):
    state = env.reset()[0]
    log_probs = []
    rewards = []
    done = False

    while not done:
        action, log_prob = agent.get_action(state)
        new_state, reward, done, _, _ = env.step(action.numpy())
     
        log_probs.append(log_prob)
        rewards.append(reward)
        state = new_state
        if done:
            agent.update_policy(rewards, log_probs)
            episode_reward = sum(rewards)
            print("Episode " + str(episode) + ": " + str(episode_reward))


Let's see how well our agent does on the game!!

In [None]:
from IPython import display
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# reset the environment
obs = env.reset()

img = plt.imshow(env.render()) # only call this once


while True:

  img.set_data(env.render()) # just update the data
  display.display(plt.gcf())
  display.clear_output(wait=True)

  action, _ = agent.get_action(state) # predict the action and state using the model
  new_state, reward, done, _, _ = env.step(action.numpy())
  if done :
    break

env.close()

As you saw from the table above, Policy Gradient also work for scenario with continuous observation space!

In [2]:
class GaussianPolicy(nn.Module):
    def __init__(self, n_inputs, n_outputs):
        super(GaussianPolicy, self).__init__()
        self.fc1 = nn.Linear(n_inputs, 16)
        self.fc2 = nn.Linear(16, 16)
        self.fc3 = nn.Linear(16, n_outputs)
        self.ReLU = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.ReLU(x)
        x = self.fc2(x)
        x = self.ReLU(x)
        x = self.fc3(x)
        x = self.ReLU(x)
        
        mean = x.mean()
        std_dev = torch.exp(x) # std deviation must be positive, hence we take exp.
        return mean, std_dev

class GaussianAgent:
    def __init__(self, n_inputs, n_outputs):
        self.policy_network = GaussianPolicy(n_inputs, n_outputs)
        self.optimizer = optim.Adam(self.policy_network.parameters(), lr=0.01)
        self.gamma = 0.99

    def get_action(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0)
        mean, std_dev = self.policy_network(state)
        normal_distribution = torch.distributions.Normal(mean, std_dev)
        action = normal_distribution.sample()
        log_prob = normal_distribution.log_prob(action)
        return action, log_prob

    def update_policy(self, rewards, log_probs):
        discounted_rewards = []
        for t in range(len(rewards)):
            Gt = 0
            pw = 0
            for r in rewards[t:]:
                Gt = Gt + self.gamma**pw * r
                pw = pw + 1
            discounted_rewards.append(Gt)

        discounted_rewards = torch.tensor(discounted_rewards)
        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9) # normalize discounted rewards

        policy_gradient = []
        for log_prob, Gt in zip(log_probs, discounted_rewards):
            policy_gradient.append(-log_prob * Gt)

        self.optimizer.zero_grad()
        policy_gradient = torch.stack(policy_gradient).sum()
        policy_gradient.backward()
        self.optimizer.step()

Different from a softmax policy that you see above, which output a discrete number of actions, we use the gaussian policy here to construct a [Normal Distribution](https://en.wikipedia.org/wiki/Normal_distribution) that is suitable for continuous action space. Think of this as creating a function, which accepts input and return output!

In [None]:
env = gym.make("Pendulum-v1", render_mode="rgb_array")
agent = GaussianAgent(env.observation_space.shape[0], env.action_space.shape[0])
n_episodes = 10

for episode in range(n_episodes):
    state = env.reset()[0]
    log_probs = []
    rewards = []
    done = False
    action, log_prob = agent.get_action(state)

    while not done:
        action, log_prob = agent.get_action(state)
        new_state, reward, done, _, _ = env.step(action.numpy())
        log_probs.append(log_prob)
        rewards.append(reward)
        state = new_state.reshape(-1)

        if done:
            agent.update_policy(rewards, log_probs)
            episode_reward = sum(rewards)
            print("Episode " + str(episode) + ": " + str(episode_reward))

Let's see how far you have been following along the lesson!

**Question 4.3**:  Is the action space here **discrete** or **continuous** ?

How about this game, how well does our agent perform?

In [None]:
from IPython import display
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# reset the environment
obs = env.reset()

img = plt.imshow(env.render()) # only call this once


while True:

  img.set_data(env.render()) # just update the data
  display.display(plt.gcf())
  display.clear_output(wait=True)

  action, _ = agent.get_action(state) # predict the action and state using the model
  new_state, reward, done, _, _ = env.step(action.numpy())
  if done :
    break

env.close()