# Session 05 - Policy Gradients - Assignment

In this assignment you will implement REINFORCE, a policy gradient method based on Monte Carlo sampling.
The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.

The REINFORCE algorithm comprises the following steps:
0. Initialize and reset the environment.
1. Get the state from the environment.
2. Feed forward our policy network to predict the probability of each action we should take. We’ll sample from this distribution to choose which action to take (i.e. toss a biased coin). This implies that the ouput layer of the neural network has a Softmax activation function.
3. Receive the reward and the next state state from the environment for the action we took.
4. Store this transition sequence of state, action, reward, for later training.
5. Repeat steps 1–4. If we receive the done flag from the game it means the episode is over.
6. Once the episode is over, we train our neural network to learn from our stored transitions using our gradient update rule. After training you can clear the stored states, actions and rewards from the memory. 
7. Play next episode and repeat steps above until convergence

In [1]:
import gym
import numpy as np

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.losses import CategoricalCrossentropy

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

tf.get_logger().setLevel('ERROR')

##  1. The cartpole environment

- Solve the OpenAI Gym cartpole environment with REINFORCE. 
- Plot the reward history (learning curve). For each episode plot the cumulative reward, or even better: plot average cumulative reward over a certain number of episodes (for example 50 episodes).
- What is the effect of alpha, the learning rate for the gradient?
- Do hyperparamter tuning to increase the speed of learning.
- Compare the results to the ones of Deep Q-learning. Check the speed of learning, consistency of the results, etc.
- Explain how the agent will explore a lot in the beginning and gradually will exploit more and more. 

In [4]:
# Implementation of REINFORCE for the Cartpole environment



## 2. Lunarlander environment

- Solve the OpenAI Gym Lunarlander environment with REINFORCE. 
- Plot the reward history (learning curve). For each episode plot the cumulative reward, or even better: plot average cumulative reward over a certain number of episodes (for example 50 episodes).
- Compare the results to the ones of Deep Q-learning. Check the speed of learning, consistency of the results, etc.

In [3]:
# Implementation of REINFORCE for the Lunarlander environment 



## 3. CarRacing environment ---- OPTIONAL ----

- Solve the OpenAI Gym CarRacing environment with REINFORCE. 
- Plot the reward history (learning curve). For each episode plot the cumulative reward, or even better: plot average cumulative reward over a certain number of episodes (for example 50 episodes).
- Compare the results to the ones of Deep Q-learning. Check the speed of learning, consistency of the results, etc.

## EXAMPLE CODE

In [None]:
import gym
import numpy as np

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.losses import CategoricalCrossentropy

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

tf.get_logger().setLevel('ERROR')

class REINFORCE:
  def __init__(self, env, path=None):
    self.env=env 
    self.state_shape=env.observation_space.shape # the state space
    self.action_shape=env.action_space.n # the action space
    self.gamma=0.99 # decay rate of past observations
    self.alpha=1e-4 # learning rate of gradient
    self.learning_rate=0.01 # learning of deep learning model
    
    if not path:
      self.model=self.build_policy_network() #build model
    else:
      self.model=self.load_model(path) #import model

    # record observations
    self.states=[]
    self.gradients=[] 
    self.rewards=[]
    self.probs=[]
    self.discounted_rewards=[]
    self.total_rewards=[]
    
    

  def build_policy_network(self):
    

    # BUILD MODEL ####################################################################
        
    return model

  def hot_encode_action(self, action):

    action_encoded=np.zeros(self.action_shape)
    action_encoded[action]=1

    return action_encoded
  
  def remember(self, state, action, action_prob, reward):
    
    # STORE EACH STATE, ACTION AND REWARD into the episodic momory #############################


  def compute_action(self, state):


    # COMPUTE THE ACTION FROM THE SOFTMAX PROBABILITIES

    return action, action_probability_distribution


  def get_discounted_rewards(self, rewards): 
   
    discounted_rewards=[]
    cumulative_total_return=0
    # iterate the rewards backwards and and calc the total return 
    for reward in rewards[::-1]:      
      cumulative_total_return=(cumulative_total_return*self.gamma)+reward
      discounted_rewards.insert(0, cumulative_total_return)

    # normalize discounted rewards
    mean_rewards=np.mean(discounted_rewards)
    std_rewards=np.std(discounted_rewards)
    norm_discounted_rewards=(discounted_rewards-
                          mean_rewards)/(std_rewards+1e-7) # avoiding zero div
    
    return norm_discounted_rewards

  def train_policy_network(self):
       
    # get X_train
    states=np.vstack(self.states)

    # get y_train
    gradients=np.vstack(self.gradients)
    rewards=np.vstack(self.rewards)
    discounted_rewards=self.get_discounted_rewards(rewards)
    gradients*=discounted_rewards
    gradients=self.alpha*np.vstack([gradients])+self.probs
    y_train = gradients
    history=self.model.train_on_batch(states, y_train)
    
    self.states, self.probs, self.gradients, self.rewards=[], [], [], []

    return history


  def train(self, episodes):
     


  def hot_encode_action(self, action):

    action_encoded=np.zeros(self.action_shape)
    action_encoded[action]=1

    return action_encoded
  
 

ENV="CartPole-v1"

N_EPISODES=500



# set the env
env=gym.make(ENV) # env to import
env.reset() # reset to env 

Agent = REINFORCE(env)

Agent.train(N_EPISODES)