# Model-free Reinforcement Learning with policy gradients

In this exercise you will program a model-free reinforcement learning algorithm policy gradient. The tasks will be:

    Explore the environment
    Design the policy
    Compute action selection
    Estimate the expected return
    Backpropagate through the policy
    

In [5]:
# import all neccesary modules
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import gym
import numpy as np
from itertools import count
%matplotlib inline  
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

eps = np.finfo(np.float32).eps.item()

In [6]:
# take a look at the environment
gamma = 0.99 #discoutn factor
render = True #render the environment
log_interval = 10

env = gym.make('CartPole-v1')

for i_episode in range(10):
    state, ep_reward = env.reset(), 0
    for t in range(1, 10000):  # Don't infinite loop while learning
        action = env.action_space.sample()
        state, reward, done, _ = env.step(action)
        if render:
            env.render()
        if done:
            break
env.close()            

# Get familiar with the environment

Describe the action and state space. (discrete or continuous, how many actions (states) are available, what do the actions (states) describe?)

State space:

The state space describes the set of all possible states that the agent can be in. If the agent conducts an action it will move from one state to another state inside the state space. Discrete spaces contain a finite and discrete number of states, while continuous states contain an infinite number of states.

Action space:

The action space is the set of all possible actions that an agent can conduct in his environment. Given a specific state, the agent can conduct certain actions, which are modelled as a map. This map is called policy function and it provides the probability that an agent will conduct a certain action when in a specific state. Again, a discrete action space contains a finite and discrete number of actions while a continuous action state can contain any number of actions.

# Define the policy

Go to 'nn.py' to implement the policy (Task 1.1 & Task 1.2). Contrary to Assignment1, you are defining the neural network policy as a class. In PyTorch, the minimal setup of a neural network in a class requieres at least two functions, the initialization and the forward function. During initialization you have to initialize all instances of layers, e.g. fully connected layer, convolutional layer ... . Additional variables, e.g. lists, constants, ..., are also set during initialization. In the forward function you are connecting these instances to a neural network. 

    Policy architecture:
    fully connected layer (#states -> 128)
    dropout layer (p=0.6)
    ReLU activation
    fully connected layer (128 -> #actions)
    softmax activation
    
Additionally, the policy needs a list for logartihmic probabilities of all actions and all rewards of a single trajectory. 

In [None]:
class Policy(nn.Module):
    def __init__(self):
        super(Policy, self).__init__()
        # Task 1.1:
        # Initialize all necessary layers 
        # Use the layers from torch.nn
        # Don't forget to initialize two lists, one for the log probs and one for the rewards

        ######################################## START OF YOUR CODE ########################################
        self.fc1 = nn.Linear(num_of_states, 128)
        self.dropout1 =  nn.Dropout(p=0.6)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(128, num_of_actions)
        self.softmax2 = nn.SoftMax
        self.log_probs = []
        # do the same for the rewards
        pass  # to be replaced by your code

        ######################################## END OF YOUR CODE ##########################################

    def forward(self, x):
        # Task 1.2:
        # Now, connect all layers to a neural network and return the probabilities for all actions
        # Use activation functions from torch.nn.functional

        ######################################## START OF YOUR CODE ########################################
        #define how our NNs are connected
        output = self.fc1(x) # x is the input data
        output = self.dropout1(output)
        output = F.ReLU(output) # applying the greedy activation function to the output
        pass  # to be replaced by your code
        ######################################## END OF YOUR CODE ##########################################


Furthermore, initialize an instance of the neural network policy. To update the weights, use an Adam optimizer (torch.optim) with a learning rate of lr=1e-2. 

In [None]:
# Task 1.3:
# Initialize the policy and the optimizer
######################################## START OF YOUR CODE ########################################
policy = Policy()
# we need an optimizer
optimizer = optim.Adam( assign a few parameters, find them in the API)


pass  # to be replaced by your code

######################################## END OF YOUR CODE ##########################################

# Sample an action from the probability distribution over actions of the policy

Now, we need to sample an action from the probability distribution computed by the policy. First, convert the state to a torch tensor. Then, compute the action probabilities and define a categorical distribution over the action probabilities. Last, sample an action from this probability and save the log probability of this action in the corresponding list of the policy.  

In [None]:
# Task 2.1:
# Define the function for sampling an action
def sample_action(state):
    ######################################## START OF YOUR CODE ########################################
   # transform this into a torch tensor, also cast it to a 32bit float tensor and use unsqueeze
    state = torch.from_numpy(state).float().unsqueeze()
    # put in into the policy
    prob = policy(state)
    # define categorical destribution over all actions in prob (distribution also in the API)
    # now we have to sample from distribution, command is in the documentation
    sample = distribution.log_prob(action) # this line is not 100% correct!!! the command is called something like log_prob(action)
    # now append to the log_probs list of the policy
    policy.log_probs.append(sample)
    
    pass  # to be replaced by your code

    ######################################## END OF YOUR CODE ##########################################
    return action.item()


# Estimate the exptected returns 

In the following, you need to compute the estimated return for each state-action pair with respect to the corresponding trajectory. In order to compute the estimate of the expectation over the remaining discounted rewards it is quite reasonable to iterate backwards through the reward list stored in the policy. 

In [None]:
# Task 4.1: 
# Define a running variable for the current rewards and a list to store the expected return for each state.
# Iterate backwards through all rewards stored in the corresponding list of the policy and compute the discounted 
# expected return for all remaining states of the current trajectory. (gamma is already defined)
# Note, when iterating backwards, appending the expected return is not a smart move!!!
def estimate_return():
    ######################################## START OF YOUR CODE ########################################
    # we define a backwards loop which iterates over the ... trajectory
    R = 0 # will be updated incrementally
    rewardlist = [] # list of rewared for every state
    
    for r in policy.rewards[::-1]: # we iterate backwards in our trajectory
        # compute the discounted reward. R = running variable over all states (the reward so far)
        # r = reward which comes from our list
        R = r + gamma * R  # discounted
        # add R to rewardlist
        
    # also normalize data to increase neural network performance
    # normalize = returns.mean()/(returns.std()) + eps 
    # eps is a stabilizing factor 
    
    pass  # to be replaced by your code

    ######################################## END OF YOUR CODE ##########################################
    return returns
    

# Compute the loss and update the weights

Using the estimates for the exptected discounted, you need to update the weights of the policy. Remember that the only difference between the gradient in supervised learning and the gradient of the RL objective in policy gradient is the weighting factor based on the expected return (see the slides of the lecture). To compute the loss of a trajectory, sum over the negative log-likelihood of the chosen action multiplied by the expected return of the current state for all states of the trajectory. Then, use the backwards-function to compute the gradient and update the weights using the optimizer.   

In [None]:
# Task 5.1:
# Define a list for the losses of each state in a trajectory
# Compute the individual loss for each state based on the negative log likelihood of the action and the expected
# return of the current state (you have to iterate over the complete trajectory)
# Prepare the optimizer for an update step
# Compute the overall loss of the policy (sum)
# Compute the gradient
# Update the weights of the network
# Don't forget to delete the stored rewards and the action probs in the corresponding lists of the policy
def update_weights(returns):
    ######################################## START OF YOUR CODE ########################################
    losses = []
    
    for log_probs, R in zip(policy.log_probs, returns): # we can use multiple lists at a time by using zip function
        #compute the nll of the return: just multiply by expected return and add to list
        # something like this: append(-log_probs*expected_return)
    # now zero out gradient in optimizer (there is a command for this)
    # now sum up the list
    torch.cat(policy_Loss).sum() #concatination of all entries in our list. maybe policy_Loss is losses
    # now backward command
    # and optimizer takes a step
    # delete what is in policy.log_probs
    # delete what is in policy.rewards so we can start from the beginning again
    
    pass  # to be replaced by your code

    ######################################## END OF YOUR CODE ##########################################

# Putting it all together

The last step is to combine everything. Update the policy with 1000 trajectory samples. The openai gym environments return a terminal flag which you can use to notice a terminal state. Remember policy gradient is on-policy meaning that it is necessary to generate new samples after every single update step. Visualize the evolution of the rewards over time in a plot. Use an exponential weighted moving average over the rewards to get a smooth graph (delta = 0.05). Additionaly, illustrate the cumulative reward of every trajectory in the same plot. Print out after every ten trajectories the trajectory counter, the averaged reward and the last reward.

In [None]:
# Task 6.1:
# Generate 1000 trajectories with a maximum length of 10000. 
running_reward = 10
render = False

######################################## START OF YOUR CODE ########################################
# just follow what is explained for this
pass  # to be replaced by your code

# from Kai Lagemann's email to stop the learning process if the exponentially weighted moving average 
# is higher than the threshold
if i_episode % log_interval == 0:
    print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
              i_episode, ep_reward, running_reward))
if running_reward > env.spec.reward_threshold:
    print("Solved! Running reward is now {} and "
              "the last episode runs to {} time steps!".format(running_reward, t))
    break
######################################## END OF YOUR CODE ##########################################        
plt.plot(true_reward)
plt.plot(averaged_reward)
plt.show

# Render your learned policy

In [None]:
# You don't have to do anything here.
env = gym.make('CartPole-v1')
render = True
for i_episode in range(10):
    state, ep_reward = env.reset(), 0
    for t in range(1, 10000):  # Don't infinite loop while learning
        action = sample_action(state)
        state, reward, done, _ = env.step(action)
        if render:
            env.render()
        if done:
            break
env.close()       