<center>
    COMP4600/5500 - Reinforcement Learning

# Homework 8 - Policy Gradient

### Due: Monday, November 29th 11:59 pm
    
</center>

Student Name: ______________________ 

The purpose of this project is to study different properties of Policy Gradient algorithms with Function Approximation.

In [1]:
# You are allowed to use the following modules
import numpy as np
import matplotlib.pyplot as plt
import math
import random

## Task description
Consider the task of driving an underpowered car up a steep mountain road, as suggested by the diagram in the upper left of the following figure. The difficulty is that gravity is stronger than the car's engine, and even at full throttle the car cannot accelerate up the steep slope. The only solution is to first move away from the goal and up the opposite slope on the left. Then by applying full throttle the car can build up enough inertia to carry it up the steep slope even though it is slowing down the whole way.


![mc.png](attachment:mc.png)


This is a continuous control task where things have to get worse in a sense (farther from the goal) before they can get better. The reward in this problem is -1 on all time steps until the car moves past its goal position at the top of the mountain, which ends the episode. There are three possible actions: full throttle forward (+1), full throttle reverse (-1), and zero throttle (0). The car moves according to a simplified physics. Its position $x_t$ and velocity $\dot{x}_t$ are updated by

$x_{t+1} \doteq \text{bound}[x_t + \dot{x}_{t+1}]$

$\dot{x}_{t+1} \doteq \text{bound}[\dot{x}_t + 0.001 A_t - 0.0025 \cos(3x_t)]$


where the \textit{bound} operation enforces $-1.2 \le x_{t+1} \le 0.5$ and $-0.07 \le \dot{x}_{t+1} \le 0.07$. In addition, when $x_{t+1}$ reached the left bound, $\dot{x}_{t+1}$ was reset to zero. When it reached the right bound, the goal was reached and the episode was terminated. Each episode starts from a random position $x_t \in [-0.6, -0.4)$ and zero velocity.




**Note:** You have been given a simple implementation of the Mountain Car task. You can use your implementation of the function approximation from Homework 6, or implement a new one. 


## Part I (COMP4600)

Implement REINFORCE with Baseline (p. 330).



In [2]:
from itertools import combinations, product
import numpy as np
def _build_coefficients(order,state_dim,max_non_zero):
    coeff = np.array(np.zeros(state_dim))

    for i in range(1,max_non_zero + 1):
        for indices in combinations(range(state_dim), i):
            for c in product(range(1, order + 1), repeat=i):
                coef = np.zeros(state_dim)
                coef[list(indices)] = list(c)
                coeff = np.vstack((coeff, coef))
    return coeff

In [3]:
coeff = _build_coefficients(3,2,2)

In [4]:
coeff.shape

(16, 2)

In [5]:
def Xi(state,c):
    p=convert_position(state[0])
    v=convert_velocity(state[1])
    state=(p,v)
    t=np.dot(np.pi*c, np.array(state).reshape(2,1))
    return np.cos(t) #.reshape(9,1)

In [6]:
new_features={}
actions = [-1, 0 , 1]
for a in actions:
    new_features[a]=np.zeros(48).reshape(48,1)
pad=np.zeros((16,1))

In [7]:
def new_Xi(position,velocity):
  global new_features,pad,coeff
  feature=Xi((position,velocity),coeff)
  new_features[-1]=np.concatenate((feature,pad,pad))
  new_features[0]=np.concatenate((pad,feature,pad))
  new_features[1]=np.concatenate((pad,pad,feature))
  return new_features

In [8]:
def v(pos,vel,w,action):
  global coeff
  f=Xi((position,velocity),coeff)
  q=np.dot(np.transpose(w[action]),f[action])
  return q[0][0]

In [9]:
POSITION_MIN = -1.2
POSITION_MAX = 0.5
VELOCITY_MIN = -0.07
VELOCITY_MAX = 0.07

In [10]:
def convert_position(input_):
    output_start=0
    output_end=1
    input_start=-1.2
    input_end=0.5
    output = output_start + ((output_end - output_start) / (input_end - input_start)) * (input_ - input_start)
    return output

In [11]:
def convert_velocity(input_):
    output_start=0
    output_end=1
    input_start=-0.07
    input_end=0.07
    output = output_start + ((output_end - output_start) / (input_end - input_start)) * (input_ - input_start)
    return output

In [12]:
w = np.zeros(16)

In [13]:
theta = np.zeros(48)

In [14]:
def policy(pos,vel):
    global theta,actions
    f=new_Xi(pos,vel)
    probs=[]
    d=np.exp(np.dot(np.transpose(theta[-1]),f[-1]))+np.exp(np.dot(np.transpose(theta[0]),f[0]))+np.exp(np.dot(np.transpose(theta[1]),f[1]))
    for a in actions:
        n=np.exp(np.dot(np.transpose(theta[a]),f[a]))
        probs.append(n/d)
        #print(theta[a])
    indices = [i for i, x in enumerate(probs) if x == max(probs)]
    t=random.choice(indices)
    #print(probs)
    #print(t)
    #print(actions[t])
    return probs,actions[t]

In [15]:
l_theta=0.9
l_w=0.9
alpha_theta= 0.00001
alpha_w=0.0056

In [16]:
class MountainCar:

    def __init__(self):
        self.actions = [-1, 0 , 1]      # [backward, not_moving, forward]
        self.reward = -1.0              #
        self.state_lb = [-1.2, -0.07]   # state lower bound
        self.state_ub = [0.5, 0.07]     # state upper bound
        self.goal_reached = False
   
    def move(self, position, velocity, action):
        
        # Update velocity
        vp = velocity + 0.001*action - 0.0025*np.cos(3*position)
        vp = min(max(vp,self.state_lb[1]),self.state_ub[1])

        # update position
        xp = position + vp
        xp = min(max(xp,self.state_lb[0]),self.state_ub[0])

        # if in left bound
        if xp == self.state_lb[0]:
            vp = 0.0
            
        # if in right bound
        if xp == self.state_ub[0]:
            self.goal_reached = True
            
        return xp, vp, self.reward, self.goal_reached

In [17]:
def delta_lnP(Position,Velocity,action):
  f=new_Xi(Position,Velocity)
  xa=f[action]
  prob,currentAction=policy(Position,Velocity)
  term2=0
  for i in range(3):
      term2=term2+(prob[i]*f[i-1])
  return xa-term2

## Part I (COMP5500)

Implement ACTOR-CRITIC with Eligibility Trace (p. 332).



In [18]:
# Your code here (you are allowed to import from an external python file (of your own impmenetation) 
# instead of copying all the code here)
m=MountainCar()
def ActorCritic(episodes,gamma):
    global l_theta,l_w,alpha_theta,alpha_w,w,theta
    s=[]
    r=[]
    for k in range(episodes):
        steps=0
        r_total=0
        print(k)
        currentPosition = np.random.uniform(-0.6, -0.4)
        currentVelocity = 0.0
        z_theta=np.zeros((48,1))
        z_w=np.zeros((48,1))
        
        i=1
        while currentPosition!=POSITION_MAX:
            prob,currentAction=policy(currentPosition,currentVelocity)
            #print(prob)
            #print(currentAction)
            #print(prob)
            #print("+++++++++++++++++++++++++++")
            newPostion, newVelocity, reward,_ =m.move(currentPosition, currentVelocity, currentAction)
            if newPostion==POSITION_MAX:
                v_newps=0
            else:
                v_newps=v(newPostion,newVelocity,w,currentAction)
            #print(v_newps)
            v_currentpos=v(currentPosition,currentVelocity,w,currentAction)
            delta=reward+(gamma*v_newps)-v_currentpos
            #print(delta)
            f=new_Xi(currentPosition,currentVelocity)
            z_w=(gamma*z_w*l_w)+f[currentAction]
            z_theta=(gamma*z_theta*l_theta)+(i*delta_lnP(currentPosition,currentVelocity,currentAction))
            w[currentAction]+=(alpha_w*delta*z_w)
            #print(w[currentAction])
            theta[currentAction]+=(alpha_theta*delta*z_theta)
            #print(theta[currentAction])
            i=i*gamma
            currentPosition = newPostion
            currentVelocity = newVelocity
            steps+=1
            r_total=r_total+reward
        s.append(steps)
        r.append(r_total)
    return r,s

In [19]:
r,s=ActorCritic(50,1)

KeyboardInterrupt: ignored

In [None]:
import matplotlib
plt.plot(np.arange(50),r)
plt.xlabel('Episode')
plt.ylabel('reward per episode')
#plt.yscale('log')
plt.legend()

In [None]:
import matplotlib
plt.plot(np.arange(50),s)
plt.xlabel('Episode')
plt.ylabel('Steps per episode')
#plt.yscale('log')
plt.legend()

## Part II

Use the algorithm to learn the Mountain Car task. Tune the step-size parameter ($\alpha$), the Function Approximation order, the discount factor ($\gamma$), and the $\lambda$-value. **Note:** you can consider the problem to be undiscounted.
 
1. Plot step-per-episode (in log scale) vs. number of episodes. This plot should be averaged over 50-100 runs. 
2. Plot total reward on episode vs. number of episodes. This plot should be averaged over 50-100 runs.
3. Show an animation of the task for the final episode.


## Part III*

Implement the other algorithm and include plots for both algorithms in part II.