<center>
    COMP4600/5500 - Reinforcement Learning

# Homework 6 - On-policy Control with Approximation

### Due: Monday, November 15th 11:59 pm
    
</center>

Student Name: ______________________ 

The purpose of this project is to study different properties of Function Approximation with on-policy control methods.  

In [16]:
# You are allowed to use the following modules
import numpy as np
import matplotlib.pyplot as plt
from mountain_car import MountainCar
import pygame as pg
from itertools import product

## Task description
Consider the task of driving an underpowered car up a steep mountain road, as suggested by the diagram in the upper left of the following figure. The difficulty is that gravity is stronger than the car's engine, and even at full throttle the car cannot accelerate up the steep slope. The only solution is to first move away from the goal and up the opposite slope on the left. Then by applying full throttle the car can build up enough inertia to carry it up the steep slope even though it is slowing down the whole way.


![mc.png](attachment:mc.png)


This is a continuous control task where things have to get worse in a sense (farther from the goal) before they can get better. The reward in this problem is -1 on all time steps until the car moves past its goal position at the top of the mountain, which ends the episode. There are three possible actions: full throttle forward (+1), full throttle reverse (-1), and zero throttle (0). The car moves according to a simplified physics. Its position $x_t$ and velocity $\dot{x}_t$ are updated by

$x_{t+1} \doteq \text{bound}[x_t + \dot{x}_{t+1}]$

$\dot{x}_{t+1} \doteq \text{bound}[\dot{x}_t + 0.001 A_t - 0.0025 \cos(3x_t)]$


where the \textit{bound} operation enforces $-1.2 \le x_{t+1} \le 0.5$ and $-0.07 \le \dot{x}_{t+1} \le 0.07$. In addition, when $x_{t+1}$ reached the left bound, $\dot{x}_{t+1}$ was reset to zero. When it reached the right bound, the goal was reached and the episode was terminated. Each episode starts from a random position $x_t \in [-0.6, -0.4)$ and zero velocity.


## Part I

You have been given a simple implementation of the Mountain Car task. 

1. Your first task is to check and confirm that the given code simulates the above formulae and task description. Then write a function that generates random episodes for this task. You should use the given code for the Mountain Car task. 


In [2]:
def episode():
    car = MountainCar()
    x = np.random.uniform(-0.6, -0.4)
    v = 0
    X = [x]
    while True:
        a = np.random.choice(car.actions)
        x, v, r, goal_reached = car.move(x, v, a)
        X.append(x)
        if goal_reached:
            return X
        

2. Use the Pygame library to develop a simple function that animates a given episode/trajectory. The equation for the mountain is $y = 0.45\sin(3x) + 0.55$. Use a randomly generated episode from the function you developed above and pass it to your animation function, then plot the results.

In [None]:
FPS = 40
BG_COLOR = pg.Color(200, 200, 200)
CURVE_COLOR = pg.Color(70, 70, 70)
CAR_COLOR = pg.Color(200, 70, 70)

SCALE = 1000
PAD = 200

X_MIN, X_MAX = -1.2, 0.5
Y_MAX, Y_MIN = 1, 0

WIDTH = int((X_MAX - X_MIN) * SCALE  + PAD)
HEIGHT = int((Y_MAX - Y_MIN) * SCALE  + PAD)

def transform(x, y):
    """transform an xy coordinate to pygame screen coordinates"""
    return (x + 1.2) * SCALE + PAD / 2, (1 - y) * SCALE + PAD / 2

def curve(x, offset=0):
    return 0.45 * np.sin(3*x) + 0.55 + offset


class PgCar:

    def __init__(self):
        self.screen, self.bg = self.init()
        
    def init(self):
        pg.init()  # initialize pygame
        screen = pg.display.set_mode((WIDTH, HEIGHT))  # set up the screen
        pg.display.set_caption("Mohamed Martini")  # add a caption
        bg = pg.Surface(screen.get_size())  # get a background surface
        bg = bg.convert()
        bg.fill(BG_COLOR)
        screen.blit(bg, (0, 0))
        return screen, bg

    def draw_curve(self):
        start = None
        for x in np.arange(X_MIN, X_MAX, 0.001):
            end = transform(x, curve(x))
            try:
                pg.draw.line(self.screen, CURVE_COLOR, start, end, width=7)
            except:
                continue
            finally:
                start = end

    def render(self):
        """show the grid array on the screen"""
        pg.display.flip()
        pg.display.update()
    
    def reset_screen(self):
        self.screen.fill(BG_COLOR)
        self.draw_curve()
    
    def animate(self, X):
        """receive a list of positions on the x axis, and plot the movement of the screen"""
        clock = pg.time.Clock()
        radius = 20
        i = 0
        num_steps = len(X)
        run = True
        while run:
            clock.tick(FPS)
            for event in pg.event.get():
                if event.type == pg.QUIT:
                    run = False
            center = transform(X[i], curve(X[i], offset=0.05))
            self.reset_screen()
            pg.draw.circle(self.screen, CAR_COLOR, center, radius, width=radius)
            self.render()
            i += 1
            if i == num_steps - 1:
                run = False
        pg.quit()
        

X = episode()        
pgcar = PgCar().animate(X)

## Part II

Develop a function approximation procedure based on either **Polynomials** or **Fourier basis** (recommended). Given the current $\bar{w}$, the developed function approximation method should return the value for each specific state.

In [17]:
def fourier_basis(s_: np.array, n: int):
    k = s_.shape[0]
    num_features = (n + 1) ** k
    x_ = np.zeros(num_features)
    for i, c in enumerate(product(range(n), repeat=k)):
        c_ = np.array(c)
        x_[i] = np.cos(np.pi * s_.T @ c_)
    return x_

def v(s_: np.array, w_: np.array, n: int):
    x_ = fourier_basis(s_, n)
    return w_.T @ x_

## Part III (COMP4600)

1. Implement the **Episodic Semi-gradient SARSA** (pp. 244).

In [None]:
# Your code here

2. Use the algorithm to learn the Mountain Car task. Tune the step-size parameter ($\alpha$), select a proper Function Approximation order, discount factor ($\gamma$), exploration probability ($\varepsilon$). Plot step-per-episode (in log scale) vs. number of episodes. This plot should be averaged over 50-100 runs.    

In [None]:
# Your code here

3. Show an animation of the task.

In [None]:
# Your code here

## Part III (COMP5500)

1. Implement the **Episodic Semi-gradient $n$-step SARSA** (pp. 247).

In [None]:
# Your code here

2. Use the algorithm to learn the Mountain Car task with $n \in \{1, 8, 16\}$. Tune the step-size parameter ($\alpha$), select a proper Function Approximation order, discount factor ($\gamma$), exploration probability ($\varepsilon$). Plot step-per-episode (in log scale) vs. number of episodes. This plot should be averaged over 50-100 runs.    

In [None]:
# Your code here

3. Show an animation of the task for each $n$.

In [None]:
# Your code here

4. Which value of $n$ results in faster learning? Why?

>Answer 