[Official documentaiton Mountain car](https://gym.openai.com/envs/MountainCar-v0/)

[Github source code](https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py)

[OpenAI Gym DOC](https://gym.openai.com/docs/)

In [67]:
import numpy as np
import matplotlib.pyplot as plt
import tiles3 as tc
from tqdm import tqdm

import gym
from gym.wrappers import Monitor

import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

%matplotlib inline

MDP Process and the definition of each variable


- **observation** (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
- **reward** (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
- **done** (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
- **info** (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.


<img src="./assets/MDP.png" width="380" />

[Section 3.1 of the textbook](http://www.incompleteideas.net/book/RLbook2018.pdf#page=70)

# Environment Specifications

**Observation**: 

     Type:  Box(2)
     Num    Observation               Min            Max
     0      Car Position              -1.2           0.6
     1      Car Velocity              -0.07          0.07
         
**Actions**:

     Type: Discrete(3)
     Num    Action
     0      Accelerate to the Left
     1      Don't accelerate
     2      Accelerate to the Right

     Note: This does not affect the amount of velocity affected by the gravitational pull acting on the car
        
**Reward**:

     Reward of 0 is awarded if the agent reached the flag(position = 0.5) on top of the mountain
     Reward of -1 is awarded if the position of the agent is less than 0.5
        
**Starting State**:

     The position of the car is assigned a uniform random value in [-0.6 , -0.4]
     The velocity of the car is always assigned to 0
        
**Episode Termination**:

     The car position is more than 0.5
     Episode length is greater than 200

In [51]:
env = gym.make("MountainCar-v0")
observation = env.reset() 

# Object's type in the action Space
print("The Action Space is an object of type: {0}\n".format(env.action_space))
# Shape of the action Space
print("The shape of the action space is: {0}\n".format(env.action_space.n))
# Object's type in the Observation Space
print("The Environment Space is an object of type: {0}\n".format(env.observation_space))
# Shape of the observation space
print("The Shape of the dimension Space are: {0}\n".format(env.observation_space.shape))
# The high and low values in the observation space
print("The High values in the observation space are {0}, the low values are {1}\n".format(
    env.observation_space.high, env.observation_space.low))
# Minimum and Maximum car position
print("The minimum and maximum car's position are: {0}, {1}\n".format(
    env.observation_space.low[0], env.observation_space.high[0]))
# Minimum and Maximum car velocity
print("The minimum and maximum car's velocity are: {0}, {1}\n".format(
    env.observation_space.low[1], env.observation_space.high[1]))
# Example of observation
print("The Observations at a given timestep are {0}\n".format(env.observation_space.sample()))



The Action Space is an object of type: Discrete(3)

The shape of the action space is: 3

The Environment Space is an object of type: Box(2,)

The Shape of the dimension Space are: (2,)

The High values in the observation space are [0.6  0.07], the low values are [-1.2  -0.07]

The minimum and maximum car's position are: -1.2000000476837158, 0.6000000238418579

The minimum and maximum car's velocity are: -0.07000000029802322, 0.07000000029802322

The Observations at a given timestep are [-0.84685916  0.04859914]



# Tile Coding Function

Tile coding is introduced in [Section 9.5.4 of the textbook](http://www.incompleteideas.net/book/RLbook2018.pdf#page=239) of the textbook as a way to create features that can both provide good generalization and discrimination. It consists of multiple overlapping tilings, where each tiling is a partitioning of the space into tiles.

<img src="./assets/tilecoding.png" width="640" />

 [Tiles3 documentation](http://incompleteideas.net/tiles/tiles3.html)

In [37]:
# Tile Coding Class
class MountainCarTileCoder:
    def __init__(self, iht_size=4096, num_tilings=8, num_tiles=8):
        """
        Initializes the MountainCar Tile Coder
        Initializers:
        iht_size -- int, the size of the index hash table, typically a power of 2
        num_tilings -- int, the number of tilings
        num_tiles -- int, the number of tiles. Here both the width and height of the
                     tile coder are the same
        Class Variables:
        self.iht -- tc.IHT, the index hash table that the tile coder will use
        self.num_tilings -- int, the number of tilings the tile coder will use
        self.num_tiles -- int, the number of tiles the tile coder will use
        """
        self.iht = tc.IHT(iht_size)
        self.num_tilings = num_tilings
        self.num_tiles = num_tiles
    
    def get_tiles(self, position, velocity):
        """
        Takes in a position and velocity from the mountaincar environment
        and returns a numpy array of active tiles.
        
        Arguments:
        position -- float, the position of the agent between -1.2 and 0.5
        velocity -- float, the velocity of the agent between -0.07 and 0.07
        returns:
        tiles - np.array, active tiles
        """
        # Set the max and min of position and velocity to scale the input
        # The max position is set to 0.5 as this is the position to end the experiment
        POSITION_MIN = -1.2
        POSITION_MAX = 0.5
        VELOCITY_MIN = -0.07
        VELOCITY_MAX = 0.07
        
        # Scale position and velocity by multiplying the inputs of each by their scale
        position_scale = self.num_tiles / (POSITION_MAX - POSITION_MIN)
        velocity_scale = self.num_tiles / (VELOCITY_MAX - VELOCITY_MIN)
        
        # Obtain active tiles for current position and velocity
        tiles = tc.tiles(self.iht, self.num_tilings, [position * position_scale, 
                                                      velocity * velocity_scale])
        
        return np.array(tiles)

In [39]:
# Test the TileCoder class
mctc = MountainCarTileCoder(iht_size = 1024, num_tilings = 8, num_tiles = 8)
tiles = mctc.get_tiles(position = -1.0, velocity = 0.01)
# Tiles obtained at a random pos and vel
print("The Tiles obtained are: {0}\n".format(tiles))

The Tiles obtained are: [0 1 2 3 4 5 6 7]



# Argmax function

In [54]:
def argmax(q_values):
    top = float("-inf")
    ties = []

    for i in range(len(q_values)):
        if q_values[i] > top:
            top = q_values[i]
            ties = []

        if q_values[i] == top:
            ties.append(i)

    return np.random.choice(ties)

# Implementing Sarsa Agent

Equation: 

\begin{equation} 
w \leftarrow w + \alpha[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, w)- \hat{q}(S_t, A_t, w)]\nabla \hat{q}(S_t, A_t, w)
\end{equation}

Target:

\begin{equation} 
\delta \leftarrow R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, w)
\end{equation}

Action-value with function approximation using Sarsa Algorithm

<img src="./assets/pseudocode.png" width="480" />

In [48]:
# SARSA
class SarsaAgent():
    """
    Initialization of Sarsa Agent. All values are set to None so they can
    be initialized in the agent_init method.
    """
    def __init__(self, agent_info={}):
        """Setup for the agent called when the experiment first starts."""
        self.last_action = None
        self.last_state = None
        self.epsilon = None
        self.gamma = None
        self.iht_size = None
        self.w = None
        self.alpha = None
        self.num_tilings = None
        self.num_tiles = None
        self.mctc = None
        self.initial_weights = None
        self.num_actions = None
        self.previous_tiles = None

    def agent_init(self, agent_info={}):
        """Setup for the agent called when the experiment first starts."""
        self.num_tilings = agent_info.get("num_tilings", 8)
        self.num_tiles = agent_info.get("num_tiles", 8)
        self.iht_size = agent_info.get("iht_size", 4096)
        self.epsilon = agent_info.get("epsilon", 0.0)
        self.gamma = agent_info.get("gamma", 1.0)
        self.alpha = agent_info.get("alpha", 0.5) / self.num_tilings
        self.initial_weights = agent_info.get("initial_weights", 0.0)
        self.num_actions = agent_info.get("num_actions", 3)
        
        # Initialize self.w to three times the iht_size. Recall this is because
        # we need to have one set of weights for each action (Stacked values).
        self.w = np.ones((self.num_actions, self.iht_size)) * self.initial_weights
        
        # Initialize self.mctc to the mountaincar verions of the  tile coder created
        self.mctc = MountainCarTileCoder(iht_size = self.iht_size, 
                                         num_tilings = self.num_tilings, 
                                         num_tiles = self.num_tiles)

    def select_action(self, tiles):
        """
        Selects an action using epsilon greedy
        Args:
        tiles - np.array, an array of active tiles
        Returns:
        (chosen_action, action_value) - (int, float), tuple of the chosen action
                                        and it's value
        """
        action_values = []
        chosen_action = None
        
        # Obtain action values for all actions (sum through rows)
        action_values = np.sum(self.w[:, tiles], axis = 1)
        
        # Epsilon Greedy action selecion
        if np.random.random() < self.epsilon:
            # Select random action among the three posible actions
            chosen_action = np.random.randint(self.num_actions)
        else:
            # Select the greedy action
            chosen_action = argmax(action_values)
        
        return chosen_action, action_values[chosen_action]
    
    def agent_start(self, state):
        """The first method called when the experiment starts, called after
        the environment starts.
        Args:
            state (Numpy array): the state observation from the
                environment's env.reset() function.
        Returns:
            The first action the agent takes.
        """
        # Current state
        position, velocity = state
        
        # Obtain tiles activated at state cero
        active_tiles = self.mctc.get_tiles(position = position, velocity = velocity)
        # Select an action and obtain action values of the state
        current_action, action_value = self.select_action(active_tiles)
        
        # Save action as last action
        self.last_action = current_action
        # Save tiles as previous tiles
        self.previous_tiles = np.copy(active_tiles)
        
        return self.last_action

    def agent_step(self, reward, state):
        """A step taken by the agent.
        Args:
            reward (float): the reward received for taking the last action taken
            state (Numpy array): the state observation from the
                environment's step based, where the agent ended up after the
                last step
        Returns:
            The action the agent is taking.
        """
        # Current state
        position, velocity = state

        # Compute current tiles
        active_tiles = self.mctc.get_tiles(position = position, velocity = velocity)
        # Obtain new action and action value before updating actition values
        current_action, action_value = self.select_action(active_tiles)
        
        # Update the Sarsa Target (delta)
        target = reward + (self.gamma * action_value)
        
        # Compute last action values to update weights
        last_action_val = np.sum(self.w[self.last_action][self.previous_tiles]) 
        
        # As we are using tile coding, which is a variant of linear function approximation
        # The gradient of the active tiles are one, otherwise cero.
        grad = 1
        self.w[self.last_action][self.previous_tiles] = self.w[self.last_action][self.previous_tiles] + \
            self.alpha * (target - last_action_val) * grad
                
        self.last_action = current_action
        self.previous_tiles = np.copy(active_tiles)
        return self.last_action

    def agent_end(self, reward):
        """Run when the agent terminates.
        Args:
            reward (float): the reward the agent received for entering the
                terminal state.
        """

        # There is no action_value used here because this is the end
        # of the episode.
        
        # Compute delta
        target = reward 
        # Compute last action value
        last_action_val = np.sum(self.w[self.last_action][self.previous_tiles])
        grad = 1
        # Update weights
        self.w[self.last_action][self.previous_tiles] = self.w[self.last_action][self.previous_tiles] + \
            self.alpha * (target - last_action_val) * grad
        

In [69]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  mp4list.sort()
  for mp4 in mp4list:
    print(mp4)
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))

def wrap_env(env, k):
  env = Monitor(env, './video', force=True)
  return env


# Text experiment

In [71]:
# Test Sarsa Agent 
env_to_wrap = gym.make('MountainCar-v0')
env = Monitor(env_to_wrap, "./vid", video_callable=lambda episode_id: True,force=True)

num_runs = 30
num_episodes = 200
agent_info_options = {"num_tilings": 8, "num_tiles": 8, "iht_size": 4096,
                      "epsilon": 0.0, "gamma": 1.0, "alpha": 0.5,
                      "initial_weights": 0.0, "num_actions": 3}
all_steps = []

agent = SarsaAgent(agent_info_options)
env = gym.make("MountainCar-v0")

agent.agent_init(agent_info_options)

# Number of runs are the times the experiment will start again (a.k.a episode)
# Number of timesteps the agent will run steps (a.k.a episodes)
for i_episode in tqdm(range(num_runs)):
    
    # Resets environment
    observation = env.reset()
    done = False
    # Reset agent
    #agent.agent_init(agent_info_options)
    # Generate last state and action in the agent
    last_action = agent.agent_start(observation)
    for t in range(200):
        # View environment
        env.render()
        
        # Take a step with the environment
        observation, reward, done, info = env.step(last_action)
        
        # If the goal has been reached stop
        if done:
            # Last step with the agent
            agent.agent_end(reward)
            print("Episode finished after {} timesteps".format(t+1))
            break
        else:
            # Take a step with the agent
            last_action = agent.agent_step(reward, observation)

env.close()
env_to_wrap.close()

  7%|▋         | 2/30 [00:00<00:05,  5.32it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 13%|█▎        | 4/30 [00:00<00:04,  6.31it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 20%|██        | 6/30 [00:00<00:03,  6.91it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 27%|██▋       | 8/30 [00:01<00:03,  7.29it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 33%|███▎      | 10/30 [00:01<00:02,  7.55it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 40%|████      | 12/30 [00:01<00:02,  7.65it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 47%|████▋     | 14/30 [00:01<00:02,  7.70it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 53%|█████▎    | 16/30 [00:02<00:01,  7.67it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 60%|██████    | 18/30 [00:02<00:01,  7.69it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 67%|██████▋   | 20/30 [00:02<00:01,  7.71it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 73%|███████▎  | 22/30 [00:02<00:01,  7.38it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 80%|████████  | 24/30 [00:03<00:00,  7.06it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


 87%|████████▋ | 26/30 [00:03<00:00,  7.31it/s]

Episode finished after 181 timesteps
Episode finished after 200 timesteps


 93%|█████████▎| 28/30 [00:03<00:00,  7.38it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps


100%|██████████| 30/30 [00:04<00:00,  7.39it/s]

Episode finished after 200 timesteps
Episode finished after 200 timesteps



