# Continuous Pendulum with function appoximation and control

This Notebook is intended to solve the Episodic Mountain car problem using Semi-gradient sarsa and Tile Coding.

The description of the problem is given below:

"The inverted pendulum swingup problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.." 

<img src="./assets/car.png" width="380" />

An extensive description and solution of the problem can be seen here [Section 10.1 Reinforment Learning an Introduction](http://www.incompleteideas.net/book/RLbook2018.pdf#page=267)

Image and Text taken from Taken from [Official documentaiton Mountain car](https://gym.openai.com/envs/Pendulum-v0/).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

import gym
from gym.wrappers import Monitor
from utils import *

import torch
from torch import nn
import torch.nn.functional as F
from torch import optim

%matplotlib inline

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Undestanding the Workflow of OpenAI

The following variables are used at each timestep and they are returned by the Mountain Car environment. 


- **observation** (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
- **reward** (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
- **done** (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
- **info** (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.


As a quick recap, the diagram below explains the workflow of a Markov Decision Process (MDP)

<img src="./assets/MDP.png" width="380" />

Image taken from [Section 3.1 Reinforment Learning an Introduction](http://www.incompleteideas.net/book/RLbook2018.pdf#page=70)

## Environment and Agent specifications

Below are presented the main features of the environment and agent. Overall, the action space of the problem is discrete with three posible actions. The observations or state space is continuios, therefore it is necessary to use a function approximation technique to solve this challenge. The agent receives a reward of -1 at each timestep unless it reaches the goal. The episode ends if the agent reaches the goal or a specific number of iterations are done. Additionally, the agent will always start at a random position between $-0.6$ and $-0.4$ with zero velocity.

**Observation**: 

     Type:  Box(2)
     Num 	Observation 	 Min 	Max
     0 	  cos(theta)      -1.0 	1.0
     1 	  sin(theta) 	 -1.0 	1.0
     2       theta dot 	  -8.0 	8.0
         
**Actions**:

     Type: Box(1)
     Num 	Action 	        Min 	Max
     0 	  Joint effort      -2.0 	2.0

        
**Reward**:

     -(theta^2 + 0.1*theta_dt^2 + 0.001*action^2)

        
**Starting State**:

     Random angle from -pi to pi, and random velocity between -1 and 1
        
**Episode Termination**:

     Continuous problem
     
For further information see [Github source code](https://github.com/openai/gym/blob/master/gym/envs/classic_control/pendulum.py)

The next cell aims to show how to iterate with the action and observation space of the agent and extract relevant information from it 

In [2]:
env = gym.make("Pendulum-v0")
observation = env.reset() 

# Object's type in the action Space
print("The Action Space is an object of type: {0}\n".format(env.action_space))
# Shape of the action Space
print("The shape of the action space is: {0}\n".format(env.action_space.shape))
# The high and low values in the action space
print("The High values in the action space are {0}, the low values are {1}\n".format(
    env.action_space.high, env.action_space.low))
# Object's type in the Observation Space
print("The Environment Space is an object of type: {0}\n".format(env.observation_space))
# Shape of the observation space
print("The Shape of the dimension Space are: {0}\n".format(env.observation_space.shape))
# The high and low values in the observation space
print("The High values in the observation space are {0}, the low values are {1}\n".format(
    env.observation_space.high, env.observation_space.low))
# Example of observation
print("The Observations at a given timestep are {0}\n".format(env.observation_space.sample()))

# https://medium.com/deeplearningmadeeasy/advantage-actor-critic-continuous-case-implementation-f55ce5da6b4c
# https://www.coursera.org/learn/prediction-control-function-approximation/home/welcome
# [Section 3.1 Reinforment Learning an Introduction](http://www.incompleteideas.net/book/RLbook2018.pdf#page=357)

The Action Space is an object of type: Box(1,)

The shape of the action space is: (1,)

The High values in the action space are [2.], the low values are [-2.]

The Environment Space is an object of type: Box(3,)

The Shape of the dimension Space are: (3,)

The High values in the observation space are [1. 1. 8.], the low values are [-1. -1. -8.]

The Observations at a given timestep are [ 0.02874173 -0.3922771  -6.9678636 ]



# Tile Coding Class

For a complete explanation about what is tile coding and how it works, see [Section 9.5.4 of Reinforment Learning an Introduction](http://www.incompleteideas.net/book/RLbook2018.pdf#page=239). Overall, this is a way to create features that can both provide good generalization and discrimination for value function approximation. Tile coding consists of multiple overlapping tiling, where each tiling is a partitioning of the space into tiles.

<img src="./assets/tilecoding.png" width="640" />

**Note**: Tile coding can be only be used with 2d observation spaces.

This technique is implemented using Tiles3, which is a python library written by Richard S. Sutton. For the full documentation see [Tiles3 documentation](http://incompleteideas.net/tiles/tiles3.html)

Image taken from [Section 9.5.4 of Reinforment Learning an Introduction](http://www.incompleteideas.net/book/RLbook2018.pdf#page=239)

In [3]:
# Critic Neural Network
class Critic(nn.Module):
    # Work Required: Yes. Fill in the layer_sizes member variable (~1 Line).
    def __init__(self, critic_config):
        super().__init__()
        
        
        # Number of states
        self.state_dim = critic_config.get("state_dim")
        # Hidden units
        self.num_hidden_units = critic_config.get("num_hidden_units")
        
        a1 = int(self.num_hidden_units / 4)
        a2 = int(self.num_hidden_units / 2)
        
        # Initialzie first hidden layer 
        self.hidden_1 = nn.Linear(self.state_dim, a1)
        # Initialzie second hidden layer 
        self.hidden_2 = nn.Linear(a1, a2)
        # Initialzie Third hidden layer 
        self.hidden_3 = nn.Linear(a2, self.num_hidden_units)
        # Initialize output layer
        self.output = nn.Linear(self.num_hidden_units, 1)
                        
    
    def forward(self, s):
        """
        This is a feed-forward pass in the network
        Args:
            s (Numpy array): The state, a 2D array of shape (batch_size, state_dim)
        Returns:
            The action-values (Torch array) calculated using the network's weights.
            A 2D array of shape (batch_size, num_actions)
        """
        # Transform observations into a pytorch tensor
        s = torch.Tensor(s)
        
        v = F.relu(self.hidden_1(s))
        v = F.relu(self.hidden_2(v))
        v = F.relu(self.hidden_3(v))
        v = self.output(v)

        return v

In [4]:
# The Actor neural network
class Actor(nn.Module):
    def __init__(self,  actor_config):
        super().__init__()
                
        # Number of states
        self.state_dim = actor_config.get("state_dim")
        # Hidden units
        self.num_hidden_units = actor_config.get("num_hidden_units")
        # Actions or output units
        self.num_actions = actor_config.get("num_actions")
        
        a1 = int(self.num_hidden_units / 4)
        a2 = int(self.num_hidden_units / 2)
        
        # Initialzie first hidden layer 
        self.hidden_1 = nn.Linear(self.state_dim, a1)
        # Initialzie second hidden layer 
        self.hidden_2 = nn.Linear(a1, a2)
        # Initialzie Third hidden layer 
        self.hidden_3 = nn.Linear(a2, self.num_hidden_units)
        # Initialize output layer
        self.output = nn.Linear(self.num_hidden_units, self.num_actions * 2)
        
        # Log of standard deviation
        # logstdv_param = nn.Parameter(torch.full((self.num_actions,), 0.1))
        # Register parameter in the network
        # self.register_parameter("logstdv", logstdv_param)
                
    def compute_mean(self, s):
        """
        This is a feed-forward pass in the network
        Args:
            s (Numpy array): The state, a 2D array of shape (batch_size, state_dim)
        Returns:
            The action-values (Torch array) calculated using the network's weights.
            A 2D array of shape (batch_size, num_actions)
        """
        # Transform observations into a pytorch tensor
        s = torch.Tensor(s)
        
        pi = F.relu(self.hidden_1(s))
        pi = F.relu(self.hidden_2(pi))
        pi = F.relu(self.hidden_3(pi))
        pi = self.output(pi)
        
        return pi
                
    
    def forward(self, s):
        
        # Compute the mean with the model
        mean, logstdv = self.compute_mean(s)
        # Clamp the stdv between 1e-3 and 50
        #stdv = torch.clamp(logstdv.exp(), 1e-3, 50)
        stdv = logstdv.exp()
        
        # Sample an action from the normal distribution
        return torch.distributions.Normal(mean, stdv)


## Computing TD agent

In [5]:
# Method to compute the TD Target and TD estimate
def get_td(states, next_states, rewards, terminals, actor, critic, avg_reward):
    """
    Args:
        states (Numpy array): The batch of states with the shape (batch_size, state_dim).
        next_states (Numpy array): The batch of next states with the shape (batch_size, state_dim).
        actions (Numpy array): The batch of actions with the shape (batch_size,).
        rewards (Numpy array): The batch of rewards with the shape (batch_size,).
        discount (float): The discount factor (gamma).
        terminals (Numpy array): The batch of terminals with the shape (batch_size,).
        network (ActionValueNetwork): The latest state of the network that is getting replay updates.
        current_q (ActionValueNetwork): The fixed network used for computing the targets, 
                                        and particularly, the action-values at the next-states.
    Returns:
        target_vec (Tensor array): The TD Target for actions taken, of shape (batch_size,)
        estimate_vec (Tensor array): The TD estimate for actions taken, of shape (batch_size,)
    """
    
    # network is the latest state of the network that is getting replay updates. In other words, 
    # network represents Q_{t+1}^{i} whereas current_q represents Q_t, the fixed network used 
    # for computing the  targets, and particularly, the action-values at the next-states.
    
    # q_next_mat is a 2D Tensor of shape (batch_size, num_actions)
    # used to compute the action-values of the next states
    # Detach is used to remove this graph from the main graph
    v_next_vals = critic.forward(next_states) #.detach()
    v_curr_vals = critic.forward(states)
    
    # target_vec = torch.tensor(rewards) - torch.tensor(avg_reward) + (v_next_vals * (1 - torch.tensor(terminals)))
    target_vec = torch.tensor(rewards) + (0.99 * v_next_vals * (1 - torch.tensor(terminals)))
    estimate_vec = v_curr_vals
 

    return target_vec, estimate_vec

# Implementing Sarsa Agent


\begin{equation} 
q_\pi(s) \approx \hat{q}(s, a, w) \doteq w^T x(s,a)
\end{equation}

## Section 2-2: Implement Agent Methods

Let's first define methods that initialize the agent. `agent_init()` initializes all the variables that the agent will need.

Now that we have implemented helper functions, let's create an agent. In this part, you will implement `agent_start()` and `agent_step()`. We do not need to implement `agent_end()` because there is no termination in our continuing task. 

`compute_softmax_prob()` is used in `agent_policy()`, which in turn will be used in `agent_start()` and `agent_step()`. We have implemented `agent_policy()` for you.

When performing updates to the Actor and Critic, recall their respective updates in the Actor-Critic algorithm video.

We approximate $q_\pi$ in the Actor update using one-step bootstrapped return($R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, \mathbf{w})$) subtracted by current state-value($\hat{v}(S_{t}, \mathbf{w})$), equivalent to TD error $\delta$.

$\delta_t = R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_{t}, \mathbf{w}) \hspace{6em} (1)$

**Average Reward update rule**: $\bar{R} \leftarrow \bar{R} + \alpha^{\bar{R}}\delta \hspace{4.3em} (2)$

**Critic weight update rule**: $\mathbf{w} \leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\delta\nabla \hat{v}(s,\mathbf{w}) \hspace{2.5em} (3)$

**Actor weight update rule**: $\mathbf{\theta} \leftarrow \mathbf{\theta} + \alpha^{\mathbf{\theta}}\delta\nabla ln \pi(A|S,\mathbf{\theta}) \hspace{1.4em} (4)$


However, since we are using linear function approximation and parameterizing a softmax policy, the above update rule can be further simplified using:

$\nabla \hat{v}(s,\mathbf{w}) = \mathbf{x}(s) \hspace{14.2em} (5)$

$\nabla ln \pi(A|S,\mathbf{\theta}) = \mathbf{x}_h(s,a) - \sum_b \pi(b|s, \mathbf{\theta})\mathbf{x}_h(s,b) \hspace{3.3em} (6)$

For further details, see [Section 9.5.4 of Reinforment Learning an Introduction](http://www.incompleteideas.net/book/RLbook2018.pdf#page=266). Image taken from the last reference.

In [6]:
def clip_grad_norm_(module, max_grad_norm):
    nn.utils.clip_grad_norm_([p for g in module.param_groups for p in g["params"]], max_grad_norm)
# SARSA
class SarsaAgent():
    """
    Initialization of Sarsa Agent. All values are set to None so they can
    be initialized in the agent_init method.
    """
    def __init__(self):
        """Setup for the agent called when the experiment first starts."""
        self.actor_step_size = None
        self.critic_step_size = None
        self.avg_reward_step_size = None

        self.avg_reward = None
        self.critic = None
        self.actor = None

        self.actions = None

        self.last_action = None
        self.last_state = None

    def agent_init(self, agent_config = {}):
        """Setup for the agent called when the experiment first starts."""
        # set step-size accordingly (we normally divide actor and critic step-size by num. tilings (p.217-218 of textbook))
        self.actor_step_size = agent_config['optimizers_config']['actor_step_size']
        self.critic_step_size = agent_config['optimizers_config']['critic_step_size']
        self.avg_reward_step_size = agent_config['optimizers_config']['reward_step_size']

        self.actions = agent_config['network_config']['num_actions']

        # Set initial values of average reward, actor weights, and critic weights
        # We initialize actor weights to three times the iht_size. 
        # Recall this is because we need to have one set of weights for each of the three actions.
        self.avg_reward = 1.0
        self.log_probs = 0
        
        self.actor = Actor(agent_config['network_config']).to(device)
        self.critic = Critic(agent_config['network_config']).to(device)
        
        self.critic_loss = nn.MSELoss()
        
        self.actor_opti = optim.Adam(self.actor.parameters(), lr = self.actor_step_size ) 
        self.critic_opti = optim.Adam(self.critic.parameters(), lr = self.critic_step_size )
        
        self.last_action = None
        self.last_state = None

    def select_action(self, state):
        """
        Selects an action using epsilon greedy
        Args:
        tiles - np.array, an array of active tiles
        Returns:
        (chosen_action, action_value) - (int, float), tuple of the chosen action
                                        and it's value
        """
        # Pass the states to create the normal distribution
        dists = self.actor.forward(state)
        # Sample an action from the current normal Distribution
        action = dists.sample().detach().data.numpy()
        #self.log_probs = dists.log_prob(torch.tensor(action).detach())
        # Clip action to a given range
        #m = nn.Tanh()
        #chosen_action = m(action) * env.action_space.high.max()
        chosen_action = np.clip(action, env.action_space.low.min(), env.action_space.high.max())

        return [chosen_action]
    
    def agent_start(self, state):
        """The first method called when the experiment starts, called after
        the environment starts.
        Args:
            state (Numpy array): the state observation from the
                environment's env.reset() function.
        Returns:
            The first action the agent takes.
        """
        
        current_action = self.select_action(state)

        # Save action as last action
        self.last_action = current_action
        # Save tiles as previous tiles
        self.last_state = state
        
        return self.last_action

    def agent_step(self, reward, state):
        """A step taken by the agent.
        Args:
            reward (float): the reward received for taking the last action taken
            state (Numpy array): the state observation from the
                environment's step based, where the agent ended up after the
                last step
        Returns:
            The action the agent is taking.
        """
                
        td_target, td_estimate = get_td(self.last_state, state, reward, 0, self.actor, self.critic, self.avg_reward)

        delta = td_target - td_estimate

        #self.avg_reward += self.avg_reward_step_size * delta
        
        # zero the gradients buffer
        self.critic_opti.zero_grad()
        # Compute the  Mean squared value error loss
        critic_loss = self.critic_loss(td_estimate.double().to(device), td_target.double().detach().to(device))
        #critic_loss = delta**2
        # Backprop the error
        critic_loss.backward()
        clip_grad_norm_(self.critic_opti, 0.5)
        # Optimize the network
        self.critic_opti.step()
        
        # Actor 
        self.actor_opti.zero_grad()
        # Compute mu and sigma
        norm_dists = self.actor.forward(self.last_state)
        # Construct equivalent loss function
        logs_probs = norm_dists.log_prob(torch.tensor(self.last_action))
        # Multiply by minus one as this is gradient ascent
        #entropy = norm_dists.entropy()
        actor_loss = -logs_probs * delta.detach()
        #actor_loss = -logs_probs * delta
        # Backprop the error
        actor_loss.backward()
        #(actor_loss + critic_loss).backward()
        clip_grad_norm_(self.actor_opti, 0.5)
        # Optimize the network
        self.actor_opti.step()
        #self.critic_opti.step()
        
        current_action = self.select_action(state)

        self.last_state = state
        self.last_action = current_action

        return self.last_action

    def agent_end(self, reward):
        """Run when the agent terminates.
        Args:
            reward (float): the reward the agent received for entering the
                terminal state.
        """

        # There is no action_value used here because this is the end
        # of the episode.
        
        state = np.zeros_like(self.last_state)
        
        
        td_target, td_estimate = get_td(self.last_state, state, reward, 1, self.actor, self.critic, self.avg_reward)

        delta = td_target - td_estimate

        #self.avg_reward += self.avg_reward_step_size * delta
        
        # zero the gradients buffer
        self.critic_opti.zero_grad()
        # Compute the  Mean squared value error loss
        critic_loss = self.critic_loss(td_estimate.double().to(device), td_target.double().detach().to(device))
        # Backprop the error
        critic_loss.backward()
        clip_grad_norm_(self.critic_opti, 0.5)
        # Optimize the network
        self.critic_opti.step()
        
        # Actor 
        self.actor_opti.zero_grad()
        # Compute mu and sigma
        norm_dists = self.actor.forward(self.last_state)
        # Construct equivalent loss function
        logs_probs = norm_dists.log_prob(torch.tensor(self.last_action))
        # Multiply by minus one as this is gradient ascent
        actor_loss = -logs_probs * delta.detach()
        # Backprop the error
        actor_loss.backward()
        clip_grad_norm_(self.actor_opti, 0.5)
        # Optimize the network
        self.actor_opti.step()

## Running the experiment

The following lines solves the Mountain Car problem and plot the average reward obtained over episodes and steps taken to solve the challenge at a specific episode.

In [None]:
# Test the expected Sarsa Agent 
#model = ActionValueNetwork(network_config).to(device)
num_runs = 1
num_episodes = 100

# Experiment parameters 256
agent_info = {
             'network_config': {
                 'state_dim': env.observation_space.shape[0],
                 'num_hidden_units': 256,
                 'num_actions': env.action_space.shape[0]
             },
             'optimizers_config': {
                 'actor_step_size': 5e-6,  #5e-6, 1e-5 
                 'critic_step_size': 1e-5, 
                 'reward_step_size': 1e-6, 
             }}

# Variable to store the amount of steps taken to solve the challeng
all_steps = []
# Variable to save the rewards in an episode
all_rewards = []
all_loss = []

# Agent
agent = SarsaAgent()

# Environment
env = gym.make('Pendulum-v0')
env.reset()
# Maximum number of possible iterations (default was 200)
env._max_episode_steps = 1500

# Number of runs are the times the experiment will start again (a.k.a episode)
for n_runs in range(num_runs):
    
    # Resets environment
    observation = env.reset()
    # Reset agent
    agent.agent_init(agent_info)
    # Generate last state and action in the agent
    last_action = agent.agent_start(observation)
    # Steps, rewards and loss at each episode to solve the challenge
    steps_per_episode = []
    rewards_per_episode = []
    loss_per_episode = []
        
    # Times the environment will start again without resetting the agent
    for t in tqdm(range(num_episodes)):
        
        # Reset done flag
        done = False
        # Set rewards, steps and loss to zero
        rewards = 0
        n_steps = 0
        # Reset environment
        observation = env.reset()
        # Run until the experiment is over
        while not done:
            
            # Render the environment only after t > # episodes
            #if t > 95:
            env.render()

            # Take a step with the environment
            observation, reward, done, info = env.step(last_action)
            
            rewards += reward
            #n_steps += 1

            # If the goal has been reached stop
            if done:
                # Last step with the agent
                agent.agent_end(reward)
            else:
                # Take a step with the agent
                last_action = agent.agent_step(reward, observation)
                
        # Append steps taken to solve the episode
        #steps_per_episode.append(n_steps)
        # Reward obtained during the episode
        print("The reward obtained during episode {0} was: {1}".format(t, rewards))
        rewards_per_episode.append(rewards)
        # Loss obtained solving the experiment
        #loss_per_episode.append(agent.loss)

    # Steps taken to solve the experiment during all
    #all_steps.append(np.array(steps_per_episode))
    # Awards obtained during all episode
    #all_rewards.append(np.array(rewards_per_episode))
    # Loss obtained during all episodes
    #all_loss.append(loss_per_episode)

env.close()
print(np.mean(np.array(all_steps), axis=0))

  1%|          | 1/100 [00:12<21:08, 12.81s/it]

The reward obtained during episode 0 was: -9857.653080914199


  2%|▏         | 2/100 [00:25<20:57, 12.83s/it]

The reward obtained during episode 1 was: -8585.373776408758


  3%|▎         | 3/100 [00:38<20:48, 12.87s/it]

The reward obtained during episode 2 was: -9846.868831930376


  4%|▍         | 4/100 [00:48<19:10, 11.99s/it]

The reward obtained during episode 3 was: -9179.942439463983


  5%|▌         | 5/100 [00:58<18:06, 11.44s/it]

The reward obtained during episode 4 was: -8236.174689795154


  6%|▌         | 6/100 [01:11<18:22, 11.73s/it]

The reward obtained during episode 5 was: -8863.470058155028


  7%|▋         | 7/100 [01:24<18:53, 12.19s/it]

The reward obtained during episode 6 was: -8712.455066763052


  8%|▊         | 8/100 [01:37<19:08, 12.48s/it]

The reward obtained during episode 7 was: -9039.856749795586


  9%|▉         | 9/100 [01:52<20:04, 13.24s/it]

The reward obtained during episode 8 was: -9074.783085007612


 10%|█         | 10/100 [02:06<20:09, 13.44s/it]

The reward obtained during episode 9 was: -7972.1799892443905


 11%|█         | 11/100 [02:20<20:24, 13.75s/it]

The reward obtained during episode 10 was: -7716.00584942943


 12%|█▏        | 12/100 [02:33<19:28, 13.27s/it]

The reward obtained during episode 11 was: -8786.563282071605


 13%|█▎        | 13/100 [02:47<19:53, 13.71s/it]

The reward obtained during episode 12 was: -10056.308266432558


 14%|█▍        | 14/100 [02:58<18:12, 12.70s/it]

The reward obtained during episode 13 was: -8050.182406615001


 15%|█▌        | 15/100 [03:09<17:14, 12.17s/it]

The reward obtained during episode 14 was: -7762.480285431772


 16%|█▌        | 16/100 [03:20<16:33, 11.83s/it]

The reward obtained during episode 15 was: -7908.106783715948


 17%|█▋        | 17/100 [03:33<17:02, 12.32s/it]

The reward obtained during episode 16 was: -10254.705464119612


 18%|█▊        | 18/100 [03:46<17:06, 12.51s/it]

The reward obtained during episode 17 was: -10110.967971138298


 19%|█▉        | 19/100 [03:59<17:12, 12.74s/it]

The reward obtained during episode 18 was: -7343.046399817438


 20%|██        | 20/100 [04:14<17:33, 13.17s/it]

The reward obtained during episode 19 was: -9295.382095025341


 21%|██        | 21/100 [04:28<17:44, 13.48s/it]

The reward obtained during episode 20 was: -8640.852137956119


 22%|██▏       | 22/100 [04:42<17:47, 13.69s/it]

The reward obtained during episode 21 was: -7565.56505212768


 23%|██▎       | 23/100 [04:56<17:43, 13.81s/it]

The reward obtained during episode 22 was: -7890.076577993322


 24%|██▍       | 24/100 [05:07<16:23, 12.94s/it]

The reward obtained during episode 23 was: -10767.977960578268


 25%|██▌       | 25/100 [05:18<15:24, 12.33s/it]

The reward obtained during episode 24 was: -8678.15595495255


 26%|██▌       | 26/100 [05:28<14:22, 11.66s/it]

The reward obtained during episode 25 was: -7162.88158774846


 27%|██▋       | 27/100 [05:42<15:03, 12.38s/it]

The reward obtained during episode 26 was: -8629.602969839909


 28%|██▊       | 28/100 [05:57<15:53, 13.24s/it]

The reward obtained during episode 27 was: -7779.735329609214


 29%|██▉       | 29/100 [06:12<16:16, 13.76s/it]

The reward obtained during episode 28 was: -8739.676726250953


 30%|███       | 30/100 [06:28<16:40, 14.29s/it]

The reward obtained during episode 29 was: -8066.883284199606


 31%|███       | 31/100 [06:39<15:29, 13.47s/it]

The reward obtained during episode 30 was: -8674.92174786871


 32%|███▏      | 32/100 [06:53<15:12, 13.42s/it]

The reward obtained during episode 31 was: -8261.738720948233


 33%|███▎      | 33/100 [07:07<15:11, 13.61s/it]

The reward obtained during episode 32 was: -10016.993033398632


 34%|███▍      | 34/100 [07:17<14:03, 12.78s/it]

The reward obtained during episode 33 was: -7641.426509086404


In [None]:
steps_average = np.mean(np.array(all_steps), axis=0)
plt.plot(steps_average, label = 'Steps')
plt.xlabel("Episodes")
plt.ylabel("Iterations",rotation=0, labelpad=40)
plt.xlim(-0.2, num_episodes)
plt.ylim(steps_average.min(), steps_average.max())
plt.title("Average iterations to solve the experiment over runs")
plt.legend()
plt.show()
print("The Minimum number of iterations used to solve the experiment were: {0}\n".format(np.array(all_steps).max()))
print("The Maximum number of iterations used to solve the experiment were: {0}\n".format(np.array(all_steps).min()))

In [None]:
rewards_average = np.mean(all_rewards, axis=0)
plt.plot(rewards_average, label = 'Average Reward')
plt.xlabel("Episodes")
plt.ylabel("Sum of\n rewards\n during\n episode" ,rotation=0, labelpad=40)
plt.xlim(-0.2, num_episodes)
plt.ylim(rewards_average.min(), rewards_average.max())
plt.title("Average iterations to solve the experiment over runs")
plt.legend()
plt.show()
print("The best reward obtained solving the experiment was: {0}\n".format(np.array(all_rewards).max()))
print("The Wordt reward obtained solving the experiment was: {0}\n".format(np.array(all_rewards).min()))

## Using the last trained Agent 

This lines shows in a video the performance of the last trained agent and save a video with the results.

In [None]:
# Test Sarsa Agent 
num_runs = 1
num_episodes = 1000

# Environment
env_to_wrap = gym.make('MountainCar-v0')
# Maximum number of possible iterations (default was 200)
env_to_wrap._max_episode_steps = 1500
env = Monitor(env_to_wrap, "./videos/mountainCar", video_callable=lambda episode_id: True, force=True)


# Number of runs are the times the experiment will start again (a.k.a episode)
for n_runs in tqdm(range(num_runs)):
    
    # Resets environment
    observation = env.reset()
    # Generate last state and action in the agent
    last_action = agent.agent_start(observation)
        
    # Times the environment will start again without resetting the agent
    for t in tqdm(range(num_episodes)):

        # View environment
        env.render()

        # Take a step with the environment
        observation, reward, done, info = env.step(last_action)

        # If the goal has been reached stop
        if done:
            # Last step with the agent
            agent.agent_end(reward)
            break

        else:
            # Take a step with the agent
            last_action = agent.agent_step(reward, observation)


env.close()
env_to_wrap.close()

print("Episode finished after {} timesteps".format(t+1))

## Plotting the Action-Values of the agent

This final plot aims to show the action-values learned by the agent with Sarsa. The action value for a given state was calculated using: -$max_a\hat{q}(s, a, w)$

In [None]:
# Resolution
values = 500
# Vector of positions
pos_vals = np.linspace(-1.2, 0.5, num = values)
# Vector of velocities
vel_vals = np.linspace(-0.07, 0.07, num = values)

# Z grid values
av_grid = np.zeros((values, values))

# Compute Action-values for each pos - vel pair
for ix in range(len(pos_vals)):
    for iy in range(len(vel_vals)):
        av_grid[ix][iy] = -1 * agent.return_action_value([pos_vals[ix], vel_vals[iy]])

In [None]:
# Plot the 3D surface
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
Px, Vy = np.meshgrid(pos_vals, vel_vals)
ax.plot_surface(Vy, Px, av_grid, color = 'gray')
ax.set_title("Cost-to-go function learned", y = 1.1)
ax.set_xlabel('Velocity')
ax.set_ylabel('Position')
ax.set_zlabel('Iterations')
ax.view_init(45, azim=30)
plt.tight_layout()
plt.show()