_JORGE MENEU MORENO_

_Reinforcement Learning - Individual Assignment II_

_@ IE MBD 2022-2023_

# Landing a Spaceship with Reinforcement Learning








<img width="600" style="float:left" 
     src="https://images.pexels.com/photos/39896/space-station-moon-landing-apollo-15-james-irwin-39896.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2" />

# 1. Context

The present document includes a simulation of a landing spaceship trained with Reinforcement Learning, based on the `OpenAI gym` package and `stable_baselines_3`.


## 1.1. Introduction to Reinforcement Learning

<img width="600" style="float:center" 
     src="https://i.imgur.com/S8lEvQQ.png" />

<a name="Footnote">1</a>: _Schema of the basic elements in a markovian `RL` system_

**The Concept**

Reinforcement Learning (`RL`) is a subset of Machine Learning in which an agent learns to make decisions in an environment, by performing actions and receiving rewards or penalties. 

The goal is for the agent to learn a policy that maximizes the cumulative reward over time. The agent learns through trial and error, with the use of a reward function that provides feedback on the desirability of the agent's actions.

**The Players**

From the upper schema we can extract the main players in scene:

* Agent: character interacting with the environment. Usually paired with a RL model and a policy.
* Model: an algorithm used to predict the next action and reward derived from it. They can be model-based or model-free.
* Policy: guideline towards finding an optimal action in a given state and environment.
* Environment: the physical surroundings around the agent, with which he interacts.
* State: complete description of the world around the agent, and the agent's situation itself.
* Observation: partial description of the state of the world.
* Action: answer to the environmental situation, by the agent. May be done in a `Continuous`or `Discrete` Action Space.
* Reward: feedback value resulting on the cumulative actions taken by the agent, whose objective is to maximize. The function itself is not a simple sum, as it includes a `discount factor`.

**Markov Property**

The property is a premise in which most of RL algorithms rely on:

> _The evolution of the Markov process in the future depends only on the present state and does not depend on past history._ 

In other words, the property sets that the agent acts as a memoryless element in the stochastic process in which he is involved. 



**The Tradeoff: Exploration and Exploitation**

We know that our agents' goal is to maximize a reward, based on a given policy.

But how?

We may not recall it, but when we learned to walk the situation was similar to the dychotomy presented: during the learning phase, there was always a tradeoff between exploring ways to learn to walk, or exploiting the ways that have already given some good results.

Same goes with Reinforcement Learning: depending on the algorithm we choose, we will have to figure out the way to balance the tradeoff between both.

**The Optimal Policy**

In order to find an optimal policy, our agent must train for a series of episodes or time. There are two different approaches to find it during the training phase:

* Policy - Based Methods: teaching the agent the actions to take at a given state, thus, learning a policy function. The present document implements this family of methods.

* Value - Based Methods: teaching the agent wihich states are most valuable and then taking action to reach that state.


**The Models**

We tiptoed around the concept of Model, and threw some terms such as `Model-free` and `Model-based`. Let's analyze these concepts into more detail: 


<img width="600" style="float:center" 
     src="https://spinningup.openai.com/en/latest/_images/rl_algorithms_9_15.svg" />


<a name="Footnote" >2</a>: _Schema of some of the popular `RL` algorithms_.


The schema shows how the different families of `RL`algorithms distribute atending to their nature. 

First of all:

* Model-Free RL: the agent has no model given to it (i.e. it doesn't have a transition model associated), and it learns through an explicit trial and error process. This results in easier to implement and to tune models.

* Model-Based RL: the agent is given a model, which allows it to plan strategies and think ahead. This ahead planning enables it to choose a better rewarding action from the possible range. This kind of models usually underperform in real-environments that differ from the simulated environments.

Having said that, our focus will be in the `Model-Free RL`, as our project will be based on one of those models. Inside the Model-Free RL, we distinguish two ways to implement it:

* Policy Optimization: an `on-policy` (updates policy based on the current collected data) series of methods, that optimizes the weights of the model either by performing gradient ascent, or by indirectly maximizing a  surrogate objective function.

* Q-Learning: usually an `off-policy` (updates in policy based data collected at any point of the training phase) series of methods, that uses the `Bellman Equation` to learn an approximator that optimizes the action-value function.

We will analyze some of the specific models (`A2C`|`PPO`| `DQN`) with greater detail, in the following sections.

Okay, that's `RL`in a nutshell!

Now that we know the basics, let's analyze the tools we will use during the project!



## 1.2. Introduction to `OpenAI gym`

In order to implement this Machine Learning subset, this project leverages `OpenAI Gym`, a toolkit for developing and comparing reinforcement learning algorithms with fully developed environments. The package presents not only a direct compatibility with the most famous algorithms in the field (`PPO`| `A2C`| `DDPG` | `DQN `| `HER` | `SAC` | `TD3`) via the `stable_baselines3` package, but also provides an interface for the agent to interact with a variety of environments, which makes it a great tool for upcoming developers. 

Among them, we may highlight:

* `Atari`: including the infamous `Breakout`.
* `MuJoCo`: including physics for walking agents.
* `Classic Control`: including the classic `CartPole`.
* `Box2D`: including varied scenarios, such as a `Car Racing` or the one we will implement, `Lunar Lander`.

The project we will implement is based on the `LunarLander` environment.



The system consists on the following:

* Agent: `Spaceship`

* Environment: `Moon`

* Action Space: `[Discrete]`[`Nothing`, `Left Engine`,`Right Engine`, `Main Engine`]

_NOTE:  Even if it's true there is a Continuous Action Space version of the Lunar Lander, it seems counterituitive to overcomplicate the project with an inifinitely sized Action Space, so the project will be based on the original version of the environment._


* State:
    + Position: `x`|`y` coordinates
    + Position: `True`| `False` for each leg in ground
    + Velocity: `x`|`y` linear velocity
    + Velocity: `x`|`y` angular velocity



* Rewards:
    + Moving from the top of the screen to the landing pad and coming to rest `[+100-140]`
    + If away from the landing pad, it loses reward.
    + If crashes, `[-100]`
    + If leg contacts ground, `[+10]`
    + If uses main engine, `[-0.3]`
    + If uses side engines, `[-0.03]`
    + Solved if `[R==200]`
    

## 1.3. Introduction to `stable_baselines3`

We may say `openAI gym` will be the base to setup our agent and it's environment. However, `stable_baselines3` will be on charge of the train, evaluate and test of the agent.

`stable_baselines3` is a user-friendly library that packs a set of implementations of RL algorithms in `PyTorch`. The library is compatible with `openAI gym`, and offers a similar structure to `sci-kit learn`. 

In `sci-kit learn`, after importing the data, exploring it and preprocessing it, we would define a model. That very same model would be trained, validated, tuned and later on, tested and saved.

In a similar fashion, with `openAI gym` we will first setup the environment to tackle, and then with `stable_baselines3` we will be able to choose our desired Reinforcement Learning algorithm. Again, the model's object we instantiate will include some basic methods:

* `.learn()`
* `.predict()`
* `.save()`
* `.load()`

All of which are self-explanatory.

Among the reinforcement algorithms we can highlight the following:

* A2C
* PPO
* DQN
* DDPG
* HER
* SAC
* TD3

However, as our environment is limited to a `Discrete` set of actions and we may process it in parallel, we can drill down how many different algorithms we should explore only a few:


<img width="400" style="float:center" 
     src="https://www.garcia-ferreira.es/wp-content/uploads/2022/05/aprendizaje_stable_baselines.png" />



<a name="Footnote" >3</a>: _A summary on the RL algorithms implemented according to the support of discrete/continuous actions and multiprocessing._


According to it, we should explore the following algorithms:

* PPO
* A2C
* DQN

Due to time limitations, we will stick to the best performant of all three for the following experiments. However, we will analyze how each of them work under the hood.

Let's jump straight to the project's pipeline!

# 2. The Pipeline

<img width="2000" style="float:left" 
     src="https://i.imgur.com/0fZFoHU.png" />
     
<a name="Footnote" >3</a>: _Project's pipeline diagram_

The diagram above shows how the project's pipeline will be organized. Even though `RL` is a subset of `ML`, it doesn't implement the usual stages in which data is analyzed and preprocessed. 

In a nutshell, the pipeline will consist on the following steps:

* Environment Setup (setting up physics and tuning the way the environment works)
* Training (with different versions of the compatible `stable_baselines3` `RL` algorithms)
* Validation (exploring different hyperparameters)
* Test (benchmark it's accuracy)

## 2.2. Environment Setup

The `LunarLander-v2` environment's setup is easy. Its creation derives from the `.make()`method from the `gym` package.






In [1]:
import gym 

env = gym.make("LunarLander-v2")

`openAI gym` provides an interface to manually tune the environment's parameters. Among them, not only  will we be able to change the action space from the original `Discrete` to `Continuous`, but also tune up some of the physics of the environment, and furthermore, add some noise to the environment itself to boost the model's robustness.

These are some of the parameters we may want to explore:

* continuous `[bool]`: whether the Action Space should be `Continuous`or `Discrete`.
* gravity `[float]`: a negative magnitude of the gravitational acceleration.
* enable_wind `[bool]`: whether the environment should add wind as noise to the environment.
* wind_power `[float]`: related to the previous one, a magnitude to fine-tune the force and influence of it.
* turbulence_power `[float]`: magnitude of the rotational wind applied to the spaceship.

In this project, we will take the default settings for the environment, to later on have an objective comparison with other participants.

In [2]:
import gym 

from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

env = gym.make("LunarLander-v2")

# vectorizing the environment, stacks several environments into one,
# making train faster and improving exploration
env = DummyVecEnv([lambda: env])

## 2.3. Training

### 2.3.1. Introduction

The training phase is crucial in any `Machine Learning` system. It defines the term itself, as its in it when the algorithm _learns_ the weights to best adapt to the given input data.

In `Reinforcement Learning` the relevance of it is even higher: the preprocessing stage may be directly overseen in this field, hence, all the progress made relies on this phase exclusively.

To invoke the training phase in `stable_baselines3`, is straightforward:

1. We define the model's algorithm family, the Policy, the environment and the hyperparameters we consider.
2. We train the agent up to a certain number of steps.

Both stages are relevant: 

* The first one, directly defines the path taken, and the capacity of converging or diverging from the minima.

* The second one, as the time the agent learns directly intercedes in the stability and quality of the model.

Let's see an example on its implementation:

In [None]:
from stable_baselines3 import PPO

model = PPO('MlpPolicy', env, verbose = 1)
model.learn(total_timesteps = 100)

As we clearly see, we define a `PPO (Proximal Policy Optimization)` model object, by calling the `PPO()`method. In it we first declare the Policy to implement, `MlpPolicy (MultiLayer Perceptron Policy)`, the environment and the verbose (feedback) given in each iteration of the model.

Afterwards, we proceed to train the model using `.learn()` method up to 100 timesteps.

This examples shows the simplicity of the process. 

Recalling what we mentioned along the previous section, lets analyze the three families of algorthms we will try:

* `A2C` (Advantage Actor-Critic): an On-Policy model that combines both policy-based and value-based methods. It uses an actor to determine the best action to take in a given state and a critic to evaluate the value of the selected action.

* `PPO` (Proximal Policy Optimization): an On-Policy model improvement upon the traditional policy gradient method. It uses a trust region optimization approach to ensure that the updates to the policy are not too drastic, leading to improved stability and convergence.

* `DQN` (Deep Q-Network): an Off-Policy model that leverages deep neural networks to approximate the Q-value function.




<img width="800" style="float:center" 
     src="https://i.imgur.com/nGIXHgd.png" />



<a name="Footnote" >4</a>: _Yearly evolution of RL Models usage_

### 2.3.2. Benchmarking `A2C`| `PPO`| `DQN`

Even if the research usually recommends the usage of `PPO` for this specific environment, let's compare how each of them perform with the default parameters:

In [None]:
from stable_baselines3 import PPO, A2C, DQN

log_path = os.path.join('Training', 'Logs')

A2C_model = A2C('MlpPolicy', env, tensorboard_log = log_path, verbose = 1)
A2C_model.learn(total_timesteps = 100000)

PPO_model = PPO('MlpPolicy', env, tensorboard_log = log_path, verbose = 1)
PPO_model.learn(total_timesteps = 100000)

DQN_model = DQN('MlpPolicy', env, tensorboard_log = log_path, verbose = 1)
DQN_model.learn(total_timesteps = 100000)

One parameter (its not even a hyperparameter) that comes quite handy is the`tensorboard_log`. This logging extension enables us to easily visualize the evolution of the training in the model, to compare both performance and behaviour.

In [None]:
A2C_training_log_path = os.path.join(logs_path, 'A2C_1')
PPO_training_log_path = os.path.join(logs_path, 'PPO_15')
DQN_training_log_path = os.path.join(logs_path, 'DQN_1')

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir={DQN_training_log_path}

Let's evaluate the results obtained in each experiment:

**_Value Loss_**


<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>A2C</figcaption>
      <img style = "display: inline-block" src = "./Models/A2C/train_value_loss-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>PPO</figcaption>
      <img style = "display: inline-block" src = "./Models/PPO/train_value_loss-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>DQN</figcaption>
      <img style = "display: inline-block" src = "./Models/DQN/rollout_exploration_rate.svg">
  </div>
</div>

<a name="Footnote" >5</a>: _Performance and behaviour comparison among A2C, PPO and DQN_


_NOTE: Even though 10e5 steps may offer an incomplete insights on the behaviour, due to time constraints that was the limit set to benchmark the different models and hyperparameters. Ideally we would try the models with more timesteps._

**Outcome**


> From the logged information we find `PPO` shows the most stable performance among the rest.  Attending to the `value_loss`, we can easily spot  `A2C` results in spikes both upwards and downwards. For `DQN`we do not have the chance to measure the same concept, but we can indee check the rollout exploration_reward (almost 0).

For these reasons, we will continue our project with this model: `PPO`.

_NOTE: As a side note, the stable behaviour provided by `PPO` can be a double-edged sword: it may have trouble to find the absolute minima, and get stuck in a local minima._

### 2.3.3. Hyperparameter Exploration

Now that we have established `PPO` preference over the other models, let's analyze the kind of `hyperparameters` the `PPO()` constructor offers, to improve our models performance.

Checking the documentation via the `model??` command in Jupyter Notebook, we see the following hyperparameters:

* `policy`: `[MlpPolicy, CnnPolicy]`
* `learning_rate`: the classic learning rate, which can be progressively toned down according to a linear function (we will see later an implementation on this). This hyperparameter represents the magnitude `[0-1]` of the strength of the gradient descent update step. 
    + If too small, too slow to converge.
    + If too big, may diverge, skipping the minima.
        + Experiments: `[0.00001, 0.0005, 0.001]`
    
* `n_steps`: represents the number of steps to run in each update iteration.
    + If too big, the updates will not occur as regularly, leading to a more stabilized update.
    + If smaller, the updates will occur more frequently, leading to a less stable update.
         + Experiments: `[4, 1024, 2048, 4096]`
         
* `batch_size`: number of experiences packed into each update iteration (size of minibatch).
    + If too big, it will train slower and may worsen the models performance.
    + If smaller, training will be faster.
         + Experiments: `[32, 128, 512]`

*  `n_epochs`: number of experience iterations during gradient descent.
    + If too big, it will train faster, at the cost of unstable updates.
    + If smaller, training will be slower, but leading to more stable updates.
         + Experiments: `[3, 8, 10, 100]`
            
*  `gamma`: represents the discount factor in the reward function, and should be treated as a magnitude on the weight of future rewards. Depending on the complexity of the environment, we should choose a bigger or smaller value.
    + If too big, the agent will take a lot into account the future rewards.
    + If smaller, the agent will give preference to inmmediate rewards.
         + Experiments: `[0.8, 0.98, 0.999]`
         
*  `gae_lambda`: a parameter that shows the trade-off between bias and variance during the `GAE` calculation.
    + If too big, higher variance.
    + If smaller, higher bias.
         + Experiments: `[0.9, 0.95, 0.999]`

*  `policy_kwargs`: enables us to implement new custom policies to our model, as neural networks defined in dictionaries.

        
### 2.3.4. Hyperparameter Tuning

Now that we've gone through some of these `hyperparameters` definitions, let's jump into the results obtained from the experimentation!

_NOTE: All `hyperparameters` where tested using `PPO`, `MlpPolicy` and `total_timesteps = 100_000`._


#### 2.3.4.5. `learning_rate`

**_Explained Variance_**

<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.00001</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/learning_rate/0.00001/train_explained_variance-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.0005</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/learning_rate/0.0005/train_explained_variance-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.001</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/learning_rate/0.001/train_explained_variance-2.svg">
  </div>
</div>

**_Value Loss_**

<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.00001</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/learning_rate/0.00001/train_value_loss-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.0005</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/learning_rate/0.0005/train_value_loss-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.001</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/learning_rate/0.001/train_value_loss-2.svg">
  </div>
</div>


**Outcome**

> * The `learning_rate=0.00001` shows an unstable and slow convergence towards the minima, with a high value_loss.
> * The `learning_rate=0.0005` shows fast and steady convergence towards 1 in explained_variance and 0 in value loss.
> * The `learning_rate=0.001` results in an unstable behaviour, with higher value loss than smaller values.
 



#### 2.3.4.6.`n_steps`

**_Explained Variance_**

<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>4</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_steps/4/train_explained_variance-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>2048</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_steps/2048/train_explained_variance-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>4096</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_steps/4096/train_explained_variance-2.svg">
  </div>
</div>

**_Value Loss_**


<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>4</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_steps/4/train_value_loss-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>2048</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_steps/2048/train_value_loss-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>4096</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_steps/4096/train_value_loss-2.svg">
  </div>
</div>


**Outcome**

> * The `n_steps=4` shows horrific instability, unreliable performance.
> * The `n_steps=2048` far better stability, yet the results aren't as steady as with higher values.
> * The `n_steps=4096` results in in the best behaviour, with optimal result in both metrics.




#### 2.3.4.7.`batch_size`

**_Explained Variance_**

<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>32</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/batch_size/32/train_explained_variance-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>128</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/batch_size/128/train_explained_variance-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>512</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/batch_size/512/train_explained_variance-2.svg">
  </div>
</div>

**_Value Loss_**

<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>32</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/batch_size/32/train_value_loss-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>128</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/batch_size/128/train_value_loss-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>512</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/batch_size/512/train_value_loss-2.svg">
  </div>
</div>


**Outcome**

> * The `batch_size=32` presents steady and fast convergence in both metrics, being the best choice among the three.
> * The `batch_size=128` shows an unstable and slow convergence towards the minima, with a high value_loss, and smallest explained_variance. Not the worse, though.
> * The `batch_size=512` shows an unstable and slow convergence towards the minima, with a high value_loss, and smallest explained_variance.





#### 2.3.4.8.`n_epochs`

**_Explained Variance_**

<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>3</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_epochs/3/train_explained_variance-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>8</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_epochs/8/train_explained_variance-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>10</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_epochs/10/train_explained_variance-2.svg">
  </div>
</div>

**_Value Loss_**

<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>3</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_epochs/3/train_value_loss-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>8</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_epochs/8/train_value_loss-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>10</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/n_epochs/10/train_value_loss-2.svg">
  </div>
</div>


**Outcome**

> * The `n_epochs=3` shows the slowest convergence towards both a 1 in Explained Variance, and 0 Value Loss.
> * The `n_epochs=8` shows  the best compromise for both metrics, with a fast and smoth behaviour.
> * The `n_epochs=10` shows good results, similar to epochs=10.




#### 2.3.4.9. `gamma`

**_Explained Variance_**


<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.8</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gamma/0.8/train_explained_variance-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.98</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gamma/0.98/train_explained_variance-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.999</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gamma/0.999/train_explained_variance-2.svg">
  </div>
</div>

**_Value Loss_**


<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.8</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gamma/0.8/train_value_loss-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.98</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gamma/0.98/train_value_loss-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.999</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gamma/0.999/train_value_loss-2.svg">
  </div>
</div>



**Outcome**

> * The `gamma=0.8` shows an unstable and slow convergence towards the minima, with unreliable behaviour in the first steps.
> * The `gamma=0.98` presents unstable behaviour, plus an erratic functioning when analyzing both metrics towards the last 30k steps.
> * The `gamma=0.999` shows fast and steady convergence towards 1 in explained_variance and 0 in value loss.





#### 2.3.4.10. `gae_lambda`

**_Explained Variance_**


<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.9</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gae_lambda/0.9/train_explained_variance-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.95</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gae_lambda/0.95/train_explained_variance-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.99</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gae_lambda/0.99/train_explained_variance-2.svg">
  </div>
</div>

**_Value Loss_**


<div style = "display: flex; justify-content: center;">
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.9</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gae_lambda/0.9/train_value_loss-2.svg">
  </div>
  <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.95</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gae_lambda/0.95/train_value_loss-2.svg">
  </div>
   <div style = "width: 250px; 
  margin: 0 1rem;
    margin-bottom : 1.5rem;">
      <figcaption>0.99</figcaption>
      <img style = "display: inline-block" src = "./Hyperparameters/gae_lambda/0.99/train_value_loss-2.svg">
  </div>
</div>


**Outcome**

> * The `gae_lambda=0.9` shows a consistent and fast convergence towards the minima, naximizing the explained variance and minimizing the value loss.
> * The `gae_lambda=0.95` shows a slower and less steady convergence.
> * The `gae_lambda=0.99` results in the slowest convergence among the three.



___

#### _Summary_

| Model | Policy    | learning_rate | n_steps | batch_size | n_epochs | gamma | gae_lambda |
|-------|-----------|---------------|---------|------------|----------|-------|------------|
| PPO   | MlpPolicy | 0.0005        | 4096    | 32         | 8        | 0.999 |     0.9    |


### 2.3.5. Final Training

Concluding the `hyperparameter tuning` phase, we will proceed to train our final model, using two new premises:

1. Longer Training `total_timesteps = 2_000_000`
2. Learning Rate Schedule (slowly reducing the learning_rate, related to the progress made in timesteps).

Both elements will help to converge into the solution (`R = 200`). We will also leverage both `eval_callback` and `stop_callback` to stop the training process as soon as the model reaches the wanted reward. 

Later on we can experiment with the three premises to optimize even further our model.

Without further ado:

import os
save_path = os.path.join('Training', 'Saved Models')
log_path = os.path.join('Training', 'Logs')

In [None]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=200, verbose=1)
eval_callback = EvalCallback(env, 
                             callback_on_new_best=stop_callback, 
                             eval_freq=10000, 
                             best_model_save_path=save_path, 
                             verbose=1)

In [None]:
from typing import Callable

from stable_baselines3 import PPO


def linear_schedule(initial_value: float) -> Callable[[float], float]:
    """
    Linear learning rate schedule.

    :param initial_value: Initial learning rate.
    :return: schedule that computes
      current learning rate depending on remaining progress
    """
    def func(progress_remaining: float) -> float:
        """
        Progress will decrease from 1 (beginning) to 0.

        :param progress_remaining:
        :return: current learning rate
        """
        return progress_remaining * initial_value

    return func

In [None]:
model = PPO("MlpPolicy", env, learning_rate=linear_schedule(0.0005), n_steps=4096, batch_size = 32, n_epochs= 8, gamma = 0.999, gae_lambda=0.9, verbose=1)
model.learn(total_timesteps=2_000_000, reset_num_timesteps=True, callback=eval_callback)

## 2.4. Evaluation

After this training phase, let's see how our rewards would get in 10 `episodes`: 

In [None]:
from stable_baselines3.common.evaluation import evaluate_policy

evaluate_policy(model, env, n_eval_episodes=10, render=True)
env.close()

Hooray!

We have succesfully solved the problem, as we have surpassed the `R=200`rquirement.

## 2.5. Test

We may also want to check how good our model is able to predict the action, given a current observation. Indeed, we can even check what the derived reward from taking the action would be:

In [None]:
obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, done, info = env.step(action)
    env.render()
    if done: 
        print('info', info)
        break
env.close()

# 3. Results

To round up all bases, we will proceed to upload the model to the `Deep Reinforcement Learning Leaderboard` hosted in `HuggingFace`.

[LeaderBoard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard)


[Notebook](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb)

From the models uploaded, we attach one of the rendered evaluations showing its behaviour:

![LunarLander](https://i.imgur.com/uUluCMn.gif "LunarLander")


Results: `267.05 +/- 19.58`

_NOTE: The final uploaded model may differ in the `learning_rate` from what has been analyzed (as a linear decay has been implemented)._

## 3.1. Other Models Results

Out of the different verions uploaded to the `HuggingFace` repo, I highlight the folowwing two, both solving the problem:


![LunarLander](https://i.imgur.com/Fz1H4Af.gif "LunarLander")

![LunarLander](https://i.imgur.com/jzPvClN.gif "LunarLander")

# 4. Conclusions

Reinforcement Learning is a promising field in the Machine Learning ecosystem. The mathematics behind it, are still difficult to digest, which makes complicated its usage in projects in which explainability is a center piece.

However, is precisely in those abstract problems, for which Machine Learning can't quite make it, where Reinforcement Learning shines.

Along the project, an overview in the main concepts and ideas behind the field has been made. The review is fundamental to comprehend the practice:

> _Experience without theory is blind, but theory without experience is mere intellectual play._ - Inmanuel Kant

The practice behind the theory, the implementation of the `Proximal Policy Optimization` algorithm, shows not only the strengths of the model, but also the relevance of tuning `hyperparameters`as a way to improve performance substantially, and reduce derived costs from the training stages.

The project concludes with a succesful implementation of the model, solving the problem, and deploying the results into the `HuggingFace` `Deep Reinforcement Learning Leaderboard`.

# 5. References

<sup>Schema of the basic elements in a markovian RL system</sup>

<sup>Schema of some of the popular RL algorithms.</sup>

<sup>A summary on the RL algorithms implemented according to the support of discrete/continuous actions and multiprocessing.</sup>

<sup>Yearly evolution of RL Models usage</sup>

<sup>Sutton, R. S., & Barto, A. (1998). Reinforcement learning : an introduction. The Mit Press.</sup>

<sup>Stable-Baselines3 Docs - Reliable Reinforcement Learning Implementations — Stable Baselines3 1.2.0a2 documentation. (n.d.). Stable-Baselines3.Readthedocs.io. https://stable-baselines3.readthedocs.io/en/master/</sup>

<sup>Gym Documentation. (n.d.). Www.gymlibrary.dev. Retrieved February 1, 2023, from https://www.gymlibrary.dev</sup>

<sup>Performance Analysis of DQN Algorithm on the Lunar Lander task — Neuromatch Academy: Deep Learning. (n.d.). Deeplearning.neuromatch.io. Retrieved February 1, 2023, from https://deeplearning.neuromatch.io/projects/ReinforcementLearning/lunar_lander.html</sup>

<sup>Mnih, V., Badia, Adrià Puigdomènech, Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. ArXiv.org. https://arxiv.org/abs/1602.01783</sup>

<sup>Kiran, M., & Ozyildirim, M. (n.d.). HYPERPARAMETER TUNING FOR DEEP REINFORCEMENT LEARNING APPLICATIONS *. Retrieved February 1, 2023, from https://arxiv.org/pdf/2201.11182.pdf</sup>

<sup>Markov Property - an overview | ScienceDirect Topics. (n.d.). Www.sciencedirect.com. Retrieved February 1, 2023, from https://www.sciencedirect.com/topics/engineering/markov-property#:~:text=The%20Markov%20property%20means%20that</sup>

<sup>Part 1: Key Concepts in RL — Spinning Up documentation. (2018). Openai.com. https://spinningup.openai.com/en/latest/spinningup/rl_intro.html</sup>

<sup>Part 2: Kinds of RL Algorithms — Spinning Up documentation. (2018). Openai.com. https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html</sup>

<sup>Part 3: Intro to Policy Optimization — Spinning Up documentation. (2018). Openai.com. https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html</sup>

<sup>OpenAI Baselines: ACKTR & A2C. (2017, August 18). OpenAI. https://openai.com/blog/baselines-acktr-a2c/</sup>

<sup>OpenAI. (2017, July 20). Proximal Policy Optimization. OpenAI. https://openai.com/blog/openai-baselines-ppo/</sup>

<sup>PLogger — Stable Baselines3 1.8.0a3 documentation. (n.d.). Stable-Baselines3.Readthedocs.io. Retrieved February 1, 2023, from https://stable-baselines3.readthedocs.io/en/master/common/logger.htmll</sup>

<sup>Tensorboard Integration — Stable Baselines3 1.8.0a3 documentation. (n.d.). Stable-Baselines3.Readthedocs.io. Retrieved February 1, 2023, from https://stable-baselines3.readthedocs.io/en/master/guide/tensorboard.html</sup>