## Introduction
In this practical we introduce the idea of reinforcement learning, discuss how it differs from supervised and unsupervised learning and then build an agent that learns to drive a car up the mountain.

## Learning Objectives
* Understand the relationship between the **environment** and the  **agent**  
* Understand how a **policy** is used by an agent to select an action
* Describe how to implement a **run-loop** that controls the interaction between environement and agent.
* Understand how the **state**, **action** and **reward** are communicated between the agent and the environment.  
* Be able to implement the a simple **Q-learning** RL algorithm call **DQN**
* Discover at least one potential issue with the DQN.

 ## A) Learning Paradims
**Supervised learning: ** is a learning paradaim, given an input and a target value or class. The goal is to predict the class value.

**Unsupervised learning: ** is a learning paradaim, where we are only given an input. The goal is to look for patterns in that input. 

**Reinforcement learning: ** which cares about training an **agent** to maximise a **reward** it obtains through interaction with an **environment**. 


## B) Reinforcement Learning ( RL )

#### How RL works?

The **environment** defines a set of **actions** that an agent can take. The agent observes the current **state** of the environment, tries actions and *learns* a **policy** which is a distribution over the possible actions given a state of the environment. 

The following diagram illustrates the interaction between the agent and environment. We will explore each of the terms in more detail throughout this practical. 


![RL Overview](data/rl-image.png)


## C) Outlines

We will use an OpenAI Gym environment called **Mountain Car**, the goal is to train an agent to drive a car up the mountain. The practical will be as follows:

1. Introduce the **Mountain Car** environment and explore the states and actions available.
2. Create a simple agent that takes random actions.
3. Going from random agent to skilled agent:
    - Introduce Q-learning algorithm.
    - The intuition behind Q-learning.
    - Ways to implement Q-learning:
        - Tabular Method
        - Function approximator: Neural Network.
        
4. Merge Q-learning with Deep Learning:
    - Introduce DQN algorithm.
    - Use DQN!!
    - Evaluate RL Algorithm:
        -  Compare DQN with the random agent.
5. Going deeper into DQN:
    - DQN main momponents
    - Implement the components.
    - Stack the components together.
    
6. The big picture of Reinforcement learing:
    - General RL taxonomy
        - Value function based method.
        - Gradient based Methods.
        - Hybrid based Methods.
    - Drawback of the Value function Methods.
    - Next Steps.

7. References.

## 1- Explore the Environment

![SegmentLocal](data/environment.gif "segment")

 The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.

Now let us take a look to how the states and actions represented:

**States:**


|Representation |  State |   Min value|  Max value  |
|---|---------------|------|-------|
| 0 |  position |  -1.2 |     0.6 | 
| 1 |   velocity|   -0.07|     0.07 | 


**Actions:**


|Representation |  Action| 
|---|---------------|
| 0 |  push left |
| 1 |   no push|  
| 2 |    push right|  



In [None]:
import gym

In [None]:
mountain_car = gym.make('MountainCar-v0')
# mountain_car = gym.make('CartPole-v0')

num_states = mountain_car.observation_space.shape[0]
num_actions = mountain_car.action_space.n

print("States:  {} Type: Contiuous  Represented as: Vector ".format(num_states,) )
print("Actions: {} Type: Discrete   Represented as: scalar or number ".format(num_actions+1))  # adding one, because action are starting from 0.
print("Example of a state: ", mountain_car.observation_space.sample())
print("Example of an action: ", mountain_car.action_space.sample())

## 2- Build a random agent

**reset:** get an initial state in the envirnoment.

**render:** show the mountain car environemnt (simulator) in the screen.

**step:** apply the action in the environemnt.

**smaple:** sample an action

**close:**  close the envrinment ( simulator ).


In [None]:
mountain_car.reset()
for _ in range(1000):
    mountain_car.render()
    random_agent = mountain_car.action_space.sample()
    mountain_car.step(random_agent) # take a random action
mountain_car.close()

## 3- Going from random agent to skilled agent

### 3.1 Q-learning

As we mentioned before, the goal of Reinforcement learning is to train an agent to maximise a reward it obtains through interaction with an environment.

So, what is reward that the agent want to maximize?

Basically, it is the objective function of the agent, and has another name called Reward function. It formalized matmatically as:

$ G_t = r_t + \gamma r_{t+1} + \gamma^{2} r_{t+2} + \gamma^{3} r_{t+3} + .... =  \sum_{t=1}^{T_i} \gamma^t r_{i,t}$

Q-learning considers that every action in a given state has a value $ Q(s_t,a_t)$, this value formalized matmatically as:


$ Q(s_t,a_t) = Q(s_t,a_t) + \alpha [r_{t} + \max_{a} \gamma Q(s_{t+1},a) - Q(s_t,a_t) ] $

which is equal to : $ E[G_t|\space s_t ,a_t]$

Looking for a proof for the previous formula? go here.

### 3.2  intuition behind Q-learning.

Now, Let us get the intiution behind this formula:


$ Q(s_t,a_t) = Q(s_t,a_t) + \alpha [r_{t} + \max_{a} \gamma Q(s_{t+1},a) - Q(s_t,a_t) ] $

$Q(s_t,a_t) \equiv Action \space value \space a_t \space at \space state \space s_t \\
\max_{a} \gamma Q(s_{t+1},a) \equiv max \space action \space value \space over \space all \space the \space actions \space in \space the \space next \space state \space s_{t+1} \equiv is \space called \space TD \space target \\ 
[r_{t} + \max_{a} \gamma Q(s_{t+1},a) - Q(s_t,a_t) ] \equiv is \space called \space TD \space error.
$

The formula actually saying: the value of the action depends on its previous value plus the discount rewards since taking that action and onward.
This the discount rewards representes TD target.

### 3.3 Ways to implement Q-learning

1. You can implement Q-learning using:

    - Q-Table ( Dictionary, Matrix ,.... ) : Each cell in the table represents action value in a given state.
    - Function Approximation: Because of the increasing of the Q-table with number of states. approximate the action value.
             
|   |  |  Q-Table |    |
|---|---------------|------|-------|---
| $s_0$ |  $a_{00}$|  $a_{01}$|     $a_{02}$ | $a_{03}$ | ... |...$a_{0n}$...
| $s_1$ |    $a_{10}$|  $a_{11}$|     $a_{12}$ | $a_{13}$ | ... |...$a_{1n}$...
| $s_2$ |   $a_{20}$|  $a_{21}$|     $a_{22}$ | $a_{23}$ | ... |...$a_{2n}$...
| $s_3$|   $a_{30}$|  $a_{31}$|     $a_{32}$ | $a_{33}$ | ... |...$a_{3n}$...
| .. |  ..|  ..|      .. | .. | ... |...
        
   

## 4- Merge Q-learning with Deep Learning

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from DQN import Agent
from DQN import run
from DQN import Metric

In [None]:
# Create the Agent

agent = Agent(env=mountain_car)

In [None]:
# Evaluation settings
episode_number = 0
number_of_episodes=400
metric = Metric(number_of_episodes)

In [None]:
while episode_number<number_of_episodes:
    
    metric.reset()
    
    R,episode_length= run(agent,mountain_car)
    
    metric.add(R,episode_number,episode_length)
 
    episode_number += 1
    
    
    if episode_number%100==0:
        
        metric.show()       

In [None]:
# Save the agent
agent.brain.model.save('Models/dqn.mod')

In [None]:
# Plot the result

plt.plot(metric.G)
plt.ylabel('Returns')
plt.xlabel('Number of episodes')

In [None]:
plt.plot(metric.mean_G_all)
plt.ylabel('Average of Returns ')
plt.xlabel('Number of episodes')

In [None]:
plt.plot(metric.episodes_length)
plt.ylabel('Episode Length')
plt.xlabel('Number of Episodes')

## 5- Going deeper into DQN: