<a href="https://colab.research.google.com/github/wikistat/AI-Frameworks/blob/master/IntroductionDeepReinforcementLearning/Deep_Q_Learning_CartPole.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [IA Frameworks](https://github.com/wikistat/AI-Frameworks) - Introduction to Deep Reinforcement Learning 

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
    
</center>

# Part 1b : Deep Q-Network on CartPole
The objectives of this noteboks are the following : 

* Discover AI Gym environmenth *CartPole* game.
* Implement DQN to solve cart pole (a pacman-like game).
* Implement Experience Replay Buffer to improve performance

# Files & Data (Google Colab)

If you're running this notebook on Google colab, you do not have access to the `solutions` folder you get by cloning the repository locally. 

The following lines will allow you to build the folders and the files you need for this TP.

**WARNING 1** Do not run this line localy.
**WARNING 2** The magic command `%load` does not work work on google colab, you will have to copy-paste the solution on the notebook.

In [None]:
! mkdir solution
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/push_cart_pole.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/DNN_class.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/DQN_cartpole_class.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/play_cartpole_with_dnn.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/solutions/DQN_cartpole_memory_replay_class.py

! wget -P . https://github.com/wikistat/AI-Frameworks/raw/master/IntroductionDeepReinforcementLearning/experience_replay.py   

# Import librairies

In [None]:
import numpy as np
from datetime import datetime
import collections

# Tensorflow
import tensorflow.keras.models as km
import tensorflow.keras.layers as kl
import tensorflow.keras.optimizers as ko
import tensorflow.keras.backend as K

# To plot figures and animations
import matplotlib.animation as animation
import matplotlib.pyplot as plt
from IPython.display import HTML
import seaborn as sb
sb.set_style("whitegrid")


# Gym Library
import gym

The following functions enable to build a video from a list of images. <br>
They will be used to build video of the game you will played.

In [None]:
def update_scene(num, frames, patch):
    patch.set_data(frames[num])
    return patch,

def plot_animation(frames, repeat=False, interval=400):
    plt.close()  # or else nbagg sometimes plots in the previous cell
    fig = plt.figure()
    patch = plt.imshow(frames[0])
    plt.axis('off')
    return animation.FuncAnimation(fig, update_scene, fargs=(frames, patch), frames=len(frames), repeat=repeat, interval=interval)

# AI Gym Librairie
<a href="https://gym.openai.com/" ><img src="https://gym.openai.com/assets/dist/home/header/home-icon-54c30e2345.svg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<br>
In this notebook we will be using [OpenAI gym](https://gym.openai.com/), a great toolkit for developing and comparing Reinforcement Learning algorithms. <br> It provides many environments for your learning *agents* to interact with.

# A simple environment: the Cart-Pole

## Description
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

### Observation

Num | Observation | Min | Max
---|---|---|---
0 | Cart Position | -2.4 | 2.4
1 | Cart Velocity | -Inf | Inf
2 | Pole Angle | ~ -41.8&deg; | ~ 41.8&deg;
3 | Pole Velocity At Tip | -Inf | Inf

### Actions

Num | Action
--- | ---
0 | Push cart to the left
1 | Push cart to the righ&t

Note: The amount the velocity is reduced or increased is not fixed as it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

### Reward
Reward is 1 for every step taken, including the termination step

### Starting State
All observations are assigned a uniform random value between ±0.05

### Episode Termination
1. Pole Angle is more than ±12°
2. Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
3. Episode length is greater than 200

### Solved Requirements
Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

The description above if part of the official description of this environemtn. Read full description [here](https://github.com/openai/gym/wiki/CartPole-v0).

The following command will load the `CartPole` environment.

In [None]:
env = gym.make("CartPole-v0")

The `reset` command initialize the environement and return the first observation which are a 1D array of size 4.


In [None]:
obs = env.reset()
env.observation_space, obs

**Q:** What are the four values above?

The `render` command allows to generate the environment which is here a 400X600 pixels with RGB channels. 

The `render` command for the `CartPole`environment also open another window that we will close directly with the `env.close`command. It can produce disturbing behaviour.

In [None]:
img = env.render(mode = "rgb_array")
env.close()
print("Environemnt is a %dx%dx%d images" %img.shape)

The environment can then easily be displayed with matplotlib function. 

In [None]:
plt.imshow(img)
_ = plt.axis("off")

The action space is composed of two actions push to the left (0), push to the right (1).

In [None]:
env.action_space

The `step function` enables to apply one of this actions and return several information : 

* The new observation after applying this action
* The reward this action has produced
* A boolean that indicates if the experience is over or not.
* Extra information that depend of the environment (CartPole environment does not provide anything).

Let's push the cart pole to the left!

In [None]:
obs, reward, done, info = env.step(0)
print("New observation : %s" %str(obs))
print("Reward : %s" %str(reward))
print("Is the experience over? : %s" %str(done))
print("Extra information : %s" %str(info))

In [None]:
img = env.render(mode = "rgb_array")
env.close()
plt.imshow(img)
axs =  plt.axis("off")

**Q** : What can you see? Does the output value seems normal to you?

**Exercise** : Reset the environment, and push the car to the left untill the experience is over then display the final environment. 
**Q** : Why do the environment ends? 

In [None]:
# %load solutions/push_cart_pole.py

# Q network

In **Q-learning** all the *Q-Values* are stored in a *Q-table*. 
The optimal value can be learn by playing game and updating the Q-table with the following formula.

$$target = R(s,a,s')+\gamma \max\limits_{a'}Q_k(s',a')$$
$$Q_{k+1}(s,a)\leftarrow(1-a)Q_k(s,a)+\alpha[target]$$

if the combinations of states and actions are too large, the memory and the computation requirement for the *Q-table* will be too high.

Hence, in **Deep Q-learning** we use a function to generate the approximation of the *Q-value* rather than remembering the solutions. <br>
As the input of the function, i.e, the *observation*, are vectors of four values, a simple **DNN** will be enough  to approximate the q table

Later, we will generate targets from experiences and train this **DNN**.

$$target = R(s,a,s')+\gamma \max\limits_{a'}Q_k(s',a')$$
$$\theta_{k+1} \leftarrow \theta_k - \alpha\nabla_{\theta}\mathbb{E}_{s\sim'P(s'|s,a)} [(Q_{\theta}(s,a)-target(s'))^2]_{\theta=\theta_k} $$

The `DNN` class below defines the architecture of this *neural network*.

**Exercise** 

The architecture of the *dnn* as been set for you<br>

However, the shape of the input as well as the number of neurons and the activation function  of the last layer are not filled.<br>
Fill the gap so that this network can be use to approximate *Q-values*

In [None]:
class DNN:
    def __init__(self):

        self.lr = 0.001

        self.model = km.Sequential()
        self.model.add(kl.Dense(150, input_dim=??, activation="relu"))
        self.model.add(kl.Dense(120, activation="relu"))
        self.model.add(kl.Dense(??, activation=??))
        self.model.compile(loss='mse', optimizer=ko.Adam(lr=self.lr))

In [None]:
# %load solutions/DNN_class.py

# DEEP Q Learning on *Cartpole*

The objective of this section is to implement a **Deep Q-learning** tha will be able to solve the cartpole environment.

For that 2 python class will be required:

* `DNN`: A class that will enable to use a function that approximate the Q-values
* `DQN`: A class that will enable to train the Qnetowrk


All the instructions of this section are in this notebook belows. 

However you will have the possibility to 
* work with the scripts DQN_cartpole.py and DQN_cartpole_test.py that can be found in the `IntroductionDeepReinforcementLearning`folder
* OR work with the codes in cells of this notebok. 

### DQN Class


The `DQN` class contains the implementation of the **Deep Q-Learning** algorithm. The code is incomplete and you will have to fill it!. 

**GENERAL INSTRUCTION**:

* Read the init of the `DQN` class. 
    * Various variable are set with their definition, make sure you understand all of its.
    * The *game environment*, the *memory of the experiences* and the *DNN Q-network* are initialised.
* Read the `train` method. It contains the main code corresponding to the **pseudo code** below. YOU DO NOT HAVE TO MODIFY IT! But make sure you understand it.
* The `train` method use methods that are not implemented. 
    * You will have to complete the code of 4 functions. (read instruction of each exercise below)
    * After the cell of the `DQN` class code below there are **test cells** for each of these exercices. <br>
    This cell should be executed after each exercice. This cell will check that the function you implemented take input and output in the desired format. <br> DO NOT MODIFY this cell. They will work if you're code is good <br> **Warning** The test celle does not guarantee that your code is correct. It just test than input and output are in the good format.


#### Pseudo code 
*We will consider that we reach the expected *goal* if achieve the max score (200 steps without falling)
over ten games.*

While you didn't reach the expected *goal* reward or the *max_num_episode* allow to be played:
* Start a new episode and while the epsiode is not done:
    * At each step:
        * Run one step of the episode: (**Exercise 1**)
        * Save experience in memory: (**Exercise 2 & 3**)
        * If we have stored enough episode on the memory to train the batch:
            * train model over a batch of targets (**Exercise 4**)
            * Decrease probability to play random


    
**Exercise 1**:  Implement `save_experience`<br>
&nbsp;&nbsp;&nbsp;&nbsp; This function save each experience produce by a step on the `memory`of the class.<br> 
&nbsp;&nbsp;&nbsp;&nbsp; We do not use the experience replay buffer in this part, so you just have to save the last `batch_size`experience in order to use it at the next train step
(https://keras.io/api/layers/)
    
**Exercise 2**:  Implement `choose_action`<br>
&nbsp;&nbsp;&nbsp;&nbsp; This method chooses an action in *eploration* or *eploitation* mode randomly:<br>

**Exercise 3**:  Implement `run_one_dtep` <br>
&nbsp;&nbsp;&nbsp;&nbsp; This method:<br>
&nbsp;&nbsp;&nbsp;&nbsp; -> Choose an action<br>
&nbsp;&nbsp;&nbsp;&nbsp; -> Apply the action on the environement.<br>
&nbsp;&nbsp;&nbsp;&nbsp; -> return all element of the experience

**Exercise 4**:  Implement `generate_target_q`<br>
This method is used within the `train_one_step` method (which is already implemented).This method:<br>
&nbsp;&nbsp;&nbsp;&nbsp; -> Generate a batch of data for training using the `experience_replay` <br>
&nbsp;&nbsp;&nbsp;&nbsp; -> Generate the targets from this batch using `generate_target_q` <br>
&nbsp;&nbsp;&nbsp;&nbsp; -> Train the model using these targets. <br>
<br> 
The `generate_target_q` is not implemented so you have to do it!<br>
You have to generate targets according to the formula below <br>

$$target = R(s,a,s')+\gamma \max\limits_{a'}Q_k(s',a';\theta) $$

# TODO give tips for train_gameover

In [None]:
class DQN:
    """ Implementation of deep q learning algorithm """

    def __init__(self):

        self.prob_random = 1.0  # Probability to play random action
        self.y = .99  # Discount factor
        self.batch_size = 64  # How many experiences to use for each training step
        self.prob_random_end = .01  # Ending chance of random action
        self.prob_random_decay = .996  # Decrease decay of the prob random
        self.max_episode = 300  # Max number of episodes you are allowes to played to train the game
        self.expected_goal = 200  # Expected goal

        self.dnn = DNN()
        self.env = gym.make('CartPole-v0')

        self.memory = []

        self.metadata = [] # we will store here info score, at the end of each episode


    def save_experience(self, experience):
        #TODO
        return None

    def choose_action(self, state, prob_random):
        #TODO
        return action

    def run_one_step(self, state):
        #TODO
        return state, action, reward, next_state, done

    def generate_target_q(self, train_state, train_action, train_reward, train_next_state, train_done):
        #TODO
        return target_q

    def train_one_step(self):

        batch_data = self.memory
        train_state = np.array([i[0] for i in batch_data])
        train_action = np.array([i[1] for i in batch_data])
        train_reward = np.array([i[2] for i in batch_data])
        train_next_state = np.array([i[3] for i in batch_data])
        train_done = np.array([i[4] for i in batch_data])

        # These lines remove useless dimension of the matrix
        train_state = np.squeeze(train_state)
        train_next_state = np.squeeze(train_next_state)

        # Generate target Q
        target_q = self.generate_target_q(
            train_state=train_state,
            train_action=train_action,
            train_reward=train_reward,
            train_next_state=train_next_state,
            train_done=train_done
        )

        loss = self.dnn.model.train_on_batch(train_state, target_q)
        return loss

    def train(self):
        scores = []
        for e in range(self.max_episode):
            # Init New episode
            state = self.env.reset()
            state = np.expand_dims(state, axis=0)
            episode_score = 0
            while True:
                state, action, reward, next_state, done = self.run_one_step(state)
                self.save_experience(experience=[state, action, reward, next_state, done])
                episode_score += reward
                state = next_state
                if len(self.memory) >= self.batch_size:
                    self.train_one_step()
                    if self.prob_random > self.prob_random_end:
                        self.prob_random *= self.prob_random_decay
                if done:
                    now = datetime.now()
                    dt_string = now.strftime("%d/%m/%Y %H:%M:%S")
                    self.metadata.append([now, e, episode_score, self.prob_random])
                    print(
                        "{} - episode: {}/{}, score: {:.1f} - prob_random {:.3f}".format(dt_string, e, self.max_episode,
                                                                                         episode_score,
                                                                                         self.prob_random))
                    break
            scores.append(episode_score)

            # Average score of last 100 episode
            means_last_10_scores = np.mean(scores[-10:])
            if means_last_10_scores == self.expected_goal:
                print('\n Task Completed! \n')
                break
            print("Average over last 10 episode: {0:.2f} \n".format(means_last_10_scores))
        print("Maximum number of episode played: %d" % self.max_episode)

**Test `save_experience`**
* Append element to the `memory`.
* Never save more than `batch_size` element, keep the last `batch_size`.

In [None]:
dqn = DQN()
dqn.batch_size=2
dqn.save_experience(1)
assert dqn.memory == [1]
dqn.save_experience(2)
assert dqn.memory == [1,2]
dqn.save_experience(3)
assert dqn.memory == [2,3]

**Test `choose_action`**

This test can't be considered as a real test. <br>
Indeed, if the action are play randomly we can't expect a fixed results. 

However, if your function is implemented correctly these test should word most of the time:

* if `prob_random` = 1 -> play randomly
    * Over 100 play, each action should appears various time
* If `prob_random` = 0 -> play in exploit mode
    * The same action is choosen all the time.
* If `prob_random` = 0.5 -> play both exploration and exploit mode randomly. 
    * All action sould be seen, but the action choosen in exploit mode is always the same and should be choosen more likely.

In [None]:
dqn = DQN()
state = np.expand_dims(dqn.env.reset(), axis=0)
# Random action if prob random is equal to one
actions = [dqn.choose_action(state=state, prob_random=1) for _ in range(100)]
count_action = collections.Counter(actions)
print(count_action)
assert count_action[0]>35
assert count_action[1]>35
# Best action according to model if prob_random is 0
actions = [dqn.choose_action(state=state, prob_random=0) for _ in range(100)]
count_action = collections.Counter(actions)
print(count_action)
assert(len(set(actions)))==1
main_action = list(set(actions))[0]
# 
actions = [dqn.choose_action(state=state, prob_random=0.5) for _ in range(100)]
count_action = collections.Counter(actions)
assert(len(set(actions)))==2
print(count_action)
assert sorted(count_action.items(), key=lambda x : x[1])[-1][0]==main_action

**Test `run_one_step`**

This method play one step of an episode.

The method return all element of an experience, i.e:
 * A *state*: a vector of size (1,4)
 * An *action*: an integer
 * A *reward*: a float
 * The *nex_state*: a vector of size (1,4)


In [None]:
dqn = DQN()
state = np.expand_dims(dqn.env.reset(), axis=0)
state, action, reward, next_state, done  = dqn.run_one_step(state)
assert state.shape == (1, 4)
assert type(action) is int
assert type(reward) is float
assert next_state.shape == (1, 4)
assert type(done) is bool

**Test `generate_target_q`**

This method generate targets of q values.

In this test we set the `batch_size`value is equal to 2. Hence the function take as an input: 
* train_state : An array of size (2,4)
* train_action : An array of size (2,1)
* train_reward  : An array of size (2,1)
* train_next_state : An array of size (2,4)
* train_done : An array of size (2,1)

And return as an output an Array of size (2,2), which is a target for each input of the batch.


In [None]:
dqn = DQN()
dqn.batch_size=2
state = np.expand_dims(dqn.env.reset(), axis=0)
target_q = dqn.generate_target_q(
    train_state = np.vstack([state,state]),
    train_action = [0,0],
    train_reward = [1.0,2.0],
    train_next_state = np.vstack([state,state]),
    train_done = [1, 1]
)

assert target_q.shape == (2,2)

Here is the solution of the **DQN class**

In [None]:
# %load solutions/DQN_cartpole_class.py

Let's now train the model! (The training can be unstable)

In [None]:
dqn = DQN()
dqn.train()

If you're DQN reached the target goal (or not) we would like to see it playing a game!
**Exercise** Play a game exploiting the dnn trained with deep q learning and display video of this game to check how it performs!

In [None]:
# %load solutions/play_cartpole_with_dnn.py

The code below enables to display the evolution of the score of each episode play during training.

In [None]:
fig = plt.figure(figsize=(20,6))
ax = fig.add_subplot(1,1,1)
ax.plot(list(range(len(dqn.metadata))),[x[2] for x in dqn.metadata])
ax.set_yticks(np.arange(0,210,10))    
ax.set_xticks(np.arange(0,175,25))    
ax.set_title("Score/Lenght of episode over Iteration withou Memory Replay", fontsize=20)
ax.set_xlabel("Number of iteration", fontsize=14)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
ax.set_ylabel("Score/Length of episode", fontsize=16)

You might be lucky but it is higly possible that the training is quite unstable. 

As see in course, this might be to the fact that the experiences on which is trained the DNN are not i.i.d.
Let's try again with and **Experience Replay Buffer**

# DQN with Experience Replay Buffer


The **Experience Replay Buffer** is where all the agent's experience will be stored and where *batch* will be generate from in order to train the *Q network*  

**Exercise** Let'us implement an `ExperienceReplay` class which will have the following characteristics 

The `buffer_size` argument represent the number of element that are kept in memory (in the `buffer`). <br>
Even if 10Milions of games have been played, the `Experience Replay` will kept only the last `buffer_size` argument in memory. <br>
Hence at the beginning the first batch of targets will be composed of randomly played experience. And during training, the probability that batch of targets will be compose of experience playe in exploitation mode will increase.

The `add` method will add elements on the `buffer `.

The `sample`method will generate a sample of `size`element.

In [None]:
class ExperienceReplay:
    def __init__(self, buffer_size=50000):
        """ Data structure used to hold game experiences """
        # Buffer will contain [state,action,reward,next_state,done]
        self.buffer = []
        self.buffer_size = buffer_size

    def add(self, experiences):
        """ Adds list of experiences to the buffer """
        # TODO

    def sample(self, size):
        """ Returns a sample of experiences from the buffer """
        # TODO

In [None]:
# %load experience_replay.py

Let's see a simple example on how it works.

In [None]:
# Instanciate an experience replay buffer with buffer_size 10
experience_replay = ExperienceReplay(buffer_size=10)
# Add list of 100 integer in the buffer
experience_replay.add(list(range(100)))
# Check that it keeps only the las 10 element
print(experience_replay.buffer)
# Randomly sample 5 element from the buffer
sample = experience_replay.sample(5)
print(sample)

**Exercise** Now that you have implemented the `ExperienceReplay` class, modify the `DQN`you implemented above, and modify it to use this class as the memory instead of a simple python list and run again the model.

In [None]:
# %load solutions/DQN_cartpole_memory_replay_class.py

Let's now train the model! (That should be much more stable)

In [None]:
dqn = DQN()
dqn.train()

And once again let's play a game

In [None]:
state = env.reset()
frames = []
num_step=0
done=False
while not done:
    action=np.argmax(dqn.dnn.model.predict(np.expand_dims(state, axis=0)),axis=1)[0]
    next_state, reward, done, _ = env.step(action)
    frames.append(env.render(mode = "rgb_array"))
    state=next_state
    num_step+=1
HTML(plot_animation(frames).to_html5_video())

And observe the evolution of the score over iteration

In [None]:
fig = plt.figure(figsize=(20,6))
ax = fig.add_subplot(1,1,1)
ax.plot(list(range(len(dqn.metadata))),[x[2] for x in dqn.metadata])
ax.set_yticks(np.arange(0,210,10))    
ax.set_xticks(np.arange(0,175,25))    
ax.set_title("Score/Lenght of episode over Iteration withou Memory Replay", fontsize=20)
ax.set_xlabel("Number of iteration", fontsize=14)
plt.yticks(fontsize=12)
plt.xticks(fontsize=12)
ax.set_ylabel("Score/Length of episode", fontsize=16)

**Q**: What can you say about the influence of the experience replay buffer over this training?