**Support** : 
* Sutton & Barto *Chapter 3 : Finite Markov decision process (MDP)*
* Sutton & Barto *Chapter 4 : Dynamic programming* 

# Discrete environnement and MDP


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import trange
import gym
import random

# 2.1 Introduction

Some definitions :
* **policy ($\pi$)** : defines the learning agent's way of behaving at a given time and state
* **state ($s_t \in \mathcal{S}$)** : numeric representation of what a agent is observing at a particular point of time in a given environnement
* **action ($a_t \in \mathcal{A}(s)$)** : the input the agent provides to the environnement, it is choosen by applying a policy given the current state
* **reward ($r_t \in \mathcal{R(s,a)} \subset \mathbb{R}$)** : signal returned by the environnement reflecting how well the agent is performing

MDPs are a formalization of sequential decision making where actions influence both immediate rewards and future rewards through subsequent states.

![title](img/schema.jpg)

In a finite MDP, $\mathcal{S}$, $\mathcal{A}$ and $\mathcal{R}$ all have a finite number of elements, random variables $S_t$ and $R_t$ define discrete probability distributions depending on the previous state and action. The probability of ending in state $s'$ and getting a reward $r$ at step $t$ is then given as :

\begin{cases} 
p(s',r|s,a) = p(S_t=s', R_t=r|S_{t-1}=s, A_{t-1}=a) \\  
\sum_{s' \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s',r|s,a) = 1 \\
p: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \to [0,1] 
\end{cases}

The transition probability of ending in state $s'$ from state $s$ is given by : 

$$p(s'|s,a) = \sum_{r \in \mathcal{R}} p(s',r|s,a)$$

You can also compute the expected state-action reward $r(s,a) \to \mathbb{R}$, it is the reward you can expect after taking action $a$ in state $s$ :

$$r(s,a) = \mathbb{E}[R_t|S_{t-1}=s,A_{t-1}=a]= \sum_{r \in \mathcal{R}} r \sum_{s' \in \mathcal{S}} p(s',r|s,a)$$

Here a representation of a simple MDP with 3 states and 2 actions :
![title](img/mdp_1.png)

### Exercice 2.1 :
**1.a)** Using the previous schema, compute :

$$p(S_0|S_1,a_0)=$$
$$p(S_0|S_2,a_1)=$$
$$\mathbb{E}[R_t|S_{t-1}=S_1,A_{t-1}=a_0]=$$
$$\mathbb{E}[R_t|S_{t-1}=S_2,A_{t-1}=a_1]=$$

**1.b)** Actions are now taken randomly : $p(a_0|S) = 1-p(a_1|S)=0.3$, compute :

$$\mathbb{E}[R_t|S_{t-1}=S_1]=$$
$$\mathbb{E}[R_t|S_{t-1}=S_2]=$$

**2.)** You want an agent to learn to play this game :
<img src="img/mario.png" alt="Drawing" style="width: 300px;"/>
* How would you define a state in this game ?
* How many actions can you pick ?
* How many states are possible ?
* What kind of reward function would you set ?
* Can we represent the game as a finite MDP ?

# 2.2 Goals and rewards

In reinforcement learning, we want to maximize the cumulative reward over the long run. To do so, we need to take into account future rewards. Lets denote $R_{t+1}, R_{t+2}, R_{t+2},\ldots$ rewards we get after time step $t$, we want to maximize a function $G_t$ that depends on this sequence. In the simplest case :

$$G_t = \sum_{k = 0}^\infty R_{t+k+1}$$

In this form, we make no trade off between immediate and long term rewards as they are all weighted the same way. In reality this function (depending on the task) is often modified by adding a **discounting factor** denoted $\gamma \in [0,1]$ with its value usually close to $1$ : 

$$G_t = \sum_{k = 0}^\infty \gamma^k R_{t+k+1}$$

In a continuous task with a constant reward per time step (e.g a survival game), this sum is finite and easier to handle. We can also rewrite $G_t$ as :

$$G_t = R_{t+1} + \gamma G_{t+1}$$


# 2.3 State and state-action value

Where in the bandit problem we estimated $q_*(a)$, in MDPs we want find $q_*(s,a)$ called the **optimal state-action value** (or **q-value**)
In MDPs we want to estimate $q_*(s,a)$ called the **optimal state-action value** (or **q-value**) or $v_*(s)$ named the **optimal state value** :
* $v_*(s)$ : is the sum of all discounted future rewards the agent can expect on average after it reachs a state $s$ assuming he is acting optimally
* $q_*(s,a)$ : is the sum of all discounted future rewards the agent can expect on average after it reachs a state $s$ and  takes the action $a$

### 2.3.1 State Value and optimal State Value function

Solving a reinforcement learning problem means finding a policy $\pi$ that gives a lot of rewards in the long run. To do so, we need to estimate a **value function** (or state value function) that estimates how good it is to stay in a given state. In order to discriminate states, we compare them in term of expected future rewards. This expectation depends of course on the way the agent is acting, called the policy $\pi$. Mathematically speaking, we can write the value function as :

$$
\begin{align}
v_\pi (s) & = \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1}|S_t=s] \quad \forall s \in \mathcal{S} \\
& = \sum_a \pi(a|s) \sum_{s',r}p(s'|s,a)[r + \gamma v_\pi(s')] 
\end{align}
$$

This fundamental equation is called the **Bellman equation** : it gives a relationship between the value of the current state and its successor states. If $V(s)$ denotes the estimated value of $v_\pi (s)$ then it can be computed as : 

$$V_{t+1}(s) \gets \sum_a \pi(a|s) \sum_{s',r}p(s'|s, a)[r + \gamma V_{t}(s')]$$

In RL we want to compute the **optimal state value function** $v_* (s)$ which assume the agent is acting optimally. The value function is slightly modified to take into account this aspect :

$$
\begin{align}
v_* (s) & = \underset{a}{\operatorname{max}} \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1}|S_t=s, A_t=a] \quad \forall s, a \in \mathcal{S}, \mathcal{A} \\
& = \underset{a}{\operatorname{max}} \sum_{s',r}p(s'|s, a)[r + \gamma v_*(s')] 
\end{align}
$$

Which can be estimated using :

$$V_{t+1}(s) \gets \underset{a}{\operatorname{max}} \sum_{s',r}p(s'|s, a)[r + \gamma V_{t}(s')]$$

This algorithm is called the **value iteration algorithm**.

### Exercice 2.2 : 
An agent moves in a grid world :
* 4 possible actions : left, up, right, down
* moving outside the grid gives $-1$ and cancels the move
* if the agent is in A, he gets $+10$ and is sent to A'
* if the agent is in B, he gets $+5$ and is sent to B'
* else moving gives $0$

<img src="img/gridworld.jpg" alt="Drawing" style="width: 300px;"/>

* Compute value and optimal value function
* Can you infere a policy ?

In [None]:
WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
DISCOUNT = 0.9

# left, up, right, down
ACTIONS = [np.array([0, -1]), #left
           np.array([-1, 0]), #up
           np.array([0, 1]), #right
           np.array([1, 0])] #down
ACTION_PROB = 0.25


def step(state, action):
    #If we are at point A or B, we are sent to A' or B', the step ends
    if state == A_POS:
        return A_PRIME_POS, 10
    if state == B_POS:
        return B_PRIME_POS, 5
    
    #Move the agent
    next_state = (np.array(state) + action).tolist()
    x, y = next_state
    
    #if we go outside the grid : -1 else 0
    if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
        reward = -1.0
        next_state = state
    else:
        reward = 0
    return next_state, reward


######################################
###########    POLICIES    ###########
######################################

def value_function():
    
    
    value = np.zeros((WORLD_SIZE, WORLD_SIZE))
    #<Add your code here>
    while True:
        break

def optimal_value_function():
    
    value = np.zeros((WORLD_SIZE, WORLD_SIZE))
    #<Add your code here>
    
    while True:
        break

value_function()
optimal_value_function()

### 2.3.2  State-action Value and optimal State-action Value function

The state value function $v_\pi$ can be seen as an expectation over possible actions for a given state $s$ and a policy $\pi$. Formally, this can be written as :

$$
\begin{align}
v_\pi (s) & = \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1}|S_t=s] \quad \forall s \in \mathcal{S} \\
& = \sum_a \pi(a|s) \sum_{s',r}p(s'|s,a)[r + \gamma v_\pi(s')] \\
& = \sum_a \pi(a|s) q_\pi(s,a)
\end{align}
$$

The state-action value function $q_\pi$ is then  :

$$
\begin{align}
q_\pi (s,a) & = \sum_{s',r}p(s'|s,a)[r + \gamma v_\pi(s')] \\
& = \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1}|S_t=s, A_t=a] \quad \forall s, a \in \mathcal{S}, \mathcal{A}
\end{align}
$$

When we compute the optimal state value function $v_*$, we assume the agent is acting optimally. This is linked to the **Bellman equation** which assumes that the vaue of a state under an optimal policy must equal the expected return for the best action from that state. This means that the agent will always choose the action that leads to the state with the maximum value :

$$
\begin{align}
v_* (s) & = \underset{a}{\operatorname{max}} \sum_{s',r}p(s'|s, a)[r + \gamma v_*(s')] \\
& = \underset{a}{\operatorname{max}} q_*(s,a)
\end{align}
$$

Replacing equations we can write the **optimal state-action value function** :

$$
\begin{align}
q_* (s,a) & = \sum_{s',r}p(s'|s,a)[r + \gamma v_*(s')] \\
& = \sum_{s',r}p(s'|s,a)[r + \gamma \underset{a'}{\operatorname{max}} q_*(s',a')]
\end{align}
$$

Which $Q^*_t$, its estimated value can be computed using :

$$Q_{t+1}(s,a) \gets  \sum_{s',r}p(s'|s, a)[r + \gamma \underset{a'}{\operatorname{max}}Q_t(s',a')]$$

This is the **Q-value iteration** algorithm. If there is no uncertainty (deterministic transition like in grid world), the rule is reduced as :

$$Q_{t+1}(s,a) \gets  [r + \gamma \underset{a'}{\operatorname{max}}Q_t(s',a')]$$

This is the main component of the **Q-learning** algorithm we'll see later. 

Thus the greedy policy is given by :

$$\pi'(s,a) = \underset{a}{\operatorname{argmax}} q_* (s,a)$$

### Exercice 2.3 :

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

<img src="img/Frozen-Lake.png" alt="Drawing" style="width: 300px;"/>
<img src="img/description.png" alt="Drawing" style="width: 100px;"/>

Using the gym environnement :
* Compute the optimal state value function (value_iteration function)
* Compute the optmal state-action value function (q_value function)
* Compute the optimal policy
* Why the optimal action of grid[3,2] is to go down ?

In [None]:
"""
env.observation_space.n 
return: 
    int: nb of possible states

env.action_space.n 
return: 
    int: nb of possible actions

env.P[state][action] 
return: 
    float: transition_prob, 
    int: next_state, 
    float: reward, 
    bool: done
    
#ACTIONS
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3
"""

def value_iteration(env, nb_of_iterations=1000, gamma = 1.0, threshold=1e-20):
    
    # initialize value table with zeros
    value_table = np.zeros(env.observation_space.n)
    
    #<Add your code here>
    
    return value_table

def q_value(env, nb_of_iterations=1000, gamma = 1.0, threshold=1e-20):
     
    # initialize q table with zeros
    q_table = np.zeros((env.observation_space.n, env.action_space.n))
    
    #<Add your code here>
    
    return q_table


#INITIALIZE the ENVIRONNEMENT
env = gym.make('FrozenLake-v0')
env.reset()

#<Add your code here>

# 2.4 Dealing with partial information : solving by playing

So far we assumed that the environnement was perfectly known and that could get rewards from any state at any time. In reality, we don't know transition probabilities : the agent has to explore in order to get an idea of how states are linked to each other and which states give rewards.

## 2.4.1 TD learning framework

**Temporal difference learning** aims to solve a MDP with only partial knowledge. We assume that the agent only knows the possibles states and actions and runs an **exploration policy** (e.g epsilon greedy) to find out transition probabilities and rewards. Recall the value **iteration algorithm** :

$$V_{t+1}(s) \gets \underset{a}{\operatorname{max}} \sum_{s',r}p(s'|s, a)[r + \gamma V_{t}(s')]$$

We call the term $[r + \gamma V(s')]$ the **TD target** as it can be shown that this term is an unbiaised estimate for $V(s)$. The **TD learning** algorithm defines an exponential smoothing between the current value and the target value given a learning rate $\alpha$ (e.g 0.01) :

$$
\begin{align}
V_{t+1}(s) & \gets (1-\alpha)V_{current} + \alpha V_{target}\\
V_{t+1}(s) & \gets (1-\alpha)V_{t}(s) + \alpha[r + \gamma V_{t}(s')]
\end{align}
$$

This algorithm has some similarities with $SGD$, it is volatile and somewhat unstable as it handles one sample at a time. It is necessary to reduce the learning rate over time to reduce the bouncing effect.

## 2.4.2 Using the Q-function in TD learning

Using the **Q-function** gives extra informations : this tells us about how good an action is in a given state. Remplacing $V(s)$ by $Q(s,a)$ leads to a new algorithm called **State–Action–Reward–State–Action (SARSA)** :

$$
\begin{align}
Q_{t+1}(s,a) & \gets (1-\alpha)Q_{current} + \alpha Q_{target}\\
Q_{t+1}(s,a) & \gets (1-\alpha)Q_{t}(s,a) + \alpha[r + \gamma Q_{t}(s',a')]
\end{align}
$$

By changing the way $Q_{target}$ is calculated, we can define another variant. Setting $Q_{target} = [r + \gamma \underset{a'}{\operatorname{max}}Q_t(s',a')]$, we get the famous **Q-learning** algorithm :

$$
\begin{align}
Q_{t+1}(s,a) & \gets (1-\alpha)Q_{current} + \alpha Q_{target}\\
Q_{t+1}(s,a) & \gets (1-\alpha)Q_{t}(s,a) + \alpha[r + \gamma \underset{a'}{\operatorname{max}}Q_t(s',a')]
\end{align}
$$

The main difference is the use of the $\underset{a'}{\operatorname{max}}$ operator. In **Q-learning**, picking the action $a'$ is straightforward... but how do we pick $a'$ in the **SARSA** setup ? 

It's simple, we use an epsilon greedy policy to choose the action.

## 2.4.3 On-policy vs off-policy learning

* On-policy : SARSA
* Off-policy : Q-learning

Q-learning is off-policy because it updates its Q-values using the Q-value of the next state s′ and the greedy action $a′$. The update policy (greedy) is different than the behavior policy (epsilon greedy).

SARSA is on-policy because it updates its Q-values using the Q-value of the next state s′ and the current policy's action $a′′$. The update and the behavior policy are similar (epsilon greedy).

There is no distinction if we use the greedy policy.

Q-learning or SARSA ?
* Q-learning and more generally off-policy learning tend to have higher sample variance and can have troubles to converge (like SGD)
* Q-learning tends to be greedier and takes more risks while SARSA is more conservative (can lead to a sub optimal solution). In some cases it is better to limit the risk (e.g in finance)
* The choice can be task dependent

### Exercice 2.4

* Compare SARSA and Q-learning on FrozenLake-v0 and Taxi-v2 environnements.
* Why training is easier on the taxi environnement ? 

For this, use the env.step(action) method :

e.g : new_state, reward, done, _ = env.step(action)

In [None]:
def train_agent(env, 
                num_episodes, 
                max_steps_per_episode, 
                algorithm, # {"Q-learning","SARSA"}
                lr = 0.1, 
                eps = 1, 
                discount_rate = 0.99, 
                max_eps = 1, 
                min_eps = 0.01, 
                eps_decay = 0.002):
    
    rewards_all_episodes = []
    q_table = np.zeros((env.observation_space.n, env.action_space.n))

    for episode in trange(num_episodes):
        
        #Reset environnement
        state = env.reset()
        done = False
        rewards_current_episode = 0

        for step in range(max_steps_per_episode): 
            
            #<Add your code here>
            
            if done == True: 
                break

        rewards_all_episodes.append(rewards_current_episode)

        # Exploration rate decay
        eps = min_eps + (max_eps - min_eps) * np.exp(-eps_decay*episode)
    
    return rewards_all_episodes
    

eps = 1
max_eps = 1
min_eps = 0.01
eps_decay = 0.002

lr = 0.1
discount_rate = 0.99

max_steps_per_episode = 50
num_episodes = 5000

env_names = ['FrozenLake-v0', "Taxi-v2"]
algorithms = ["SARSA", "Q-learning"]

for env_name in env_names:
    
    env = gym.make(env_name)
    all_rewards = []
    
    for algorithm in algorithms:
        
        #<Add your code here>

## 2.4.4 Extensions

### 2.4.4.1 TD-lambda and eligibility trace

There are two main methods for solving a MDP : 
* TD learning : is biased and sensible on the initial conditions but can be computed online (after each action)
* Monte-Carlo methods : unbiaised but requires the episode to end before updating and usually have very high variance (sample inefficient)

TD-lambda uses a trick called the **eligibility trace**, that acts like a bridge between TD and MC methods. Q-learning or SARSA, can be combined with it to obtain a more general method that may learn more efficiently. An **eligibility trace** is a temporary record of the occurrence of an event (visiting a state or taking an action), it can be seen as a memory mechanism which fades over time (like a vanilla RNN).

Recall SARSA update rule (works similarly with Q-learning) :

$$
\begin{align}
Q_{t+1}(s,a) & \gets (1-\alpha)Q_{t}(s,a) + \alpha[r + \gamma Q_{t}(s',a')]\\
& \gets Q_{t}(s,a) + \alpha[r + \gamma Q_{t}(s',a') - Q_{t}(s,a)]\\
& \gets Q_{t}(s,a) + \alpha \delta_t
\end{align}
$$

The term $\delta$ is called the **TD error**. In SARSA($\lambda$), the eligibility trace $e(s,a)$ is added to the update rule :

$$
\begin{align}
Q_{t+1}(s,a) & \gets Q_{t}(s,a) + \alpha \delta_t e_t(s,a) \quad \forall a,s \in \mathcal{S}, \mathcal{A}\\
e_t(s_t,a_t) & = \begin{cases} \gamma \lambda e_{t-1}(a,s) + 1 && \text{if } a_t = a, s_t=s\\ 
\gamma \lambda e_{t-1}(a,s) && \text{else }\end{cases}
\end{align}
$$

There are different ways of calculating $e(s,a)$, another common idea is to replace instead of accumulating : 

$$e_t(s_t,a_t) = \begin{cases} 1 && \text{if } a_t = a, s_t=s\\ 
\gamma \lambda e_{t-1}(a,s) && \text{else }\end{cases}$$

In both implementations : 
* if $\lambda = 0$ we get back to the standard SARSA algorithm. 
* If $\lambda = 1$ we are on an online MC setup 
* If $\lambda \in ]0,1[$, the parameter act as a decay rate. 

### 2.4.4.2 Double Q-learning

One of the main variant is the **Double Q-learning** algorithm which aims to reduce the bias due to the use of the ${\operatorname{max}}$ operator. Standard **Q-learning** tends to overestimate the value of a given action, since it is used as a target, training can be volatile and so slow. To solve this, we introduce a second **Q-table** in the algorithm : at each step, we randomly decide to update either table A or table B. The target Q-value is then picked up from the other Q-table :
* If table A is choosen :

$$
\begin{align}
a^* & = \underset{a'}{\operatorname{argmax}Q^A_{t}(s',a')} \\
Q_{t+1}^A(s,a) & \gets (1-\alpha)Q^A_{t}(s,a) + \alpha[r + \gamma Q^B_t(s',a^*)]
\end{align}
$$

* if table B is choosen :

$$
\begin{align}
b^* & = \underset{a'}{\operatorname{argmax}Q^B_{t}(s',a')} \\
Q_{t+1}^B(s,a) & \gets (1-\alpha)Q^B_{t}(s,a) + \alpha[r + \gamma Q^A_t(s',b^*)]
\end{align}
$$



### Exercice 2.5

* Implement SARSA($\lambda$) with different values on Taxi-v2 and compare
* Implement double Q-learning on FrozenLake-v0
* Is double Q-learning on or off-policy ?

In [None]:
#SARSA

def train_agent(env, 
                lb,
                num_episodes, 
                max_steps_per_episode,  
                lr = 0.1, 
                eps = 1, 
                discount_rate = 0.99, 
                max_eps = 1, 
                min_eps = 0.01, 
                eps_decay = 0.002):
    
    rewards_all_episodes = []
    q_table = np.zeros((env.observation_space.n, env.action_space.n))
    
    for episode in trange(num_episodes):

        state = env.reset()
        done = False
        rewards_current_episode = 0
        e = np.zeros_like(q_table)
        
        for step in range(max_steps_per_episode): 
            
            #<Add your code here>

            if done == True: 
                break

        rewards_all_episodes.append(rewards_current_episode)

        # Exploration rate decay
        eps = min_eps + (max_eps - min_eps) * np.exp(-eps_decay*episode)
    
    return rewards_all_episodes

eps = 1
max_eps = 1
min_eps = 0.01
eps_decay = 0.002

lr = 0.1
discount_rate = 0.99

max_steps_per_episode = 50
num_episodes = 3000

lambdas = [0., 0.3, 0.7, 1.]
env = gym.make('Taxi-v2')
all_rewards = []

for lb in lambdas:
    #<Add your code here>
    
env.close()

In [None]:
#DOUBLE Q LEARNING

def train_agent(env, 
                num_episodes, 
                max_steps_per_episode,  
                lr = 0.1, 
                eps = 1, 
                discount_rate = 0.99, 
                max_eps = 1, 
                min_eps = 0.01, 
                eps_decay = 0.002):
    
    rewards_all_episodes = []
    q_table_a = np.zeros((env.observation_space.n, env.action_space.n))
    q_table_b = np.zeros_like(q_table_a)
    
    for episode in trange(num_episodes):

        state = env.reset()
        done = False
        rewards_current_episode = 0

        for step in range(max_steps_per_episode): 
            
            #<Add your code here>

            if done == True: 
                break

        rewards_all_episodes.append(rewards_current_episode)

        # Exploration rate decay
        eps = min_eps + (max_eps - min_eps) * np.exp(-eps_decay*episode)
    
    return rewards_all_episodes

eps = 1
max_eps = 1
min_eps = 0.01
eps_decay = 0.002

lr = 0.1
discount_rate = 0.99

max_steps_per_episode = 100
num_episodes = 10000


env = gym.make('FrozenLake-v0')
all_rewards = []

#<Add your code here>

env.close()