## Markov Decision Process 
The state transition and reward models $R$ and $T$ are known.

### Value Iteration
The value function $U(s)$ represents the long-term reward that the agent is going to get if he starts in $s$ and follows the optimal policy.  

The value iteration approach keeps improving the value function at each iteration until it converges.
At each iteration and for each state s we update its estimated utility:  
  $$U_{t+1}(s) = \max_{a}\sum_{s'}T(s, a, s')(R(s') + \gamma U_t(s')) $$  


In [1]:
def value_iteration(env, gamma, max_iter, epsilon):
    U = np.zeros(env.nS)
    for i in range(max_iter):
        prev_U = np.copy(U)
        for s in range(env.nS):
            list_sum = np.zeros(env.nA)
            for a in range(env.nA):
                for p, s_prime, r, _ in env.P[s][a]:
                    list_sum[a] += p*(r + gamma*prev_U[s_prime])
            U[s] = max(list_sum)
        if (np.sum(np.fabs(prev_U - U)) <= epsilon):
            break
    return U     

Once we have computed the true utility of each state $U(s)$ we can figure out the optimal policy $\pi(s) = \underset{a}{\operatorname{argmax}}\sum_{s'}T(s, a, s')U(s')$
### Policy Iteration
If we compute the true utility of each state $U(s)$ we can figure out the optimal policy but we have much more information than what we need to figure out the optimal policy.  
The policy iteration approach re-defines the policy at each step and computes the value function associated to the current policy until the policy converges to the optimal policy.
It needs less iterations than VI to converge however each iteration is more computationally expensive.  
Given a policy $\pi_t$ we compute the utility of each state:  
  $$U_t(s) = \sum_{s'}T(s, \pi_t(s), s')(R(s') + \gamma U_t(s')) $$

In [2]:
def evaluate_policy(env, policy, gamma, epsilon):
    U = np.zeros(env.nS)
    while True:
        prev_U = np.copy(U)
        for s in range(env.nS):
            a = policy[s]
            U[s] = sum([p * (r + gamma * prev_U[s_]) for p, s_, r, _ in env.P[s][a]])
            #for p, s_prime, r, _ in env.P[s][a]:
                #U[s] += p*(r + gamma*prev_U[s_prime])
        if (np.sum(np.fabs(prev_U - U)) <= epsilon):
            break
    return U

We then improve the policy:  
$$ \pi_{t+1}(s) = \underset{a}{\operatorname{argmax}}\sum_{s'}T(s, a, s')U_t(s'))$$

In [3]:
def improve_policy(U, gamma):
    policy = np.zeros(env.nS)
    for s in range(env.nS):
        list_sum = np.zeros(env.nA)
        for a in range(env.nA):
            for p, s_prime, r, _ in env.P[s][a]:
                list_sum[a] += p*U[s_prime]
        policy[s] = np.argmax(list_sum)
    return policy


To get the final Policy Iteration algorithm we combine the two previous steps:

In [4]:
def policy_iteration(env, gamma, max_iter, epsilon):
    policy = np.random.choice(env.nA, env.nS)
    for i in range(max_iter):
        U = evaluate_policy(env, policy, gamma, epsilon)
        new_policy = improve_policy(U, gamma)
        if (np.all(policy == new_policy)):
            break
        policy = new_policy
    return policy


## Reinforcement Learning
The state transition and reward models $T$ and $R$ are not known. The agent has access to the set of possible states and actions and has to learn through interactions with the environment.

### Q-Learning

The Q-Learning algorithm does no longer have access to the models of the MDP that is to say the transition and reward functions.
The idea is now to evaluate the Bellman equation from data by using transitions (data : $ <s, a, r, s'>$) to produce the solutions to the Q equations.
At each episode we are going to update the estimates of the Q function coming from the previous episode through a learning rate $\alpha$. 
$$ Q(s, a) = \alpha(r + \gamma \max_{a'}Q(s, a')) + (1 - \alpha)Q(s, a)$$


In [5]:
def q_learning(env, alpha, gamma, nb_episodes, nb_steps, epsilon):
    
    # Initialize the Q-table with zeros
    Q = np.zeros([env.observation_space.n, env.action_space.n])
    
    for i in range(nb_episodes):
        s = env.reset() #Initial observation
        for j in range(nb_steps):
            # The action associated to s is the one that provides the best Q-value with a proba 1-epsilon and is random with a proba epsilon
            if random.random() < 1 - epsilon:
                a = np.argmax(Q[s,:]) 
            else : 
                a = np.random.randint(env.action_space.n)
            # We get our transition <s, a, r, s'>
            s_prime, r, d, _ = env.step(a)
            # We update the Q-tqble with using new knowledge
            Q[s, a] = alpha*(r + gamma*np.max(Q[s_prime,:])) + (1 - alpha)*Q[s, a]
            s = s_prime
            if d == True:
                break
    
    return Q


### Deep Q-Learning

The problem of the previous Q-Learning algorithm is that it will not be able to work in big state space environments.
So rather than using a Q-table which returns a Q-value for a given state and a given action we can implement a neural network $N$ which takes a state and returns the Q-values of all the possible actions that could be taken in that state: $N(s) = \{Q(s, a_1), Q(s, a_2), ..., Q(s, a_n)\}$.   
  Just as Q-Learning we start with an initial state $s$ and action $a$. We look at the next state $s'$ and the associated reward $r$ that the agent receives when he takes this action $a$ in the state $s$.
  The transition $<s, a, r, s'>$ is stored in the memory of the agent. When we start to have enough transitions in the memory we sample a batch of them and for each transition $<s, a_j, r, s'>$ we do the following:  
   - Compute a target $t$, which represents the "best" action that can be done when the agent is in $s$, that is to say the action that maximizes the expected long-term reward.  
   $t = r +\gamma \max_a Q(s',a)$ with $\{Q(s', a_i)\}_{i = 1, ..., n} = N(s')$
   - Compute the output predicted by the network for $s$: $N(s) = \{Q(s, a_i)\}_{i = 1, ..., n}$ 
   - Replace $Q(s, a_j)$ with $t$ to get $N'(s)$
   - Train the network using $s$ as the input and $N'(s)$ as the output.  

We then use $s'$ as the current state $s$ and reiterate.

In [21]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()

    def _build_model(self):
        # Neural Net for Deep-Q learning Model
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])  # returns action

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

## Gym environments

### Frozen Lake - 16 states


In [6]:
import gym
import numpy as np
import random

env = gym.make('FrozenLake-v0')


#### Description
States: There are 16 different states that represent the different parts of the lake. An agent wants to go from a starting point (S) to a goal (G) situated on the other side of the lake.
Some states are traversable (F) but others are holes (H) that lead the agent to fall into the water.
The surface can be represented using the following grid:  
  SFFF  
  FHFH  
  FFFH  
  HFFG

Actions: There are 4 different actions that the agent can do when he is in a state: [LEFT, DOWN, RIGHT, UP].

Transition model: If the agent wishes to execute an action, this action is executed correctly with a probability of 0.8 and causes the agent to move at a right angle with a probability of 0.2.
The 0.2 is distributed uniformly over the two possible right angles.

Rewards: +1 if the agent reaches the goal, -1 if the agent falls down a hole and -0.04 if the agent is on a frozen surface.

#### Value Iteration

In [17]:
env = env.unwrapped
gamma = 0.95
max_iter = 100000
epsilon = 1e-20
print(value_iteration(env, gamma, max_iter, epsilon))

[ 0.10923213 -0.0224833   0.15217449 -0.0224833   0.18551446  0.
  0.258482    0.          0.39985603  0.68188442  0.64537105  0.
  0.          0.81919453  0.94288425  0.        ]


To visualize the policy produced by this utility:

In [15]:
def visualize_policy(policy):
    visu = ''
    for k in range(len(policy)):
        if k > 0 and k%4 == 0:
            visu += '\n'
        if k == 5 or k == 7 or k == 11 or k == 12 or k == 15:
            visu+='H'
        elif int(policy[k]) == 0:
            visu += 'L'
        elif int(policy[k]) == 1:
            visu += 'D'
        elif int(policy[k]) == 2:
            visu += 'R'
        elif int(policy[k]) == 3:
            visu += 'U'
    print(visu)
    
U = value_iteration(env, gamma, max_iter, epsilon)
policy = improve_policy(U, gamma)
visualize_policy(policy)

DRDL
DHDH
RDDH
HRLH


#### Policy Iteration

In [16]:
policy = policy_iteration(env, gamma, max_iter, epsilon)
visualize_policy(policy)

DDDD
DHDH
RDDH
HRDH


#### Q-Learning

In [20]:
alpha, gamma = 0.05, 0.95
nb_episodes, nb_steps = 300, 100
epsilon = 0.1
print(q_learning(env, alpha, gamma, nb_episodes, nb_steps, epsilon))


[[-0.2703593   0.01366858 -0.26953727 -0.25210457]
 [-0.33379303 -0.49290313 -0.34383339 -0.30714565]
 [-0.14452463 -0.1090627  -0.14078897 -0.14456021]
 [-0.15791265 -0.22621906 -0.20820038 -0.15800191]
 [-0.20335352  0.13353655 -0.51240193 -0.31623327]
 [ 0.          0.          0.          0.        ]
 [-0.18549375  0.04395889 -0.16658194 -0.15130954]
 [ 0.          0.          0.          0.        ]
 [-0.15296038 -0.26618784  0.25534461 -0.13390038]
 [-0.07284454  0.60071384 -0.05127881 -0.142625  ]
 [-0.00417099  0.56622879 -0.1444212  -0.00827035]
 [ 0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.        ]
 [-0.22621906  0.14415937  0.7886408  -0.06465765]
 [ 0.09066991  0.12842186  0.95207074  0.07823609]
 [ 0.          0.          0.          0.        ]]


If we look at the action which maximizes the Q-value for each state we can visualize the policy produced by the Q-table:  

In [19]:
q_table = q_learning(env, alpha, gamma, nb_episodes, nb_steps, epsilon)
def q_to_policy(Q):
    policy = []
    for l in Q:
        if l[0] == l[1] == l[2] == l[3] == 0.0:
            policy.append(0)
        else:
            for k in range(0, len(l)):
                if l[k] == max(l):
                    policy.append(k)
                    break
    return policy
                    
q_table = q_learning(env, alpha, gamma, nb_episodes, nb_steps, epsilon)
policy = q_to_policy(q_table)
visualize_policy(policy)

DUDL
DHDH
RRDH
HRRH



#### Deep Q-Learning


In [30]:
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

state_size = 1
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
done = False
batch_size = 32

for e in range(nb_episodes):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(nb_steps):
        action = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        next_state = np.reshape(next_state, [1, state_size])
        agent.remember(state, action, reward, next_state, done)
        state = next_state
        if done:
            break
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
                
trained_network = agent.model

If we use the trained network to predict the Q-values of all the possible actions for each state we get the following Q-table:

In [28]:
def compute_q_table(env, network):
    q_table = np.zeros([env.observation_space.n, env.action_space.n])
    for s in range(env.observation_space.n):
        state = np.zeros(1, int)
        state[0] = s
        state = np.reshape(state, [1, state_size])
        q_table[s] = list(network.predict(state)[0])
    return q_table

print(compute_q_table(env, trained_network))

[[-0.76135409 -0.8108964  -0.78342205 -0.78358257]
 [-0.77555388 -0.88860106 -0.78634799 -0.78616321]
 [-0.77420568 -0.94245815 -0.78825724 -0.78376877]
 [-0.78086573 -0.92055047 -0.79246944 -0.7923671 ]
 [-0.79521191 -0.83643091 -0.79979205 -0.81153494]
 [-0.80802393 -0.77505672 -0.80358922 -0.80420381]
 [-0.80460203 -0.75839829 -0.79603648 -0.79628098]
 [-0.81798756 -0.77963543 -0.8234151  -0.81082869]
 [-0.8292352  -0.79884332 -0.84636456 -0.82366037]
 [-0.78410727 -0.76391685 -0.78513217 -0.79044443]
 [-0.72800511 -0.71922374 -0.71389955 -0.7483899 ]
 [-0.56857026 -0.60667545 -0.51452321 -0.64284033]
 [-0.19239044 -0.37354037 -0.06866091 -0.41754788]
 [ 0.18375027 -0.14052671  0.37733489 -0.19189364]
 [ 0.55989069  0.09248677  0.82333034  0.03376043]
 [ 0.9203189   0.32429221  1.25621033  0.24979198]]


The policy produced by this Q-table is the following one:

In [29]:
q_table = compute_q_table(env, trained_network)
policy = q_to_policy(q_table)
visualize_policy(policy)

LLLL
LHDH
DDRH
HRRH
