# Learning and Decision Making

## Laboratory 6: Reinforcement learning

In the end of the lab, you should submit all code/answers written in the tasks marked as "Activity n. XXX", together with the corresponding outputs and any replies to specific questions posed to the e-mail <adi.tecnico@gmail.com>. Make sure that the subject is of the form [&lt;group n.&gt;] LAB &lt;lab n.&gt;.

### 1. The windy gridworld domain

Consider the larger version of the windy gridworld domain depicted in the figure below.

<img src="windy.png" width="400px">

In it, a boat must navigate a 7 &times; 10 gridworld, to reach the goal cell, marked with _G_. There is a crosswind upward through the middle of the grid, in the direction indicated by the gray arrows. The boat has available the standard four actions -- _Up_, _Down_, _Left_ and _Right_. In the region affected by the wind, however, the resulting next state is shifted upward as a consequence of the crosswind, the strength of which varies from column to column. The strength of the wind is given below each column, and corresponds to the number of cells that the movement is shifted upward. For example, if the boat is one cell to the right of the goal, then the action _Left_ takes you to the cell just above the goal.

The agent pays a cost of 1 in every step before reaching the goal. The problem can be described as an MDP $(\mathcal{X},\mathcal{A},\mathbf{P},c,\gamma)$ as follows.

In [2]:
%matplotlib notebook
import numpy as np
import numpy.linalg as la
import matplotlib.pyplot as plt

np.set_printoptions(threshold=1000)

# Problem specific parameters
WIND = (0, 0, 0, 1, 1, 1, 2, 2, 1, 0)
nrows = 7
ncols = 10
init = [3, 0]
goal = [3, 7]

# States
X = [[x, y] for x in range(nrows) for y in range(ncols)]
nX = len(X)

# Actions
A = ['U', 'D', 'L', 'R']
nA = len(A)

# Transition probabilities
P = dict()
P['U'] = np.zeros((nX, nX))
P['D'] = np.zeros((nX, nX))
P['L'] = np.zeros((nX, nX))
P['R'] = np.zeros((nX, nX))

for i in range(len(X)):
    x = X[i]
    y = dict()
    
    y['U'] = [x[0] - WIND[x[1]] - 1, x[1]]
    y['D'] = [x[0] - WIND[x[1]] + 1, x[1]]
    y['L'] = [x[0] - WIND[x[1]], x[1] - 1]
    y['R'] = [x[0] - WIND[x[1]], x[1] + 1]
    
    for k in y:
        y[k][0] = max(min(y[k][0], nrows - 1), 0)
        y[k][1] = max(min(y[k][1], ncols - 1), 0)
        j = X.index(y[k])
        P[k][i, j] = 1

c = np.ones((nX, nA))
c[X.index(goal), :] = 0

gamma = 0.99

#MDP dictionary structure
mdp = {"X": X, "A": A, "Pa's": [P['U'], P['D'], P['L'], P['R']], "C":c} 

# -- Pretty print

print('\n- MDP problem specification: -\n')

print('States:')
print(np.array(X))

print('\nActions:')
print(A)

print('\nTransition probabilities:')
for a in A:
    print('Action', a)
    print(P[a])
    
print('\ncost:')
print(c)

print('\nStart state:', init)
print('\nGoal state:', goal)


- MDP problem specification: -

States:
[[0 0]
 [0 1]
 [0 2]
 [0 3]
 [0 4]
 [0 5]
 [0 6]
 [0 7]
 [0 8]
 [0 9]
 [1 0]
 [1 1]
 [1 2]
 [1 3]
 [1 4]
 [1 5]
 [1 6]
 [1 7]
 [1 8]
 [1 9]
 [2 0]
 [2 1]
 [2 2]
 [2 3]
 [2 4]
 [2 5]
 [2 6]
 [2 7]
 [2 8]
 [2 9]
 [3 0]
 [3 1]
 [3 2]
 [3 3]
 [3 4]
 [3 5]
 [3 6]
 [3 7]
 [3 8]
 [3 9]
 [4 0]
 [4 1]
 [4 2]
 [4 3]
 [4 4]
 [4 5]
 [4 6]
 [4 7]
 [4 8]
 [4 9]
 [5 0]
 [5 1]
 [5 2]
 [5 3]
 [5 4]
 [5 5]
 [5 6]
 [5 7]
 [5 8]
 [5 9]
 [6 0]
 [6 1]
 [6 2]
 [6 3]
 [6 4]
 [6 5]
 [6 6]
 [6 7]
 [6 8]
 [6 9]]

Actions:
['U', 'D', 'L', 'R']

Transition probabilities:
Action U
[[ 1.  0.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  0.  1. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
Action D
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  1.  0.]
 [ 0.  0.  0. ...,  0.  0

---

#### Activity 1.        

Compute the optimal _Q_-function for the MDP defined above using value iteration. As your stopping condition, use an error between iterations smaller than `1e-8`.

---

In [3]:
"""
mdp_aux = {"X": ['0', 'A', 'B'], 
           "A": ['a', 'b'], 
           "Pa's": [[[0, 1, 0], [0, 1, 0], [0, 0, 1]], [[0, 0, 1], [0, 1, 0], [0, 0, 1]]], 
           "C": [[1, 0.5], [0, 0], [1, 1]]} 
"""

########################### Auxiliar ###############################
#@brief: 
#      This function computes the H operator explained in lecture 8.
def operatorH(MDP, gamma, Q, x, a):
    sum = 0
    for dest in range(0, len(MDP['X'])): 
        min = float('inf')
        for aux in range(0, len(MDP['A'])):
            if Q[dest][aux] < min:
                min = Q[dest][aux]
        sum += MDP["Pa's"][a][x][dest]*min
    return MDP['C'][x][a] + gamma*sum
    
#@brief: 
#      This function computes the valueIteration algorithm to find the Q function.
#
#@param: - MDP (dictionary with respective fields)
#        - tolerance, which is used to stop the algorithm when Jnew and J are close by a small factor
#        - gamma, which represents the inflation.
#
#@return: 
#        returns the matrix that represents the Q function and the number of iterations required to compute it.
def valueIterationQ(MDP, tolerance, gamma):
    Q = np.zeros((len(MDP["X"]), len(MDP["A"]))) # initialize Q
    err = 1
    i=0  
    while err > tolerance:
        
        Qnew = np.zeros((len(MDP["X"]), len(MDP["A"])))
        for x in range(0, len(MDP["X"])):
            for a in range(0, len(MDP["A"])):
                Qnew[x][a] = operatorH(MDP, gamma, Q, x, a)
                
        err = la.norm(Qnew - Q)
        i += 1
        Q = Qnew
    return (Q, i)

# This takes a while to end.
VI, iterations = valueIterationQ(MDP=mdp, tolerance=1e-8, gamma=gamma)

#print (VI)
#print (iterations)

KeyboardInterrupt: 

---

#### Activity 2.        

Write down a Python function that, given a Q-function $Q$ and a state $x$, selects a random action using the $\epsilon$-greedy policy obtained from $Q$ for state $x$. Your function should receive an optional parameter, corresponding to $\epsilon$, with default value of 0.1. 

**Note:** In the case of two actions with the same value, your $\epsilon$-greedy policy should randomize between the two.

---

In [4]:
import random

def eGreedyHeuristic(MDP, Q, x, e=0.1):
    greedy = random.randrange(0, 100) < (1-e)*100
    
    # exploration case
    if not greedy:
        return random.randrange(0, len(MDP['A']))
    
    min = float('inf')
    action = -1
    for a in range(0, len(MDP['A'])):
        if Q[x][a] < min:
            min = Q[x][a]
            action = a

    # we now want to count how many actions have the min value and select randomly between them if count > 1
    # aux is a list that stores the indexes of the actions with the value equal to the min
    aux = []
    for a in range(0, len(MDP['A'])):
        if Q[x][a] == min:
            aux.append(a)
            
    if len(aux) > 1:
        return aux[random.randrange(0, len(aux))]
   
    return action


### 2. Model-based learning

You will now run the model-based learning algorithm discussed in class, and evaluate its learning performance.

---

#### Activity 3.        

Run the model-based reinforcement learning algorithm discussed in class to compute $Q^*$ for 100,000 iterations. Initialize each transition probability matrix as the identity and the cost function as all-zeros. Use an $\epsilon$-greedy policy with $\epsilon=0.1$ (use the function from Activity 2). Note that, at each step,

* You will need to select an action according to the $\epsilon$-greedy policy;
* The state and action, you will then compute the cost and generate the next state; 
* With this transition information (state, action, cost, next-state), you can now perform an update. 
* When updating the components $(x,a)$ of the model, use the step-size

$$\alpha_t=\frac{1}{N_t(x,a)+1},$$

where $N_t(x,a)$ is the number of visits to the pair $(x,a)$ up to time step $t$.

In order to ensure that your algorithm visits every state and action a sufficient number of times, after the boat reaches the goal cell, make one further step, the corresponding update, and then reset the position of the boat to a random state in the environment.

Plot the norm $\|Q^*-Q^{(k)}\|$ every 500 iterations of your method, where $Q^*$ is the optimal _Q_~function computed in Activity 1.

**Note:** The simulation may take a bit. Don't despair.

---

In [6]:
mb_C = np.zeros((nX, nA))

mb_P = dict()
mb_P['U'] = np.identity(nX)
mb_P['D'] = np.identity(nX)
mb_P['L'] = np.identity(nX)
mb_P['R'] = np.identity(nX)

# the MDP in his initial state is defined as the following dictionary:
mb_mdp = {"X": X, "A": A, "Pa's": [mb_P['U'], mb_P['D'], mb_P['L'], mb_P['R']], "C":mb_C}

x = X.index(init)
Q = np.zeros((len(mdp["X"]), len(mdp["A"]))) # initialize Q
transition_states = np.zeros((len(mdp['X']), len(mdp['A']), len(mdp['X'])))


""" ESTÁ MAL FEITO A PARTIR DAQUI """
for i in range(0, 5):
    action = eGreedyHeuristic(mb_mdp, Q, x)
    
    # compute the destiny state based on the underlying mdp.
    for j in range(0, len(mdp['X'])):
        if mdp["Pa's"][action][x][j] == 1:
            dest_state = j
            break
            
    # update our matrix that stores the info about the number of actions and destiny for each state
    transition_states[x][action][dest_state] += 1
    
    N_x_a_y = transition_states[x][action][dest_state]
    N_x_a =0
    for state_count in transition_states[x][action]:
        N_x_a += state_count
    
    
        
        

### 3. Temporal-difference learning

You will now run both Q-learning and SARSA, and compare their learning performance with that of the model-based method just studied.

---

#### Activity 4.        

Repeat Activity 3 but using the _Q_-learning algorithm with a learning rate $\alpha=0.3$.

---

In [14]:

GAMMA = 0.95



def q_learning(InitialState,iterations):
    
    state_t = InitialState
    action = eGreedyHeuristic(mdp, Q, state_t, e=0.1)
    cost = mdp['C'][state_t][action]
    state_t_1 = X.index(X[np.where(mdp["Pa's"][action][state_t]==1)[0][0]])
    
    
    for i in range(0,iterations):
        #print(state_t,action,cost,state_t_1)
        q_update(state_t,action,cost,state_t_1)
    
        action = eGreedyHeuristic(mdp, Q, state_t_1, e=0.1)
        cost = mdp['C'][state_t_1][action]
        
        previous = state_t_1
        state_t_1 = X.index(X[np.where(mdp["Pa's"][action][previous]==1)[0][0]])
        state_t=previous
    
   
    

def q_update(state,action,cost,nextState):
    
    ALPHA=0.3
    mini = min(Q[nextState,:])
    Q[state][action] = Q[state][action] + ALPHA * (cost + gamma * mini - Q[state][action]) 
    #print (Q[state][action])


x = X.index(init)
q_learning(x,100000)
print (Q)


[[ 87.92471251  87.94130667  87.93643233  87.91584576]
 [ 87.93148956  87.94963117  87.94346075  87.98588233]
 [ 88.04695091  88.02215521  88.02435066  88.06897395]
 [ 88.12276367  88.11647588  88.12015096  88.1304842 ]
 [ 88.22086235  88.22664344  88.21228097  88.21401785]
 [ 88.29738919  88.27745439  88.2763003   88.26776923]
 [ 88.25274062  88.22145121  88.26284855  88.17534607]
 [ 88.17471797  88.17531838  88.29286516  88.05590519]
 [ 88.05590519  88.05590513  88.17534613  87.93525777]
 [ 87.93525777  87.81339168  88.05590519  87.93525777]
 [ 87.92472923  87.90670599  87.90656132  87.90573414]
 [ 87.90881659  87.91715049  87.90729498  87.9237606 ]
 [ 87.95765685  87.92714314  87.94494367  87.96919124]
 [ 88.06108402  87.98346004  87.98177164  88.0736497 ]
 [ 88.10128874  88.03623673  88.03333513  88.10177195]
 [ 88.09154817  87.98574703  88.01205999  88.04956468]
 [ 88.26311275  88.26256005  88.3067137   88.17534614]
 [ 88.06489697  88.07969667  88.14783234  88.05585118]
 [ 88.0559

---

#### Activity 5.

Repeat Activity 4 but using the SARSA algorithm.

---

In [15]:
sarsaMatrix = np.zeros((len(mdp["X"]), len(mdp["A"])))

def sarsa(InitialState,iterations):
    
    state_t = InitialState
    action_t = eGreedyHeuristic(mdp, sarsaMatrix, state_t, e=0.1)
    cost = mdp['C'][state_t][action_t]
    state_t_1 = X.index(X[np.where(mdp["Pa's"][action_t][state_t]==1)[0][0]])
    action_t_1 = eGreedyHeuristic(mdp, sarsaMatrix, state_t_1, e=0.1)
    
    
    for i in range(0,iterations):
        #print(state_t,action,cost,state_t_1)
        sarsa_update(state_t,action_t,cost,state_t_1,action_t_1)
        action_t = eGreedyHeuristic(mdp, sarsaMatrix, state_t_1, e=0.1)
        cost = mdp['C'][state_t_1][action_t]
        previous = state_t_1
        state_t_1 = X.index(X[np.where(mdp["Pa's"][action_t][previous]==1)[0][0]])
        state_t = previous
        action_t_1 = eGreedyHeuristic(mdp, sarsaMatrix, state_t_1, e=0.1)

def sarsa_update(state,action,cost,nextState,nextAction):
    
    ALPHA=0.3
    sarsaMatrix[state][action] = sarsaMatrix[state][action] + ALPHA * (cost + gamma * sarsaMatrix[nextState][nextAction] - sarsaMatrix[state][action]) 
    #print (Q[state][action])


x = X.index(init)
sarsa(x,100000)
print (sarsaMatrix)

[[ 68.7709176   68.77722903  68.8169999   68.72944781]
 [ 68.81669826  68.87652204  68.85417691  68.91234691]
 [ 69.16146309  69.01738453  69.04328659  69.04985079]
 [ 69.26193739  69.34058413  69.28751867  69.26716758]
 [ 69.4420984   69.46702658  69.4520104   69.47634869]
 [ 69.71646868  69.73672724  69.69331511  69.7488142 ]
 [ 69.87505863  69.84819271  69.85644372  69.79527998]
 [ 69.68733806  69.73071817  69.73899695  69.66597432]
 [ 69.4606283   69.59660869  69.54909086  69.43770912]
 [ 69.20161781  69.18491214  69.30118576  69.24016725]
 [ 68.66023819  68.69008501  68.67253578  68.73599403]
 [ 68.74231892  68.74715033  68.73905818  68.72355187]
 [ 68.8201481   68.78878249  68.83055591  68.8610076 ]
 [ 69.13006212  68.97511463  69.01263459  69.05616407]
 [ 68.9007905   68.92959117  69.05543872  69.37475318]
 [ 69.19900125  68.961663    68.96930496  69.03179132]
 [ 69.38867696  69.25576373  69.18684242  69.32777043]
 [ 69.38983707  69.07663366  69.3342259   69.26909401]
 [ 69.2143

---

#### Activity 6.

Discuss the differences observed between the performance of the three methods.

---

_Add your discussion here._