






# SARSA and Q-learning
 
Sungchul Lee  




# How to run these slides yourself

**Setup python environment**

- Install RISE for an interactive presentation viewer

# Model vs Model-free



$$
\begin{array}{llllll}
\mbox{Model}&\quad\Rightarrow\quad&\mbox{Model-free}\\
\mbox{Based on $P_{ss'}^a$}&\quad\Rightarrow\quad&\mbox{Based on Samples}\\
V&\quad\Rightarrow\quad&Q\\
\mbox{Greedy}&\quad\Rightarrow\quad&\mbox{$\varepsilon$-Greedy}\\
\end{array}
$$

# Model

If we know $R_s^a$, $P_{ss'}^a$, and $V$, and if we are at state $s$, our next action is
$$
\mbox{argmax}_a Q(s,a) \quad =\quad \mbox{argmax}_a\left(R_s^a + \gamma * \sum_{s'} P_{ss'}^a * V(s')\right) 
$$

# Model-free

- In reality, typically we don't know $P_{ss'}^a$.
So, we cannot decide our next action based on $V$.
That is why we use $Q$, not $V$.

- If we update policy greedily, we may miss good regions in state space.
We update policy $\varepsilon$-greedily instead. 

# On and Off-Policy Learning



### On-policy learning

- “Learn on the job”
- Learn about policy $\pi$ from experience sampled from $\pi$


### Off-policy learning

- “Look over someone’s shoulder”
- Learn about policy $\pi$ from experience sampled from $\mu$

|Sample $V$|Sample $Q$|Sample $Q$ (off-policy)|
|---|---|
|MC|MC|
|TD|SARSA|Q-learnig|
|TD($\lambda$)|SARSA($\lambda$)|

# SARSA 

With $a_{t+1}$ from the data
$$
Q(s_t,a_t)\quad\leftarrow\quad
Q(s_t,a_t)+\alpha(\color{red}{r_{t+1}+\gamma Q(s_{t+1},a_{t+1})}-Q(s_t,a_t))
$$



In [2]:
# SARSA

import numpy as np

epoch = 30000
gamma = 0.99
alpha = 0.01
epsilon = 0.01

states = [0,1,2,3,4,5,6,7,8,9,10]
actions = [0,1,2,3] # left, right, up, down
N_STATES = len(states)
N_ACTIONS = len(actions)

policy = 0.25*np.ones((N_STATES, N_ACTIONS))
Q = np.zeros((N_STATES, N_ACTIONS))
# Q = 0.01*np.random.random((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  # transition probability

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 
#print(P)

if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  # rewards
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  # rewards

def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin 
    return [ n for n,i in enumerate(cum_policy_now_minus_random_coin) if i>0 ][0]

def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin 
    return [ n for n,i in enumerate(cum_prob_minus_random_coin) if i>0 ][0]
    
for t in range(epoch):
    
    done = False
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    a = sample_action(policy_given_state=policy[s,:])
    while not done:
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])
        
        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1,:])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1+4*epsilon)
        
        # choose action using epsilon-greedy policy 
        a1 = sample_action(policy_given_state=policy_now) 
        Q[s,a] = Q[s,a] + alpha * (R[s,a]+gamma*Q[s1,a1] - Q[s,a])

        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1
    
print(Q)

[[ 0.64950299  0.69017939  0.64939369  0.63193186]
 [ 0.65124177  0.72248816  0.66564699  0.66779558]
 [ 0.66797505  0.76349194  0.68449087  0.58899867]
 [ 1.          1.          1.          1.        ]
 [ 0.63078184  0.6282633   0.66567924  0.60356169]
 [ 0.69930416 -0.64837452  0.74738361  0.55217622]
 [-1.         -1.         -1.         -1.        ]
 [ 0.60415412  0.58117413  0.64038186  0.60503118]
 [ 0.6186548   0.56715898  0.5799376   0.58076371]
 [ 0.58860248  0.40910139  0.53398223  0.55080234]
 [ 0.55843397  0.5218148  -0.86600922  0.53349094]]


# Q-learnig 

With a sampling $a'$ from the policy of interest, not from the data
$$
Q(s_t,a_t)\quad\leftarrow\quad
Q(s_t,a_t)+\alpha(\color{red}{r_{t+1}+\gamma Q(s_{t+1},a')}-Q(s_t,a_t))
$$

If the policy of interest is greedy,
$$
Q(s_t,a_t)\quad\leftarrow\quad
Q(s_t,a_t)+\alpha(\color{red}{r_{t+1}+\gamma \max_{a'}Q(s_{t+1},a')}-Q(s_t,a_t))
$$

In [4]:
# Q-learning

import numpy as np
from collections import deque
import random

replay_meomory = deque(maxlen=100)
epoch_sarsa = 1000
epoch_q_learning = 20000
gamma = 0.99
alpha = 0.01
epsilon = 0.01

states = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
actions = [0, 1, 2, 3]  # left, right, up, down
N_STATES = len(states)
N_ACTIONS = len(actions)

policy = 0.25 * np.ones((N_STATES, N_ACTIONS))
Q = np.zeros((N_STATES, N_ACTIONS))
# Q = 0.01*np.random.random((N_STATES, N_ACTIONS))
Q[3, :] = 1
Q[6, :] = -1

P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  # transition probability

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
# print(P)

if True:  # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  # rewards
else:  # fuel-inefficient robot
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  # rewards


def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_policy_now_minus_random_coin) if i > 0][0]


def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_prob_minus_random_coin) if i > 0][0]


for t in range(epoch_sarsa):

    done = False
    s = np.random.choice([0, 1, 2, 4, 5, 7, 8, 9, 10])  # 3 and 6 removed
    a = sample_action(policy_given_state=policy[s, :])
    while not done:
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        Q[s, a] = Q[s, a] + alpha * (R[s, a] + gamma * Q[s1, a1] - Q[s, a])

        replay_meomory.append([s,a,R[s,a],s1])

        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

for t in range(epoch_q_learning):

    done = False
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    a = sample_action(policy_given_state=policy[s,:]) # epsilon-greedy policy
    while not done:
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        replay_meomory.append([s, a, R[s, a], s1])

        # experience replay
        sample = random.sample(replay_meomory, 7)
        for i in range(7):
            sampled = sample[i]
            Q[sampled[0],sampled[1]] = Q[sampled[0],sampled[1]] + \
                                 alpha * (sampled[2] + gamma * max(Q[sampled[3],:]) - Q[sampled[0],sampled[1]])

        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

[[ 0.05498236  0.54533103  0.07056754  0.03224705]
 [ 0.07911819  0.70386938  0.14433013  0.13376335]
 [ 0.1112421   0.79432344  0.2396675   0.08519836]
 [ 1.          1.          1.          1.        ]
 [ 0.02083935  0.01809598  0.3324215  -0.00364398]
 [ 0.15757644 -0.20682696  0.73833763  0.02644631]
 [-1.         -1.         -1.         -1.        ]
 [-0.00855757  0.01510166  0.08693695 -0.00556295]
 [-0.00731839  0.21470185 -0.00184952  0.00212867]
 [ 0.0079362  -0.01800231  0.42161678  0.05253249]
 [ 0.19009332  0.01269543 -0.23225642  0.00846619]]
[[ 0.4585549   0.76560936  0.45227328  0.42416659]
 [ 0.44748585  0.78950609  0.49530373  0.51888518]
 [ 0.49591137  0.75993222  0.62025129  0.34950453]
 [ 1.          1.          1.          1.        ]
 [ 0.39306242  0.39920305  0.732524    0.38897126]
 [ 0.54592265 -0.70593815  0.81264371  0.33907882]
 [-1.         -1.         -1.         -1.        ]
 [ 0.35676667  0.3080031   0.67916175  0.34497249]
 [ 0.55154856  0.37293773  0.2