






# SARSA and Q-learning
 
Sungchul Lee  




# References

- Reinforcement Learning: 4 Model-Free Prediction [David Silver](https://www.youtube.com/watch?v=PnHCvfgC_ZA&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT&index=4) [local-video](http://localhost:8888/notebooks/Dropbox/Video/RL Course by David Silver - Lecture 4_ Model-Free Prediction.mp4) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/Reinforcement Learning by David Silver 4.pdf) [slide](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MC-TD.pdf)

- Reinforcement Learning: 5 Model Free Control [David Silver](https://www.youtube.com/watch?v=0g4j2k_Ggc4&index=5&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/Reinforcement Learning by David Silver 5.pdf) [slide](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/control.pdf)

- Tutorial: Deep Reinforcement Learning, ICML 2016 [David Silver](http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/deep_rl_tutorial.pdf)

- Machine Learning, part III: The Q-learning algorithm [JAKE BENNETT](https://articles.wearepop.com/secret-formula-for-self-learning-computers)

- DQN [Lee Young Moo](http://www.phrgcm.com/blog/2016/08/17/deep-q-network/)

- [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/pdf/1602.01783v2.pdf)



# How to run these slides yourself

**Setup python environment**

- Install RISE for an interactive presentation viewer

# Model vs Model-free



$$
\begin{array}{llllll}
\mbox{Model}&\quad\Rightarrow\quad&\mbox{Model-free}\\
\mbox{Based on $P_{ss'}^a$}&\quad\Rightarrow\quad&\mbox{Based on Samples}\\
V&\quad\Rightarrow\quad&Q\\
\mbox{Greedy}&\quad\Rightarrow\quad&\mbox{$\varepsilon$-Greedy}\\
\end{array}
$$

# Model

If we know $R_s^a$, $P_{ss'}^a$, and $V$, and if we are at state $s$, our next action is
$$
\mbox{argmax}_a Q(s,a) \quad =\quad \mbox{argmax}_a\left(R_s^a + \gamma * \sum_{s'} P_{ss'}^a * V(s')\right) 
$$

# Model-free

- In reality, typically we don't know $P_{ss'}^a$.
So, we cannot decide our next action based on $V$.
That is why we use $Q$, not $V$.

- If we update policy greedily, we may miss good regions in state space.
We update policy $\varepsilon$-greedily instead. 

# On and Off-Policy Learning



### On-policy learning

- “Learn on the job”
- Learn about policy $\pi$ from experience sampled from $\pi$


### Off-policy learning

- “Look over someone’s shoulder”
- Learn about policy $\pi$ from experience sampled from $\mu$

|Sample $V$|Sample $Q$|Sample $Q$ (off-policy)|
|---|---|
|MC|MC|
|TD|SARSA|Q-learnig|
|TD($\lambda$)|SARSA($\lambda$)|

# SARSA 





With $a_{t+1}$ from the data
$$
Q(s_t,a_t)\quad\leftarrow\quad
Q(s_t,a_t)+\alpha(\color{red}{r_{t+1}+\gamma Q(s_{t+1},a_{t+1})}-Q(s_t,a_t))
$$




<img src="img/RZBt6.png"/>

https://i.stack.imgur.com/RZBt6.png

In [None]:
# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

In [None]:
# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  

In [None]:
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

In [None]:
# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin 
    return [ n for n,i in enumerate(cum_policy_now_minus_random_coin) if i>0 ][0]

In [None]:
# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin 
    return [ n for n,i in enumerate(cum_prob_minus_random_coin) if i>0 ][0]

In [None]:
# SARSA
for t in range(epoch):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])
    
    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(
            transition_prob_given_state_and_action=P[s,a,:])
        
        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1,:])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1+4*epsilon)
        
        # choose action using epsilon-greedy policy 
        a1 = sample_action(policy_given_state=policy_now) 
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a]+gamma*Q[s1,a1] - Q[s,a])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1
    
print(Q)

<div align="center"><img src="img/SARSA result.png" width="60%" height="20%"></div>

In [2]:
# SARSA

# import libraries
import numpy as np

# set parameters ###############################################################
epoch = 30000
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  

# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin 
    return [ n for n,i in enumerate(cum_policy_now_minus_random_coin) if i>0 ][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin 
    return [ n for n,i in enumerate(cum_prob_minus_random_coin) if i>0 ][0]

# SARSA
for t in range(epoch):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])
    
    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(
            transition_prob_given_state_and_action=P[s,a,:])
        
        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1,:])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1+4*epsilon)
        
        # choose action using epsilon-greedy policy 
        a1 = sample_action(policy_given_state=policy_now) 
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1
    
print(Q)

[[ 0.64950299  0.69017939  0.64939369  0.63193186]
 [ 0.65124177  0.72248816  0.66564699  0.66779558]
 [ 0.66797505  0.76349194  0.68449087  0.58899867]
 [ 1.          1.          1.          1.        ]
 [ 0.63078184  0.6282633   0.66567924  0.60356169]
 [ 0.69930416 -0.64837452  0.74738361  0.55217622]
 [-1.         -1.         -1.         -1.        ]
 [ 0.60415412  0.58117413  0.64038186  0.60503118]
 [ 0.6186548   0.56715898  0.5799376   0.58076371]
 [ 0.58860248  0.40910139  0.53398223  0.55080234]
 [ 0.55843397  0.5218148  -0.86600922  0.53349094]]


# Q-learnig 



With a sampling $a'$ from the policy of interest, not from the data or the data generating policy
$$
Q(s_t,a_t)\quad\leftarrow\quad
Q(s_t,a_t)+\alpha(\color{red}{r_{t+1}+\gamma Q(s_{t+1},a')}-Q(s_t,a_t))
$$



If the policy of interest is greedy,
$$
Q(s_t,a_t)\quad\leftarrow\quad
Q(s_t,a_t)+\alpha(\color{red}{r_{t+1}+\gamma \max_{a'}Q(s_{t+1},a')}-Q(s_t,a_t))
$$


<img src="img/Images_Algorithm_pt2_3.gif"/>

https://articles.wearepop.com/secret-formula-for-self-learning-computers



<img src="img/JvJqR.png"/>

https://i.stack.imgur.com/JvJqR.png

In [None]:
# import libraries
import numpy as np

# set parameters ###############################################################
epoch = 40000
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  

# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin 
    return [ n for n,i in enumerate(cum_policy_now_minus_random_coin) if i>0 ][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin 
    return [ n for n,i in enumerate(cum_prob_minus_random_coin) if i>0 ][0]

In [None]:
# Q-learning
for t in range(epoch):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])
    
    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(
            transition_prob_given_state_and_action=P[s,a,:])
        
        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1,:])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1+4*epsilon)
        
        # choose action using epsilon-greedy policy 
        a1 = sample_action(policy_given_state=policy_now) 
        
        # SARSA
        # Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # Q-learning
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * max(Q[s1,:]) - Q[s,a])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1
    
print(Q)

<div align="center"><img src="img/Q-learning result.png" width="60%" height="20%"></div>

In [2]:
# Q-learning

# import libraries
import numpy as np

# set parameters ###############################################################
epoch = 40000
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  

# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0,0,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,1,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[0,2,:] = [1,0,0,0,0,0,0,0,0,0,0]
P[0,3,:] = [0,0,0,0,1,0,0,0,0,0,0]  

P[1,0,:] = [0.9,0,0,0,0.1,0,0,0,0,0,0]
P[1,1,:] = [0,0,0.9,0,0,0.1,0,0,0,0,0]
P[1,2,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[1,3,:] = [0,1,0,0,0,0,0,0,0,0,0] 

P[2,0,:] = [0,1,0,0,0,0,0,0,0,0,0]
P[2,1,:] = [0,0,0,0.9,0,0,0.1,0,0,0,0]
P[2,2,:] = [0,0,1,0,0,0,0,0,0,0,0]
P[2,3,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0] 

P[3,0,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,1,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,2,:] = [0,0,0,1,0,0,0,0,0,0,0]
P[3,3,:] = [0,0,0,1,0,0,0,0,0,0,0] 

P[4,0,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,1,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[4,2,:] = [0.9,0.1,0,0,0,0,0,0,0,0,0]
P[4,3,:] = [0,0,0,0,0,0,0,0.9,0.1,0,0] 

P[5,0,:] = [0,0,0,0,0,1,0,0,0,0,0]
P[5,1,:] = [0,0,0,0.1,0,0,0.8,0,0,0,0.1]
P[5,2,:] = [0,0.1,0.8,0.1,0,0,0,0,0,0,0]
P[5,3,:] = [0,0,0,0,0,0,0,0,0.1,0.8,0.1] 

P[6,0,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,1,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,2,:] = [0,0,0,0,0,0,1,0,0,0,0]
P[6,3,:] = [0,0,0,0,0,0,1,0,0,0,0]

P[7,0,:] = [0,0,0,0,0,0,0,1,0,0,0]
P[7,1,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[7,2,:] = [0,0,0,0,1,0,0,0,0,0,0]
P[7,3,:] = [0,0,0,0,0,0,0,1,0,0,0] 

P[8,0,:] = [0,0,0,0,0.1,0,0,0.9,0,0,0]
P[8,1,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[8,2,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[8,3,:] = [0,0,0,0,0,0,0,0,1,0,0] 

P[9,0,:] = [0,0,0,0,0,0,0,0,1,0,0]
P[9,1,:] = [0,0,0,0,0,0,0.1,0,0,0,0.9]
P[9,2,:] = [0,0,0,0,0,0.9,0.1,0,0,0,0]
P[9,3,:] = [0,0,0,0,0,0,0,0,0,1,0] 

P[10,0,:] = [0,0,0,0,0,0.1,0,0,0,0.9,0]
P[10,1,:] = [0,0,0,0,0,0,0,0,0,0,1]
P[10,2,:] = [0,0,0,0,0,0.1,0.9,0,0,0,0]
P[10,3,:] = [0,0,0,0,0,0,0,0,0,0,1] 

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin 
    return [ n for n,i in enumerate(cum_policy_now_minus_random_coin) if i>0 ][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin 
    return [ n for n,i in enumerate(cum_prob_minus_random_coin) if i>0 ][0]

# Q-learning
for t in range(epoch):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])
    
    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(
            transition_prob_given_state_and_action=P[s,a,:])
        
        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1,:])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1+4*epsilon)
        
        # choose action using epsilon-greedy policy 
        a1 = sample_action(policy_given_state=policy_now) 
        
        # SARSA
        # Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # Q-learning
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * max(Q[s1,:]) - Q[s,a])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1
    
print(Q)

[[ 0.63795748  0.66894433  0.63796903  0.63448998]
 [ 0.64241004  0.68625675  0.64607077  0.64667734]
 [ 0.64670274  0.70859544  0.64558886  0.47432525]
 [ 1.          1.          1.          1.        ]
 [ 0.62380167  0.62312734  0.63855185  0.61957505]
 [ 0.71590873 -0.67242921  0.72853767  0.58694581]
 [-1.         -1.         -1.         -1.        ]
 [ 0.60824595  0.60788458  0.60981955  0.60867764]
 [ 0.60009355  0.59591962  0.60004376  0.59996823]
 [ 0.59988497  0.36246705  0.5871322   0.58257855]
 [ 0.60573852  0.5762576  -0.85525937  0.57595163]]


# Q-learning using experience replay

<img src="img/output_ahug9u_by_elphin_zephyr-daxvvvu.gif"/>

https://orig00.deviantart.net/1b54/f/2017/035/9/b/output_ahug9u_by_elphin_zephyr-daxvvvu.gif

In [None]:
# import libraries
import numpy as np
from collections import deque
import random

In [None]:
# set parameters ###############################################################
epoch_sarsa = 1000
epoch_q_learning = 20000
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

In [None]:
# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  
        
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES)) 

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_policy_now_minus_random_coin) if i > 0][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_prob_minus_random_coin) if i > 0][0]



<div align="center"><img src="img/WW1-Great-War-Cartoons-Punch-Magazine-Raven-Hill-1917-12-19-421.jpg" width="60%" height="20%"></div>


https://ssl.c.photoshelter.com/img-get/I0000x4Qkv5Ut3mo/s/900/720/WW1-Great-War-Cartoons-Punch-Magazine-Raven-Hill-1917-12-19-421.jpg

In [None]:
# make a memory for a deque of maxlen 100 for experience replay
replay_meomory = deque(maxlen=100)

In [None]:
# make a deque of maxlen 100 for experience replay
for t in range(epoch_sarsa):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])

    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s,a,R[s,a],s1])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)



In [None]:
# Q-learning using experience replay
for t in range(epoch_q_learning):

    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:]) 
    
    while not done:
        # exploit - update Q-function using Q-learning with experience replay
        # and
        # explore - move according to updated epsilon-greedy policy
        
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s, a, R[s, a], s1])

        # Q-learning using experience replay
        # choose 7 experiences from the deque
        sample = random.sample(replay_meomory, 7)
        for i in range(7):
            # experience replay
            replay = sample[i]
            # Q-learning
            Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
                                 alpha * (replay[2] + gamma * max(Q[replay[3],:]) - Q[replay[0],replay[1]])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

<div align="center"><img src="img/Q-learning using experience replay.png" width="60%" height="20%"></div>

In [4]:
# Q-learning using experience replay

# import libraries
import numpy as np
from collections import deque
import random

# set parameters ###############################################################
epoch_sarsa = 1000
epoch_q_learning = 20000
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  
        
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_policy_now_minus_random_coin) if i > 0][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_prob_minus_random_coin) if i > 0][0]

# make a memory for a deque of maxlen 100 for experience replay
replay_meomory = deque(maxlen=100)

# make a deque of maxlen 100 for experience replay
for t in range(epoch_sarsa):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])

    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s,a,R[s,a],s1])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

# Q-learning using experience replay
for t in range(epoch_q_learning):

    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:]) 
    
    while not done:
        # exploit - update Q-function using Q-learning with experience replay
        # and
        # explore - move according to updated epsilon-greedy policy
        
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s, a, R[s, a], s1])

        # Q-learning using experience replay
        # choose 7 experiences from the deque
        sample = random.sample(replay_meomory, 7)
        for i in range(7):
            # experience replay
            replay = sample[i]
            # Q-learning
            Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
                                 alpha * (replay[2] + gamma * max(Q[replay[3],:]) - Q[replay[0],replay[1]])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

[[ 0.05498236  0.54533103  0.07056754  0.03224705]
 [ 0.07911819  0.70386938  0.14433013  0.13376335]
 [ 0.1112421   0.79432344  0.2396675   0.08519836]
 [ 1.          1.          1.          1.        ]
 [ 0.02083935  0.01809598  0.3324215  -0.00364398]
 [ 0.15757644 -0.20682696  0.73833763  0.02644631]
 [-1.         -1.         -1.         -1.        ]
 [-0.00855757  0.01510166  0.08693695 -0.00556295]
 [-0.00731839  0.21470185 -0.00184952  0.00212867]
 [ 0.0079362  -0.01800231  0.42161678  0.05253249]
 [ 0.19009332  0.01269543 -0.23225642  0.00846619]]
[[ 0.4585549   0.76560936  0.45227328  0.42416659]
 [ 0.44748585  0.78950609  0.49530373  0.51888518]
 [ 0.49591137  0.75993222  0.62025129  0.34950453]
 [ 1.          1.          1.          1.        ]
 [ 0.39306242  0.39920305  0.732524    0.38897126]
 [ 0.54592265 -0.70593815  0.81264371  0.33907882]
 [-1.         -1.         -1.         -1.        ]
 [ 0.35676667  0.3080031   0.67916175  0.34497249]
 [ 0.55154856  0.37293773  0.2

# Q-learning using experience replay and target Q

In [7]:
# Q-learning using experience replay and target Q

# import libraries
import numpy as np
from collections import deque
import random

# set parameters ###############################################################
epoch_sarsa = 1000
epoch_q_learning = 40000
size_experience_replay = 1000
number_of_sample_from_experience_replay = 20
time_period_to_update_target_Q = 100
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q 
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  
        
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_policy_now_minus_random_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_policy_now_minus_random_coin) if i > 0][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_prob_minus_random_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_prob_minus_random_coin) if i > 0][0]

# make a memory for a deque of maxlen size_experience_replay for experience replay
replay_meomory = deque(maxlen=size_experience_replay)

# make a deque of maxlen size_experience_replay for experience replay
for t in range(epoch_sarsa):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])

    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s,a,R[s,a],s1])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

# initialize target Q
Q_target = Q

# Q-learning using experience replay and target Q
for t in range(epoch_q_learning):

    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:]) 
    
    # time log to update target Q
    time_log_to_update_target_Q = 0
    
    while not done:
        # exploit - update Q-function using Q-learning with experience replay
        # and
        # explore - move according to updated epsilon-greedy policy
        
        # time log to update target Q
        time_log_to_update_target_Q += 1
        
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s, a, R[s, a], s1])

        # Q-learning using experience replay and target Q
        # choose number_of_sample_from_experience_replay experiences from the deque
        sample = random.sample(replay_meomory, number_of_sample_from_experience_replay)
        for i in range(number_of_sample_from_experience_replay):
            # experience replay
            replay = sample[i]
            
            # Q-learning
            # Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
            #                      alpha * (replay[2] + gamma * max(Q[replay[3],:]) - Q[replay[0],replay[1]])
                
            # Q-learning with target Q
            Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
                                 alpha * (replay[2] + gamma * max(Q_target[replay[3],:]) - Q[replay[0],replay[1]])
                
        # target Q update
        if time_log_to_update_target_Q % time_period_to_update_target_Q == 0:
            Q_target = Q

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

[[  7.20477115e-02   6.23569214e-01   7.51497043e-02   3.49270849e-02]
 [  6.69469273e-02   7.34768345e-01   1.62949182e-01   1.33353783e-01]
 [  1.66233815e-01   7.53564952e-01   1.51510595e-01   8.99148776e-02]
 [  1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00]
 [  4.44500336e-02   2.25966833e-02   4.15697612e-01   5.49873502e-03]
 [  1.57705936e-01  -1.74401997e-01   7.50917635e-01   4.97867570e-02]
 [ -1.00000000e+00  -1.00000000e+00  -1.00000000e+00  -1.00000000e+00]
 [  6.36393639e-04  -3.39228410e-03   1.55186056e-01  -2.93230692e-03]
 [ -3.87006439e-03   2.01921509e-01  -5.42407939e-03  -1.78645296e-03]
 [ -1.30538920e-02  -2.62087774e-02   4.35722999e-01   9.18807343e-03]
 [  1.46041357e-01  -2.72626419e-03  -1.50432675e-01  -2.07415314e-03]]
[[ 0.7269991   0.75107371  0.73514853  0.70416665]
 [ 0.72884115  0.78054021  0.76281138  0.76344136]
 [ 0.77573499  0.81651291  0.77941966  0.42552569]
 [ 1.          1.          1.          1.        ]
 [ 0.69511574 

# DQN

DQN paper: https://www.nature.com/articles/nature14236

DQN source code: https://sites.google.com/a/deepmind.com/dqn/

<div align="center"><img src="img/DQN Nature.png" width="60%" height="20%"></div>





<div align="center"><img src="img/Deep Reinforcement Learning in Atari.png" width="60%" height="20%"></div>

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

<div align="center"><img src="img/DQN in Atari.png" width="60%" height="20%"></div>

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

<div align="center"><img src="img/DQN Results in Atari.png" width="60%" height="20%"></div>

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

<div align="center"><img src="img/Improvements since Nature DQN.png" width="60%" height="20%"></div>

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

# Asynchronous Method

최근에는 Asynchronous Method라는 방법으로 correlation을 없애줘서 Experience Replay를 대체한다고 한다. 간단히 Asynchronous Method를 설명하면 Thread를 통해서 여러개의 agent가 동시에 [state, action, reward, state’]를 수집한다. 그렇게 여러 agent가 동시에 수집한 데이터들은 서로 correlation이 없을 것이기 때문에 Experience Replay를 대체할 수 있으면서 더 빠르고 메모리도 절약할 수 있는 방법이라고 한다.

http://www.phrgcm.com/blog/2016/08/17/deep-q-network/

https://arxiv.org/pdf/1602.01783v2.pdf