






# Large-Scale Reinforcement Learning
 
Sungchul Lee  




# References

- Reinforcement Learning: 6 Value Function Approximation [David Silver](https://www.youtube.com/watch?v=UoPei5o4fps&index=6&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/Reinforcement Learning by David Silver 6.pdf) [slide](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf)

- Tutorial: Deep Reinforcement Learning, ICML 2016 [David Silver](http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/deep_rl_tutorial.pdf)

- Gradient Temporal Difference Networks [David Silver](http://proceedings.mlr.press/v24/silver12a/silver12a.pdf) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/silver12a.pdf) 

- Gradient Temporal-Difference Learning Algorithms [Hamid Maei](http://ai2-s2-pdfs.s3.amazonaws.com/9494/639c7d4cc4aad0842c18a26562903ca0c6f8.pdf) [local-slide](http://localhost:8888/notebooks/Dropbox/Paper/639c7d4cc4aad0842c18a26562903ca0c6f8.pdf) 

- Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond [Arthur Juliani](https://medium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-networks-and-beyond-8438a3e2b8df)

- DQN [Lee Young Moo](http://www.phrgcm.com/blog/2016/08/17/deep-q-network/)

- [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/pdf/1602.01783v2.pdf)

- [PR-005: Playing Atari with Deep Reinforcement Learning (NIPS 2013 Deep Learning Workshop)](https://www.youtube.com/watch?v=V7_cNTfm2i8&t=4s&list=PLlMkM4tgfjnJhhd4wn5aj8fVTYJwIpWkS&index=6)

- DQN [nalsil](https://github.com/nalsil/TensorFlow-Tutorials/tree/master/07%20-%20DQN)



# How to run these slides yourself

**Setup python environment**

- Install RISE for an interactive presentation viewer

# Large-Scale Reinforcement Learning

<div align="center"><img src="img/Large-Scale Reinforcement Learning.png" width="60%" height="20%"></div>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf

# Value Function Approximation

<div align="center"><img src="img/Value Function Approximation.png" width="60%" height="20%"></div>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf

# Types of Value Function Approximation

<div align="center"><img src="img/Types of Value Function Approximation.png" width="60%" height="20%"></div>

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf

# Value Function Approximation By Stochastic Gradient Descent - Cheating

# Goal

Find ${\bf w}$ minimizing
$$
J({\bf w})=\mathbb{E}_\pi\left(v_\pi(S)-v_{\bf w}(S)\right)^2
$$


$$\begin{array}{ll}
\mbox{Gradient descent}&
\Delta{\bf w}
=\alpha\mathbb{E}_\pi\left(v_\pi(S)-v_{\bf w}(S)\right)\nabla_{\bf w}v_{\bf w}(S)\\
\mbox{Stochastic gradient descent}&
\Delta{\bf w}
=\alpha\left(v_\pi(S)-v_{\bf w}(S)\right)\nabla_{\bf w}v_{\bf w}(S)
\end{array}$$

# Value Function Approximation By Incremental Prediction

# Goal

Find ${\bf w}$ minimizing
$$
J({\bf w})=\mathbb{E}_\pi\left(v_\pi(S)-v_{\bf w}(S)\right)^2
$$

$$\begin{array}{llllll}
\mbox{MC}&\Delta{\bf w}&=&\alpha&\left(G_t-v_{\bf w}(S)\right)&\nabla_{\bf w}v_{\bf w}(S)\\
\mbox{TD}&\Delta{\bf w}&=&\alpha&\left(R_{t+1}+\gamma v_{\bf w}(S_{t+1}) -v_{\bf w}(S)\right)&\nabla_{\bf w}v_{\bf w}(S)\\
\mbox{TD}(\lambda)&\Delta{\bf w}&=&\alpha&\left(G_t^{\lambda}-v_{\bf w}(S)\right)&\nabla_{\bf w}v_{\bf w}(S)\\
\end{array}$$

# DQN

DQN paper: https://www.nature.com/articles/nature14236

DQN source code: https://sites.google.com/a/deepmind.com/dqn/

<div align="center"><img src="img/DQN Nature.png" width="60%" height="20%"></div>

<div align="center"><img src="img/Deep Reinforcement Learning in Atari.png" width="60%" height="20%"></div>

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

<div align="center"><img src="img/DQN in Atari.png" width="60%" height="20%"></div>

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

<div align="center"><img src="img/DQN Results in Atari.png" width="60%" height="20%"></div>

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

# Why DQN works

- Going from a single-layer network to a multi-layer convolutional network to approximate the Q function.

- Implementing Experience Replay, which will allow our network to train itself using stored memories from it’s experience.


- Utilizing a second “target” network, which we will use to compute target Q-values during our updates.

https://medium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-networks-and-beyond-8438a3e2b8df

# Q-learning using experience replay

<img src="img/output_ahug9u_by_elphin_zephyr-daxvvvu.gif"/>

https://orig00.deviantart.net/1b54/f/2017/035/9/b/output_ahug9u_by_elphin_zephyr-daxvvvu.gif

In [None]:
# import libraries
import numpy as np
from collections import deque
import random

In [None]:
# set parameters ###############################################################
epoch_sarsa = 1000
epoch_q_learning = 20000
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

In [None]:
# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  
        
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES)) 

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]



<div align="center"><img src="img/WW1-Great-War-Cartoons-Punch-Magazine-Raven-Hill-1917-12-19-421.jpg" width="60%" height="20%"></div>


https://ssl.c.photoshelter.com/img-get/I0000x4Qkv5Ut3mo/s/900/720/WW1-Great-War-Cartoons-Punch-Magazine-Raven-Hill-1917-12-19-421.jpg

In [None]:
# make a memory for a deque of maxlen 100 for experience replay
replay_meomory = deque(maxlen=100)

In [None]:
# make a deque of maxlen 100 for experience replay
for t in range(epoch_sarsa):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])

    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s,a,R[s,a],s1])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)



In [None]:
# Q-learning using experience replay
for t in range(epoch_q_learning):
    
    ... code skipped
    
    while not done: 
        
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s, a, R[s, a], s1])

        # Q-learning using experience replay 
        # choose 7 experiences from the deque
        sample = random.sample(replay_meomory, 7)
        for i in range(7):
            # experience replay
            replay = sample[i]
            # Q-learning
            Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
                                 alpha * (replay[2] + gamma * max(Q[replay[3],:]) - Q[replay[0],replay[1]])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

<div align="center"><img src="img/Q-learning using experience replay.png" width="60%" height="20%"></div>

In [4]:
# Q-learning using experience replay

# import libraries
import numpy as np
from collections import deque
import random

# set parameters ###############################################################
epoch_sarsa = 1000
epoch_q_learning = 20000
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  
        
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# make a memory for a deque of maxlen 100 for experience replay
replay_meomory = deque(maxlen=100)

# make a deque of maxlen 100 for experience replay
for t in range(epoch_sarsa):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])

    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s,a,R[s,a],s1])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

# Q-learning using experience replay
for t in range(epoch_q_learning):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:]) 
    
    while not done: 
        
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s, a, R[s, a], s1])

        # Q-learning using experience replay 
        # choose 7 experiences from the deque
        sample = random.sample(replay_meomory, 7)
        for i in range(7):
            # experience replay
            replay = sample[i]
            # Q-learning
            Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
                                 alpha * (replay[2] + gamma * max(Q[replay[3],:]) - Q[replay[0],replay[1]])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

[[ 0.05498236  0.54533103  0.07056754  0.03224705]
 [ 0.07911819  0.70386938  0.14433013  0.13376335]
 [ 0.1112421   0.79432344  0.2396675   0.08519836]
 [ 1.          1.          1.          1.        ]
 [ 0.02083935  0.01809598  0.3324215  -0.00364398]
 [ 0.15757644 -0.20682696  0.73833763  0.02644631]
 [-1.         -1.         -1.         -1.        ]
 [-0.00855757  0.01510166  0.08693695 -0.00556295]
 [-0.00731839  0.21470185 -0.00184952  0.00212867]
 [ 0.0079362  -0.01800231  0.42161678  0.05253249]
 [ 0.19009332  0.01269543 -0.23225642  0.00846619]]
[[ 0.4585549   0.76560936  0.45227328  0.42416659]
 [ 0.44748585  0.78950609  0.49530373  0.51888518]
 [ 0.49591137  0.75993222  0.62025129  0.34950453]
 [ 1.          1.          1.          1.        ]
 [ 0.39306242  0.39920305  0.732524    0.38897126]
 [ 0.54592265 -0.70593815  0.81264371  0.33907882]
 [-1.         -1.         -1.         -1.        ]
 [ 0.35676667  0.3080031   0.67916175  0.34497249]
 [ 0.55154856  0.37293773  0.2

# Q-learning using experience replay and target Q


<div align="center"><img src="img/i2.cdn_.turner.commoneydamassets150930100632-target-bullseye-stunning-stats-1024x576-b25e28f4bec7a1d3ee6d3d9482367677e022c71e.jpg" width="60%" height="20%"></div>

https://www.markettamer.com/blog/wp-content/uploads/2017/10/i2.cdn_.turner.commoneydamassets150930100632-target-bullseye-stunning-stats-1024x576-b25e28f4bec7a1d3ee6d3d9482367677e022c71e.jpg

In [None]:
# import libraries
import numpy as np
from collections import deque
import random

# set parameters ###############################################################
epoch_sarsa = 1000
epoch_q_learning = 40000
size_experience_replay = 1000
number_of_sample_from_experience_replay = 20
time_period_to_update_target_Q = 100
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

In [None]:
# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q 
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  
        
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES)) 

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# make a memory for a deque of maxlen size_experience_replay for experience replay
replay_meomory = deque(maxlen=size_experience_replay)

# make a deque of maxlen size_experience_replay for experience replay
for t in range(epoch_sarsa):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])

    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s,a,R[s,a],s1])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

In [None]:
# initialize target Q
Q_target = Q

In [None]:
# Q-learning using experience replay and target Q
for t in range(epoch_q_learning):

    ... code skipped 
    
    # time log to update target Q
    time_log_to_update_target_Q = 0
    
    while not done:
        
        # time log to update target Q
        time_log_to_update_target_Q += 1
        
        ... code skipped

        # Q-learning using experience replay and target Q
        # choose number_of_sample_from_experience_replay experiences from the deque
        sample = random.sample(replay_meomory, number_of_sample_from_experience_replay)
        for i in range(number_of_sample_from_experience_replay):
            # experience replay
            replay = sample[i]
            
            # Q-learning
            # Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
            #                      alpha * (replay[2] + gamma * max(Q[replay[3],:]) - Q[replay[0],replay[1]])
                
            # Q-learning with target Q
            Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
                                 alpha * (replay[2] + gamma * max(Q_target[replay[3],:]) - Q[replay[0],replay[1]])
                
        # target Q update
        if time_log_to_update_target_Q % time_period_to_update_target_Q == 0:
            Q_target = Q

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)



<div align="center"><img src="img/Q-learning using experience replay and target Q result.png" width="60%" height="20%"></div>

In [7]:
# Q-learning using experience replay and target Q

# import libraries
import numpy as np
from collections import deque
import random

# set parameters ###############################################################
epoch_sarsa = 1000
epoch_q_learning = 40000
size_experience_replay = 1000
number_of_sample_from_experience_replay = 20
time_period_to_update_target_Q = 100
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q 
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  
        
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# make a memory for a deque of maxlen size_experience_replay for experience replay
replay_meomory = deque(maxlen=size_experience_replay)

# make a deque of maxlen size_experience_replay for experience replay
for t in range(epoch_sarsa):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])

    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s,a,R[s,a],s1])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

# initialize target Q
Q_target = Q

# Q-learning using experience replay and target Q
for t in range(epoch_q_learning):

    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:]) 
    
    # time log to update target Q
    time_log_to_update_target_Q = 0
    
    while not done:
        # exploit - update Q-function using Q-learning with experience replay
        # and
        # explore - move according to updated epsilon-greedy policy
        
        # time log to update target Q
        time_log_to_update_target_Q += 1
        
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s, a, R[s, a], s1])

        # Q-learning using experience replay and target Q
        # choose number_of_sample_from_experience_replay experiences from the deque
        sample = random.sample(replay_meomory, number_of_sample_from_experience_replay)
        for i in range(number_of_sample_from_experience_replay):
            # experience replay
            replay = sample[i]
            
            # Q-learning
            # Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
            #                      alpha * (replay[2] + gamma * max(Q[replay[3],:]) - Q[replay[0],replay[1]])
                
            # Q-learning with target Q
            Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
                                 alpha * (replay[2] + gamma * max(Q_target[replay[3],:]) - Q[replay[0],replay[1]])
                
        # target Q update
        if time_log_to_update_target_Q % time_period_to_update_target_Q == 0:
            Q_target = Q

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

[[  7.20477115e-02   6.23569214e-01   7.51497043e-02   3.49270849e-02]
 [  6.69469273e-02   7.34768345e-01   1.62949182e-01   1.33353783e-01]
 [  1.66233815e-01   7.53564952e-01   1.51510595e-01   8.99148776e-02]
 [  1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00]
 [  4.44500336e-02   2.25966833e-02   4.15697612e-01   5.49873502e-03]
 [  1.57705936e-01  -1.74401997e-01   7.50917635e-01   4.97867570e-02]
 [ -1.00000000e+00  -1.00000000e+00  -1.00000000e+00  -1.00000000e+00]
 [  6.36393639e-04  -3.39228410e-03   1.55186056e-01  -2.93230692e-03]
 [ -3.87006439e-03   2.01921509e-01  -5.42407939e-03  -1.78645296e-03]
 [ -1.30538920e-02  -2.62087774e-02   4.35722999e-01   9.18807343e-03]
 [  1.46041357e-01  -2.72626419e-03  -1.50432675e-01  -2.07415314e-03]]
[[ 0.7269991   0.75107371  0.73514853  0.70416665]
 [ 0.72884115  0.78054021  0.76281138  0.76344136]
 [ 0.77573499  0.81651291  0.77941966  0.42552569]
 [ 1.          1.          1.          1.        ]
 [ 0.69511574 

<div align="center"><img src="img/Improvements since Nature DQN.png" width="60%" height="20%"></div>

http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

# Asynchronous Method

최근에는 Asynchronous Method라는 방법으로 correlation을 없애줘서 Experience Replay를 대체한다고 한다. 간단히 Asynchronous Method를 설명하면 Thread를 통해서 여러개의 agent가 동시에 [state, action, reward, state’]를 수집한다. 그렇게 여러 agent가 동시에 수집한 데이터들은 서로 correlation이 없을 것이기 때문에 Experience Replay를 대체할 수 있으면서 더 빠르고 메모리도 절약할 수 있는 방법이라고 한다.

http://www.phrgcm.com/blog/2016/08/17/deep-q-network/

https://arxiv.org/pdf/1602.01783v2.pdf

# Double DQN

<div align="center"><img src="img/Double DQN code.png" width="100%" height="50%"></div>


          

In [4]:
# Double DQN

# import libraries
import numpy as np
from collections import deque
import random

# set parameters ###############################################################
epoch_sarsa = 1000
epoch_q_learning = 50000
size_experience_replay = 1000
number_of_sample_from_experience_replay = 20
time_period_to_update_target_Q = 100
gamma = 0.99
alpha = 0.01
epsilon = 0.01
# set parameters ###############################################################

# state
states = [0,1,2,3,4,5,6,7,8,9,10]
N_STATES = len(states)

# action
actions = [0,1,2,3] # left, right, up, down
N_ACTIONS = len(actions)

# policy
policy = 0.25*np.ones((N_STATES, N_ACTIONS))

# Q 
Q = np.zeros((N_STATES, N_ACTIONS))
Q[3,:] = 1
Q[6,:] = -1

# rewards
if True: # fuel-efficient robot
    R = -0.02 * np.ones((N_STATES, N_ACTIONS))  
else: # fuel-inefficient robot 
    R = -0.5 * np.ones((N_STATES, N_ACTIONS))  
        
# transition probabilities
P = np.zeros((N_STATES, N_ACTIONS, N_STATES))  

P[0, 0, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 1, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 2, :] = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[0, 3, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

P[1, 0, :] = [0.9, 0, 0, 0, 0.1, 0, 0, 0, 0, 0, 0]
P[1, 1, :] = [0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0, 0]
P[1, 2, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[1, 3, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

P[2, 0, :] = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 1, :] = [0, 0, 0, 0.9, 0, 0, 0.1, 0, 0, 0, 0]
P[2, 2, :] = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
P[2, 3, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]

P[3, 0, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 1, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 2, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
P[3, 3, :] = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

P[4, 0, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 1, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[4, 2, :] = [0.9, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
P[4, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0.9, 0.1, 0, 0]

P[5, 0, :] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
P[5, 1, :] = [0, 0, 0, 0.1, 0, 0, 0.8, 0, 0, 0, 0.1]
P[5, 2, :] = [0, 0.1, 0.8, 0.1, 0, 0, 0, 0, 0, 0, 0]
P[5, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.8, 0.1]

P[6, 0, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 1, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 2, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
P[6, 3, :] = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

P[7, 0, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
P[7, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[7, 2, :] = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
P[7, 3, :] = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

P[8, 0, :] = [0, 0, 0, 0, 0.1, 0, 0, 0.9, 0, 0, 0]
P[8, 1, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[8, 2, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[8, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

P[9, 0, :] = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
P[9, 1, :] = [0, 0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9]
P[9, 2, :] = [0, 0, 0, 0, 0, 0.9, 0.1, 0, 0, 0, 0]
P[9, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

P[10, 0, :] = [0, 0, 0, 0, 0, 0.1, 0, 0, 0, 0.9, 0]
P[10, 1, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
P[10, 2, :] = [0, 0, 0, 0, 0, 0.1, 0.9, 0, 0, 0, 0]
P[10, 3, :] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

# define a function - sample_action 
def sample_action(policy_given_state):
    policy_now = policy_given_state
    cum_policy_now = np.cumsum(policy_now)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_policy_now - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# define a function - sample_transition
def sample_transition(transition_prob_given_state_and_action):
    prob = transition_prob_given_state_and_action
    cum_prob = np.cumsum(prob)
    random_coin = np.random.random(1)
    cum_minus_coin = cum_prob - random_coin
    return [n for n, i in enumerate(cum_minus_coin) if i > 0][0]

# make a memory for a deque of maxlen size_experience_replay for experience replay
replay_meomory = deque(maxlen=size_experience_replay)

# make a deque of maxlen size_experience_replay for experience replay
for t in range(epoch_sarsa):
    
    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:])

    while not done:
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s, a, :])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # SARSA
        Q[s,a] = Q[s,a] + alpha * (R[s,a] + gamma * Q[s1,a1] - Q[s,a])

        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s,a,R[s,a],s1])

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

# initialize target Q
Q_target = Q

# Double DQN
for t in range(epoch_q_learning):

    # indicate game is not over yet
    done = False
    # choose initial state randomly
    s = np.random.choice([0,1,2,4,5,7,8,9,10]) # 3 and 6 removed
    # choose action using current policy
    a = sample_action(policy_given_state=policy[s,:]) 
    
    # time log to update target Q
    time_log_to_update_target_Q = 0
    
    while not done:
        
        # time log to update target Q
        time_log_to_update_target_Q += 1
        
        # choose next state using transition probabilities
        s1 = sample_transition(transition_prob_given_state_and_action=P[s,a,:])

        # epsilon-greedy policy update
        policy_now = np.zeros(N_ACTIONS)
        m = np.argmax(Q[s1, :])
        policy_now[m] = 1
        policy_now = (policy_now + epsilon) / (1 + 4 * epsilon)

        # choose action using epsilon-greedy policy
        a1 = sample_action(policy_given_state=policy_now)
        
        # append current experience at the end of the deque and update the deque
        replay_meomory.append([s, a, R[s, a], s1])

        # Q-learning using experience replay and target Q
        # choose number_of_sample_from_experience_replay experiences from the deque
        sample = random.sample(replay_meomory, number_of_sample_from_experience_replay)
        for i in range(number_of_sample_from_experience_replay):
            # experience replay
            replay = sample[i]
            
            # Q-learning
            # Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
            #                      alpha * (replay[2] + gamma * max(Q[replay[3],:]) - Q[replay[0],replay[1]])
                
            # Q-learning with target Q
            # Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
            #                      alpha * (replay[2] + gamma * max(Q_target[replay[3],:]) - Q[replay[0],replay[1]])
                
            # Double DQN
            Double_DQN_action = np.argmax(Q[replay[3],:]) 
            Q[replay[0],replay[1]] = Q[replay[0],replay[1]] + \
                                 alpha * (replay[2] + gamma * Q_target[replay[3],Double_DQN_action] - Q[replay[0],replay[1]])
                
        # target Q update
        if time_log_to_update_target_Q % time_period_to_update_target_Q == 0:
            Q_target = Q

        # if game is not over, continue playing game
        if (s1 == 3) or (s1 == 6):
            done = True
        else:
            s = s1
            a = a1

print(Q)

[[  3.88294107e-02   5.80533475e-01   6.47216574e-02   2.03804496e-02]
 [  7.88383282e-02   7.38173943e-01   1.15983541e-01   1.49778303e-01]
 [  9.85908289e-02   8.06146385e-01   2.28119229e-01   5.76145462e-02]
 [  1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00]
 [  2.50531768e-02   1.16886068e-02   3.60279813e-01  -9.73112744e-03]
 [  1.18801188e-01  -2.32804519e-01   7.56112831e-01   2.60616980e-02]
 [ -1.00000000e+00  -1.00000000e+00  -1.00000000e+00  -1.00000000e+00]
 [ -1.34181063e-02   3.97168593e-04   8.47549702e-02  -1.29663057e-02]
 [ -6.25033897e-04   1.61342294e-01  -7.03326851e-03  -2.17996660e-03]
 [ -1.70211389e-03  -4.24904390e-02   3.65520181e-01   2.95034345e-02]
 [  1.55335945e-01   6.53493260e-04  -2.65430759e-01   5.09829324e-03]]
[[ 0.64057426  0.66841556  0.64109863  0.61962404]
 [ 0.63909739  0.70765271  0.66796135  0.67067979]
 [ 0.66830872  0.76361993  0.71434224  0.29980248]
 [ 1.          1.          1.          1.        ]
 [ 0.61853367 

# Dueling DQN



$$\begin{array}{ccccccccc}
Q(s,a)&=&V(s)&+&A(s,a)\\
\uparrow&&\uparrow&&\uparrow\\
\mbox{$Q$ function}&&\mbox{value function}&&\mbox{advantage function}\\
\end{array}$$


<div align="center"><img src="img/1-N_t9I7MeejAoWlDuH1i7cw.png" width="30%" height="10%"></div>

https://medium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-networks-and-beyond-8438a3e2b8df