# Q learning
## Brief
Suppose we have 5 rooms in a building connected by doors as shown in the figure below.  We'll number each room 0 through 4.  The outside of the building can be thought of as one big room (5).  Notice that doors 1 and 4 lead into the building from room 5 (outside). For this example, we'd like to put an agent in any room, and from that room, go outside the building (this will be our target room). In other words, the goal room is number 5. 
### Map
![map](map.gif)
### Graph
![Graph](graph.gif)
### Reference
[Reference](http://mnemstudio.org/path-finding-q-learning-tutorial.htm)

## Import

In [1]:
import numpy as np
import random

## Initialize

In [2]:
Q=np.zeros([6,6])
R=np.array([[-1,-1,-1,-1,0,-1],
           [-1,-1,-1,0,-1,100],
           [-1,-1,-1,0,-1,-1],
           [-1,0,0,-1,0,-1],
           [0,-1,-1,0,-1,100],
            [-1,0,-1,-1,0,100]])
print("Q:\n{}\nR:\n{}".format(Q,R))

Q:
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]
R:
[[ -1  -1  -1  -1   0  -1]
 [ -1  -1  -1   0  -1 100]
 [ -1  -1  -1   0  -1  -1]
 [ -1   0   0  -1   0  -1]
 [  0  -1  -1   0  -1 100]
 [ -1   0  -1  -1   0 100]]


## Training #1

In [3]:
n_episode=20
gamma=0.8
for episode in range(n_episode):
    state=random.randint(0,5)
    while state!=5:
        #Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
        actionSet=np.argwhere(R[state]>=0)
        action=actionSet[random.randint(0,np.shape(actionSet)[0]-1),0]
        Q[state,action]=R[state,action]+gamma*np.argmax(Q[action])
        state=action
    if (episode+1)%2==0:
        print("At episode {}\nQ:\n{}".format(episode+1,Q))

At episode 2
Q:
[[   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.  100.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.  100.]
 [   0.    0.    0.    0.    0.    0.]]
At episode 4
Q:
[[   0.     0.     0.     0.     0.     0. ]
 [   0.     0.     0.     0.     0.   100. ]
 [   0.     0.     0.     0.8    0.     0. ]
 [   0.     4.     0.     0.     4.     0. ]
 [   0.     0.     0.     0.     0.   100. ]
 [   0.     0.     0.     0.     0.     0. ]]
At episode 6
Q:
[[   0.     0.     0.     0.     4.     0. ]
 [   0.     0.     0.     0.     0.   100. ]
 [   0.     0.     0.     0.8    0.     0. ]
 [   0.     4.     2.4    0.     4.     0. ]
 [   3.2    0.     0.     0.8    0.   100. ]
 [   0.     0.     0.     0.     0.     0. ]]
At episode 8
Q:
[[   0.     0.     0.     0.     4.     0. ]
 [   0.     0.     0.     0.     0.   100. ]
 [   0.     0.     0.     0.8    0.     0. ]
 [   0.     4.     2.4  

## Online Inference

In [4]:
for state in range(5):
    print("Initial state:{}".format(state))
    while state!=5:
        #Didn't bother to validate the action
        actionSet=np.argwhere(Q[state]==np.max(Q[state]))
        action=actionSet[random.randint(0,np.shape(actionSet)[0]-1),0]
        print("Moving from room {} to room {}.".format(state,action))
        state=action

Initial state:0
Moving from room 0 to room 4.
Moving from room 4 to room 5.
Initial state:1
Moving from room 1 to room 5.
Initial state:2
Moving from room 2 to room 3.
Moving from room 3 to room 1.
Moving from room 1 to room 5.
Initial state:3
Moving from room 3 to room 1.
Moving from room 1 to room 5.
Initial state:4
Moving from room 4 to room 5.


## Training #2

In [5]:
n_episode=20
alpha=0.2
gamma=0.8
Q=np.zeros([6,6])
for episode in range(n_episode):
    state=random.randint(0,5)
    while state!=5:
        #Q(S,A) ← (1-α)*Q(S,A) + α*[R + γ*maxQ(S',a)] 
        actionSet=np.argwhere(R[state]>=0)
        action=actionSet[random.randint(0,np.shape(actionSet)[0]-1),0]
        Q[state,action]=(1-alpha)*Q[state,action]+alpha*(R[state,action]+gamma*np.argmax(Q[action]))
        state=action
    if (episode+1)%2==0:
        print("At episode {}\nQ:\n{}".format(episode+1,Q))

At episode 2
Q:
[[  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.  20.]
 [  0.   0.   0.   0.   0.   0.]]
At episode 4
Q:
[[  0.     0.     0.     0.     1.44   0.  ]
 [  0.     0.     0.     0.64   0.     0.  ]
 [  0.     0.     0.     0.64   0.     0.  ]
 [  0.     0.     0.     0.     1.44   0.  ]
 [  0.     0.     0.     0.64   0.    48.8 ]
 [  0.     0.     0.     0.     0.     0.  ]]
At episode 6
Q:
[[  0.      0.      0.      0.      1.44    0.   ]
 [  0.      0.      0.      0.64    0.     20.   ]
 [  0.      0.      0.      1.152   0.      0.   ]
 [  0.      0.48    0.48    0.      1.952   0.   ]
 [  0.      0.      0.      1.152   0.     59.04 ]
 [  0.      0.      0.      0.      0.      0.   ]]
At episode 8
Q:
[[  0.        0.        0.        0.        1.952     0.     ]
 [  0.        0.        0.        0.64      0.       48.8    ]
 [  0.        0.        0.       

## Online inference

In [6]:
for state in range(5):
    print("Initial state:{}".format(state))
    while state!=5:
        #Didn't bother to validate the action
        actionSet=np.argwhere(Q[state]==np.max(Q[state]))
        action=actionSet[random.randint(0,np.shape(actionSet)[0]-1),0]
        print("Moving from room {} to room {}.".format(state,action))
        state=action

Initial state:0
Moving from room 0 to room 4.
Moving from room 4 to room 5.
Initial state:1
Moving from room 1 to room 5.
Initial state:2
Moving from room 2 to room 3.
Moving from room 3 to room 1.
Moving from room 1 to room 5.
Initial state:3
Moving from room 3 to room 1.
Moving from room 1 to room 5.
Initial state:4
Moving from room 4 to room 5.
