# Q learning
## Brief
Suppose we have 5 rooms in a building connected by doors as shown in the figure below.  We'll number each room 0 through 4.  The outside of the building can be thought of as one big room (5).  Notice that doors 1 and 4 lead into the building from room 5 (outside). For this example, we'd like to put an agent in any room, and from that room, go outside the building (this will be our target room). In other words, the goal room is number 5. 
### Map
![map](map.jpg)
### Graph
![Graph](graph.jpg)
### Reference
[Reference](http://mnemstudio.org/path-finding-q-learning-tutorial.htm)

## Import

In [1]:
import numpy as np
import random

## Initialize

In [2]:
Q=np.zeros([6,6])
R=np.array([[-1,-1,-1,-1,0,-1],
           [-1,-1,-1,0,-1,100],
           [-1,-1,-1,0,-1,-1],
           [-1,0,0,-1,0,-1],
           [0,-1,-1,0,-1,100],
            [-1,0,-1,-1,0,100]])
print("Q:\n{}\nR:\n{}".format(Q,R))

Q:
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]
R:
[[ -1  -1  -1  -1   0  -1]
 [ -1  -1  -1   0  -1 100]
 [ -1  -1  -1   0  -1  -1]
 [ -1   0   0  -1   0  -1]
 [  0  -1  -1   0  -1 100]
 [ -1   0  -1  -1   0 100]]


In [3]:
def train(R,targetState,Q=None,n_episode=20,learningRate=0.2,gamma=0.8,printInterval=None,verbose=False):
    rShape=np.shape(R)
    if rShape[0]!=rShape[1]:
        raise ValueError("The number of columns and rows in R didn't match. ")
    if Q==None:
        Q=np.zeros([rShape[0],rShape[1]])
    else:
        qShape=np.shape(Q)
        if qShape[0]!=rShape[0] or qShape[1]!=rShape[1]:
            raise ValueError("Size of Q and R didn't match. ")
    for episode in range(n_episode):
        state=random.randint(0,rShape[0]-1)
        step=0
        while state!=targetState:
            #Q(S,A) ← (1-α)*Q(S,A) + α*[R + γ*maxQ(S',a)] 
            step+=1
            actionSet=np.argwhere(R[state]>=0)
            action=actionSet[random.randint(0,np.shape(actionSet)[0]-1),0]
            if verbose and printInterval!=None and (episode+1)%printInterval==0:
                print("At episode {}\nState:{}\nAction:{}\nQ:\n{}:".format(episode+1,state,action,Q))
            Q[state,action]=(1-learningRate)*Q[state,action]+learningRate*(R[state,action]+gamma*np.argmax(Q[action]))
            state=action
        if printInterval!=None and (episode+1)%printInterval==0:
            print("At episode {}\nQ:\n{}".format(episode+1,Q))
    return Q

## Training #1

In [4]:
#Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Q=train(R,targetState=5,learningRate=1,gamma=0.8,printInterval=1,verbose=True)

At episode 1
State:4
Action:5
Q:
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]:
At episode 1
Q:
[[   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.  100.]
 [   0.    0.    0.    0.    0.    0.]]
At episode 2
Q:
[[   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.  100.]
 [   0.    0.    0.    0.    0.    0.]]
At episode 3
State:3
Action:1
Q:
[[   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.    0.]
 [   0.    0.    0.    0.    0.  100.]
 [   0.    0.    0.    0.    0.    0.]]:
At episode 3
State:1
Action:3
Q:

## Online Inference

In [5]:
for state in range(5):
    print("Initial state:{}".format(state))
    while state!=5:
        #Didn't bother to validate the action
        actionSet=np.argwhere(Q[state]==np.max(Q[state]))
        action=actionSet[random.randint(0,np.shape(actionSet)[0]-1),0]
        print("Moving from room {} to room {}.".format(state,action))
        state=action

Initial state:0
Moving from room 0 to room 4.
Moving from room 4 to room 5.
Initial state:1
Moving from room 1 to room 5.
Initial state:2
Moving from room 2 to room 3.
Moving from room 3 to room 4.
Moving from room 4 to room 5.
Initial state:3
Moving from room 3 to room 4.
Moving from room 4 to room 5.
Initial state:4
Moving from room 4 to room 5.


## Training #2

In [6]:
#Q(S,A) ← (1-α)*Q(S,A) + α*[R + γ*maxQ(S',a)]
Q=train(R,learningRate=0.2,gamma=0.8,targetState=5,printInterval=1,verbose=True)

At episode 1
State:1
Action:3
Q:
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]:
At episode 1
State:3
Action:1
Q:
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]:
At episode 1
State:1
Action:3
Q:
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]:
At episode 1
State:3
Action:1
Q:
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]:
At episode 1
State:1
Action:3
Q:
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]:
At episode 1
St

## Online inference

In [7]:
for state in range(5):
    print("Initial state:{}".format(state))
    while state!=5:
        #Didn't bother to validate the action
        actionSet=np.argwhere(Q[state]==np.max(Q[state]))
        action=actionSet[random.randint(0,np.shape(actionSet)[0]-1),0]
        print("Moving from room {} to room {}.".format(state,action))
        state=action

Initial state:0
Moving from room 0 to room 4.
Moving from room 4 to room 5.
Initial state:1
Moving from room 1 to room 5.
Initial state:2
Moving from room 2 to room 3.
Moving from room 3 to room 1.
Moving from room 1 to room 5.
Initial state:3
Moving from room 3 to room 1.
Moving from room 1 to room 5.
Initial state:4
Moving from room 4 to room 5.
