## Q-Learning Step By Step Example

A simple example of Q learning in a step by step fashion using a simple 2x2 gridworld type problem

State 0 | State 1
--------|--------
State 2 | State 3

State 0 = Start<br />
State 1 = Safe<br />
State 2 = Hole<br />
State 3 = Goal<br />

For each state we can move up, down, left, right or stay put - not excluding invalid moves at edges.

Each hole gives a reward of -10, reaching the goal gives +10, all other states give a reward of -1.

So the optimal path is 0-1-3.

In [12]:
import numpy as np
import random
import matplotlib.pyplot as plt

gamma = 0.8

# each matrix below has states as rows, columns in order (U, D, L, R, N) unless otherwise stated

# rewards for each state / action. 0 represents no such transition possible
rewards = np.array([[0, -10, 0, -1, -1],
                    [0, 10, -1, 0, -1],
                    [-1, 0, 0, 10, -1],
                    [-1, 0, -10, 0, 0]])

q_matrix = np.zeros((4,5))

# valid actions for each state encoded as 0=up,1=down, 2=left, 3?right, 4=no action
valid_actions = np.array([[1, 3, 4],
                          [1, 2, 4],
                          [0, 3, 4],
                          [0, 2, 4]])

# what states we move to for each state / action. -1 represents invalid transaction
transition_matrix = np.array([[-1, 2, -1, 1, 1 ],
                              [-1, 3, 0, -1, 2 ],
                              [0, -1, -1, 3, 3 ],
                              [1, -1, 2, -1, -1]])


for i in range(100): # 10 episodes
    current_state = 0
    while current_state != 3:
        # chose a random action - could use epsilon-greedy here
        action = random.choice(valid_actions[current_state])

        # record next state and reward (r, s')
        next_state = transition_matrix[current_state][action]
        reward = rewards[current_state][action]

        # get possible rewards for all valid actions
        future_rewards = []
        for action_next in valid_actions[next_state]:
            future_rewards.append(q_matrix[next_state][action_next])

        # q update
        q_state = reward + gamma + max(future_rewards)
        q_matrix[current_state][action] = q_state
        print(q_matrix)

        current_state = next_state
        if current_state == 3:
            print('goal state reached')

[[ 0.  -9.2  0.   0.   0. ]
 [ 0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0. ]]
[[ 0.  -9.2  0.   0.   0. ]
 [ 0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.  -0.2]
 [ 0.   0.   0.   0.   0. ]]
goal state reached
[[ 0.  -9.2  0.  -0.2  0. ]
 [ 0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.  -0.2]
 [ 0.   0.   0.   0.   0. ]]
[[ 0.  -9.2  0.  -0.2  0. ]
 [ 0.   0.   0.   0.  -0.2]
 [ 0.   0.   0.   0.  -0.2]
 [ 0.   0.   0.   0.   0. ]]
[[ 0.  -9.2  0.  -0.2  0. ]
 [ 0.   0.   0.   0.  -0.2]
 [ 0.   0.   0.  10.8 -0.2]
 [ 0.   0.   0.   0.   0. ]]
goal state reached
[[ 0.  -9.2  0.  -0.2  0. ]
 [ 0.   0.   0.   0.  -0.2]
 [ 0.   0.   0.  10.8 -0.2]
 [ 0.   0.   0.   0.   0. ]]
[[ 0.  -9.2  0.  -0.2  0. ]
 [ 0.  10.8  0.   0.  -0.2]
 [ 0.   0.   0.  10.8 -0.2]
 [ 0.   0.   0.   0.   0. ]]
goal state reached
[[ 0.  -9.2  0.  10.6  0. ]
 [ 0.  10.8  0.   0.  -0.2]
 [ 0.   0.   0.  10.8 -0.2]
 [ 0.   0.   0.   0.   0. ]]
[[ 0.  -9.2  0.  10.6  0. ]
 [ 0.  10.8

If this works then we would expect to:

1. go right (q value for row 1, column 4 to be highest)
2. go down (q value for row 2, column 2 to be highest) 

In [13]:
print("Final q-matrix")
print(q_matrix)

Final q-matrix
[[ 0.   1.6  0.  10.6 10.6]
 [ 0.  10.8 10.4  0.  10.6]
 [10.4  0.   0.  10.8 -0.2]
 [ 0.   0.   0.   0.   0. ]]
