## Tower of Hanoi Using Q-Learning

The objective of the puzzle is to move the entire stack to another rod, obeying the following simple rules:
1. Only one disk can be moved at a time.
2. Each move consists of taking the upper disk from one of the stacks and placing it on top of another stack.
3. No disk may be placed on top of a smaller disk.

<img src="files/towers-of-hanoi.png" width="60%", heigth="60%">

In [1]:
# Import libraries
import numpy as np
import pickle
import time

In [2]:
# Load reward matrix "R" from file
R = np.matrix(np.genfromtxt('R_3d.csv',delimiter=','))

# Load states
with open ('comb_3d', 'rb') as fp:
    comb = pickle.load(fp)

[__Click here to see how states and matrix R are generated__](solution-space-3-disks.ipynb)

In [3]:
# Initialize "Q" matrix to zero
Q = np.matrix(np.zeros([R.shape[0],R.shape[0]]))

# Gamma (learning parameter).
gamma = 0.8

# Initial state (Usually to be chosen at random)
initial_state = 0

In [4]:
# This function returns all available actions in the state given as an argument
def available_actions(state):
    current_state_row = R[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act

In [5]:
# This function chooses at random which action to be performed within the range of all the available actions.
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_actions_range,1))
    return next_action

In [6]:
# This function updates the Q matrix according to the path selected and the Q-Learning algorithm
def update(current_state, action, gamma):
    
    max_value = np.max(Q[action,])
    
    # Q learning formula
    Q[current_state, action] = R[current_state, action] + gamma * max_value

### Training

In [7]:
# Training
t = time.time()
# Train over 10 000 iterations. (Re-iterate the process above).
for i in range(1000):
    target = False
    current_state = np.random.randint(0, int(Q.shape[0]))
    
    while not target:    
        available_act = available_actions(current_state)
        action = sample_next_action(available_act)
        update(current_state,action,gamma)
        current_state = action
        if Q[current_state,action] >= 100:
            target = True

# Normalize the "trained" Q matrix
print "Elapsed time: {}, s".format(time.time()-t)
print("\nTrained Q matrix:")
print(Q/np.max(Q)*100)
np.savetxt("Q_3d.csv", Q, delimiter=",")

Elapsed time: 5.49099993706, s

Trained Q matrix:
[[   0.          26.2144      20.97152      0.           0.           0.
     0.           0.           0.           0.           0.           0.
     0.           0.           0.           0.           0.           0.
     0.           0.           0.           0.           0.           0.
     0.           0.           0.       ]
 [  20.97152      0.          20.97152      0.           0.          32.768
     0.           0.           0.           0.           0.           0.
     0.           0.           0.           0.           0.           0.
     0.           0.           0.           0.           0.           0.
     0.           0.           0.       ]
 [  20.97152     26.2144       0.          20.97152      0.           0.
     0.           0.           0.           0.           0.           0.
     0.           0.           0.           0.           0.           0.
     0.           0.           0.           0.           0. 

### Testing

In [8]:
# Goal state = 26
current_state = 0
steps = [current_state]

while current_state != 26:

    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size = 1))
    else:
        next_step_index = int(next_step_index)
    
    steps.append(next_step_index)
    current_state = next_step_index

# Print selected sequence of steps
print("Selected path:")
for ix, state in enumerate(steps):
    print "{:2.0f}: state {:2.0f} --> {}".format(ix, state, comb[state])

Selected path:
 0: state  0 --> ('SML', '*', '*')
 1: state  1 --> ('ML', '*', 'S')
 2: state  5 --> ('L', 'M', 'S')
 3: state  9 --> ('L', 'SM', '*')
 4: state 12 --> ('*', 'SM', 'L')
 5: state 18 --> ('S', 'M', 'L')
 6: state 24 --> ('S', '*', 'ML')
 7: state 26 --> ('*', '*', 'SML')


### From string to number

In [9]:
print comb.index(('SML', '*', '*'))

0
