# Reinforcement Learning Solution to the Towers of Hanoi Puzzle

For this assignment, you will use reinforcement learning to solve the [Towers of Hanoi](https://en.wikipedia.org/wiki/Tower_of_Hanoi) puzzle.  

To accomplish this, you must modify the code discussed in lecture for learning to play Tic-Tac-Toe.  Modify the code  so that it learns to solve the three-disk, three-peg
Towers of Hanoi Puzzle.  In some ways, this will be simpler than the
Tic-Tac-Toe code.  

Steps required to do this include the following:

  - Represent the state, and use it as a tuple as a key to the Q dictionary.
  - Make sure only valid moves are tried from each state.
  - Assign reinforcement of $-1$ to each move unless it is a move to the goal state, for which the reinforcement is $0$.  This represents the goal of finding the shortest path to the goal.

Make a plot of the number of steps required to reach the goal for each
trial.  Each trial starts from the same initial state.  Decay epsilon
as in the Tic-Tac-Toe code.

## Requirements

First, how should we represent the state of this puzzle?  We need to keep track of which disks are on which pegs. Name the disks 1, 2, and 3, with 1 being the smallest disk and 3 being the largest. The set of disks on a peg can be represented as a list of integers.  Then the state can be a list of three lists.

For example, the starting state with all disks being on the left peg would be `[[1, 2, 3], [], []]`.  After moving disk 1 to peg 2, we have `[[2, 3], [1], []]`.

To represent that move we just made, we can use a list of two peg numbers, like `[1, 2]`, representing a move of the top disk on peg 1 to peg 2.

Now on to some functions. Define at least the following functions. Examples showing required output appear below.

   - `printState(state)`: prints the state in the form shown below
   - `validMoves(state)`: returns list of moves that are valid from `state`
   - `makeMove(state, move)`: returns new (copy of) state after move has been applied.
   - `trainQ(nRepetitions, learningRate, epsilonDecayFactor)`: train the Q function for number of repetitions, decaying epsilon at start of each repetition. Returns Q and list or array of steps to reach goal for each repetition.
   - `testQ(Q, maxSteps)`: without updating Q, use Q to find greedy action each step until goal is found. Return path of states.

A function that you might choose to implement is

   - `stateMoveTuple(state, move)`: returns tuple of state and move.  
    
This is useful for converting state and move to a key to be used for the Q dictionary.

# Examples (more coming soon)

To make it easier to visualize the state of the game, the function *printState* is used.

In [57]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from copy import copy

In [37]:
def printState(state):
    import copy
    
    stateCopy = copy.deepcopy(state)
    for peg in stateCopy:
        numberDisks=len(peg)
        for pos in range(3-len(peg)):
            peg.insert(pos,' ')
            
    print stateCopy[0][0],stateCopy[1][0],stateCopy[2][0]
    print stateCopy[0][1],stateCopy[1][1],stateCopy[2][1]
    print stateCopy[0][2],stateCopy[1][2],stateCopy[2][2]
    print '------'

In [38]:
state = [[1, 2,3], [], []]
printState(state)


1    
2    
3    
------


The function *validMoves* returns list of moves that are valid from a certain state.
The list returned consists of pairs containing the number of the disk and the peg to which it can be moved.

The following is the list of rules:
- Only one disk can be moved at a time.
- Each move consists of taking the upper disk from one of the stacks and placing it on top of another stack.
- No disk may be placed on top of a smaller disk.



In [307]:
#returns list of moves that are valid from state
def validMoves(state):
    validMoves=[]
    #for each peg 
    for p in range(3):
        if len(state[p]) >0:
            #print state[p]
            for q in range(3):
                if p==q:
                    continue
                else:
                    if len(state[q])>0:
                        if state[p][0] < state[q][0]:
                            validMoves.append([p+1,q+1])
                            #print 'move', state[p][0], 'to', q+1
                    else:
                        #print 'move', state[p][0], 'to', q+1
                        validMoves.append([p+1,q+1])
 
        
    return validMoves
    


In this case, the valid moves that can be perfomed are:
- move the dist 2 to the peg 3
- move the dist 1 to the peg 1
- move the dist 1 to the peg 3


In [308]:
state = [[1], [2], [3]]

validMoves(state)

[[1, 2], [1, 3], [2, 3]]

 The function `makeMove(state, move)` returns new (copy of) state after move has been applied.


In [179]:
def makeMove(state, move):
    import copy
    newState=copy.deepcopy(state)
    disk=move[0]
    peg=move[1]
    del newState[disk-1][0]
    newState[peg-1].insert(0,state[disk-1][0])    
    return newState

In the following example, the disk 1 is moved to peg 3

In [181]:
state = [[1,2,3], [], []]
print validMoves(state)

move=validMoves(state)[1]
print move
newState=makeMove(state, move)
printState(newState)



[(1, 2), (1, 3)]
(1, 3)
     
2    
3   1
------


The funtion stateMoveTuple(state, move) returns tuple of state and move.

In [28]:
def stateMoveTuple(state, move):
    import copy
    newState=copy.deepcopy(state)
    newTuple=[]
    
    for l in range(len(state)):
        newTuple.append(tuple(newState[l]))
    return tuple((tuple(newTuple),tuple(move)))

In [29]:
state = [[1,2,3], [], []]
move =[1, 2]
stateMoveTuple(state, move)

(((1, 2, 3), (), ()), (1, 2))

In [130]:
Q={}
stateOld = [[1,2,3], [], []]
moveOld =[1, 2]
Q[tuple(stateMoveTuple(stateOld, moveOld))]=-1
state=makeMove(stateOld, moveOld)
Q[tuple(stateMoveTuple(state, move))]=-1
Q[tuple(stateMoveTuple(stateOld, moveOld))] += rho * (Q[tuple(stateMoveTuple(state, move))] - Q[tuple(stateMoveTuple(stateOld, moveOld))])






state = [[1,2,3], [], []]
validMovesL=validMoves(state)
printState(state)
print('Valid moves are',validMovesL)
Qs= np.array([Q.get(tuple(stateMoveTuple(state, m)), 0) for m in validMovesL]) 
print('Q values for validMoves are',Qs)
bestMove = validMovesL[np.argmax(Qs)]
print('Best move is',bestMove)


1    
2    
3    
------
('Valid moves are', [(1, 2), (1, 3)])
('Q values for validMoves are', array([-1.,  0.]))
('Best move is', (1, 3))


In [136]:
Q = {}  # empty table
rho = 0.2
stateOld = [[1,2,3], [], []]
moveOld =[1, 2]
Q[tuple(stateMoveTuple(stateOld, moveOld))]=0
state=makeMove(stateOld, moveOld)
#print 'state',state
Q[tuple(stateMoveTuple(state, move))]=-1
Q[tuple(stateMoveTuple(stateOld, moveOld))] += rho * (Q[tuple(stateMoveTuple(state, move))] - Q[tuple(stateMoveTuple(stateOld, moveOld))])
Q
epsilonGreedy(0.8, Q, stateOld)
#print tuple(stateMoveTuple(state, move))


[(1, 2), (1, 3)]
Random Move


(1, 2)

In [31]:
state = [[1,2,3], [], []]
move =[1, 2]
Q[tuple(stateMoveTuple(state, move))]

0

If nobody won yet, let's calculate the temporal difference error and use it to adjust the Q value of the previous state,move. We do this only if we are not at the first move of a game.

In [None]:
step = 1
if step > 0:
    
    Q[tuple(stateMoveTuple(stateOld, moveOld))] += rho * (Q[tuple(stateMoveTuple(state, move))] - Q[tuple(stateMoveTuple(stateOld, moveOld))])

In [205]:
def winner(state):
    disks_in_right_pos_1=0
    disks_in_right_pos_2=0
    for i in range(3):
        if len(state[1]) ==3 and state[1][i]==i+1:
            disks_in_right_pos_1+=1
        if len(state[2]) ==3 and state[2][i]==i+1:
            disks_in_right_pos_2+=1
    return True if disks_in_right_pos_1 ==3 or disks_in_right_pos_2 ==3 else  False
            
state = [ [], [1,2,3],[]]

winner(state)
     

True

In [285]:
def epsilonGreedy(epsilon, Q, state):
    bestMove=[]
    validMovesList = validMoves(state)
    #print validMovesList
    
    if np.random.uniform() < epsilon:
        #print 'Random Move'
        return validMovesList[np.random.choice(len(validMovesList))]
    else:
        #print 'Greedy Move'
        
        Qs = np.array([Q.get(tuple(stateMoveTuple(state, m)), 0) for m in validMovesList]) 
        bestMove = validMovesList[np.argmax(Qs)]

        return bestMove





In [140]:
outcomes = np.random.choice([-1,0],replace=True,size=(1000))
outcomes[:10]

array([-1,  0, -1, -1,  0, -1,  0,  0,  0,  0])

In [378]:
# train the Q function for number of repetitions, decaying epsilon at start of each repetition.
#Returns Q and list or array of steps to reach goal for each repetition.

def trainQ(nRepetitions, learningRate, epsilonDecayFactor,x,y):
    import copy
    steps=[]
    maxGames = nRepetitions
    rho = learningRate
    epsilonDecayRate=epsilonDecayFactor
    epsilon = 1.0
    Q = {}
    for nGames in range(maxGames):
        #print '****New Game***'
        epsilon *= epsilonDecayRate
        step = 0
        state= [[1,2,3], [], []]

        done = False

        while not done:
            #printState(state)

            step += 1
            move = epsilonGreedy(epsilon, Q, state)
 
            stateNew=makeMove(state, move)
            

            #state being explored not in dictionary
            if tuple(stateMoveTuple(state, move)) not in Q:
                Q[tuple(stateMoveTuple(state, move))] = -1  # initial Q value for new state,move

            if winner(stateNew):
                Q[tuple(stateMoveTuple(state, move))] = 0
                Q[tuple(stateMoveTuple(stateNew, []))] = 0
                
                done = True

                
            if step > 1:
                Q[tuple(stateMoveTuple(stateOld, moveOld))] += rho * (Q[tuple(stateMoveTuple(state, move))] - Q[tuple(stateMoveTuple(stateOld, moveOld))])

            stateOld=copy.deepcopy(state)   
            moveOld = copy.deepcopy(move) # remember state and move to Q(state,move) can be updated after next steps
            state = copy.deepcopy(stateNew)
  
        
        steps.append(step)
    return Q,steps   
    
    

In [362]:
Q, stepsToGoal = trainQ(50, 0.5, 0.7,1,3)

In [None]:
stepsToGoal

In [380]:
#without updating Q, use Q to find greedy action each step until goal is found. Return path of states.

def testQ(Q, maxSteps, validMovesF, makeMoveF):
    state=[[1,2,3],[],[]]
    win=False
    path=[]
    for step in range(maxSteps):
        while not win:
            path.append(state)
            move=[]
            validMovesList = validMoves(state)    
            Qs = np.array([Q.get(tuple(stateMoveTuple(state, m)), 0) for m in validMovesList]) 
            move = validMovesList[np.argmax(Qs)]
            stateNew=makeMove(state, move)

            
            if winner(stateNew):  
                path.append(stateNew)
                win = True

            state = copy.deepcopy(stateNew)
    return path     
        
        
        
        
    

In [373]:
path = testQ(Q, 20,1,2)

In [374]:
path

[[[1, 2, 3], [], []],
 [[2, 3], [1], []],
 [[3], [1], [2]],
 [[1, 3], [], [2]],
 [[3], [], [1, 2]],
 [[], [3], [1, 2]],
 [[], [1, 3], [2]],
 [[1], [3], [2]],
 [[1], [2, 3], []]]

In [376]:
for s in path:
    printState(s)
    print

1    
2    
3    
------

     
2    
3 1  
------

     
     
3 1 2
------

     
1    
3   2
------

     
    1
3   2
------

     
    1
  3 2
------

     
  1  
  3 2
------

     
     
1 3 2
------

     
  2  
1 3  
------



In [381]:
%run -i A5grader.py


Testing validMoves([[1], [2], [3]])

--- 10/10 points. Correctly returned [[1, 2], [1, 3], [2, 3]]

Testing validMoves([[], [], [1, 2, 3]])

--- 10/10 points. Correctly returned [[3, 1], [3, 2]]

Testing makeMove([[], [], [1, 2, 3]], [3, 2])

--- 10/10 points. Correctly returned [[], [1], [2, 3]]

Testing makeMove([[2], [3], [1]], [1, 2])

--- 10/10 points. Correctly returned [[], [2, 3], [1]]

Testing   Q, steps = trainQ(1000, 0.5, 0.7, validMoves, makeMove).

---  0/10 points. Q dictionary should have close to 76 entries. Yours has 52

--- 10/10 points. The mean of the number of steps is 7.116 which is correct.

Testing   path = testQ(Q, 20, validMoves, makeMove).

--- 20/20 points. Correctly returns path of length 8, less than 10.

AI Execution Grade is 70/80

 Remaining 20 points will be based on your text describing the trainQ and test! functions.

AI FINAL GRADE is __/100
