NoteBook to try some example algorithms in python from the reinforcement learning book. Originally the algorithms I implemented on MATLAB have been translated into python code. 
The basis problem is a grid of squares where the terminal states are square 1,and the final square. The policy implemented contains a grid of numbers, where in each square is the index the player should move to from that square.


# The Actual Value Function (for the policy pi)

In [33]:
pi = np.array([0,0,1,0,1,8,7,8,8]) #policy array - each number points to the next location

P = np.array([[1,0,0,0,0,0,0,0,0],
             [1,0,0,0,0,0,0,0,0],
             [0,1,0,0,0,0,0,0,0],
             [1,0,0,0,0,0,0,0,0],
             [0,1,0,0,0,0,0,0,0],
             [0,0,0,0,0,0,0,0,1],
             [0,0,0,0,0,0,0,1,0],
             [0,0,0,0,0,0,0,0,1],
             [0,0,0,0,0,0,0,0,1]]) #Prob matrix of policy

VActual = np.matmul(np.linalg.inv(np.eye(9)-gamma*P),R) #Solve bellman equation
VActual

array([[0. ],
       [1. ],
       [1.9],
       [1. ],
       [1.9],
       [1. ],
       [1.9],
       [1. ],
       [0. ]])

# TD[0] (One Step TD) Implementation

In [38]:
import numpy as np

alpha, gamma, n = 0.5, 0.9, 200    #step parameter, discount factor, number of episodes
V = np.zeros((9,1))                #arbitrary initial V estimate  
R = np.ones((9,1))                    
R[0], R[8] = 0, 0                  #reward array
S0_arr = np.random.randint(0,9,n)  #Generate starting states (1,to 9)

#initially use for loops though I'm aware it's not the fastest way.
for i in range(n):
    S = S0_arr[i]          #choose random starting state
    running = True         #true until the terminal state is reached
    while running:
        if S == 0 or S == 8:
            running = False
        else:
            Snew = pi[S]
            V[S] = V[S] + alpha*(R[S]+gamma*V[Snew]-V[S])
            S = Snew

print("The difference between the estimate and the actual value function is : \n", V-VActual)


The difference between the estimate and the actual value function is : 
 [[ 0.00000000e+00]
 [ 0.00000000e+00]
 [-1.47249879e-07]
 [-7.62939453e-06]
 [-3.70615691e-07]
 [-4.65661287e-10]
 [-1.48702093e-06]
 [-9.09494702e-13]
 [ 0.00000000e+00]]


# Simple 3x3 Grid Q Learning Implementation

Slightly different implementation of the 3x3 grid idea, where the Q table has actions 0-3; up, right, down, left in that order ascending. Rows of the Q table are the states and the columns are the possible actions. This method learns the optimum policy as the iterations of episodes update the Q values.

In [11]:
import numpy as np

#Takes the arr from Q with one state and multiple actions, finds highest value and returns the index of the action column
def get_max_A(arr):
    indices = np.where(arr == np.amax(arr)) #indices of maximum Q value of the array
    return indices[0][0]

#takes input of action and returns new state according to board physics
def change_state(S,A):
    #Don't move if at the board edge. Otherwise 0 up, 1 right, 2 down, 3 left
    if((S % 3 == 0 and A == 3) or (S < 3 and A == 0) or (S > 5 and A == 2) or (S % 3 == 2 and A == 1)):
        return S
    elif(A == 0):
        return S-3
    elif(A == 1):
        return S+1
    elif(A == 2):
        return S+3
    elif(A == 3):
        return S-1
    else:
        print('change_state() broken at state {} and action {}'.format(S,A))
        return S

alpha, gamma, n = 0.5, 0.9, 200    #step parameter, discount factor, number of episodes

Q = np.zeros((9,4))                #arbitrary initial Q Table estimate ((9 states, 4 actions) value function)

#set up reward so if an action would take it off the board, make it stay still and lose 10 reward, otherwise reward=-1
R = np.array([[-10, -1, -1, -10],
              [-10, -1, -1,  -1],
              [-10,-10, -1,  -1],
              [ -1, -1, -1, -10],
              [ -1, -1, -1,  -1],
              [ -1,-10, -1,  -1],
              [ -1, -1,-10, -10],
              [ -1, -1,-10,  -1],
              [ -1,-10,-10,  -1]])
S0_arr = np.random.randint(0,9,n)  #Generate starting states (1,to 9)

for i in range(n):
    S = S0_arr[i]          #choose random starting state
    Snew = S               #Set Snew to fix the algorithm
    running = True         #true until the terminal state is reached (S = 0 or 8)
    # iteration = 1
    while running:
        if S == 0 or S == 8:
            running = False
        else:
            Amax = get_max_A(Q[S,:])       #Get action that maximizes current move
            Snew = change_state(S,Amax)    #Find out the new state according to the action
            Anewmax = get_max_A(Q[Snew,:]) #Get action that maximizes the next move
            Q[S,Amax] = Q[S,Amax] + alpha*(R[S,Amax]+ gamma*Q[Snew,Anewmax] - Q[S,Amax]) #Update Q
            S = Snew                         #Update state
            # print('Iteration {} at state {}'.format(iteration,S)) #debugging iteration check
            # iteration += 1

print('Final Q: {}'.format(Q))
#Find the optimum policy from Q:
optimum_pi = np.zeros((3,3))
for i in range(3):
    for j in range(3):
        optimum_pi[j,i] = change_state(i+j*3,get_max_A(Q[i+j*3,:]))
optimum_pi[2,2] = 8
print("optimum policy:\n {}".format(optimum_pi))

Final Q: [[ 0.          0.          0.          0.        ]
 [-5.         -1.18875    -1.65617188 -1.        ]
 [-5.         -5.         -1.89998767 -1.89998779]
 [-1.         -1.18875    -1.408125   -5.        ]
 [-1.89051866 -1.89435179 -1.89330679 -1.89407904]
 [-1.533125   -5.         -1.         -1.31375   ]
 [-1.89950877 -1.89909814 -5.         -5.        ]
 [-1.533125   -1.         -5.         -1.2905625 ]
 [ 0.          0.          0.          0.        ]]
optimum policy:
 [[0. 0. 5.]
 [0. 1. 8.]
 [7. 8. 8.]]


# More Complicated 4x4 Grid with walls, Q Learning

Using a 4x4 grid with some walls as shown below:
{F, , , },
{_,_,_, },
{ , ,_|, },
{ , , , }
Where F is the final state.
The optimum policy shown has the index in each square of the square that the player should move to.

In [4]:
import numpy as np

#Takes the arr from Q with one state and multiple actions, finds highest value and returns the index of the action column
def get_max_A(arr):
    indices = np.where(arr == np.amax(arr)) #indices of maximum Q value of the array
    return indices[0][0]

#takes input of action and returns new state according to board physics
def change_state(S,A):
    #Don't move if at the board edge. Otherwise 0 up, 1 right, 2 down, 3 left
    if((S % 4 == 0 and A == 3) or (S < 4 and A == 0) or (S > 11 and A == 2) or (S % 4 == 3 and A == 1)):
        return S
    elif(A == 0):
        return S-4
    elif(A == 1):
        return S+1
    elif(A == 2):
        return S+4
    elif(A == 3):
        return S-1
    else:
        print('change_state() broken at state {} and action {}'.format(S,A))
        return S

alpha, gamma, n = 0.5, 0.9, 200    #step parameter, discount factor, number of episodes

Q = np.zeros((16,4))                #arbitrary initial Q Table estimate ((9 states, 4 actions) value function)

#set up reward so if an action would take it off the board, make it stay still and lose 10 reward, otherwise reward
# {F, , , },
# {_,_,_, },
# { , ,_|, },
# { , , , }
R = np.array([[-10, -1, -1,-10], [-10, -1, -1, -1], [-10, -1, -1, -1], [-10,-10, -1, -1],
              [ -1, -1,-10,-10], [ -1, -1,-10, -1], [ -1, -1,-10, -1], [ -1,-10, -1, -1],
              [-10, -1, -1,-10], [-10, -1, -1, -1], [-10,-10,-10, -1], [ -1,-10, -1,-10],
              [ -1, -1,-10,-10], [ -1, -1,-10, -1], [-10, -1,-10, -1], [ -1,-10,-10, -1]])
S0_arr = np.random.randint(0,16,n)  #Generate starting states (1,to 9)

for i in range(n):
    S = S0_arr[i]          #choose random starting state
    Snew = S               #Set Snew to fix the algorithm
    running = True         #true until the terminal state is reached (S = 0 or 8)
    # iteration = 1
    while running:
        if S == 0:
            running = False
        else:
            Amax = get_max_A(Q[S,:])       #Get action that maximizes current move
            Snew = change_state(S,Amax)    #Find out the new state according to the action
            Anewmax = get_max_A(Q[Snew,:]) #Get action that maximizes the next move
            Q[S,Amax] = Q[S,Amax] + alpha*(R[S,Amax]+ gamma*Q[Snew,Anewmax] - Q[S,Amax]) #Update Q
            S = Snew                         #Update state
            # print('Iteration {} at state {}'.format(iteration,S)) #debugging iteration check
            # iteration += 1

print('Final Q: {}'.format(Q))
optimum_pi = np.zeros((4,4))
for i in range(4):
    for j in range(4):
        optimum_pi[j,i] = change_state(i+j*4,get_max_A(Q[i+j*4,:]))
print("optimum policy:\n {}".format(optimum_pi))

Final Q: [[ 0.          0.          0.          0.        ]
 [-5.         -1.18875    -1.701875   -1.        ]
 [-5.         -2.0045625  -2.43918969 -1.9       ]
 [-5.         -5.         -2.8267977  -2.71      ]
 [-1.         -1.245      -5.         -5.        ]
 [-1.9        -1.92909687 -5.         -1.9       ]
 [-2.71       -2.87998358 -5.         -2.71      ]
 [-3.439      -5.         -3.85334621 -3.439     ]
 [-7.94824219 -6.50598787 -6.50125559 -9.75      ]
 [-8.35158356 -6.19179893 -6.12579511 -6.24083094]
 [-8.6993882  -9.44557883 -9.75       -6.51313462]
 [-4.0951     -5.         -4.28579446 -5.        ]
 [-6.43220859 -6.12564453 -9.75       -9.75      ]
 [-5.89324849 -5.6953279  -9.75       -5.95273391]
 [-9.75       -5.217031   -9.75       -5.56159263]
 [-4.68559    -5.         -5.         -5.01094988]]
optimum policy:
 [[ 0.  0.  1.  2.]
 [ 0.  4.  2.  6.]
 [12. 13.  9.  7.]
 [13. 14. 15. 11.]]


# Trying to break the 4x4 Q Learning Grid by Feeding False Data

In progress