# Brief Summary(known transition probability)

## Bellman Expectation Equation
* State Value Function 

$$\large v_{\pi}(s) = \sum_{a} \: \pi  (a|s) \: q_{\pi}(s,a)$$
$ $
$$\large v_{\pi}(s) = \sum_{a} \: \pi(a|s) \: \sum_{s'}\:R_{s}^{a}\:+\:P_{ss'}^{a}\:v_{\pi}(s')$$
* Action Value Function
$$\large q_{\pi}(s,a) = R_{s}^{a}\: +\: \gamma \sum_{s'}\:P_{ss'}^{a}\:v_{\pi}(s')$$
$ $
$$\large q_{\pi}(s,a) = R_{s}^{a}\: +\: \gamma \sum_{s'}\:P_{ss'}^{a}\:\sum_{a'}\:\pi(a'|s')q_{\pi}(s',a')$$

## Bellman Optimality Equation
#### Instead of following policy choose policy greedly $$\pi(a|s) \rightarrow 1 \: or\: 0$$ 
* Optimal State Value Function
$$\large v_{*}(s) = \max_{\pi} v_{\pi}(s)$$
$$\large v_{*}(s) = \max_{a} q_{*}(s,a)$$
$$\large v_{*}(s) = \max_{a} (R_{s}^{a} + \gamma \sum_{s'\in S}\,P_{ss'}^{a}v_{*}(s'))$$
* Optimal Action Value Function
$$\large q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s'\in S}\,P_{ss'}^{a}\max_{a'} q_{*}(s',a'))$$

## Iterative Policy Evaulation

* Use bellman expectation eqn above to update current value function(state or action) by looking ahead
* value function indicates how good current policy is

## Value Iteration
* Use bellman optimality eqn above to update current value function(state or action) by looking ahead
* To find optimal policy and optimal value function


# Policy Iteration

* Evaluation( policy Evaluation using bellman expectation eqn)
* imporvement(greedy) a = argmax Q

In [4]:
# set state
import numpy as np
nCols = 3
nRows = 4
nWalls = 1
states = []
for i in range(nCols*nRows-nWalls):
    states.append(i)
N_STATES = len(states)
#print(N_STATES)
#print(states)

# set map
map = -np.ones((nCols+2,nRows+2))
for i in range(nCols):
    for j in range(nRows):
        map[i+1,j+1] = 0
map[2,2] = -1 # add wall
#print(map)

# set action
actions = [0, 1, 2, 3]
N_ACTIONS = len(actions)

# states -> location
locations = []
index = 0
for i in range(nCols):
    for j in range(nRows):
        if map[i+1,j+1]==0:
            locations.append((i+1,j+1))
            index = index + 1
#print(locations) # match index with states
# action -> move
move = [(0,-1),(-1,0),(0,1),(1,0)] # match index with actions
#print(move)

# set transition probability
P = np.zeros((N_STATES,N_ACTIONS,N_STATES)) # P[S,A,S']
for s in range(N_STATES):
    for a in range(N_ACTIONS):
        current_location = locations[s]
        # heading collectly  ####################################################################################
        next_location = (current_location[0] + move[a][0],current_location[1] + move[a][1])
        
        if map[next_location[0],next_location[1]] == -1: # there is barrier or wall
            next_location = current_location
            next_s = states[locations.index(next_location)]
        else:
            next_s = states[locations.index(next_location)]
        P[s,a,next_s] = P[s,a,next_s] + 0.8
        # left error ############################################################################################
        next_location = (current_location[0] + move[a-1][0],current_location[1] + move[a-1][1])
        if map[next_location[0],next_location[1]] == -1: # there is barrier or wall
            next_location = current_location
            next_s = states[locations.index(next_location)]
        else:
            next_s = states[locations.index(next_location)]
        P[s,a,next_s] = P[s,a,next_s] + 0.1
        # right error ############################################################################################
        next_location = (current_location[0] + move[(a+1)%4][0],current_location[1] + move[(a+1)%4][1])
        
        if map[next_location[0],next_location[1]] == -1: # there is barrier or wall
            next_location = current_location
            next_s = states[locations.index(next_location)]
        else:
            next_s = states[locations.index(next_location)]
        P[s,a,next_s] = P[s,a,next_s] + 0.1
        
# rewards s,a ---  R(s,a)  ---> s'
if True:
    R = -0.02*np.ones((N_STATES,N_ACTIONS))
else:
    R = -0.5*np.ones((N_STATES,N_ACTIONS))
R[3,:] = 1
R[6,:] = -1
#print(R)
# discount factor
gamma = 0.99

# policy : given state which action would u choose
# assume that we know the policy
bad_policy = np.zeros((N_STATES,N_ACTIONS))
bad_policy[0,2] = 1
bad_policy[1,2] = 1
bad_policy[2,2] = 1
bad_policy[3,2] = 1
bad_policy[4,3] = 1
bad_policy[5,2] = 1
bad_policy[6,2] = 1
bad_policy[7,2] = 1
bad_policy[8,2] = 1
bad_policy[9,2] = 1
bad_policy[10,1] = 1

random_policy = 0.25*np.ones((N_STATES,N_ACTIONS))

optimal_policy = np.zeros((N_STATES,N_ACTIONS))
optimal_policy[0,2] = 1
optimal_policy[1,2] = 1
optimal_policy[2,2] = 1
optimal_policy[3,2] = 1
optimal_policy[4,1] = 1
optimal_policy[5,1] = 1
optimal_policy[6,1] = 1
optimal_policy[7,1] = 1
optimal_policy[8,0] = 1
optimal_policy[9,0] = 1
optimal_policy[10,0] = 1
#print(optimal_policy)

optimalWithNoise_policy = np.zeros((N_STATES,N_ACTIONS))
ep = 0.1
optimalWithNoise_policy[0,2] = 1
optimalWithNoise_policy[1,2] = 1
optimalWithNoise_policy[2,2] = 1
optimalWithNoise_policy[3,2] = 1
optimalWithNoise_policy[4,1] = 1
optimalWithNoise_policy[5,1] = 1
optimalWithNoise_policy[6,1] = 1
optimalWithNoise_policy[7,1] = 1
optimalWithNoise_policy[8,0] = 1
optimalWithNoise_policy[9,0] = 1
optimalWithNoise_policy[10,0] = 1
optimalWithNoise_policy = optimalWithNoise_policy + (ep/4)*np.ones((N_STATES,N_ACTIONS))
optimalWithNoise_policy = optimalWithNoise_policy / np.sum(optimalWithNoise_policy,axis = 1).reshape((N_STATES,1))