### Assignment : Week 2
## Finding best policies in simple MDPs

Great work making the MDPs in Week 1!

In this assignment, we'll use the simplest RL techniques - Policy and Value iteration to find the best policies (which maximize the discounted total reward) in our MDPs from last week.

Feel free to use your own MDPs, or import them from the OpenAI Gym library.

You can start this assignment during/after reading Grokking Ch-3.

Let us recall the equation to find the value function of agent's states under a policy $\pi$ -
$$v_{\pi}(s) = \sum _{a} \pi(a|s) ~ \left( ~ \sum _{s', r} ~ p(s', r | s, a) ~ \left[r + \gamma v_{\pi}(s') \right] ~ \right)$$

We can observe that the value function $v_{\pi}$ has a lot of circular dependencies on different states. 

To solve such equations, one of the ways is to iteratively calculate the RHS and replace the LHS by it until the $v_{\pi}(s)$ values start to converge. 

The point of convergence makes all the equations simultaneously true and hence is the required solution.

Let us calculate the value functions for some policies in the MDPs we created last week.

## Environment 0 - Bandit Walk

Again, we consider the BW environment on Page 39.

Let's consider what seems to be the most natural policy - always go Right.

This environment is so simple, that we can simply calculate the value functions by hand.

Note that by convention for the terminal states, 
$$v_{\pi}(0) = v_{\pi}(2) = 0$$

Now, 
$$v_{\pi}(1) = 1 + \gamma \cdot v_{\pi}(2) = 1$$

Note both the summations just have one term due to the deterministic nature of the environment and the policy (check which summation was corresponding to which stochastic variable)

## Environment 1 - Slippery Walk

Let's now try to solve the SWF environment from Page 67 for the naturally adversarial policy - always go Left.

Since we have 5 coupled equations for states 1-5 with 5 unknown variables, we'll use Python to bruteforce the solution.

To align with Grokking, let us consider an unusual $\gamma = 1$.

In [3]:
# Step 0 is to import stuff

import gym, gym_walk
import numpy as np
from gym.envs.toy_text.frozen_lake import generate_random_map
import pprint
import random

In [4]:
# Step 1 is to get the MDP

env = gym.make('SlipperyWalkFive-v0')
swf_mdp = env.P
pprint.pprint(swf_mdp)

# Note that in Gym, action "Left" is "0" and "Right" is "1"

{0: {0: [(0.5000000000000001, 0, 0.0, True),
         (0.3333333333333333, 0, 0.0, True),
         (0.16666666666666666, 0, 0.0, True)],
     1: [(0.5000000000000001, 0, 0.0, True),
         (0.3333333333333333, 0, 0.0, True),
         (0.16666666666666666, 0, 0.0, True)]},
 1: {0: [(0.5000000000000001, 0, 0.0, True),
         (0.3333333333333333, 1, 0.0, False),
         (0.16666666666666666, 2, 0.0, False)],
     1: [(0.5000000000000001, 2, 0.0, False),
         (0.3333333333333333, 1, 0.0, False),
         (0.16666666666666666, 0, 0.0, True)]},
 2: {0: [(0.5000000000000001, 1, 0.0, False),
         (0.3333333333333333, 2, 0.0, False),
         (0.16666666666666666, 3, 0.0, False)],
     1: [(0.5000000000000001, 3, 0.0, False),
         (0.3333333333333333, 2, 0.0, False),
         (0.16666666666666666, 1, 0.0, False)]},
 3: {0: [(0.5000000000000001, 2, 0.0, False),
         (0.3333333333333333, 3, 0.0, False),
         (0.16666666666666666, 4, 0.0, False)],
     1: [(0.5000000000000

In [10]:
# Step 2 is to write the policy

pi = {
    0 : 0,
    1 : 0,
    2 : 0,
    3 : 0,
    4 : 0,
    5 : 0,
    6 : 0
}

# Or you can do it randomly
# pi = dict()
# for state in mdp:
#     pi[state] = np.random.choice(mdp[state].keys())

In [45]:
# Step 3 is computing the value function for this envi and policy

# Let us start with a random value function

val = dict()
for state in swf_mdp:
    # val[state] = np.random.random()
    val[state] = 0

# Since 0 and 6 are terminal states, we know their values are 0

val[0] = 0
val[6] = 0

pprint.pprint(val)

#Or you could do it randomly, remember to set the terminal states to 0. You can also implement this while evaluating the value function using 
# val = dict()
# for state in mdp:
#     val[state] = np.random.random()
#     if mdp[state][0][0][0] == 0: # if the first action in the first outcome of the first state is 0, then it is a terminal state
#         val[state] = 0

#instead of doing thsi you can simply intialize the value function to 0 for all states 
# for state in swf_mdp:
#   val[state] = 0

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0}


In [27]:
def get_new_value_fn(val, mdp, pi, gamma = 1.0):
    
    new_val = dict()
    for state in mdp:
        action = pi[state]
        # print(action)
        probability = mdp[state][action]

        new_val_for_state = 0
        for (transition_probability, next_state, reward, isTerminal) in probability:
            new_val_for_state += transition_probability * (reward + gamma * val[next_state] * (not isTerminal))

        new_val[state] = new_val_for_state
    return new_val

In [64]:
# print(get_new_value_fn(val, swf_mdp, pi, gamma = 1.0))
# print(get_new_value_fn(get_new_value_fn(val, swf_mdp, pi, gamma = 1.0), swf_mdp, pi, gamma = 1.0))
# print(get_new_value_fn(get_new_value_fn(get_new_value_fn(val, swf_mdp, pi, gamma = 1.0), swf_mdp, pi, gamma = 1.0), swf_mdp, pi, gamma = 1.0))
# print(get_new_value_fn(get_new_value_fn(get_new_value_fn(get_new_value_fn(val, swf_mdp, pi, gamma = 1.0), swf_mdp, pi, gamma = 1.0), swf_mdp, pi, gamma = 1.0), swf_mdp, pi, gamma = 1.0))


In [35]:
#Use to above function to get the new value function, also print how many iterations it took to converge
def policy_evaluation(val, mdp, pi, epsilon=1e-10, gamma=1.0):
    count = 1
    while True:
        val_new = get_new_value_fn(val, mdp, pi, gamma)
        if np.max(np.abs(np.array(list(val_new.values())) - np.array(list(val.values())))) < epsilon:
            return val, count
        else:
            count += 1
            val = val_new

In [59]:
# pprint.pprint(policy_evaluation(val, swf_mdp, pi, gamma = 1))

In [77]:
# Perform policy improvement using the polivy and the value function and return a new policy, the action value function should be a nested dictionary
def policy_improvement(val, mdp, pi, gamma=1.0):
    new_pi = dict()
    q = dict()
    # q must be something like
    # {0: {0: val, 1: val}, 1: {0: val, 1: val},...}

    for state in mdp:
        q[state] = dict() # initialization, each value will be a dictionary
        for action in mdp[state]:
            q[state][action] = 0
            # prob_tuples = mdp[state][action]
            for (transition_probability, next_state, reward, isTerminal) in mdp[state][action]:
                q[state][action] += transition_probability * (reward + gamma * val[next_state] * (not isTerminal))
            
    # after q has been made, 
    for state in q:
        keys, values = np.array(list(q[state].keys())), np.array(list(q[state].values()))
        new_pi[state] = keys[np.argmax(values)] # q[state] is a dict
        
    return new_pi, q


In [78]:
# Use the above functions to get the optimal policy and optimal value function and return the total number of iterations it took to converge
# Create a random policy and value function to start with or use the ones defined above
def policy_iteration(mdp, epsilon=1e-10, gamma=1.0):
    optimal_pi = {s: 0 for s in mdp}
    optimal_val = {s: 0 for s in mdp} # both pi and val are all zeroes initially
    # as always, it is possible to create a random policy and value function to start with
    count = 0
    
    while True:
        new_pi = policy_improvement(optimal_val, mdp, gamma)[0]
        if new_pi == optimal_pi: # convergence when policy cannot be optimized further
            return optimal_pi, optimal_val, count + 1
        else:
            count += 1
            optimal_pi = new_pi
            optimal_val = policy_evaluation(optimal_val, mdp, optimal_pi, gamma = gamma)[0]

In [130]:
#Now perform value iteration, note that the value function is a dictionary and not a list, also return the number of iterations it took to converge
def value_iteration(mdp, gamma=1.0, epsilon=1e-10):
    val = {s: 0 for s in mdp}
    count = 1
    
    while True:

        q = dict()

        for state in mdp:
            q[state] = dict()
            for action in mdp[state]:
                q[state][action] = 0
                for (transition_probability, next_state, reward, isTerminal) in mdp[state][action]:
                    q[state][action] += transition_probability * (reward + gamma * val[next_state] * (not isTerminal))

        val_new = {state: max(q[state].values()) for state in mdp}
                    
        if np.max(np.abs(np.array(list(val_new.values())) - np.array(list(val.values())))) < epsilon:
            break

        val = val_new.copy()
        count += 1

    pi = {s: 0 for s in mdp}
    for state in mdp:
        keys, values = np.array(list(q[state].keys())), np.array(list(q[state].values()))
        pi[state] = keys[np.argmax(values)]
        
    return val, pi, count
    

## Enviroment 2 - Frozen Lake

Repeat the above steps for the frozen lake environment. Don't create new functions , use the old functions.

You can also write a function `test_policy()` to test your policy after training to find the number of times you reached the goal state

In [93]:
mdp2 = gym.make('FrozenLake-v1').P
pprint.pprint(mdp2)

{0: {0: [(0.3333333333333333, 0, 0.0, False),
         (0.3333333333333333, 0, 0.0, False),
         (0.3333333333333333, 4, 0.0, False)],
     1: [(0.3333333333333333, 0, 0.0, False),
         (0.3333333333333333, 4, 0.0, False),
         (0.3333333333333333, 1, 0.0, False)],
     2: [(0.3333333333333333, 4, 0.0, False),
         (0.3333333333333333, 1, 0.0, False),
         (0.3333333333333333, 0, 0.0, False)],
     3: [(0.3333333333333333, 1, 0.0, False),
         (0.3333333333333333, 0, 0.0, False),
         (0.3333333333333333, 0, 0.0, False)]},
 1: {0: [(0.3333333333333333, 1, 0.0, False),
         (0.3333333333333333, 0, 0.0, False),
         (0.3333333333333333, 5, 0.0, True)],
     1: [(0.3333333333333333, 0, 0.0, False),
         (0.3333333333333333, 5, 0.0, True),
         (0.3333333333333333, 2, 0.0, False)],
     2: [(0.3333333333333333, 5, 0.0, True),
         (0.3333333333333333, 2, 0.0, False),
         (0.3333333333333333, 1, 0.0, False)],
     3: [(0.3333333333333333,

In [73]:
# env2 = gym.make('FrozenLake-v1',desc=generate_random_map(size=4))
# mdp2 = env2.P

In [99]:
pi1, val1, count1 = policy_iteration(mdp2)
pprint.pprint(pi1)
pprint.pprint(val1)
pprint.pprint(count1)
pi2, val2, count2 = value_iteration(mdp2)
pprint.pprint(pi2)
pprint.pprint(val2)
pprint.pprint(count2)

{0: 0,
 1: 3,
 2: 3,
 3: 3,
 4: 0,
 5: 0,
 6: 0,
 7: 0,
 8: 3,
 9: 1,
 10: 0,
 11: 0,
 12: 0,
 13: 2,
 14: 1,
 15: 0}
{0: 0.8235294093675635,
 1: 0.8235294085640291,
 2: 0.8235294079934693,
 3: 0.8235294076974147,
 4: 0.8235294095420496,
 5: 0.0,
 6: 0.5294117629918788,
 7: 0.0,
 8: 0.8235294098783209,
 9: 0.8235294103519006,
 10: 0.7647058811069277,
 11: 0.0,
 12: 0.0,
 13: 0.88235294017329,
 14: 0.9411764700677003,
 15: 0.0}
7
{0: 0.8235294093738184,
 1: 0.8235294085723809,
 2: 0.82352940800331,
 3: 0.8235294077080278,
 4: 0.823529409547849,
 5: 0.0,
 6: 0.5294117629963512,
 7: 0.0,
 8: 0.823529409883243,
 9: 0.823529410355587,
 10: 0.764705881110179,
 11: 0.0,
 12: 0.0,
 13: 0.8823529401759076,
 14: 0.9411764700690586,
 15: 0.0}
{0: 0,
 1: 3,
 2: 3,
 3: 3,
 4: 0,
 5: 0,
 6: 0,
 7: 0,
 8: 3,
 9: 1,
 10: 0,
 11: 0,
 12: 0,
 13: 2,
 14: 1,
 15: 0}
806


In [95]:
fl_mdp = {}

## (transition probability, next state, reward, isTerminal)

terminal_states = [5, 7, 11, 12, 15]
correspondance = {"Left" : 0, "Down" : 1, "Right" : 2, "Up" : 3}
direction = [-1, +4, +1, -4]

actions_dict = {
    0: [3, 0, 1],
    1: [0, 1, 2],
    2: [1, 2, 3],
    3: [2, 3, 0]
}


def new_state(state, move):
    if(move == -4 and state + move < 0):
        return state
    if(move == 4 and state + move > 15):
        return state
    if(move == 1 and (state + move) % 4 == 0):
        return state
    if(move == -1 and (state + move + 4) % 4 == 3):
        return state
    else:
        return state + move


In [129]:
def test_policy(pi, env, goalstate = 15):
    success = 0
    failure = 0
    LIMIT = 10000

    for i in range(LIMIT):
        state = 0
        while True:
            chance = random.random()
            moves = actions_dict[pi[state]]
            if chance < 1/3:
                state = new_state(state, direction[moves[0]])
            elif chance < 2/3:
                state = new_state(state, direction[moves[1]])
            else:
                state = new_state(state, direction[moves[2]])

            if state == goalstate:
                success += 1
                break

            if state in terminal_states:
                failure += 1
                break
            
    return f"{(success/LIMIT)*100}% times it reached goal state"

In [128]:
for i in range(15):
    print(test_policy(pi1, mdp2))

82.45% times it reached goal state
82.64% times it reached goal state
83.25% times it reached goal state
82.11% times it reached goal state
82.22% times it reached goal state
82.59% times it reached goal state
82.28999999999999% times it reached goal state
82.38% times it reached goal state
81.82000000000001% times it reached goal state
82.39999999999999% times it reached goal state
81.93% times it reached goal state
82.58% times it reached goal state
82.17% times it reached goal state
82.38% times it reached goal state
82.25% times it reached goal state
