# COURSE:   PGP [AI&ML]

## Learner :  Chaitanya Kumar Battula
## Module  : RNN
## Topic   : Implement policy evaluation method on GridWorld.

## **Environment**

* This environment possesses two terminal states present at:<br>
  * Top left corner
  * Bottom right corner
<br>
The 4x4 grid looks as follows:<br>
T  o  o  o<br>
o  x  o  o<br>
o  o  o  o<br>
o  o  o  T<br>

    Where **x** is the position of the agent and **T** are the two terminal states.<br>

* The allowed actions are as follows:
  * UP = 0 
  * RIGHT = 1 
  * DOWN = 2 
  * LEFT = 3 <br>


    Note: The agent will move back to current states if it performs an action that leads it to go off the edge.

* Rewards:<br>The agent is granted a reward of -1 at each step until it reaches a terminal state.

    
    Environment courtesy: Sutton's Reinforcement Learning book, chapter 4.


### **Dependencies**
* [Discrete](https://drive.google.com/file/d/1aLV-ln3qZgDQbbGZVDQaez4buV9EhdwJ/view?usp=sharing)
* [Gridworld](https://drive.google.com/file/d/1MdOVjmYzSR4Gg1AX3AnwnpXqCMfMPcax/view?usp=sharing)

## **Import libraries and environment**

    Note: Put discrete and gridworld in worknig directory

In [1]:
import numpy as nump
import sys
from gridworld import GridworldEnv

In [2]:
environment = GridworldEnv()

In [3]:
environment.nS

16

In [4]:
environment.nA

4

In [5]:
policy = nump.ones([environment.nS, environment.nA]) / environment.nA
policy

array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]])

In [7]:
policy[0]
#actions are 0,1,2,3
#probability : 0.25, 0.25

array([0.25, 0.25, 0.25, 0.25])

In [8]:
nump.zeros(environment.nS)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

## **Evaluate the policy**

Arguments:
    
* policy = [S, A] shaped matrix
* environment.P = Transition probabilities
* environment.P[s][a] = Transition tuple (prob, next_state, reward, done)
* environment.nS = Number of states 
* environment.nA = Number of actions
* theta = Stopping the evaluation once the value function change is less than theta for all the states
* discount_factor = Gamma discount factor
* Returns = Value function in form of a vector of length environment.nS
        

In [9]:
def policy_eval(policy, environment, discount_factor=1.0, theta=0.00001):
    
    # Start with a random value function where the value is 0 for all the states.
    Val_function = nump.zeros(environment.nS)
    while True:
      
        delta = 0
        # Perform a "full backup" for each state
        for s in range(environment.nS):
            v = 0
            # Look at all the possible next actions
            for a, action_prob in enumerate(policy[s]):
              
                # Look at the possible next states in accordance to all the 4 types of actions
                for  prob, next_state, reward, done in environment.P[s][a]:
                  
                    # Calculate the expected value
                    v += action_prob * prob * (reward + discount_factor * Val_function[next_state])
                    print("v" + str(v))
                    
            # Register the change in value function across any state
            delta = max(delta, nump.abs(v - Val_function[s]))
            Val_function[s] = v
            print(Val_function)
              
        # Cease the evaluation once the value function change is below a threshold i.e, theta
        if delta < theta:
            break
    return nump.array(Val_function)

In [10]:
Random_policy = nump.ones([environment.nS, environment.nA]) / environment.nA
v = policy_eval(Random_policy, environment)

v0.0
v0.0
v0.0
v0.0
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
v-0.25
v-0.5
v-0.75
v-1.0
[ 0. -1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
v-0.25
v-0.5
v-0.75
v-1.25
[ 0.   -1.   -1.25  0.    0.    0.    0.    0.    0.    0.    0.    0.
  0.    0.    0.    0.  ]
v-0.25
v-0.5
v-0.75
v-1.3125
[ 0.     -1.     -1.25   -1.3125  0.      0.      0.      0.      0.
  0.      0.      0.      0.      0.      0.      0.    ]
v-0.25
v-0.5
v-0.75
v-1.0
[ 0.     -1.     -1.25   -1.3125 -1.      0.      0.      0.      0.
  0.      0.      0.      0.      0.      0.      0.    ]
v-0.5
v-0.75
v-1.0
v-1.5
[ 0.     -1.     -1.25   -1.3125 -1.     -1.5     0.      0.      0.
  0.      0.      0.      0.      0.      0.      0.    ]
v-0.5625
v-0.8125
v-1.0625
v-1.6875
[ 0.     -1.     -1.25   -1.3125 -1.     -1.5    -1.6875  0.      0.
  0.      0.      0.      0.      0.      0.      0.    ]
v-0.578125
v-0.828125
v-1.078125
v-1.75
[ 0.     -1.     -1.25   -1.3125 -1.     -1.5    -

In [25]:
v

array([  0.        , -13.99993529, -19.99990698, -21.99989761,
       -13.99993529, -17.9999206 , -19.99991379, -19.99991477,
       -19.99990698, -19.99991379, -17.99992725, -13.99994569,
       -21.99989761, -19.99991477, -13.99994569,   0.        ])

In [26]:
print("Value Function:")
print(v)
print("")

print("Reshaped Grid Value Function:")
print(v.reshape(environment.shape))
print("")

Value Function:
[  0.         -13.99993529 -19.99990698 -21.99989761 -13.99993529
 -17.9999206  -19.99991379 -19.99991477 -19.99990698 -19.99991379
 -17.99992725 -13.99994569 -21.99989761 -19.99991477 -13.99994569
   0.        ]

Reshaped Grid Value Function:
[[  0.         -13.99993529 -19.99990698 -21.99989761]
 [-13.99993529 -17.9999206  -19.99991379 -19.99991477]
 [-19.99990698 -19.99991379 -17.99992725 -13.99994569]
 [-21.99989761 -19.99991477 -13.99994569   0.        ]]



## **Test the Evaluated Policy Against the Expected** 

In [32]:
expected_v = nump.array([0, -14, -20, -22, -14, -18, -20, -20, -20, -20, -18, -14, -22, -20, -14, 0])
result = nump.testing.assert_array_almost_equal(v, expected_v, decimal=2)
if result==None:
    print("Match")
else:
    print("No Match")

Match
