<a href="https://colab.research.google.com/github/MSiswanto/RL/blob/main/5_03_Predicting_the_Value_of_States_using_TD(0)_in_a_Frozen_Lake_Environment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting the value of states in a frozen lake environment
We learned that in the prediction method, the policy is given as an input and we predict
value function using the given policy. So let's initialize a random policy and predict the
value function (state values) of the frozen lake environment using the random policy.

First, let's import the necessary libraries:

In [None]:
import gym
import pandas as pd

Now, we create the frozen lake environment using gym:

In [None]:
env = gym.make('FrozenLake-v1')

  "Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future."
  "Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future."


Define the random policy which returns the random action by sampling from the action
space:

In [None]:
def random_policy():
    return env.action_space.sample()

Let's define the dictionary for storing the value of states and we initialize the value of all the
states to 0.0:

In [None]:
V = {}
for s in range(env.observation_space.n):
    V[s] = 0.0

Initialize the discount factor $\gamma$ and the learning rate $\alpha$: 

In [None]:
alpha = 0.85
gamma = 0.90

Define the number of episodes and number of time steps in the episode:

In [None]:
num_episodes = 5000
num_timesteps = 1000

## Computing the value of states
Now, let's compute the value function (state values) using the given random policy as:

$$V(s) = V(s) + \alpha (r + \gamma V(s') - V(s)) $$

In [None]:
#for each episode
for i in range(num_episodes):
    
    #initialize the state by resetting the environment
    s = env.reset()
    
    #for every step in the episode
    for t in range(num_timesteps):
        
        #select an action according to random policy
        a = random_policy()
        
        #perform the selected action and store the next state information
        s_, r, done, _ = env.step(a)
        
        #compute the value of the state
        V[s] += alpha * (r + gamma * V[s_]-V[s])
        
        #update next state to the current state
        s = s_
        
        #if the current state is the terminal state then break
        if done:
            break

After all the iterations, we will have a value of all the states according to the given random
policy. 

## Evaluating the value of states 

Now, let's evaluate our value function (state values). First, let's convert our value dictionary
to a pandas data frame for more clarity:

In [None]:
df = pd.DataFrame(list(V.items()), columns=['state', 'value'])

Before checking the value of the states, let's recollect that in the gym all the states in the
frozen lake environment will be encoded into numbers. Since we have 16 states, all the
states will be encoded into numbers from 0 to 15 as shown below:

![title](Images/1.png)

Now, Let's check the value of the states:

In [None]:
df

Unnamed: 0,state,value
0,0,0.000262
1,1,0.015199
2,2,0.002004
3,3,0.0001
4,4,0.001991
5,5,0.0
6,6,0.007678
7,7,0.0
8,8,0.002771
9,9,0.085223


As we can observe, now we have the value of all the states and also we can notice that
the value of all the terminal states (hole states and goal state) is zero.

Now that we have understood how TD learning can be used for the prediction task, in the
next section, we will learn how to use TD learning for the control task. 