In [1]:
import gymnasium as gym
import numpy as np

In [48]:
def value_iteration(env,gamma,theta):
    V = np.zeros((env.observation_space.n))

    env.reset()
    counter = 0
    while (True):
        counter += 1
        delta = 0
        for state in range(env.observation_space.n):
            values = []
            for action in range(env.action_space.n):
                for probability ,next_state,reward, terminated in env.P[state][action]:
                    values.append(probability * (reward + gamma * V[next_state]))
                
            best_value = np.max(values)
                
            if terminated : env.reset()
                            
            delta = max(delta,np.abs(V[state]-best_value))
            V[state] = best_value

        if delta < theta : break
    env.close()
    print(f'counter:{counter}')
    return V

In [3]:
def get_policy(env,gamma,V):
    policy = np.zeros(env.observation_space.n, dtype=int)
    for state in range(env.observation_space.n):
        values = []
        for action in range(env.action_space.n):
            for probability ,next_state,reward, terminated in env.P[state][action]:
                values.append(probability * (reward + gamma * V[next_state]))

        max_index = []
        best_value = np.max(values)
        for action in range(env.action_space.n):
            if values[action] == best_value :
                max_index.append(action)
        if max_index != []:
            policy[state] = np.random.choice(max_index)
        else:
            policy[state] = np.random.randint(0,4)
            
    return policy

In [4]:
def train(theta,gamma,slippery):
    env = gym.make('FrozenLake-v1',map_name='4x4',render_mode=None,is_slippery=slippery)
    V  = value_iteration(env,gamma,theta)
    policy = get_policy(env,gamma,V)
    return V,policy

In [5]:
def walk(policy,slippery):
    env = gym.make('FrozenLake-v1',map_name='4x4',render_mode="human",is_slippery=slippery)
    state = env.reset()
    terminated = False
    state = 0
    while not terminated:
        action = policy[state]
        next_state , reward, terminated, truncated , _ = env.step(action)
        state = next_state 
    env.close()

Value Iteration when is_slippery = False

low gamma
gamma = 0.1

In [57]:
V , policy = train(theta=0.0001,gamma=0.1,slippery=False)
print(f"Optimal_Value : {V} , Optimal_Policy : {policy}")
walk(policy=policy,slippery=False)

counter:6
Optimal_Value : [1.e-05 1.e-04 1.e-03 1.e-04 1.e-04 0.e+00 1.e-02 0.e+00 1.e-03 1.e-02
 1.e-01 0.e+00 0.e+00 1.e-01 1.e+00 0.e+00] , Optimal_Policy : [1 2 1 0 1 0 1 3 2 2 1 3 3 2 2 0]


we can see that using gamma = 0.1 or low gamma , we will have convergence to optimal value if we have convergence rate(theta) below 0.001 . for higher value of theta the convergence is not guranteed . but we can see that if we use proper theta value we will converge quickly to the optimal value.
also its obvious that the value function is much more focused on immediate rewards that could make big differenc  in environment with various type of rewards . 

mid gamma
gamma = 0.5

In [58]:
V , policy = train(theta=0.0001,gamma=0.5,slippery=False)
print(f"Optimal_Value : {V} , Optimal_Policy : {policy}")
walk(policy=policy,slippery=False)

counter:7
Optimal_Value : [0.03125 0.0625  0.125   0.0625  0.0625  0.      0.25    0.      0.125
 0.25    0.5     0.      0.      0.5     1.      0.     ] , Optimal_Policy : [2 2 1 0 1 1 1 0 2 1 1 0 2 2 2 0]


most noticable thing is that with mid level gamma we can set the theta to 0.1 which is much bigger that the theta rate with low gamma .but if we use the same theta as last part we can see that this gamma converge to optimal value slower. also the 0.5 gamma means we focus equally on discounted and immediate reward .

High gamma
gamma = 0.9

In [62]:
V , policy = train(theta=0.0001,gamma=0.9,slippery=False)
print(f"Optimal_Value : {V} , Optimal_Policy : {policy}")
walk(policy=policy,slippery=False)

counter:7
Optimal_Value : [0.59049 0.6561  0.729   0.6561  0.6561  0.      0.81    0.      0.729
 0.81    0.9     0.      0.      0.9     1.      0.     ] , Optimal_Policy : [1 2 1 0 1 3 1 0 2 2 1 0 2 2 2 1]


Reuslts for gamma = 0.9 is like results with gamma = 0.5 but theta rate could be even higher(7X higher) .We can see that the convegence speed is slower . The main difference is that with high gamma we are highly focused on rewards from future and immediate rewards are replaced in time . 

Value Iteration when is_slippery = True

With the stochastic environment we can see that agent learn things differently. If it's trained enough it wont even go near the holes beacuse there is chance it get stuck . So we can see the policy that it learn in slippery mode is very different from non_slippery mode.

low gamma
gamma = 0.1

In [81]:
V , policy = train(theta=0.001,gamma=0.1,slippery=True)
print(f"Optimal_Value : {V} , Optimal_Policy : {policy}")
walk(policy=policy,slippery=True)

counter:3
Optimal_Value : [0.         0.         0.         0.         0.         0.
 0.00037037 0.         0.         0.00037037 0.01111111 0.
 0.         0.01111111 0.33333333 0.        ] , Optimal_Policy : [0 1 2 3 0 3 2 3 1 2 2 2 0 3 2 3]


With low gamma ,  its hard for agent to get the actual value of each state that is connected to the Goal state . This happens beacuse due to low gamma , its very unlikly to get the real value of states that come from states in far future.

mid gamma 
gamma  = 0.5

In [109]:
V , policy = train(theta=0.0001,gamma=0.5,slippery=True)
print(f"Optimal_Value : {V} , Optimal_Policy : {policy}")
walk(policy=policy,slippery=True)

counter:6
Optimal_Value : [4.28669410e-05 2.57201646e-04 1.54320988e-03 2.57201646e-04
 2.57201646e-04 0.00000000e+00 9.25925926e-03 0.00000000e+00
 1.54320988e-03 9.25925926e-03 5.55555556e-02 0.00000000e+00
 0.00000000e+00 5.55555556e-02 3.33333333e-01 0.00000000e+00] , Optimal_Policy : [2 3 2 3 2 2 2 2 3 2 2 2 2 3 1 3]


We can see that with looking more into future the agent can learn to avoid holes better . but still it's difficult to model this stochastic env.

High gamma 
gamma =0.9

In [101]:
V , policy = train(theta=0.0001,gamma=0.9,slippery=True)
print(f"Optimal_Value : {V} , Optimal_Policy : {policy}")
walk(policy=policy,slippery=True)

counter:7
Optimal_Value : [0.00081    0.0027     0.009      0.0027     0.0027     0.
 0.03       0.         0.009      0.03       0.1        0.
 0.         0.1        0.33333333 0.        ] , Optimal_Policy : [2 0 2 3 2 2 2 1 3 2 2 1 1 1 1 2]


We can see that with high gamma , the value of the states near the goal are very high and the agent will try to reach them faster . unlike this we saw in last two parts that the value of the states around goal are less beacuse lower gammas do not look that deep into future state of this states.

Policy Iteration vs Value Iteration

Performance of both algorithms varies based on the discount factor and the evironment's stochatic property . 
Policy iteration is faster due to variating between policies . 
Value iteration seems to be more robust in stochastic environment . 