In [1]:
import numpy as np
import gymnasium as gym

In [7]:
def policy_evaluation(env,policy,num_iteration,theta,gamma):
    value = np.zeros(env.observation_space.n)
    for i in range(num_iteration):
        updated_value = np.copy(value)
        for state in range(env.observation_space.n):
            action = policy[state]
            value[state] = sum(reward + probability * gamma * updated_value[next_state] for probability , next_state , reward, _ in env.P[state][action])
        
        diff = np.sum(np.abs(np.subtract(updated_value , value))) 
        if diff <= theta:
            break
    
    policy = np.zeros(env.observation_space.n,dtype=int)
    for state in range(env.observation_space.n):
        q_values = [sum((reward + probability * gamma * value[next_state]) for probability , next_state, reward, _ in env.P[state][action]) for action in range(env.action_space.n)]
        policy[state] = np.argmax(q_values)
    return policy

In [9]:
def policy_iteration(env,num_iteration,theta,gamma):
    policy = np.zeros(env.observation_space.n,dtype=int)
    for i in range(num_iteration):
        new_policy = policy_evaluation(env,policy,num_iteration,theta,gamma)
        # new_policy = extract_policy(value,gamma)
        if (np.all(policy==new_policy)):
            break
        else:
            policy = new_policy
            
    return policy

In [4]:
def train(num_iteration,theta,gamma,slippery):
    env = gym.make("FrozenLake-v1",map_name='4x4',render_mode=None,is_slippery=slippery)
    optimal_policy = policy_iteration(env,num_iteration,theta,gamma)
    return optimal_policy

In [5]:
def walk(policy,slippery):
    env = gym.make("FrozenLake-v1",map_name='4x4',render_mode='human',is_slippery=slippery)
    state = env.reset()
    terminated = False
    state = 0
    while not terminated:
        action = policy[state]
        next_state , reward, terminated, truncated , _ = env.step(action)
        state = next_state 
    env.close()

In [11]:
policy = train(num_iteration=6,theta=0.1,gamma=0.5,slippery=False)
walk(policy,slippery=False)

Review

Policy Iteration in non slippery mode

We can see that with appropriate hyperparameters all gamma rates converge to same policy . But the difference is in the speed of convergence and also we will see the difference of this gammas in more complex environments.

Low gamma
gamma = 0.1

In [44]:
policy = train(num_iteration=6,theta=0.001,gamma=0.1,slippery=False)
print(policy)
walk(policy,slippery=False)

[1 2 1 0 1 0 1 0 2 1 1 0 0 2 2 0]


We can see that with proper theta(0.001) wich is less that mid and high gamma appropriate theta we can converge to optimal policy . The optimal policy highly focus on immediate rewards (would be more obvious in environments with bigger reward range).
We can say that the low gamma converge quickly but might result in suboptimal policies beacuse it priotrizes immediate rewards.

Mid gamma
gamma = 0.5

In [42]:
policy = train(num_iteration=6,theta=0.1,gamma=0.5,slippery=False)
print(policy)
walk(policy,slippery=False)

[1 2 1 0 1 0 1 0 2 1 1 0 0 2 2 0]


We did get the same optimal policy in this env , but the theta could be much bigger  . 
We can say that the mid gamma convergence is stable and it consider a balance between imediate and discounted rewards.

High gama
gamma = 0.9

In [36]:
policy = train(num_iteration=6,theta=0.9,gamma=0.9,slippery=False)
print(policy)
walk(policy,slippery=False)

[1 2 1 0 1 0 1 0 2 1 1 0 0 2 2 0]


For higher gamma we have the same optimal policy but with much higher theta . We can say that high gamma converge slower but in the complex environments the policy will be more optimal , beacuse it consideres long term rewards.

Policy Iteration in slippery mode

Trying to learn in a stochastic environment is more challenging . We can see the algorithm try to avoid non goal terminal states and sometimes it even use the randomness of the environment as a tool .

In [47]:
policy = train(num_iteration=6,theta=0.0001,gamma=0.1,slippery=True)
print(policy)
walk(policy,slippery=True)

[1 2 0 3 0 0 0 0 3 1 0 0 0 2 1 0]


Due to the slippery property , we see the need to use more accurate values by using smaller theta . everything that we said about low gamma in non slippery part also holds here . 

Mid gamma
gamma = 0.5

In [48]:
policy = train(num_iteration=6,theta=0.001,gamma=0.5,slippery=True)
print(policy)
walk(policy,slippery=True)

[2 3 2 3 0 0 0 0 3 1 0 0 0 2 1 0]


Here we can see an interesting act of policy wich is in state before goal . we can see that the policy choose  the down action instead of going right wich give agent the chance to slipp right to the goal .

High gama
gamma = 0.9

In [49]:
policy = train(num_iteration=6,theta=0.001,gamma=0.9,slippery=True)
print(policy)
walk(policy,slippery=True)

[2 3 2 3 0 0 0 0 3 1 0 0 0 2 1 0]


The high gamma learn the exact policy as mid gamma due to simple environment and also some similar attributes of both mid and high gamma(both consider future rewards but with different weights) . 


Policy Iteration vs Value Iteration

Performance of both algorithms varies based on the discount factor and the evironment's stochatic property . 
Policy iteration is faster due to variating between policies . 
Value iteration seems to be more robust in stochastic environment . 