### Value Iteration for Stochastic MDPs

**Value Iteration** is a model-based dynamic programming algorithm that directly computes the optimal value function $V^*(s)$ without explicit policy representation.

**Key Principle**: Apply Bellman optimality operator iteratively:
$$V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V_k(s')]$$

**Algorithm**: Initialize $V_0$ → Iterate Bellman operator → Extract optimal policy $\pi^*(s) = \arg\max_a Q^*(s,a)$

#### Environment Setup

Using **stochastic FrozenLake** environment where transitions are probabilistic, demonstrating value iteration's effectiveness with uncertain dynamics.

In [1]:
import gymnasium as gym
import numpy as np

In [2]:
env = gym.make('FrozenLake-v1', map_name="4x4", render_mode="rgb_array", is_slippery=True)

#### Policy Visualization Helper

Utility function to display learned policy in grid format for easy interpretation of agent's action choices.

In [3]:
def print_policy(policy, grid=(4,4)):
    action_dict = {0: "Left", 1: "Down", 2: "Right", 3: "Up"}
    policy_print = np.empty(grid).astype(str)
    for idx_h in range(grid[0]):
        for idx_w in range(grid[1]):
            index = idx_h * grid[0] + idx_w
            selected_action = action_dict[policy[index]]
            selected_action = selected_action[0] 
            policy_print[idx_h, idx_w] = selected_action

    print("Policy:")
    print(policy_print)

#### Value Iteration Algorithm

**Core Implementation**: For each state, compute Q-values for all actions and take maximum. Iterate until convergence when $\|V_{k+1} - V_k\|_{\infty} < \epsilon$.

In [4]:
def value_iteration(env, gamma=0.99, num_iterations=1000, tol=1e-5):
    V = np.zeros(env.observation_space.n)

    for _ in range(num_iterations):
        V_k = np.copy(V)

        for s in range(env.observation_space.n):
            Q_s = []
            
            # Take every action from current state and compute Q_value for every action
            for wanted_action in range(env.action_space.n):
                possible_actions = env.unwrapped.P[s][wanted_action]

                Q_sa = 0
                for probability, s_next, reward, terminal in possible_actions:
                    Q_sa += probability * (reward + gamma * V_k[s_next])

                Q_s.append(Q_sa)

            V[s] = np.max(Q_s)

        if np.max(np.abs(V - V_k)) < tol:
            break

    return V

#### Compute Optimal Value Function

Run value iteration to find $V^*(s)$ - the maximum expected cumulative reward achievable from each state under optimal policy.

In [5]:
optimal_values = value_iteration(env)

print("Optimal Values")
print(np.array(optimal_values).reshape(4,4))

Optimal Values
[[0.54185998 0.49858161 0.47043461 0.45657012]
 [0.55829709 0.         0.35822941 0.        ]
 [0.59166815 0.64298202 0.6151213  0.        ]
 [0.         0.74165099 0.86280139 0.        ]]


#### Policy Extraction

**Extract optimal policy** from value function: $\pi^*(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]$

In [6]:
def policy_improvement(values, gamma=0.99):
    new_policy = np.zeros(env.observation_space.n)

    for s in range(env.observation_space.n):
        Q_s = []

        for wanted_action in range(env.action_space.n):
            possible_actions = env.unwrapped.P[s][wanted_action]

            Q_sa = 0
            for probability, s_next, reward, terminal in possible_actions:
                Q_sa += probability * (reward + gamma * values[s_next])

            Q_s.append(Q_sa)

        best_action = np.argmax(Q_s) 

        new_policy[s] = best_action

    return new_policy

In [7]:
optimal_policy = policy_improvement(optimal_values)
print_policy(optimal_policy)

Policy:
[['L' 'U' 'U' 'U']
 ['L' 'L' 'L' 'L']
 ['U' 'D' 'L' 'L']
 ['L' 'R' 'D' 'L']]


#### Performance Evaluation

Test the learned optimal policy in the stochastic environment to measure success rate and validate the effectiveness of value iteration.

In [8]:
num_games = 1000
max_steps = 500

game_success = 0
for _ in range(num_games):
    observation, _ = env.reset()
    
    for _ in range(max_steps):
        action = int(optimal_policy[observation])
        
        observation, reward, done, _, _ = env.step(action)

        if done:
            if reward > 0:
                game_success += 1
            break

proportion_sucessful = game_success / num_games
print("Proportion of Successful Games:", proportion_sucessful)

Proportion of Successful Games: 0.82
