# 10.2.2 Value Iteration

## Explanation of Value Iteration

Value Iteration is an algorithm used in reinforcement learning to find the optimal policy for a Markov Decision Process (MDP). Unlike Policy Iteration, which involves separate steps for policy evaluation and policy improvement, Value Iteration combines these two steps into a single process. The algorithm iteratively updates the value function for each state by considering the expected utility of taking an action and moving to the next state, ultimately converging to the optimal value function. Once the optimal value function is found, the optimal policy can be derived by choosing the action that maximizes the expected value.

## Benefits and Scenarios for Using Value Iteration

- **Efficiency:** Value Iteration is often more efficient than Policy Iteration because it combines the evaluation and improvement steps, leading to faster convergence.
- **Convergence:** The algorithm guarantees convergence to the optimal policy, making it a reliable method for solving MDPs.
- **Applicability:** Value Iteration is well-suited for scenarios where the state space is manageable, and the MDP is well-defined, such as grid-based environments and other finite-state problems.
- **Flexibility:** The algorithm can be adapted to various settings, including partially observable MDPs (POMDPs) and stochastic environments.

___
___
### Readings:
- [Reinforcement Learning Chapter 4: Dynamic Programming (Part 3 — Value Iteration)](https://medium.com/@numsmt2/reinforcement-learning-chapter-4-dynamic-programming-part-3-value-iteration-6f01f6347813)
- [Markov decision process: value iteration with code implementation](https://medium.com/@ngao7/markov-decision-process-value-iteration-2d161d50a6ff)
- [Reinforcement Learning: an Easy Introduction to Value Iteration](https://towardsdatascience.com/reinforcement-learning-an-easy-introduction-to-value-iteration-e4cfe0731fd5)
- [Value Iteration](https://gibberblot.github.io/rl-notes/single-agent/value-iteration.html)
- [Value Iteration vs. Policy Iteration in Reinforcement Learning](https://www.baeldung.com/cs/ml-value-iteration-vs-policy-iteration)

___
___


## Methods for Implementing Value Iteration

Below is a basic implementation of Value Iteration in Python. This example considers a simple grid environment where the agent seeks to find the optimal path to a goal.


In [1]:
import numpy as np

# Define the environment
states = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
actions = ['up', 'down', 'left', 'right']
rewards = {
    (0, 0): 0, (0, 1): -1, (0, 2): 10,
    (1, 0): -1, (1, 1): -1, (1, 2): -1,
    (2, 0): -1, (2, 1): -1, (2, 2): -1
}
transitions = {
    'up': (-1, 0), 'down': (1, 0),
    'left': (0, -1), 'right': (0, 1)
}
gamma = 0.9  # Discount factor
theta = 0.0001  # Threshold for stopping criterion

# Initialize value function
V = {state: 0 for state in states}

def get_next_state(state, action):
    next_state = (state[0] + transitions[action][0], state[1] + transitions[action][1])
    return next_state if next_state in states else state

def value_iteration():
    while True:
        delta = 0
        for state in states:
            v = V[state]
            action_values = []
            for action in actions:
                next_state = get_next_state(state, action)
                action_value = rewards[next_state] + gamma * V[next_state]
                action_values.append(action_value)
            V[state] = max(action_values)
            delta = max(delta, abs(v - V[state]))
        if delta < theta:
            break

def get_policy():
    policy = {}
    for state in states:
        action_values = {}
        for action in actions:
            next_state = get_next_state(state, action)
            action_values[action] = rewards[next_state] + gamma * V[next_state]
        policy[state] = max(action_values, key=action_values.get)
    return policy

# Perform Value Iteration
value_iteration()
optimal_policy = get_policy()

print("Optimal Value Function:")
for state in states:
    print(f"State {state}: {V[state]}")

print("\nOptimal Policy:")
for state in states:
    print(f"State {state}: {optimal_policy[state]}")


Optimal Value Function:
State (0, 0): 88.99916647515822
State (0, 1): 99.99916647515822
State (0, 2): 99.99916647515822
State (1, 0): 80.0992498276424
State (1, 1): 88.9992498276424
State (1, 2): 99.9992498276424
State (2, 0): 71.08932484487816
State (2, 1): 79.09932484487817
State (2, 2): 88.99932484487816

Optimal Policy:
State (0, 0): right
State (0, 1): right
State (0, 2): up
State (1, 0): up
State (1, 1): right
State (1, 2): up
State (2, 0): up
State (2, 1): right
State (2, 2): up


## Conclusion

Value Iteration is a fundamental method in reinforcement learning that iteratively improves the value function for each state until it converges to the optimal value function. By evaluating the expected rewards for each possible action and selecting the maximum, Value Iteration provides a way to determine the optimal policy for an agent in a given environment. This process is particularly useful in environments with a finite number of states and actions, where the goal is to maximize cumulative rewards over time. Through the implementation of Value Iteration, we can systematically approach the problem of decision-making in uncertain environments, ensuring that the agent makes the best possible choices to achieve its objectives.
