## 10.2.1 Policy Iteration

### Explanation of Policy Iteration

Policy Iteration is a fundamental algorithm in Reinforcement Learning used to solve **Markov Decision Processes (MDPs)**. It involves two main steps: Policy Evaluation and Policy Improvement. The goal is to find an optimal policy that maximizes the expected cumulative reward for an agent interacting with the environment.

1. **Policy Evaluation:** Given a policy (a mapping from states to actions), this step calculates the value function, which represents the expected return (or reward) from each state under the current policy.

2. **Policy Improvement:** After evaluating the policy, this step updates the policy by choosing actions that maximize the value function. This means the agent will prefer actions that lead to higher rewards, thus improving the policy.

The algorithm alternates between these two steps until the policy converges to an optimal one, where no further improvements can be made.

### Benefits and Use Cases of Policy Iteration

- **Guaranteed Convergence:** Policy Iteration is guaranteed to converge to an optimal policy for finite MDPs, making it a reliable algorithm for finding the best strategy in environments with known dynamics.
  
- **Efficiency:** Compared to value iteration, Policy Iteration often requires fewer iterations to converge, especially in environments where the policy does not change frequently after evaluation.
  
- **Applications:** Policy Iteration is used in various fields, including robotics, game playing, and resource management, where the environment can be modeled as an MDP, and the goal is to find an optimal decision-making strategy.

### Methods for Implementing Policy Iteration

To implement Policy Iteration, follow these steps:

1. **Initialization:**
   - Start with an arbitrary policy and initialize the value function for all states to zero or random values.

2. **Policy Evaluation:**
   - Iterate over all states and update the value function by computing the expected return under the current policy.

3. **Policy Improvement:**
   - For each state, update the policy by selecting the action that maximizes the expected return based on the current value function.

4. **Convergence:**
   - Repeat the Policy Evaluation and Policy Improvement steps until the policy converges and no further improvements can be made.

The following Python code snippet provides an example of how to implement Policy Iteration for a simple MDP.


___
___
### Readings:
- [Finite Markov Decision Processes](https://medium.com/towards-data-science/introduction-to-reinforcement-learning-rl-part-3-finite-markov-decision-processes-51e1f8d3ddb7)
- [Dynamic Programming](https://medium.com/towards-data-science/introduction-to-reinforcement-learning-rl-part-4-dynamic-programming-6af57e575b3d)
- [Policy Iteration in RL: A step by step Illustration](https://towardsdatascience.com/policy-iteration-in-rl-an-illustration-6d58bdcb87a7)
- [Policy Iteration — Easy Example](https://medium.com/@pesupavish/policy-iteration-easy-example-d3fd1eb98c6c)
- [Reinforcement Learning Chapter 4: Dynamic Programming \(Part 1 — Policy Iteration\)](https://medium.com/@numsmt2/reinforcement-learning-chapter-4-dynamic-programming-part-1-policy-iteration-2a1f66a5ca42)
- [Markov decision process: policy iteration with code implementation](https://medium.com/@ngao7/markov-decision-process-policy-iteration-42d35ee87c82)
- [Policy iteration](https://gibberblot.github.io/rl-notes/single-agent/policy-iteration.html)
___
___

In [1]:
import numpy as np

# Define the MDP environment
states = [0, 1, 2, 3]  # States
actions = [0, 1]       # Actions: 0 = left, 1 = right
rewards = np.array([[-1, 0], [0, 0], [0, 0], [0, 1]])  # Reward matrix
transition_probs = np.array([
    [[1.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0]],  # State 0
    [[1.0, 0.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0]],  # State 1
    [[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 1.0]],  # State 2
    [[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0]]   # State 3 (Terminal)
])

# Initialize policy and value function
policy = np.zeros(len(states), dtype=int)
value_function = np.zeros(len(states))
gamma = 0.9  # Discount factor
theta = 1e-6  # Convergence threshold

def policy_evaluation(policy, value_function, gamma, theta):
    while True:
        delta = 0
        for s in states:
            v = value_function[s]
            action = policy[s]
            value_function[s] = sum(transition_probs[s, action, s_prime] * 
                                    (rewards[s, action] + gamma * value_function[s_prime])
                                    for s_prime in states)
            delta = max(delta, abs(v - value_function[s]))
        if delta < theta:
            break
    return value_function

def policy_improvement(policy, value_function, gamma):
    policy_stable = True
    for s in states:
        old_action = policy[s]
        action_values = np.zeros(len(actions))
        for a in actions:
            action_values[a] = sum(transition_probs[s, a, s_prime] * 
                                   (rewards[s, a] + gamma * value_function[s_prime])
                                   for s_prime in states)
        policy[s] = np.argmax(action_values)
        if old_action != policy[s]:
            policy_stable = False
    return policy, policy_stable

# Policy Iteration Algorithm
while True:
    value_function = policy_evaluation(policy, value_function, gamma, theta)
    policy, policy_stable = policy_improvement(policy, value_function, gamma)
    if policy_stable:
        break

print("Optimal Policy:", policy)
print("Optimal Value Function:", value_function)


Optimal Policy: [1 0 0 0]
Optimal Value Function: [0. 0. 0. 0.]


## Conclusion

In this section, we explored the concept of Policy Iteration in Reinforcement Learning, a dynamic programming method used to find the optimal policy in Markov Decision Processes (MDPs). By alternating between policy evaluation and policy improvement, Policy Iteration efficiently converges to an optimal policy that maximizes the expected cumulative reward. We also implemented a basic example to illustrate how this method can be applied in practice. This foundational understanding of Policy Iteration provides a strong basis for more advanced reinforcement learning techniques.
