# Title: **Write a Program to Implement Value and Policy Iteration in a Grid World**

## Objective: To develop a Python program that demonstrates value and policy iteration techniques in a simple grid world environment, illustrating key concepts of reinforcement learning.

### Theory
In reinforcement learning, value and policy iteration are two fundamental methods used to compute optimal policies given a model of the environment in the form of a Markov Decision Process (MDP). 

**Value Iteration** is a method of computing an optimal policy by iteratively improving the value function for each state. It involves updating the value of each state by considering the maximum expected utility achievable by performing each action available from that state.

**Policy Iteration** involves two steps: policy evaluation, where the utility of following a current policy is calculated, and policy improvement, where the policy is improved by making it greedy with respect to the evaluated utility.

### Materials/Tools Required
- Python 3.x installed on a computer
- Python libraries: `numpy`
- Text editor or Integrated Development Environment (IDE) such as PyCharm, Visual Studio Code, or Jupyter Notebook

### Procedure
1. Install the required Python library using pip:
   ```bash
   pip install numpy
   ```
2. Open your Python development environment.
3. Type the provided code into the editor.
4. Save the file with a `.py` extension, for example, `grid_world_iterations.py`.
5. Run the program and observe how value and policy iteration algorithms converge to the optimal policy.

In [None]:
### Python Program Code

import numpy as np

def value_iteration(states, actions, transition, reward, gamma=0.9, threshold=0.01):
    V = np.zeros(len(states))
    while True:
        delta = 0
        for s in states:
            v = V[s]
            V[s] = max(sum(transition[s][a][s_prime] * (reward[s_prime] + gamma * V[s_prime]) 
                           for s_prime in states) for a in actions)
            delta = max(delta, abs(v - V[s]))
        if delta < threshold:
            break
    return V

def policy_iteration(states, actions, transition, reward, gamma=0.9):
    policy = np.zeros(len(states), dtype=int)
    V = np.zeros(len(states))
    while True:
        # Policy evaluation
        while True:
            delta = 0
            for s in states:
                v = V[s]
                V[s] = sum(transition[s][policy[s]][s_prime] * (reward[s_prime] + gamma * V[s_prime]) 
                           for s_prime in states)
                delta = max(delta, abs(v - V[s]))
            if delta < 0.01:
                break
        
        # Policy improvement
        policy_stable = True
        for s in states:
            old_action = policy[s]
            policy[s] = np.argmax([sum(transition[s][a][s_prime] * (reward[s_prime] + gamma * V[s_prime]) 
                                       for s_prime in states) for a in actions])
            if old_action != policy[s]:
                policy_stable = False
        
        if policy_stable:
            break
    return policy, V

# Example setup for a simple 2x2 grid world
states = np.arange(4)  # 0, 1, 2, 3
actions = [0, 1]  # 0: up, 1: down (simplified for example)
transition = np.zeros((4, 2, 4))  # simplified transitions
reward = np.array([-1, -1, -1, 10])  # rewards for each state

# Setting up some example transitions (simplified)
transition[0, 0, 1] = 1
transition[1, 1, 3] = 1
transition[2, 0, 3] = 1
transition[3, 1, 3] = 1

gamma = 0.9  # Discount factor

# Run value iteration
values = value_iteration(states, actions, transition, reward, gamma)
print("Value Function:")
print(values)

# Run policy iteration
policy, values = policy_iteration(states, actions, transition, reward, gamma)
print("Optimal Policy and Value Function:")
print("Policy:", policy)
print("Values:", values)

In [None]:
### Observations
- Observe the convergence of both value and policy iteration algorithms to the optimal policy and value function.
- Note the differences in convergence speed and computational efficiency between the two methods.

In [None]:
### Conclusion
Value and policy iteration are powerful algorithms used to solve decision-making problems modeled as MDPs in reinforcement learning. They provide systematic approaches for calculating optimal policies that maximize the cumulative reward.

In [None]:
### Applications
- Robotics navigation and path planning
- Strategic game playing
- Resource management and allocation