<a href="https://colab.research.google.com/github/omaremad02/Markov-Decision-Process/blob/main/Value_iteration_and_Policy_Iteration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This notebook includes implementations of the following algorithms:**


*   Value Iteration Algorithm
*   Policy Iteration Algorithm

The notebook also includes a test gridworld game where the two algorithms are implemented to extract the optimal policy using value iteration and policy iteration.

Below is the commented implementation where each section is in a seperate notebook.


In [1]:
import numpy as np

**Grid world class representing the dynamics of the grid (enviroment) including the following:**


*   Grid size
*   Immediate rewards
*   Possible actions
*   States of the game which are indeed the cells in the grid





In [53]:
class gridworld:

  def __init__(self, grid_size):
    self.grid_size = grid_size
    rewards = -(np.ones((grid_size, grid_size)))
    rewards[0,2] = 10
    self.rewards = rewards
    self.actions = ["UP", "DOWN", "LEFT", "RIGHT"]
    self.action_prob = {"UP": (0.8, 0.1, 0.1), "DOWN": (0.8, 0.1, 0.1),
               "LEFT": (0.8, 0.1, 0.1), "RIGHT": (0.8, 0.1, 0.1)}

  def next_state(self, state, action):
    x, y = state
    if action == 'UP':
        return [(x-1, y), (x, y-1), (x, y+1)]
    elif action == 'DOWN':
        return [(x+1, y), (x, y-1), (x, y+1)]
    elif action == 'LEFT':
        return [(x, y-1), (x-1, y), (x+1, y)]
    elif action == 'RIGHT':
        return [(x, y+1), (x-1, y), (x+1, y)]
    return [state, state, state]

  def is_valid(self, state):
      x,y = state
      return 0 <= x < self.grid_size and 0 <= y < self.grid_size

  def get_terminal_states(self):
    return [(0,0), (0,2)]

  def is_terminal_state(self, state):
    x,y = state
    return (x == 0 and y == 2) or (x == 0 and y == 0)

**In this section, the class agent algorithms is implemented including the following:**


*   Value Iteration: finds the optimal value function for each state
*   Policy Extraction: finds the optimal policy based on the output of the value iteration.
*   Policy Iteration



In [58]:
class agent_algorithms:
  def __init__(self, grid: gridworld):
    self.grid = grid
    self.discount_factor = 0.99 # takes the future highly in consideration.


  def value_iteration(self, reward):
    state_values = np.zeros((3,3)) #initializng the value function to zero.
    state_values[0,0] = reward
    state_values[0,2] = 10
    while True:
      delta = 0
      for row in range(0,3):
        for col in range(0,3):
          max_value = float("-inf")
          if (row,col) in self.grid.get_terminal_states():
            continue
          for action in self.grid.actions:
            value = 0
            for prob, new_state in zip(self.grid.action_prob[action], self.grid.next_state((row,col), action)):
              x1,y1 = new_state
              if self.grid.is_valid(new_state):
                value += prob * self.discount_factor * state_values[x1,y1]
              else:
                value += prob * self.discount_factor * state_values[row,col]
              value += self.grid.rewards[row,col]
            if value > max_value:
              state_values[row,col] = value
              delta = max(delta, abs(value - state_values[row, col]))
      if delta < 1e-4:
            break
    return state_values

  def extract_policy(self, state_values):
    policy = np.empty((3, 3), dtype=str)
    policy[0,0] = '-'
    policy[0,2] = '-'
    for row in range(3):
        for col in range(3):
            max_value = float("-inf")
            if (row,col) in self.grid.get_terminal_states():
              continue
            best_action = None
            for action in self.grid.actions:
                value = 0
                for prob, new_state in zip(self.grid.action_prob[action], self.grid.next_state((row,col), action)):
                    x1, y1 = new_state
                    if self.grid.is_valid(new_state):
                        value += prob * (self.discount_factor * state_values[x1, y1])
                    else:
                        value += prob * (self.discount_factor * state_values[row, col])
                    value+= self.grid.rewards[row, col]
                if value > max_value:
                    max_value = value
                    best_action = action
            policy[row, col] = best_action
    return policy

**In this section**, The test cases are implemented using different values for the variable reward r and a discount factor = 0.99 (The future is highly accounted in the calculation)

In [59]:
reward_list = [100,3,0,-3]
grid = gridworld(grid_size= 3)
for i in range(4):
  grid.rewards[0,0] = reward_list[i]
  agent = agent_algorithms(grid)
  print(f"State values when r = {reward_list[i]}")
  result = agent.value_iteration(reward_list[i])
  policy = agent.extract_policy(result)
  print(result)
  print()
  print(policy)
  print()
  print()

State values when r = 100
[[100.          12.54112989  10.        ]
 [  6.9         -1.75842814  -4.70491867]
 [ -2.61718107  -3.76157803 -10.06363525]]

[['-' 'L' '-']
 ['U' 'U' 'U']
 ['U' 'L' 'L']]


State values when r = 3
[[  3.           4.84143489  10.        ]
 [ -2.703       -2.52069795  -5.18306308]
 [ -4.25975138  -3.98084676 -10.34126376]]

[['-' 'R' '-']
 ['U' 'U' 'U']
 ['U' 'U' 'L']]


State values when r = 0
[[  0.           4.60329999  10.        ]
 [ -3.          -2.5442733   -5.19785105]
 [ -4.31055252  -3.98762826 -10.34985021]]

[['-' 'R' '-']
 ['U' 'U' 'U']
 ['U' 'U' 'L']]


State values when r = -3
[[ -3.           4.36516509  10.        ]
 [ -3.297       -2.56784866  -5.21263902]
 [ -4.36135366  -3.99440977 -10.35843665]]

[['-' 'R' '-']
 ['R' 'U' 'U']
 ['U' 'U' 'L']]


