<a href="https://colab.research.google.com/github/Imran-co/Machine-Intelligence--2-/blob/main/Lab_cycle2_q4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

4) Implement a Python program to solve a simple 5x5 grid world environment using
dynamic programming techniques. The environment consists of an agent that
starts at the top-left corner and aims to reach the bottom-right cell, with possible
action set {up, down, left, right}. Define the transition model and reward
structure (reward of +1 for reaching the goal, 0 otherwise). Apply value iteration
and policy iteration separately to compute the optimal value function for the
agent to reach the goal.

In [10]:
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

dimension = 5
actions = ['up', 'down', 'left', 'right']

#value function
gamma = 0.9
rewardValue_g = 1
rewardValue_n = 0
numIterations = 1000

initial_state = (0, 0)
goal_state = (dimension - 1, dimension - 1)



def get_reward(state):
  if state == goal_state:
    return rewardValue_g
  else:
    return rewardValue_n

def transition(state, action):
  row, col = state
  if action == 'up':
    next_row, next_col = row - 1, col
  elif action == 'down':
    next_row, next_col = row + 1, col
  elif action == 'left':
    next_row, next_col = row, col - 1
  elif action == 'right':
    next_row, next_col = row, col + 1
  else:
    next_row, next_col = row, col # Stay in the same state for invalid actions

  # Handle boundary conditions
  next_row = max(0, min(next_row, dimension -1))
  next_col = max(0, min(next_col, dimension -1))

  return (next_row, next_col)

def value_iteration():
  value_function = np.zeros((dimension, dimension))
  policy = np.zeros((dimension, dimension), dtype=str)

  for _ in range(numIterations):
      updated_value_function = np.copy(value_function)
      for row in range(dimension):
          for col in range(dimension):
              state = (row, col)
              if state == goal_state:
                  continue
              value_actions = []
              for action in actions:
                  next_state = transition(state, action)
                  reward = get_reward(next_state)
                  value_action = reward + gamma * value_function[next_state]
                  value_actions.append(value_action)
              updated_value_function[state] = max(value_actions)
      value_function = updated_value_function

  # Derive policy from value function
  for row in range(dimension):
      for col in range(dimension):
          state = (row, col)
          if state == goal_state:
              continue
          value_actions = []
          for action in actions:
              next_state = transition(state, action)
              reward = get_reward(next_state)
              value_action = reward + gamma * value_function[next_state]
              value_actions.append(value_action)
          best_action_index = value_actions.index(max(value_actions))
          policy[state] = actions[best_action_index]

  return value_function, policy


def policy_iteration():
  value_function = np.zeros((dimension, dimension))
  policy = np.zeros((dimension, dimension), dtype=str)

  for _ in range(numIterations):
    updated_value_function = np.copy(value_function)
    for row in range(dimension):
      for col in range(dimension):
        state = (row, col)
        if state == goal_state:
          continue
        value_actions = []
        for action in actions:
          next_state = transition(state, action)
          reward = get_reward(next_state)
          value_action = reward + gamma * value_function[next_state]
          value_actions.append(value_action)

        best_value = max(value_actions)
        updated_value_function[state] = best_value

        best_action_index = value_actions.index(best_value)
        policy[state] = actions[best_action_index]


    value_function = updated_value_function # Update the value function for the next iteration

  return value_function, policy

Here's what the letters mean:

'd': Move down
'r': Move right
'': This is the goal state, so no action is needed.

In [11]:
print("Value Iteration:")
value_function, policy = value_iteration()
print(value_function)
print("Policy Iteration:")
value_function, policy = policy_iteration()
print(value_function)
print("Optimal Policy : ")
print(policy)

Value Iteration:
[[0.4782969 0.531441  0.59049   0.6561    0.729    ]
 [0.531441  0.59049   0.6561    0.729     0.81     ]
 [0.59049   0.6561    0.729     0.81      0.9      ]
 [0.6561    0.729     0.81      0.9       1.       ]
 [0.729     0.81      0.9       1.        0.       ]]
Policy Iteration:
[[0.4782969 0.531441  0.59049   0.6561    0.729    ]
 [0.531441  0.59049   0.6561    0.729     0.81     ]
 [0.59049   0.6561    0.729     0.81      0.9      ]
 [0.6561    0.729     0.81      0.9       1.       ]
 [0.729     0.81      0.9       1.        0.       ]]
Optimal Policy : 
[['d' 'd' 'd' 'd' 'd']
 ['d' 'd' 'd' 'd' 'd']
 ['d' 'd' 'd' 'd' 'd']
 ['d' 'd' 'd' 'd' 'd']
 ['r' 'r' 'r' 'r' '']]
