You will need the following values for the implementation:
1. Reward for reaching the goal state = 1
2. Penalty for reaching the red state = -1
3. Step cost = -0.04
4. Probability of going in the direction of the action = p
5. Probability of going in a direction perpendicular to the action = (1−p)/2.
6. Discount factor = 0.95
One thing to note is that if the agent is unable to move according to the action i.e., the move is blocked by the boundaries or the wall, the agent will remain in the same state.

Task A-
Task is to implement Value Iteration in Python for the given grid world (Fig 1). Take p = 0.7 for this task. The code should print the utility value of each cell in the grid after each iteration until the values converge, where convergence is defined by a difference ≤ 0.0001 between utility values.
![fig.png](/home/navdha/gitrepo/Machine-data-and-Learning-concepts/fig.png)


In [3]:
import numpy as np

class GridWorldMDP:
    def __init__(self):
        self.grid_world = np.array([[0, -1, 1], [0, 0, 0], [0, None, 0], [0, 0, 0]])
        self.possible_actions = [(0,1), (0, -1), (1, 0), (-1, 0)]
        self.reward = 1
        self.penalty = -1
        self.step_cost = -0.04
        self.discount_factor = 0.95
        self.probability_selected_action = 0.7
        self.probability_other_actions = 0.15
        self.U = np.zeros((4,3))
        self.U[0][1] = -1
        self.U[0][2] = 1
        self.policy = np.zeros((4,3))

    def value(self, row, col, action):
     total_value = 0
    
     for i in [-1, 0, 1]:
      for j in [-1, 0, 1]:
            if (i, j) == (-action[0], -action[1]):
                continue
            if (i == 0 and abs(j) == 1) or (j == 0 and abs(i) == 1):
                current_prob = self.probability_selected_action if (i, j) == action else self.probability_other_actions
                new_row, new_col = row + i, col + j
                
                if 0 <= new_row < 4 and 0 <= new_col < 3 and self.grid_world[new_row][new_col] is not None:
                    total_value += current_prob * self.U[new_row, new_col]
                else:
                    total_value += current_prob * self.U[row, col]
    
     return total_value


    def iterate_values(self):
     iteration = 1
     while True:
        delta = 0
        U1 = np.zeros((4, 3))
        
        for row in range(4):
            for col in range(3):
                old_value = self.U[row, col]
                
                if self.grid_world[row][col] is None:
                    self.policy[row][col] = -1 
                    continue
                
                if self.grid_world[row][col] == 1:
                    U1[row][col] = self.reward
                    self.policy[row][col] = -1  
                
                elif self.grid_world[row][col] == -1:
                    U1[row][col] = self.penalty
                    self.policy[row][col] = -1 
                
                else:
                  best_action_index = 0
                  best_action_value = self.value(row, col, self.possible_actions[0])
    
                  for i in range(1, len(self.possible_actions)):
                    action_value = self.value(row, col, self.possible_actions[i])
                    if action_value > best_action_value:
                       best_action_value = action_value
                       best_action_index = i
    
                  U1[row, col] = self.step_cost + (self.discount_factor * best_action_value)
                  self.policy[row, col] = best_action_index  

                delta = abs(old_value - U1[row, col]) if abs(old_value - U1[row, col]) > delta else delta

        
        self.U = U1
        print(f"Iteration {iteration}:")
        self.print_values()
        iteration += 1
        
        if delta <= 0.0001:
            break

    def print_values(self):
     print('Utility Grid:')
     for row in self.U:
        print("\t".join("{:.6f}".format(value) for value in row))
     print()

    def print_policy(self):
     action_symbols = {0: 'right', 1: 'left', 2: 'down', 3: 'up', -1: 'none'}
     print('Optimal policy grid:')
     for row in self.policy:
        print(" ".join("{:^6}".format(action_symbols[int(action)]) for action in row))

# Usage
mdp = GridWorldMDP()
mdp.iterate_values()
#mdp.print_policy()


Iteration 1:
Utility Grid:
-0.040000	-1.000000	1.000000
-0.040000	-0.040000	0.625000
-0.040000	0.000000	-0.040000
-0.040000	-0.040000	-0.040000

Iteration 2:
Utility Grid:
-0.078000	-1.000000	1.000000
-0.078000	0.227425	0.708362
-0.078000	0.000000	0.364225
-0.078000	-0.078000	-0.078000

Iteration 3:
Utility Grid:
-0.114100	-1.000000	1.000000
0.089008	0.320969	0.758350
-0.114100	0.000000	0.534865
-0.114100	-0.114100	0.179980

Iteration 4:
Utility Grid:
-0.119452	-1.000000	1.000000
0.140926	0.367541	0.778803
-0.013328	0.000000	0.616739
-0.148395	0.047168	0.325073

Iteration 5:
Utility Grid:
-0.105806	-1.000000	1.000000
0.185493	0.387778	0.788354
0.049917	0.000000	0.653675
-0.031679	0.189617	0.423176

Iteration 6:
Utility Grid:
-0.074224	-1.000000	1.000000
0.209909	0.397014	0.792599
0.097579	0.000000	0.670553
0.088694	0.295453	0.482017

Iteration 7:
Utility Grid:
-0.053488	-1.000000	1.000000
0.227342	0.401153	0.794520
0.127399	0.000000	0.678186
0.183020	0.364745	0.516707

Iteration 8:
Uti

Task B-
Run the algorithm for p values ranging from 0.1 to 0.9 with steps of 0.1. Print only the final policies that
the algorithm converges to in the following format for each p.
’right’ ’none’ ’none’
’up’ ’up’ ’down’
’up’ ’none’ ’down’
’up’ ’left’ ’left’
Observe how the policy changes with different p values. Comment on why the changes occur as they do.

In [4]:
import numpy as np

class GridWorldMDP:
    def __init__(self, probability_selected_action):
        self.grid_world = np.array([[0, -1, 1], [0, 0, 0], [0, None, 0], [0, 0, 0]])
        self.possible_actions = [(0,1), (0, -1), (1, 0), (-1, 0)]
        self.reward = 1
        self.penalty = -1
        self.step_cost = -0.04
        self.discount_factor = 0.95
        self.probability_selected_action = probability_selected_action
        self.probability_other_actions = (1 - probability_selected_action) / 2
        self.U = np.zeros((4,3))
        self.U[0][1] = -1
        self.U[0][2] = 1
        self.policy = np.zeros((4,3))

    def value(self, row, col, action):
        total_value = 0
    
        for i in [-1, 0, 1]:
            for j in [-1, 0, 1]:
                if (i, j) == (-action[0], -action[1]):
                    continue
                if (i == 0 and abs(j) == 1) or (j == 0 and abs(i) == 1):
                    current_prob = self.probability_selected_action if (i, j) == action else self.probability_other_actions
                    new_row, new_col = row + i, col + j
                    
                    if 0 <= new_row < 4 and 0 <= new_col < 3 and self.grid_world[new_row][new_col] is not None:
                        total_value += current_prob * self.U[new_row, new_col]
                    else:
                        total_value += current_prob * self.U[row, col]
        
        return total_value

    def iterate_values(self):
        iteration = 1
        while True:
            delta = 0
            U1 = np.zeros((4, 3))
            
            for row in range(4):
                for col in range(3):
                    old_value = self.U[row, col]
                    
                    if self.grid_world[row][col] is None:
                        self.policy[row][col] = -1  
                        continue
                    
                    if self.grid_world[row][col] == 1:
                        U1[row][col] = self.reward
                        self.policy[row][col] = -1  
                    
                    elif self.grid_world[row][col] == -1:
                        U1[row][col] = self.penalty
                        self.policy[row][col] = -1 
                    else:
                        best_action_index = 0
                        best_action_value = self.value(row, col, self.possible_actions[0])
    
                        for i in range(1, len(self.possible_actions)):
                          action_value = self.value(row, col, self.possible_actions[i])
                          if action_value > best_action_value:
                            best_action_value = action_value
                            best_action_index = i
    
                        U1[row, col] = self.step_cost + (self.discount_factor * best_action_value)
                        self.policy[row, col] = best_action_index  
                        
                    delta = abs(old_value - U1[row, col]) if abs(old_value - U1[row, col]) > delta else delta
                    
            
            self.U = U1
            iteration += 1
            
            if delta <= 0.0001:
                break

    def print_policy(self):
        action_symbols = {0: 'right', 1: 'left', 2: 'down', 3: 'up', -1: 'none'}
        print('Optimal policy grid (p={}):'.format(self.probability_selected_action))
        for row in self.policy:
            print(" ".join("{:^6}".format(action_symbols[int(action)]) for action in row))

# Usage
for p in np.arange(0.1, 1.0, 0.1):
    mdp = GridWorldMDP(probability_selected_action=p)
    mdp.iterate_values()
    mdp.print_policy()


Optimal policy grid (p=0.1):
 left   none   none 
  up    down  right 
right   none  right 
  up    down  right 
Optimal policy grid (p=0.2):
 left   none   none 
  up    down  right 
right   none  right 
  up    down  right 
Optimal policy grid (p=0.30000000000000004):
 left   none   none 
 down   down  right 
right   none    up  
 down  right  right 
Optimal policy grid (p=0.4):
 left   none   none 
 down   down  right 
right   none    up  
 down  right  right 
Optimal policy grid (p=0.5):
 left   none   none 
right   down    up  
 down   none    up  
right  right    up  
Optimal policy grid (p=0.6):
 left   none   none 
right   down    up  
 down   none    up  
right  right    up  
Optimal policy grid (p=0.7000000000000001):
 down   none   none 
right  right    up  
 down   none    up  
right  right    up  
Optimal policy grid (p=0.8):
 down   none   none 
right  right    up  
 down   none    up  
right  right    up  
Optimal policy grid (p=0.9):
 down   none   none 
right  right   

 let's analyze how the policy changes with different values of pp, the probability of selecting the desired action.

    1.p=0.1:
        The optimal policy mostly prefers moving right or up. This might be because with a low probability of selecting the desired action, the agent tends to prefer safer moves (i.e., moves that are less likely to lead to a negative reward).

    2.p=0.2 to p=0.5:
        The optimal policy remains similar across these probabilities, favoring moves to the right and down when possible. This could be because with a moderate probability of selecting the desired action, the agent is more willing to take risks and explore potential high-reward actions.

    3.p=0.6 to p=0.9:
        The optimal policy starts favoring moves downward more frequently. This might be because with a higher probability of selecting the desired action, the agent becomes more confident in its ability to reach the desired state, allowing it to prioritize downward moves which might lead to higher rewards.

In general, the changes in the optimal policy as pp varies are influenced by the agent's risk-taking behavior. Lower values of pp lead to safer policies, while higher values encourage riskier actions, assuming that the agent trusts its ability to select the desired action. This demonstrates the trade-off between exploration and exploitation in reinforcement learning.