## Group No

## Group Member Names:
1.
2.
3.


1.**Problem statement**:

* Develop a reinforcement learning agent using dynamic programming to solve the Treasure Hunt problem in a FrozenLake environment. The agent must learn the optimal policy for navigating the lake while avoiding holes and maximizing its treasure collection.

2.**Scenario**:
* A treasure hunter is navigating a slippery 5x5 FrozenLake grid. The objective is to navigate through the lake collecting treasures while avoiding holes and ultimately reaching the exit (goal).
Grid positions on a 5x5 map with tiles labeled as S, F, H, G, T. The state includes the current position of the agent and whether treasures have been collected.


#### Objective
* The agent must learn the optimal policy π* using dynamic programming to maximize its cumulative reward while navigating the lake.

#### About the environment

The environment consists of several types of tiles:
* Start (S): The initial position of the agent, safe to step.
* Frozen Tiles (F): Frozen surface, safe to step.
* Hole (H): Falling into a hole ends the game immediately (die, end).
* Goal (G): Exit point; reaching here ends the game successfully (safe, end).
* Treasure Tiles (T): Added to the environment. Stepping on these tiles awards +5 reward but does not end the game.

After stepping on a treasure tile, it becomes a frozen tile (F).
The agent earns rewards as follows:
* Reaching the goal (G): +10 reward.
* Falling into a hole (H): -10 reward.
* Collecting a treasure (T): +5 reward.
* Stepping on a frozen tile (F): 0 reward.

#### States
* Current position of the agent (row, column).
* A boolean flag (or equivalent) for whether each treasure has been collected.

#### Actions
* Four possible moves: up, down, left, right

#### Rewards
* Goal (G): +10.
* Treasure (T): +5 per treasure.
* Hole (H): -10.
* Frozen tiles (F): 0.

#### Environment
Modify the FrozenLake environment in OpenAI Gym to include treasures (T) at certain positions. Inherit the original FrozenLakeEnv and modify the reset and step methods accordingly.
Example grid:

![image.png](attachment:image.png)


**Expected Outcomes:**
1.	Create the custom environment by modifying the existing “FrozenLakeNotSlippery-v0” in OpenAI Gym and Implement the dynamic programming using value iteration and policy improvement to learn the optimal policy for the Treasure Hunt problem.
2.	Calculate the state-value function (V*) for each state on the map after learning the optimal policy.
3.	Compare the agent’s performance with and without treasures, discussing the trade-offs in reward maximization.
4.	Visualize the agent’s direction on the map using the learned policy.
5.	Calculate expected total reward over multiple episodes to evaluate performance.

### Import required libraries and Define the custom environment - 2 Marks

In [102]:
!pip install gymnasium



In [103]:
# Import statements
import gym
import numpy as np
from gym.envs.toy_text.frozen_lake import FrozenLakeEnv

In [104]:
# Custom environment to create the given grid and respective functions that are required for the problem
class FrozenSlipperyLakeWithTreasures(FrozenLakeEnv):

    def __init__(self, desc=None, map_name='5x5', render_mode="ansi", is_slippery = True):
      super().__init__(desc = desc, map_name = map_name, is_slippery = is_slippery, render_mode = render_mode)
      self.treasure_collected = set()

    def step(self, action):
        state, reward, done, info = super().step(action)  # Call base call method to perform the step

        # Determine the current position
        row, col = divmod(self.s, self.ncol)  # Convert state index to (row, col)
        current_tile = self.desc[row][col].decode("utf-8")  # Get the tile character

        # Assign rewards based on tile type
        if current_tile == "G":  # Goal
            reward = 10
        elif current_tile == "H":  # Hole
            reward = -10
        elif current_tile == "T":  # Treasure
            reward = 5
            self.treasure_collected.add((row, col))  # Mark treasure as collected
            self.desc.flatten()[state] = b'F' #Update the state to F after collecting the treasure
        elif current_tile == "F":  # Frozen tile
            reward = 0

        return state, reward, done, info

    def reset(self):
      super().reset()
      self.treasure_collected = set()
      return self.s

    def render(self):
        """Override render to display the state as a matrix."""
        # Decode byte strings for human-readable format
        decoded_matrix = np.array(self.desc, dtype=str)

        # Render the matrix
        print("\n".join([" ".join(row) for row in decoded_matrix]))

    def get_transitionProbability_states(self, state, action):
        updated_states = []
        states = self.P[state][action]
        for prob, s_, r, _ in states:
          row, col = divmod(s_, self.ncol)  # Convert state index to (row, col)
          current_tile = self.desc[row][col].decode("utf-8")  # Get the tile character
          # Assign rewards based on tile type
          if current_tile == "G":  # Goal
              r = 10
          elif current_tile == "H":  # Hole
              r = -10
          elif current_tile == "T":  # Treasure
              if (row, col) not in self.treasure_collected:
                  r = 5
              else:
                  r = 0
          elif current_tile == "F":  # Frozen tile
              r = 0
          elif current_tile == "S":  # Frozen tile
              r = 0
          updated_states.append((prob, s_, r, _))
        return updated_states


#Include functions to take an action, get reward, to check if episode is over

In [105]:
custom_map_5x5 = [
    "SHFTF",
    "FFFFH",
    "TFFFF",
    "FFHFT",
    "HFFFG"
]
env = FrozenSlipperyLakeWithTreasures( desc = custom_map_5x5, map_name = "5x5")
env.reset()
env.render()
for state in env.P:
    print(f"State {state}:")
    for action in env.P[state]:
      print(f"  Action {action}: {env.get_transitionProbability_states(state,action)}")
env.render()

S H F T F
F F F F H
T F F F F
F F H F T
H F F F G
State 0:
  Action 0: [(1.0, 0, 0, False)]
  Action 1: [(1.0, 5, 0, False)]
  Action 2: [(1.0, 1, -10, True)]
  Action 3: [(1.0, 0, 0, False)]
State 1:
  Action 0: [(1.0, 1, -10, True)]
  Action 1: [(1.0, 1, -10, True)]
  Action 2: [(1.0, 1, -10, True)]
  Action 3: [(1.0, 1, -10, True)]
State 2:
  Action 0: [(1.0, 1, -10, True)]
  Action 1: [(1.0, 7, 0, False)]
  Action 2: [(1.0, 3, 5, False)]
  Action 3: [(1.0, 2, 0, False)]
State 3:
  Action 0: [(1.0, 2, 0, False)]
  Action 1: [(1.0, 8, 0, False)]
  Action 2: [(1.0, 4, 0, False)]
  Action 3: [(1.0, 3, 5, False)]
State 4:
  Action 0: [(1.0, 3, 5, False)]
  Action 1: [(1.0, 9, -10, True)]
  Action 2: [(1.0, 4, 0, False)]
  Action 3: [(1.0, 4, 0, False)]
State 5:
  Action 0: [(1.0, 5, 0, False)]
  Action 1: [(1.0, 10, 5, False)]
  Action 2: [(1.0, 6, 0, False)]
  Action 3: [(1.0, 0, 0, False)]
State 6:
  Action 0: [(1.0, 5, 0, False)]
  Action 1: [(1.0, 11, 0, False)]
  Action 2: [(1.0, 7

### Value Iteration Algorithm - 1 Mark

In [109]:
def value_iteration(env):
  num_iterations = 10000
  threshold = 1e-20 #value used to terminate if no changes observed in the new values from the previous values
  gamma = 0.9 #discount factor
  value_table = np.zeros(env.observation_space.n)
  print('Initial Value Table: ', value_table)
  for i in range(num_iterations): #iterate through given no of times
    updated_value_table = np.copy(value_table)
    for s in range(env.observation_space.n): #update for each state
      row, col = divmod(s, env.ncol)
      if env.desc[row, col] == b'H': #If hole dont update the value keep the reward as it is and return
        value_table[s]=-10
        continue
      if env.desc[row, col] == b'G': #If gold dont update the value keep the reward as it is and return
        value_table[s]=10
        continue
      if env.desc[row, col] == b'T': #If Treasure dont update the value keep the reward as it is and return
        value_table[s]=5
        env.treasure_collected.add((row, col))
        continue
      Q_values = []
      for a in range(env.action_space.n): #get q values for each possible action
        Q_value = 0
        for prob, s_, r, _ in env.get_transitionProbability_states(s,a):
          Q_value += prob*(r + gamma * updated_value_table[s_])
        Q_values.append(Q_value)
      value_table[s] = max(Q_values) #pick the maximum q values
      #row, col = divmod(s, env.ncol)
      #if env.desc[row, col] == b'T':
          # Reset the tile to Frozen
        #env.desc[row, col] = b'F'
      #print(f'Value Table for State {s}', value_table[s])
    if (np.sum(np.fabs(updated_value_table - value_table)) <= threshold): # check for divergence
      break
  for s in range(env.observation_space.n):
    print(f'Value Table for State {s}', value_table[s])
  return value_table

### Policy Improvement Function - 1 Mark

In [107]:
def extract_policy(value_table):
  gamma = 1.0
  policy = np.zeros(env.observation_space.n)
  for s in range(env.observation_space.n):
    Q_values = [sum([prob*(r + gamma * value_table[s_])
                for prob, s_, r, _ in env.P[s][a]])
                for a in range(env.action_space.n)]
    policy[s] = np.argmax(np.array(Q_values))
  return policy

### Print the Optimal Value Function

### Visualization of the learned optimal policy - 1 Mark

### Evaluate the policy - 1 Mark

### Main Execution

In [110]:
optimal_value_function = value_iteration(env)
print('optimal value function', optimal_value_function)

Initial Value Table:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0.]
Value Table for State 0 9.087641100000004
Value Table for State 1 -10.0
Value Table for State 2 11.219310000000004
Value Table for State 3 5.0
Value Table for State 4 4.5
Value Table for State 5 10.097379000000004
Value Table for State 6 11.219310000000004
Value Table for State 7 12.465900000000003
Value Table for State 8 13.851000000000003
Value Table for State 9 -10.0
Value Table for State 10 5.0
Value Table for State 11 12.465900000000003
Value Table for State 12 13.851000000000003
Value Table for State 13 15.390000000000002
Value Table for State 14 13.851000000000003
Value Table for State 15 12.465900000000003
Value Table for State 16 13.851000000000003
Value Table for State 17 -10.0
Value Table for State 18 17.1
Value Table for State 19 5.0
Value Table for State 20 -10.0
Value Table for State 21 15.390000000000002
Value Table for State 22 17.1
Value Table for State 23 19.0
Value Tab