<a href="https://colab.research.google.com/github/Bosy-Ayman/DSAI-402-RL/blob/main/Assignment2_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  **Assignment: Gridworld MDP Simulation**

You are given a rectangular gridworld (like in Figure 3.2 from the main book).
Each cell in the grid represents a **state** in the environment.

From any state, the agent can take one of four possible **actions**:

```
NORTH, SOUTH, EAST, WEST
```

Each action deterministically moves the agent one cell in the chosen direction.

If the agent tries to move **off the grid**, it **stays in the same position** and receives a **reward of −1**.
All other moves give a **reward of 0**, except for two special states **A** and **B**:

* From **state A = (0, 1)**:
  Any action gives a **reward of +10** and moves the agent to **A′ = (4, 1)**.
* From **state B = (0, 3)**:
  Any action gives a **reward of +5** and moves the agent to **B′ = (2, 3)**.

---

### 🎯 **Your tasks**

1. **Define a policy function π(a|s)**
   A function that returns the action to be taken for a given state.

2. **Define the environment dynamics**
   Implement a function that returns the **next state (s′)** and **reward (r)** given the current state (s) and chosen action (a).
   i.e. implement ( p(s′|s,a) ) and ( r(s,a,s′) ).

3. **Simulate the Markov chain**

   * Choose an initial state (e.g., any grid cell).
   * Simulate the process for **1000 time steps** following your policy.
   * Repeat this simulation **multiple times** to generate many episodes.

4. **Compute the Value Function**
   For each state, compute its **value function** ( V(s) ) under **discount rate γ = 0.9**,
   using the formula:
   [
   G_t = R_{t+1} + γR_{t+2} + γ^2R_{t+3} + ...
   ]

5. **Evaluate under two reward schemes:**

   #### Reward Scheme 1:

   | Condition         | Reward |
   | ----------------- | ------ |
   | Move off the grid | −1     |
   | Any action from A | +10    |
   | Any action from B | +5     |
   | All other moves   | 0      |

   #### Reward Scheme 2:

   | Condition         | Reward |
   | ----------------- | ------ |
   | Move off the grid | +5     |
   | Any action from A | +16    |
   | Any action from B | +11    |
   | All other moves   | +6     |

6. **Compare and Discuss**

   * Compute the value function ( V(s) ) for both reward schemes.
   * Compare how the value of each state changes under the two settings.
   * Discuss why and how the different reward structures affect the agent’s expected return.


In [None]:
import random
import numpy as np


# MDP Definition Constants


In [None]:

GRID_SIZE = 5
GAMMA = 0.9
ACTIONS = ['NORTH', 'SOUTH', 'EAST', 'WEST']

STATE_A = (0, 1)
STATE_B = (0, 3)
STATE_A_PRIME = (4, 1)
STATE_B_PRIME = (2, 3)

# Define two reward schemes


In [None]:

REWARD_SCHEME_1 = {
    'name': 'Standard Rewards',
    'OFF_GRID': -1,
    'A_REWARD': 10,
    'B_REWARD': 5,
    'NORMAL_REWARD': 0
}

REWARD_SCHEME_2 = {
    'name': 'High Rewards',
    'OFF_GRID': 5,
    'A_REWARD': 16,
    'B_REWARD': 11,
    'NORMAL_REWARD': 6
}

In [None]:
ACTION_MAP = {
    'NORTH': (-1, 0),
    'SOUTH': (1, 0),
    'EAST': (0, 1),
    'WEST': (0, -1)
}

# Policy Function

From any state, choose any action randomly.

In [None]:
def get_action(state):
    return random.choice(ACTIONS)

#Transition and Reward Function


In [None]:
def get_next_state_reward(state, action, reward_scheme):

    row, column = state

    # if we in grid A --> Move directly to A'
    if state == STATE_A:
        return STATE_A_PRIME, reward_scheme['A_REWARD']

    # if we in grid B --> Move directly to B'
    if state == STATE_B:
        return STATE_B_PRIME, reward_scheme['B_REWARD']
    dr, dc = ACTION_MAP[action]
    next_r, next_c = row + dr, column + dc

    # Check for moving off the grid
    if not (0 <= next_r < GRID_SIZE and 0 <= next_c < GRID_SIZE):
        return state, reward_scheme['OFF_GRID']
    else:
        return (next_r, next_c), reward_scheme['NORMAL_REWARD']

#  Simulation and Monte Carlo Functions


In [None]:
def generate_episode(start_state, max_steps, reward_scheme, gamma):
    state = start_state
    episode = []

    for _ in range(max_steps):

        action = get_action(state)

        next_state, reward = get_next_state_reward(state, action, reward_scheme)

        # Store the transition (s, R)
        episode.append({'state': state, 'reward': reward})

        # Move to the next state
        state = next_state

    return episode

In [None]:
def monte_carlo(num_episodes, max_steps, start_state, reward_scheme, gamma):

    # { (row, col): [list of G values for that state] }
    returns = {}

    for _ in range(num_episodes):
        episode = generate_episode(start_state, max_steps, reward_scheme, gamma)
        G = 0
        visited_states = set()


        for i in reversed(range(len(episode))):
            step = episode[i]
            s = step['state']
            R = step['reward']

            G = R + gamma * G

            if s not in visited_states:
                if s not in returns:
                    returns[s] = []

                returns[s].append(G)
                visited_states.add(s)

    # the average of all recorded returns for state s
    V_pi = np.zeros((GRID_SIZE, GRID_SIZE))

    for r in range(GRID_SIZE):
        for c in range(GRID_SIZE):
            state = (r, c)
            if state in returns and returns[state]:
                V_pi[r, c] = np.mean(returns[state])

    return V_pi, reward_scheme['name']

In [None]:
def print_value_function(V_pi, title):
    print(f"\n Value Function for {title}")
    V_str = ""
    for r in range(GRID_SIZE):
        row_str = ""
        for c in range(GRID_SIZE):
            if (r, c) == STATE_A:
                val = f"{V_pi[r, c]:>5.2f} (A)"
            elif (r, c) == STATE_B:
                val = f"{V_pi[r, c]:>5.2f} (B)"
            else:
                val = f"{V_pi[r, c]:>5.2f}"
            row_str += val + " | "
        V_str += row_str.strip(' | ') + "\n"
        if r < GRID_SIZE - 1:
            V_str += "------" * GRID_SIZE + "\n"
    print(V_str)


# Define simulation parameters


In [None]:
NUM_EPISODES = 100000
MAX_STEPS = 1000
START_STATE = (2, 2)

# Simulate with Reward Set 1

In [None]:
V_pi_1, name_1 = monte_carlo(NUM_EPISODES, MAX_STEPS, START_STATE, REWARD_SCHEME_1, GAMMA)
print_value_function(V_pi_1, name_1)



 Value Function for Standard Rewards
5.25 |  8.40 (A) |  4.90 |  4.62 (B) |  3.00
------------------------------
2.59 |  3.08 |  2.19 |  1.50 |  1.29
------------------------------
0.75 |  0.69 |  0.46 | -0.19 |  0.18
------------------------------
-0.20 | -0.39 | -0.32 | -0.31 | -0.24
------------------------------
-0.52 | -0.87 | -0.63 | -0.51 | -0.44



# Simulate with Reward Set 2


In [None]:

V_pi_2, name_2 = monte_carlo(NUM_EPISODES, MAX_STEPS, START_STATE, REWARD_SCHEME_2, GAMMA)
print_value_function(V_pi_2, name_2)



 Value Function for High Rewards
62.09 | 62.74 (A) | 61.69 | 58.78 (B) | 59.44
------------------------------
55.64 | 54.15 | 52.73 | 51.79 | 53.52
------------------------------
50.10 | 46.82 | 45.21 | 43.15 | 48.40
------------------------------
46.49 | 41.95 | 41.63 | 42.42 | 46.83
------------------------------
47.12 | 40.80 | 43.06 | 45.21 | 49.07



#  Comparison


In [None]:
V_diff = V_pi_2 - V_pi_1

constant_shift_r = 6
expected_uniform_increase = constant_shift_r / (1 - GAMMA)

print(f"The constant reward difference is {constant_shift_r}. With γ = {GAMMA}")
print(f"The expected uniform increase in V(s) is {expected_uniform_increase:.2f}")


The constant reward difference is 6. With γ = 0.9
The expected uniform increase in V(s) is 60.00


In [None]:

V_diff_str = ""
for r in range(GRID_SIZE):
    row_str = ""
    for c in range(GRID_SIZE):
        row_str += f"{V_diff[r, c]:>+6.2f} | "
    V_diff_str += row_str.strip(' | ') + "\n"
    if r < GRID_SIZE - 1:
        V_diff_str += "------" * GRID_SIZE + "\n"
print(V_diff_str)



+56.84 | +54.33 | +56.79 | +54.15 | +56.44
------------------------------
+53.04 | +51.07 | +50.54 | +50.29 | +52.23
------------------------------
+49.35 | +46.14 | +44.75 | +43.34 | +48.23
------------------------------
+46.69 | +42.33 | +41.95 | +42.73 | +47.07
------------------------------
+47.64 | +41.67 | +43.69 | +45.72 | +49.51

