<center>

# **22AIE401 - Reinforcement Learning**  
# **Lab 4**  

</center>

### Team Members:
- Guruprasath M R - AIE22015  
- Rudraksh Mohanty - AIE22046  
- Shree Prasad M - AIE22050  
- Tharun Kaarthik G K - AIE22062  

---

### Objective:
Design and implement a Monte Carlo-based learning agent that learns optimal policies for minimizing time to reach dynamic, weighted emergency locations under a probabilistic and time-varying urban environment. 


---

### Problem Statement:
A taxi operates in a grid-based city (5x5). The driver needs to:
 - Pick up passengers from random locations.
 - Drop them at requested destinations.
 - Decide which direction to move in each state to maximize reward (successful trips).
 - Learn this policy without a known model (i.e., using Monte Carlo control) 


---


## Original Code

In [1]:
%pip install pymdptoolbox

Collecting pymdptoolbox
  Downloading pymdptoolbox-4.0-b3.zip (29 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pymdptoolbox
  Building wheel for pymdptoolbox (setup.py): started
  Building wheel for pymdptoolbox (setup.py): finished with status 'done'
  Created wheel for pymdptoolbox: filename=pymdptoolbox-4.0b3-py3-none-any.whl size=25669 sha256=8f94a3f8450c5dede3d7a8ddf7e5aef2c501c8f61dace2185970995e5c681df7
  Stored in directory: c:\users\spras\appdata\local\pip\cache\wheels\cc\81\b3\db002373e7a93d9151e9dc9ea1084102b0028f2339724b32a3
Successfully built pymdptoolbox
Installing collected packages: pymdptoolbox
Successfully installed pymdptoolbox-4.0b3
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
import random

# ---------- Environment Setup ----------
GRID_SIZE = 6
ACTIONS = ['up', 'down', 'left', 'right']
ACTION_MAP = {'up': (-1, 0), 'down': (1, 0), 'left': (0, -1), 'right': (0, 1)}
MAX_STEPS = 50
DISCOUNT = 0.95
EPSILON = 0.1
EPISODES = 10000

# Static obstacles (permanent roadblocks)
static_obstacles = [(1, 3), (3, 2)]
hospital = (0, 0)  # Ambulance dispatch center

# Rush hour control
def is_rush_hour(ep):
    return ep % 1000 < 300 or ep % 1000 > 800  # Congested traffic windows

# Emergency severity and urgency
emergency_types = {
    'minor': 20,
    'moderate': 35,
    'critical': 50
}

# Helper functions
def is_valid(state):
    x, y = state
    return 0 <= x < GRID_SIZE and 0 <= y < GRID_SIZE

def get_dynamic_obstacles():
    return [(2, 4), (4, 1), (3, 3)] if random.random() < 0.3 else []

def epsilon_greedy(state, Q):
    if np.random.rand() < EPSILON or state not in Q:
        return random.randint(0, len(ACTIONS) - 1)
    else:
        return np.argmax(Q[state])

def generate_emergency():
    location = random.choice([
        (i, j) for i in range(GRID_SIZE) for j in range(GRID_SIZE)
        if (i, j) != hospital and (i, j) not in static_obstacles
    ])
    severity = random.choice(list(emergency_types.keys()))
    reward = emergency_types[severity]
    return location, reward

# ---------- Monte Carlo Training ----------
Q = defaultdict(lambda: np.zeros(len(ACTIONS)))
Returns = defaultdict(list)

def run_episode(episode_num):
    rush = is_rush_hour(episode_num)
    prob_blocks = get_dynamic_obstacles()
    all_obstacles = static_obstacles + prob_blocks
    goal, goal_reward = generate_emergency()
    state = hospital
    episode = []
    steps = 0

    while steps < MAX_STEPS:
        action_idx = epsilon_greedy(state, Q)
        dx, dy = ACTION_MAP[ACTIONS[action_idx]]
        next_state = (state[0] + dx, state[1] + dy)

        if not is_valid(next_state) or next_state in all_obstacles:
            reward = -10 if rush else -5
            next_state = state
        elif next_state == goal:
            reward = goal_reward - steps
        else:
            reward = -2 if rush else -1

        episode.append((state, action_idx, reward))

        if next_state == goal:
            break

        state = next_state
        steps += 1

    return episode

for ep in range(EPISODES):
    episode = run_episode(ep)
    G = 0
    visited = set()
    for t in reversed(range(len(episode))):
        s, a, r = episode[t]
        G = DISCOUNT * G + r
        if (s, a) not in visited:
            Returns[(s, a)].append(G)
            Q[s][a] = np.mean(Returns[(s, a)])
            visited.add((s, a))

print("🚑 Training Complete: Smart Ambulance Dispatch Policy Learned.")

# ---------- Policy Visualization ----------
policy = np.full((GRID_SIZE, GRID_SIZE), '.', dtype=str)
for i in range(GRID_SIZE):
    for j in range(GRID_SIZE):
        state = (i, j)
        if state in static_obstacles:
            policy[i][j] = 'S'
        elif state in Q:
            best_action = np.argmax(Q[state])
            policy[i][j] = ['↑', '↓', '←', '→'][best_action]
        else:
            policy[i][j] = ' '

print("\n📍 Learned Ambulance Dispatch Policy Grid:")
for row in policy:
    print(' '.join(row))


🚑 Training Complete: Smart Ambulance Dispatch Policy Learned.

📍 Learned Ambulance Dispatch Policy Grid:
→ → ← ← ← ↑
↓ ↓ ↓ S ↓ →
↑ ↑ ← ← ↓ →
→ ← S ↓ → ←
→ ↑ → ↑ ← ←
→ ↓ ↑ ↑ ← ↑


## Task 1

Each hospital has a dynamic load (e.g., occupied beds). Ambulances should choose hospitals not just based on proximity but expected availability. Model hospital queues and incorporate delayed rewards based on treatment delay penalties. Train agents to learn which hospital is
better not just closer




In [3]:
# Task 1: Multiple Hospitals with Dynamic Queues
GRID_SIZE = 6
ACTIONS = ['up', 'down', 'left', 'right']
ACTION_MAP = {'up': (-1, 0), 'down': (1, 0), 'left': (0, -1), 'right': (0, 1)}
MAX_STEPS = 50
DISCOUNT = 0.95
EPSILON = 0.1
EPISODES = 5000

# Multiple hospitals
hospitals = [(0, 0), (5, 5), (0, 5)]
hospital_queues = {h: 0 for h in hospitals}  # beds occupied
max_beds = {h: 5 for h in hospitals}

def update_hospital_queues():
    for h in hospitals:
        # Randomly discharge patients
        if hospital_queues[h] > 0 and random.random() < 0.3:
            hospital_queues[h] -= 1
        # Randomly admit new patients
        if hospital_queues[h] < max_beds[h] and random.random() < 0.2:
            hospital_queues[h] += 1

def choose_hospital():
    # Prefer hospitals with available beds
    available = [h for h in hospitals if hospital_queues[h] < max_beds[h]]
    if available:
        return random.choice(available)
    return random.choice(hospitals)

def run_episode_task1(ep):
    update_hospital_queues()
    goal, goal_reward = generate_emergency()
    chosen_hospital = choose_hospital()
    state = chosen_hospital
    episode = []
    steps = 0
    while steps < MAX_STEPS:
        action_idx = epsilon_greedy(state, Q)
        dx, dy = ACTION_MAP[ACTIONS[action_idx]]
        next_state = (state[0] + dx, state[1] + dy)
        if not is_valid(next_state):
            reward = -5
            next_state = state
        elif next_state == goal:
            # Penalty if hospital queue is full
            if hospital_queues[chosen_hospital] >= max_beds[chosen_hospital]:
                reward = goal_reward - steps - 20  # delayed treatment penalty
            else:
                reward = goal_reward - steps
        else:
            reward = -1
        episode.append((state, action_idx, reward))
        if next_state == goal:
            break
        state = next_state
        steps += 1
    return episode

# Train agent for Task 1
Q = defaultdict(lambda: np.zeros(len(ACTIONS)))
Returns = defaultdict(list)
for ep in range(EPISODES):
    episode = run_episode_task1(ep)
    G = 0
    visited = set()
    for t in reversed(range(len(episode))):
        s, a, r = episode[t]
        G = DISCOUNT * G + r
        if (s, a) not in visited:
            Returns[(s, a)].append(G)
            Q[s][a] = np.mean(Returns[(s, a)])
            visited.add((s, a))
print("Task 1 Training Complete.")

# Display learned hospital selection and policy grid
chosen_hospital = choose_hospital()
print(f"Learned hospital selection (sample): {chosen_hospital}")
policy = np.full((GRID_SIZE, GRID_SIZE), '.', dtype=str)
for i in range(GRID_SIZE):
    for j in range(GRID_SIZE):
        state = (i, j)
        if state in hospitals:
            policy[i][j] = 'H'
        elif state in Q:
            best_action = np.argmax(Q[state])
            policy[i][j] = ['↑', '↓', '←', '→'][best_action]
        else:
            policy[i][j] = ' '
print("\nTask 1 Policy Grid:")
for row in policy:
    print(' '.join(row))

Task 1 Training Complete.
Learned hospital selection (sample): (0, 5)

Task 1 Policy Grid:
H → ↓ ← ← H
→ ← ↓ ↓ ↓ ←
→ ↓ → ← ← ↓
↑ ↑ ↓ ↓ → ←
← → ↑ ↑ ← ↑
→ ↑ ↑ ↑ ↓ H


## Task 2: 

Update your environment so that traffic jams or roadblocks may appear after the episode has started. Modify your simulation to invalidate paths mid-way and require real-time policy adaptation using Monte Carlo rollouts. The agent should reroute to avoid costly delays. 



In [4]:
# Task 2: Dynamic Obstacles Mid-Episode
def get_dynamic_obstacles_midway(step):
    # Obstacles appear after step 10
    if step > 10 and random.random() < 0.2:
        return [(random.randint(0, GRID_SIZE-1), random.randint(0, GRID_SIZE-1))]
    return []

def run_episode_task2(ep):
    state = (0, 0)
    goal, goal_reward = generate_emergency()
    episode = []
    steps = 0
    obstacles = static_obstacles.copy()
    while steps < MAX_STEPS:
        obstacles += get_dynamic_obstacles_midway(steps)
        action_idx = epsilon_greedy(state, Q)
        dx, dy = ACTION_MAP[ACTIONS[action_idx]]
        next_state = (state[0] + dx, state[1] + dy)
        if not is_valid(next_state) or next_state in obstacles:
            reward = -10
            next_state = state
        elif next_state == goal:
            reward = goal_reward - steps
        else:
            reward = -1
        episode.append((state, action_idx, reward))
        if next_state == goal:
            break
        state = next_state
        steps += 1
    return episode

# Train agent for Task 2
Q = defaultdict(lambda: np.zeros(len(ACTIONS)))
Returns = defaultdict(list)
for ep in range(EPISODES):
    episode = run_episode_task2(ep)
    G = 0
    visited = set()
    for t in reversed(range(len(episode))):
        s, a, r = episode[t]
        G = DISCOUNT * G + r
        if (s, a) not in visited:
            Returns[(s, a)].append(G)
            Q[s][a] = np.mean(Returns[(s, a)])
            visited.add((s, a))
print("Task 2 Training Complete.")

# Display sample episode with dynamic obstacles
sample_ep = run_episode_task2(0)
print("\nTask 2 Sample Episode (state, action, reward):")
for step in sample_ep:
    print(step)

Task 2 Training Complete.

Task 2 Sample Episode (state, action, reward):
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 2, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 1, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 3, -1)
((1, 1), 2, -1)
((1, 0), 0, -1)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)
((0, 0), 0, -10)


## Task 3
Model hospital occupancy as a time-varying parameter. An ambulance should decide not only the quickest path to an emergency but also the least crowded hospital for drop-off. Implement delayed penalties when the selected hospital has no immediate bed availability. Let the policy adapt over multiple episodes. 


In [5]:
# Task 3: Time-Varying Hospital Occupancy
def update_hospital_occupancy():
    for h in hospitals:
        # Simulate time-varying occupancy
        hospital_queues[h] = min(max_beds[h], max(0, hospital_queues[h] + random.choice([-1, 0, 1])))

def run_episode_task3(ep):
    update_hospital_occupancy()
    goal, goal_reward = generate_emergency()
    # Choose least crowded hospital
    chosen_hospital = min(hospitals, key=lambda h: hospital_queues[h])
    state = chosen_hospital
    episode = []
    steps = 0
    while steps < MAX_STEPS:
        action_idx = epsilon_greedy(state, Q)
        dx, dy = ACTION_MAP[ACTIONS[action_idx]]
        next_state = (state[0] + dx, state[1] + dy)
        if not is_valid(next_state):
            reward = -5
            next_state = state
        elif next_state == goal:
            # Penalty if hospital is full
            if hospital_queues[chosen_hospital] >= max_beds[chosen_hospital]:
                reward = goal_reward - steps - 30
            else:
                reward = goal_reward - steps
        else:
            reward = -1
        episode.append((state, action_idx, reward))
        if next_state == goal:
            break
        state = next_state
        steps += 1
    return episode

# Train agent for Task 3
Q = defaultdict(lambda: np.zeros(len(ACTIONS)))
Returns = defaultdict(list)
for ep in range(EPISODES):
    episode = run_episode_task3(ep)
    G = 0
    visited = set()
    for t in reversed(range(len(episode))):
        s, a, r = episode[t]
        G = DISCOUNT * G + r
        if (s, a) not in visited:
            Returns[(s, a)].append(G)
            Q[s][a] = np.mean(Returns[(s, a)])
            visited.add((s, a))
print("Task 3 Training Complete.")

# Display hospital occupancy and learned policy
print("\nTask 3 Hospital Occupancy:")
for h in hospitals:
    print(f"Hospital {h}: {hospital_queues[h]} beds occupied")
policy = np.full((GRID_SIZE, GRID_SIZE), '.', dtype=str)
for i in range(GRID_SIZE):
    for j in range(GRID_SIZE):
        state = (i, j)
        if state in hospitals:
            policy[i][j] = 'H'
        elif state in Q:
            best_action = np.argmax(Q[state])
            policy[i][j] = ['↑', '↓', '←', '→'][best_action]
        else:
            policy[i][j] = ' '
print("\nTask 3 Policy Grid:")
for row in policy:
    print(' '.join(row))

Task 3 Training Complete.

Task 3 Hospital Occupancy:
Hospital (0, 0): 0 beds occupied
Hospital (5, 5): 5 beds occupied
Hospital (0, 5): 5 beds occupied

Task 3 Policy Grid:
H → → ← → H
→ ↓ ↓ ↓ → ←
↑ ↑ ↑ → ← ←
→ → ← → ← ↓
→ → → ↑ ← ←
← ↓ ↑ ↑ ↑ H
