<center>

# **22AIE401 - Reinforcement Learning**  
# **Lab 4**  

</center>

### Team Members:
- Guruprasath M R - AIE22015  
- Rudraksh Mohanty - AIE22046  
- Shree Prasad M - AIE22050  
- Tharun Kaarthick G K - AIE22062  

---

### Objective:
Design and implement a Monte Carlo-based learning agent that learns optimal policies for minimizing time to reach dynamic, weighted emergency locations under a probabilistic and time-varying urban environment. 


---

### Problem Statement:
A taxi operates in a grid-based city (5x5). The driver needs to:
 - Pick up passengers from random locations.
 - Drop them at requested destinations.
 - Decide which direction to move in each state to maximize reward (successful trips).
 - Learn this policy without a known model (i.e., using Monte Carlo control) 


---

### Common Interpretation after completing tasks:
To be filled

## Original Code

In [9]:
%pip install pymdptoolbox

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
import random

# ---------- Environment Setup ----------
GRID_SIZE = 6
ACTIONS = ['up', 'down', 'left', 'right']
ACTION_MAP = {'up': (-1, 0), 'down': (1, 0), 'left': (0, -1), 'right': (0, 1)}
MAX_STEPS = 50
DISCOUNT = 0.95
EPSILON = 0.1
EPISODES = 10000

# Static obstacles (permanent roadblocks)
static_obstacles = [(1, 3), (3, 2)]
hospital = (0, 0)  # Ambulance dispatch center

# Rush hour control
def is_rush_hour(ep):
    return ep % 1000 < 300 or ep % 1000 > 800  # Congested traffic windows

# Emergency severity and urgency
emergency_types = {
    'minor': 20,
    'moderate': 35,
    'critical': 50
}

# Helper functions
def is_valid(state):
    x, y = state
    return 0 <= x < GRID_SIZE and 0 <= y < GRID_SIZE

def get_dynamic_obstacles():
    return [(2, 4), (4, 1), (3, 3)] if random.random() < 0.3 else []

def epsilon_greedy(state, Q):
    if np.random.rand() < EPSILON or state not in Q:
        return random.randint(0, len(ACTIONS) - 1)
    else:
        return np.argmax(Q[state])

def generate_emergency():
    location = random.choice([
        (i, j) for i in range(GRID_SIZE) for j in range(GRID_SIZE)
        if (i, j) != hospital and (i, j) not in static_obstacles
    ])
    severity = random.choice(list(emergency_types.keys()))
    reward = emergency_types[severity]
    return location, reward

# ---------- Monte Carlo Training ----------
Q = defaultdict(lambda: np.zeros(len(ACTIONS)))
Returns = defaultdict(list)

def run_episode(episode_num):
    rush = is_rush_hour(episode_num)
    prob_blocks = get_dynamic_obstacles()
    all_obstacles = static_obstacles + prob_blocks
    goal, goal_reward = generate_emergency()
    state = hospital
    episode = []
    steps = 0

    while steps < MAX_STEPS:
        action_idx = epsilon_greedy(state, Q)
        dx, dy = ACTION_MAP[ACTIONS[action_idx]]
        next_state = (state[0] + dx, state[1] + dy)

        if not is_valid(next_state) or next_state in all_obstacles:
            reward = -10 if rush else -5
            next_state = state
        elif next_state == goal:
            reward = goal_reward - steps
        else:
            reward = -2 if rush else -1

        episode.append((state, action_idx, reward))

        if next_state == goal:
            break

        state = next_state
        steps += 1

    return episode

for ep in range(EPISODES):
    episode = run_episode(ep)
    G = 0
    visited = set()
    for t in reversed(range(len(episode))):
        s, a, r = episode[t]
        G = DISCOUNT * G + r
        if (s, a) not in visited:
            Returns[(s, a)].append(G)
            Q[s][a] = np.mean(Returns[(s, a)])
            visited.add((s, a))

print("🚑 Training Complete: Smart Ambulance Dispatch Policy Learned.")

# ---------- Policy Visualization ----------
policy = np.full((GRID_SIZE, GRID_SIZE), '.', dtype=str)
for i in range(GRID_SIZE):
    for j in range(GRID_SIZE):
        state = (i, j)
        if state in static_obstacles:
            policy[i][j] = 'S'
        elif state in Q:
            best_action = np.argmax(Q[state])
            policy[i][j] = ['↑', '↓', '←', '→'][best_action]
        else:
            policy[i][j] = ' '

print("\n📍 Learned Ambulance Dispatch Policy Grid:")
for row in policy:
    print(' '.join(row))


🚑 Training Complete: Smart Ambulance Dispatch Policy Learned.

📍 Learned Ambulance Dispatch Policy Grid:
→ ↓ ↓ ← ↓ ↑
→ → ← S ↓ ↓
→ ↑ ↑ → ← →
← ↑ S ↓ ↑ ↑
→ ← ↑ ↑ ↑ ←
← ↑ ← ↓ ← ↓


## Task 1

Each hospital has a dynamic load (e.g., occupied beds). Ambulances should choose hospitals not just based on proximity but expected availability. Model hospital queues and incorporate delayed rewards based on treatment delay penalties. Train agents to learn which hospital is
better not just closer




## Task 2: 

Update your environment so that traffic jams or roadblocks may appear after the episode has started. Modify your simulation to invalidate paths mid-way and require real-time policy adaptation using Monte Carlo rollouts. The agent should reroute to avoid costly delays. 



## Task 3
Model hospital occupancy as a time-varying parameter. An ambulance should decide not only the quickest path to an emergency but also the least crowded hospital for drop-off. Implement delayed penalties when the selected hospital has no immediate bed availability. Let the policy adapt over multiple episodes. 
