# Lab 03 â€“ Markov Decision Processes Starter Notebook

## Overview
Formalize sequential decision problems as Markov decision processes (MDPs). Students will define states, actions, rewards, and transitions, then validate their models with simple simulations.

## Objectives
- Translate real-world scenarios into MDP components.
- Compute expected returns analytically for small state spaces.
- Simulate trajectories to confirm MDP reasoning.

## Pre-Lab Review
- Study Section 1 of [`old content/RL_1.pdf`](../../old%20content/RL_1.pdf) to revisit MDP notation and Bellman equations.
- Review the worked problems in [`old content/RL Solved Example - Updated.pdf`](../../old%20content/RL%20Solved%20Example%20-%20Updated.pdf).

## In-Lab Exercises
1. Brainstorm a gridworld or robot navigation problem and enumerate its state/action spaces.
2. Encode transition probabilities and rewards in Python structures.
3. Calculate state-value functions for a fixed policy analytically.
4. Simulate sample trajectories to verify expectations against empirical returns.

## Deliverables
- MDP specification sheet (template provided by instructors).
- Python script or notebook that simulates a handful of trajectories and prints empirical returns.

## Resources
- [`old content/Mindmap.jpg`](../../old%20content/Mindmap.jpg) for a high-level concept map linking MDP components.
- Any supplementary notes shared in class for MDP modeling patterns.

### MDP Skeleton
Adapted from the legacy GridWorld exercise in `old content/ALL_WEEKS_V5 - Student.ipynb`. Use this as a foundation for modelling your own problem.

In [None]:
import numpy as np

class GridWorld:
    def __init__(self, grid_size=(4, 4), start=(0, 0), goal=(3, 3), obstacles=None):
        self.grid_size = grid_size
        self.start = start
        self.goal = goal
        self.obstacles = set(obstacles or [])
        self.actions = ['up', 'down', 'left', 'right']
        self.state = self.start

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        row, col = self.state
        if action == 'up':
            row = max(0, row - 1)
        elif action == 'down':
            row = min(self.grid_size[0] - 1, row + 1)
        elif action == 'left':
            col = max(0, col - 1)
        elif action == 'right':
            col = min(self.grid_size[1] - 1, col + 1)

        candidate = (row, col)
        if candidate in self.obstacles:
            candidate = self.state

        self.state = candidate
        reward = 100 if self.state == self.goal else -1
        done = self.state == self.goal
        return self.state, reward, done

    def render(self):
        grid = np.zeros(self.grid_size)
        for obs in self.obstacles:
            grid[obs] = -1
        grid[self.goal] = 10
        grid[self.state] = 5
        print(grid)

# Example usage
env = GridWorld(obstacles={(1, 1), (2, 1)})
env.reset()
for _ in range(5):
    next_state, reward, done = env.step(np.random.choice(env.actions))
    print(next_state, reward, done)
    if done:
        break
