# Lab 05 – Monte Carlo Methods Starter Notebook

## Overview
Explore Monte Carlo prediction and control techniques for episodic tasks. Students will build or reuse a simulator (e.g., Blackjack) to estimate value functions from sampled returns.

## Objectives
- Implement first-visit and every-visit Monte Carlo estimators.
- Compare exploring starts with ε-greedy behavior policies.
- Track policy improvement over episodes.

## Pre-Lab Review
- Review the Monte Carlo lecture decks [`old content/Section_10_MonteCarlo_Dynamic_1.pdf`](../../old%20content/Section_10_MonteCarlo_Dynamic_1.pdf) and [`old content/Section_10_MonteCarlo_Example_1.pdf`](../../old%20content/Section_10_MonteCarlo_Example_1.pdf).
- Revisit relevant segments in the legacy notebook [`old content/ALL_WEEKS_V5 - Student.ipynb`](../../old%20content/ALL_WEEKS_V5%20-%20Student.ipynb).

## In-Lab Exercises
1. Implement Monte Carlo prediction for state values using first-visit returns.
2. Extend to action-value estimation with exploring starts.
3. Introduce ε-greedy action selection and observe policy evolution.
4. Discuss sample efficiency compared to dynamic programming.

## Deliverables
- Notebook with Monte Carlo code, learning curves, and commentary.
- Brief reflection on trade-offs between Monte Carlo and DP methods.

## Resources
- [`old content/UpdateRuleExample.png`](../../old%20content/UpdateRuleExample.png) to visualize incremental averaging.
- OpenAI Gymnasium environments for quick experimentation (e.g., Blackjack-v1).

### Monte Carlo Control Starter
Adapted from the Blackjack and Monte Carlo examples in `old content/Section_10_MonteCarlo_Example_1.pdf`.

In [None]:
import random
from collections import defaultdict

try:
    import gymnasium as gym
except ImportError:
    gym = None
    print("Install gymnasium to run the full Blackjack example.")

def create_env():
    if gym is None:
        raise RuntimeError("gymnasium is required for this lab. Install it via pip install gymnasium[all].")
    return gym.make('Blackjack-v1', sab=True)

def generate_episode(policy, env):
    episode = []
    state, _ = env.reset()
    done = False
    while not done:
        action = policy(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        done = terminated or truncated
    return episode

def monte_carlo_control(num_episodes=50000, epsilon=0.1):
    env = create_env()
    returns_sum = defaultdict(float)
    returns_count = defaultdict(int)
    Q = defaultdict(lambda: [0.0, 0.0])

    def policy(state):
        if random.random() < epsilon:
            return env.action_space.sample()
        values = Q[state]
        return int(values[1] > values[0])

    for episode_idx in range(1, num_episodes + 1):
        episode = generate_episode(policy, env)
        G = 0
        visited = set()
        for state, action, reward in reversed(episode):
            G = reward + 0.9 * G
            if (state, action) not in visited:
                visited.add((state, action))
                returns_sum[(state, action)] += G
                returns_count[(state, action)] += 1
                Q[state][action] = returns_sum[(state, action)] / returns_count[(state, action)]
        if episode_idx % 5000 == 0:
            print(f"Episode {episode_idx}: exploring starts Monte Carlo running...")
    return Q

# q_values = monte_carlo_control(num_episodes=10000)
# list(q_values.items())[:5]
