# Lab 06 – Temporal-Difference Learning Starter Notebook

## Overview
Transition from Monte Carlo methods to temporal-difference (TD) learning with incremental updates. Students will implement TD(0) and eligibility traces, comparing performance with Monte Carlo baselines.

## Objectives
- Implement TD(0) prediction for value estimation.
- Understand eligibility traces and TD(λ) intuition.
- Compare convergence speed and sample efficiency versus Monte Carlo approaches.

## Pre-Lab Review
- Revisit the solved examples highlighting TD behavior in [`old content/RL Solved Example - Updated.pdf`](../../old%20content/RL%20Solved%20Example%20-%20Updated.pdf).
- Review any TD-focused notes embedded in [`old content/ALL_WEEKS_V5 - Student.ipynb`](../../old%20content/ALL_WEEKS_V5%20-%20Student.ipynb).

## In-Lab Exercises
1. Implement TD(0) for the random walk or Blackjack tasks from prior labs.
2. Add eligibility traces to experiment with TD(λ) variants.
3. Plot learning curves comparing TD and Monte Carlo approaches.
4. Discuss the bias-variance trade-off inherent in TD learning.

## Deliverables
- Notebook showcasing TD implementations, experiments, and plots.
- Short write-up summarizing insights on TD vs. Monte Carlo.

## Resources
- [`old content/DQN_vs_Q.png`](../../old%20content/DQN_vs_Q.png) to preview discussions about value approximation to come.
- Open-source RL textbooks or Sutton & Barto Chapter 6 for supplemental reading.

### Temporal-Difference Starter
Use this scaffold—mirroring the random walk TD examples from the archived notebook—to compare TD(0) and Monte Carlo estimates.

In [None]:
import numpy as np
from collections import defaultdict

class RandomWalk:
    def __init__(self, n_states=5):
        self.n_states = n_states
        self.start_state = n_states // 2
        self.terminal_left = -1
        self.terminal_right = n_states
        self.reset()

    def reset(self):
        self.state = self.start_state
        return self.state

    def step(self):
        move = np.random.choice([-1, 1])
        self.state += move
        if self.state == self.terminal_right:
            return self.state, 1, True
        if self.state == self.terminal_left:
            return self.state, 0, True
        return self.state, 0, False


def td_zero(env, V, alpha=0.1, gamma=1.0):
    state = env.reset()
    done = False
    while not done:
        next_state, reward, done = env.step()
        if next_state in (env.terminal_left, env.terminal_right):
            target = reward
        else:
            target = reward + gamma * V[next_state]
        V[state] += alpha * (target - V[state])
        state = next_state

def run_td_experiment(episodes=100):
    env = RandomWalk()
    V = defaultdict(float)
    for episode in range(episodes):
        td_zero(env, V, alpha=0.1)
        if (episode + 1) % 20 == 0:
            print(f"Episode {episode + 1}: value estimates {dict(sorted(V.items()))}")
    return V

# td_values = run_td_experiment(episodes=100)
