# Lab 07 – Value-Based Control (SARSA & Q-Learning) Starter Notebook

## Overview
Dive into control with temporal-difference methods by implementing SARSA and Q-learning on discrete environments. Students will analyze on-policy vs. off-policy behavior and exploration strategies.

## Objectives
- Implement SARSA and Q-learning algorithms with ε-greedy exploration.
- Compare learning curves and policy quality under different exploration rates.
- Discuss stability considerations for on-policy vs. off-policy methods.

## Pre-Lab Review
- Review SARSA vs. Q-learning visual comparisons in [`old content/DQN_vs_Q.png`](../../old%20content/DQN_vs_Q.png).
- Revisit control sections within [`old content/ALL_WEEKS_V5 - Student.ipynb`](../../old%20content/ALL_WEEKS_V5%20-%20Student.ipynb).

## In-Lab Exercises
1. Select a benchmark environment (e.g., CliffWalking-v0, Taxi-v3) and reset seeds for reproducibility.
2. Implement SARSA with decaying ε-greedy exploration; log episodic returns.
3. Implement Q-learning with the same environment for comparison.
4. Analyze stability, convergence speed, and sensitivity to hyperparameters.

## Deliverables
- Consolidated notebook summarizing SARSA and Q-learning implementations.
- Short memo contrasting exploration/exploitation trade-offs observed.

## Resources
- [`old content/optimal.png`](../../old%20content/optimal.png) for discussing optimal policy structures.
- Links to Gymnasium documentation for environment-specific APIs.

### SARSA vs. Q-Learning
Starter adapted from the GridWorld Q-learning agent in `old content/ALL_WEEKS_V5 - Student.ipynb`. Compare on-policy and off-policy updates.

In [None]:
import numpy as np
import random

class EpsilonGreedyAgent:
    def __init__(self, env, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.q_table = np.zeros((*env.grid_size, len(env.actions)))

    def choose_action(self, state):
        if random.random() < self.epsilon:
            return random.choice(self.env.actions)
        row, col = state
        action_index = np.argmax(self.q_table[row, col])
        return self.env.actions[action_index]

    def update_sarsa(self, state, action, reward, next_state, next_action):
        row, col = state
        action_idx = self.env.actions.index(action)
        next_row, next_col = next_state
        next_idx = self.env.actions.index(next_action) if next_action is not None else 0
        td_target = reward + self.gamma * self.q_table[next_row, next_col, next_idx]
        td_error = td_target - self.q_table[row, col, action_idx]
        self.q_table[row, col, action_idx] += self.alpha * td_error

    def update_q_learning(self, state, action, reward, next_state):
        row, col = state
        action_idx = self.env.actions.index(action)
        next_row, next_col = next_state
        td_target = reward + self.gamma * np.max(self.q_table[next_row, next_col])
        td_error = td_target - self.q_table[row, col, action_idx]
        self.q_table[row, col, action_idx] += self.alpha * td_error

def run_episode(env, agent, algorithm='q_learning'):
    state = env.reset()
    action = agent.choose_action(state)
    done = False
    total_reward = 0
    while not done:
        next_state, reward, done = env.step(action)
        total_reward += reward
        next_action = agent.choose_action(next_state) if not done else None
        if algorithm == 'sarsa':
            agent.update_sarsa(state, action, reward, next_state, next_action)
        else:
            agent.update_q_learning(state, action, reward, next_state)
        state, action = next_state, next_action if next_action is not None else agent.choose_action(next_state)
    return total_reward

# env = GridWorld(obstacles={(1, 1)})
# agent = EpsilonGreedyAgent(env)
# for episode in range(100):
#     reward = run_episode(env, agent, algorithm='q_learning')
#     if (episode + 1) % 20 == 0:
#         print(f"Episode {episode + 1}, total reward: {reward}")
