# Temporal Difference (TD) Learning

**Learning Objectives:**
By the end of this notebook, you will be able to:
* Understand TD learning and how it combines Dynamic Programming and Monte Carlo methods
* Implement and train Q-Learning agents (off-policy TD control)
* Implement and train SARSA agents (on-policy TD control)
* Implement and train Expected SARSA agents
* Compare different TD learning algorithms and understand their trade-offs

## Table of Contents
- [1 - Packages](#1)
- [2 - Introduction to TD Learning](#2)
    - [2.1 - TD Learning Fundamentals](#2-1)
    - [2.2 - The TD Error](#2-2)
- [3 - Q-Learning](#3)
    - [Exercise 1 - implement_q_learning](#ex-1)
    - [3.1 - Training Q-Learning](#3-1)
- [4 - SARSA](#4)
    - [Exercise 2 - implement_sarsa](#ex-2)
    - [4.1 - Training SARSA](#4-1)
- [5 - Expected SARSA](#5)
    - [Exercise 3 - implement_expected_sarsa](#ex-3)
    - [5.1 - Training Expected SARSA](#5-1)
- [6 - Algorithm Comparison](#6)
    - [Exercise 4 - compare_td_methods](#ex-4)
    - [6.1 - Comprehensive Analysis](#6-1)
- [7 - Cliff Walking Problem](#7)
- [8 - Summary and Key Takeaways](#8)

<a name='1'></a>
## 1 - Packages

In [None]:
### v1.0

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import gymnasium as gym
from IPython.display import display, HTML

# Import TD utilities
from td_utils import (
    QLearningAgent, SARSAAgent, ExpectedSARSAAgent,
    train_q_learning, train_sarsa, train_expected_sarsa,
    test_q_learning_agent, test_sarsa_agent, test_expected_sarsa_agent,
    test_td_error_calculation, test_epsilon_decay,
    plot_training_curves, plot_comparison_bars
)

# Configure visualization
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Set reproducibility
np.random.seed(42)

%load_ext autoreload
%autoreload 2

print('✓ All packages imported successfully')
print('  - NumPy, Matplotlib, Seaborn, Gymnasium')
print('  - TD Learning utilities loaded')

<a name='2'></a>
## 2 - Introduction to TD Learning

**Temporal Difference (TD) Learning** is a fundamental paradigm in reinforcement learning that combines ideas from:
- **Dynamic Programming**: Bootstrap with estimated future values
- **Monte Carlo Methods**: Learn from actual experience without a model

### Key Characteristics of TD Learning:
1. **Model-free**: No need for $p(s',r|s,a)$
2. **Online learning**: Learn after each step (not waiting for episode termination)
3. **Bootstrapping**: Use estimates of future values to update current estimates
4. **Efficient**: Combines advantages of DP and MC methods

<a name='2-1'></a>
### 2.1 - TD Learning Fundamentals

#### Comparison: Monte Carlo vs TD Learning

| Aspect | Monte Carlo | TD Learning | Dynamic Programming |
|--------|-------------|-------------|---------------------|
| **Requires model** | No | No | Yes |
| **Update timing** | End of episode | After each step | Each step |
| **Uses bootstrapping** | No | Yes | Yes |
| **Variance** | High | Low-Medium | Low |
| **Bias** | Low | Medium | Medium |
| **Convergence speed** | Slow | Fast | Fast (if model known) |
| **Sample efficiency** | Low | Medium | High (if model perfect) |

<a name='2-2'></a>
### 2.2 - The TD Error

The core of TD learning is the **TD Error**, which measures how much our current estimate differs from the "TD target":

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

Where:
- $R_{t+1}$: Immediate reward received
- $\gamma$: Discount factor (0 ≤ γ ≤ 1)
- $V(S_t)$: Current value estimate for state $S_t$
- $V(S_{t+1})$: Bootstrap estimate for next state (this is the key!)

**The TD Update Rule:**
$$V(S_t) \leftarrow V(S_t) + \alpha \cdot \delta_t$$

Where $\alpha$ is the learning rate.

**Key Insight**: We move the current estimate towards the TD target by an amount proportional to the TD error.

In [None]:
# Run unit tests
print("Testing TD Error calculation...")
test_td_error_calculation()
test_epsilon_decay()
print("\n✓ Unit tests passed!")

<a name='3'></a>
## 3 - Q-Learning

### 3.0 - Q-Learning Algorithm (Off-Policy)

Q-Learning learns the **optimal policy** while exploring with a different policy (epsilon-greedy).

**Algorithm:**
```
Initialize Q(s,a) = 0 for all s,a
For each episode:
    s = initial state
    For each step:
        a = select action using epsilon-greedy policy
        Execute a, observe r, s'
        Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)]  ← MAX is key!
        s = s'
    until terminal state reached
```

**Off-Policy Definition**: The learning target uses the best possible next action (max), regardless of which action the policy actually explores.

**Update Rule**:
$$Q(s,a) \leftarrow Q(s,a) + \alpha [R_{t+1} + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$

<a name='ex-1'></a>
### Exercise 1 - implement_q_learning

Implement the Q-Learning update rule. Your function should:
1. Calculate the current Q-value for the state-action pair
2. Calculate the TD target (reward + gamma * max next Q-value)
3. Update the Q-value using the TD error

**Hints:**
- Use `self.Q[state][action]` to access/update Q-values
- Use `np.max(self.Q[next_state])` to get the maximum Q-value for the next state
- TD error = target - current estimate

In [None]:
# GRADED FUNCTION: implement_q_learning

def implement_q_learning(Q, state, action, reward, next_state, done, alpha=0.1, gamma=0.99):
    """
    Implement the Q-Learning update rule.
    
    Arguments:
    Q -- Q-value table (dictionary)
    state -- current state
    action -- action taken
    reward -- reward received
    next_state -- resulting state
    done -- whether episode terminated
    alpha -- learning rate
    gamma -- discount factor
    
    Returns:
    Q -- updated Q-value table
    td_error -- temporal difference error
    """
    # (approx. 6 lines)
    # YOUR CODE STARTS HERE
    current_q = Q[state][action]
    
    if done:
        target_q = reward
    else:
        # Q-Learning: use MAX action value
        target_q = reward + gamma * np.max(Q[next_state])
    
    td_error = target_q - current_q
    Q[state][action] += alpha * td_error
    # YOUR CODE ENDS HERE
    
    return Q, td_error

print("✓ Q-Learning update rule implemented")

In [None]:
# Test the implementation
def test_implement_q_learning():
    # Initialize Q-values
    Q = defaultdict(lambda: np.zeros(4))
    
    # Test case 1: Update with non-terminal state
    Q[0] = np.array([0.1, 0.2, 0.3, 0.4])
    Q[1] = np.array([0.2, 0.3, 0.5, 0.1])
    
    Q, td_error = implement_q_learning(Q, state=0, action=0, reward=1.0, 
                                       next_state=1, done=False, alpha=0.1, gamma=0.99)
    
    # Expected: target = 1.0 + 0.99 * max([0.2, 0.3, 0.5, 0.1]) = 1.0 + 0.99*0.5 = 1.495
    # current = 0.1
    # td_error = 1.495 - 0.1 = 1.395
    # Q[0][0] = 0.1 + 0.1*1.395 = 0.2395
    
    expected_q = 0.1 + 0.1 * (1.0 + 0.99 * 0.5 - 0.1)
    assert np.isclose(Q[0][0], expected_q), f"Expected {expected_q}, got {Q[0][0]}"
    
    # Test case 2: Terminal state
    Q, td_error = implement_q_learning(Q, state=0, action=1, reward=2.0, 
                                       next_state=2, done=True, alpha=0.1, gamma=0.99)
    
    # For terminal: target = reward = 2.0
    # Q[0][1] = 0.2 + 0.1 * (2.0 - 0.2) = 0.2 + 0.18 = 0.38
    
    expected_q = 0.2 + 0.1 * (2.0 - 0.2)
    assert np.isclose(Q[0][1], expected_q), f"Expected {expected_q}, got {Q[0][1]}"
    
    print("✓ All Q-Learning update tests passed!")

test_implement_q_learning()

<a name='3-1'></a>
### 3.1 - Training Q-Learning

Now let's train a Q-Learning agent on the FrozenLake environment.

In [None]:
# Create FrozenLake environment (deterministic)
env_ql = gym.make('FrozenLake-v1', is_slippery=False)
print('Environment: FrozenLake-v1 (deterministic)')
print(f'  States: {env_ql.observation_space.n}')
print(f'  Actions: {env_ql.action_space.n}')
print(f'  Goal: Reach the frisbee without falling into holes\n')

# Initialize Q-Learning agent
agent_ql = QLearningAgent(
    n_actions=env_ql.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995,
    epsilon_min=0.01
)

print('Training Q-Learning agent...')
rewards_ql = train_q_learning(env_ql, agent_ql, n_episodes=500, verbose=True)
print(f'\nTraining complete!')

env_ql.close()

In [None]:
# Analyze Q-Learning performance
final_100_ql = rewards_ql[-100:]
success_rate_ql = sum([1 for r in final_100_ql if r > 0.5]) / len(final_100_ql)

print(f"Q-Learning Performance (last 100 episodes):")
print(f"  Average reward: {np.mean(final_100_ql):.3f}")
print(f"  Std deviation: {np.std(final_100_ql):.3f}")
print(f"  Success rate: {success_rate_ql*100:.1f}%")

<font color='blue'>

**What you should remember**:
- Q-Learning is **off-policy**: learns the optimal policy while exploring with a different policy
- The **max operator** in the update rule is crucial for learning the optimal value function
- Q-Learning can be aggressive because it assumes optimal behavior in the future
- Convergence is guaranteed under appropriate conditions (GLIE = Greedy in the Limit with Infinite Exploration)
</font>

<a name='4'></a>
## 4 - SARSA

### 4.0 - SARSA Algorithm (On-Policy)

SARSA learns about the **policy being followed** during exploration (on-policy).

**Algorithm:**
```
Initialize Q(s,a) = 0 for all s,a
For each episode:
    s = initial state
    a = select action using epsilon-greedy policy
    For each step:
        Execute a, observe r, s'
        a' = select action using epsilon-greedy policy from s'
        Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]  ← Uses actual next action!
        s = s', a = a'
    until terminal state reached
```

**On-Policy Definition**: The learning target uses the Q-value of the actual action that will be taken (a'), not necessarily the best action.

**Update Rule**:
$$Q(s,a) \leftarrow Q(s,a) + \alpha [R_{t+1} + \gamma Q(s',a') - Q(s,a)]$$

Where $a'$ is the actual next action selected by the epsilon-greedy policy.

<a name='ex-2'></a>
### Exercise 2 - implement_sarsa

Implement the SARSA update rule. Your function should:
1. Calculate the current Q-value for the state-action pair
2. Calculate the TD target (reward + gamma * Q-value of actual next action)
3. Update the Q-value using the TD error

**Key Difference from Q-Learning**: Use `next_action` parameter instead of max over all actions.

**Hints:**
- The only difference from Q-Learning is in step 2
- Use `self.Q[next_state][next_action]` instead of `np.max(...)`

In [None]:
# GRADED FUNCTION: implement_sarsa

def implement_sarsa(Q, state, action, reward, next_state, next_action, done, alpha=0.1, gamma=0.99):
    """
    Implement the SARSA update rule.
    
    Arguments:
    Q -- Q-value table (dictionary)
    state -- current state
    action -- action taken
    reward -- reward received
    next_state -- resulting state
    next_action -- next action that will be taken (actual policy action)
    done -- whether episode terminated
    alpha -- learning rate
    gamma -- discount factor
    
    Returns:
    Q -- updated Q-value table
    td_error -- temporal difference error
    """
    # (approx. 6 lines)
    # YOUR CODE STARTS HERE
    current_q = Q[state][action]
    
    if done:
        target_q = reward
    else:
        # SARSA: use actual next action's Q-value
        target_q = reward + gamma * Q[next_state][next_action]
    
    td_error = target_q - current_q
    Q[state][action] += alpha * td_error
    # YOUR CODE ENDS HERE
    
    return Q, td_error

print("✓ SARSA update rule implemented")

In [None]:
# Test the implementation
def test_implement_sarsa():
    # Initialize Q-values
    Q = defaultdict(lambda: np.zeros(4))
    
    # Test case 1: Update with non-terminal state
    Q[0] = np.array([0.1, 0.2, 0.3, 0.4])
    Q[1] = np.array([0.2, 0.3, 0.5, 0.1])
    
    next_action = 2  # Will take action 2 in next state
    Q, td_error = implement_sarsa(Q, state=0, action=0, reward=1.0, 
                                   next_state=1, next_action=next_action, done=False, alpha=0.1, gamma=0.99)
    
    # Expected: target = 1.0 + 0.99 * Q[1][2] = 1.0 + 0.99*0.5 = 1.495
    expected_q = 0.1 + 0.1 * (1.0 + 0.99 * Q[1][next_action] - 0.1)
    assert np.isclose(Q[0][0], expected_q), f"Expected {expected_q}, got {Q[0][0]}"
    
    # Test case 2: Terminal state
    Q, td_error = implement_sarsa(Q, state=0, action=1, reward=2.0, 
                                   next_state=2, next_action=1, done=True, alpha=0.1, gamma=0.99)
    
    # For terminal: target = reward = 2.0
    expected_q = 0.2 + 0.1 * (2.0 - 0.2)
    assert np.isclose(Q[0][1], expected_q), f"Expected {expected_q}, got {Q[0][1]}"
    
    print("✓ All SARSA update tests passed!")

test_implement_sarsa()

<a name='4-1'></a>
### 4.1 - Training SARSA

Now let's train a SARSA agent on the same FrozenLake environment.

In [None]:
# Create FrozenLake environment
env_sarsa = gym.make('FrozenLake-v1', is_slippery=False)

# Initialize SARSA agent
agent_sarsa = SARSAAgent(
    n_actions=env_sarsa.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995,
    epsilon_min=0.01
)

print('Training SARSA agent...')
rewards_sarsa = train_sarsa(env_sarsa, agent_sarsa, n_episodes=500, verbose=True)
print(f'\nTraining complete!')

env_sarsa.close()

In [None]:
# Analyze SARSA performance
final_100_sarsa = rewards_sarsa[-100:]
success_rate_sarsa = sum([1 for r in final_100_sarsa if r > 0.5]) / len(final_100_sarsa)

print(f"SARSA Performance (last 100 episodes):")
print(f"  Average reward: {np.mean(final_100_sarsa):.3f}")
print(f"  Std deviation: {np.std(final_100_sarsa):.3f}")
print(f"  Success rate: {success_rate_sarsa*100:.1f}%")

<font color='blue'>

**What you should remember**:
- SARSA is **on-policy**: learns about the policy being followed during exploration
- Uses the Q-value of the **actual next action**, not the best possible action
- More conservative than Q-Learning because it considers exploration risk
- Useful when the cost of failure during exploration is high
- Convergence is guaranteed under GLIE conditions
</font>

<a name='5'></a>
## 5 - Expected SARSA

### 5.0 - Expected SARSA Algorithm

Expected SARSA combines the best of both worlds:
- More stable than SARSA (not dependent on single next action)
- Less aggressive than Q-Learning (considers full policy distribution)

**Update Rule**:
$$Q(s,a) \leftarrow Q(s,a) + \alpha \left[R_{t+1} + \gamma \mathbb{E}[Q(s',a')] - Q(s,a)\right]$$

Where the expectation is taken over the epsilon-greedy policy:
$$\mathbb{E}[Q(s',a')] = (1-\epsilon) \max_{a'} Q(s',a') + \frac{\epsilon}{|\mathcal{A}|} \sum_{a'} Q(s',a')$$

<a name='ex-3'></a>
### Exercise 3 - implement_expected_sarsa

Implement the Expected SARSA update rule. Your function should:
1. Calculate the current Q-value
2. Compute the expected Q-value over the epsilon-greedy policy
3. Calculate the TD target and update

**Formula for expected value**:
- Best action gets: $(1-\epsilon) + \frac{\epsilon}{|\mathcal{A}|}$
- Other actions get: $\frac{\epsilon}{|\mathcal{A}|}$

**Hints:**
- Find the max Q-value for the next state
- Calculate average over all Q-values
- Combine using the formula above

In [None]:
# GRADED FUNCTION: implement_expected_sarsa

def implement_expected_sarsa(Q, state, action, reward, next_state, done, 
                             alpha=0.1, gamma=0.99, epsilon=0.1, n_actions=4):
    """
    Implement the Expected SARSA update rule.
    
    Arguments:
    Q -- Q-value table (dictionary)
    state -- current state
    action -- action taken
    reward -- reward received
    next_state -- resulting state
    done -- whether episode terminated
    alpha -- learning rate
    gamma -- discount factor
    epsilon -- exploration rate
    n_actions -- total number of actions
    
    Returns:
    Q -- updated Q-value table
    td_error -- temporal difference error
    """
    # (approx. 9 lines)
    # YOUR CODE STARTS HERE
    current_q = Q[state][action]
    
    if done:
        target_q = reward
    else:
        # Expected SARSA: compute expected value over epsilon-greedy policy
        q_values = Q[next_state]
        max_action = np.argmax(q_values)
        
        # Expected value under epsilon-greedy policy
        expected_q = ((1 - epsilon) * q_values[max_action] + 
                     (epsilon / n_actions) * np.sum(q_values))
        
        target_q = reward + gamma * expected_q
    
    td_error = target_q - current_q
    Q[state][action] += alpha * td_error
    # YOUR CODE ENDS HERE
    
    return Q, td_error

print("✓ Expected SARSA update rule implemented")

In [None]:
# Test the implementation
def test_implement_expected_sarsa():
    # Initialize Q-values
    Q = defaultdict(lambda: np.zeros(4))
    
    Q[0] = np.array([0.1, 0.2, 0.3, 0.4])
    Q[1] = np.array([0.2, 0.3, 0.5, 0.1])
    
    epsilon = 0.1
    gamma = 0.99
    alpha = 0.1
    n_actions = 4
    
    Q, td_error = implement_expected_sarsa(Q, state=0, action=0, reward=1.0, 
                                           next_state=1, done=False, 
                                           alpha=alpha, gamma=gamma, epsilon=epsilon, n_actions=n_actions)
    
    # Verify calculation
    q_next = np.array([0.2, 0.3, 0.5, 0.1])
    max_q = np.max(q_next)  # 0.5
    expected_q = (1 - epsilon) * max_q + (epsilon / n_actions) * np.sum(q_next)
    expected_target = 1.0 + gamma * expected_q
    expected_q_value = 0.1 + alpha * (expected_target - 0.1)
    
    assert np.isclose(Q[0][0], expected_q_value), f"Expected {expected_q_value}, got {Q[0][0]}"
    
    print("✓ All Expected SARSA update tests passed!")

test_implement_expected_sarsa()

<a name='5-1'></a>
### 5.1 - Training Expected SARSA

Now let's train an Expected SARSA agent.

In [None]:
# Create FrozenLake environment
env_exp_sarsa = gym.make('FrozenLake-v1', is_slippery=False)

# Initialize Expected SARSA agent
agent_exp_sarsa = ExpectedSARSAAgent(
    n_actions=env_exp_sarsa.action_space.n,
    alpha=0.1,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995,
    epsilon_min=0.01
)

print('Training Expected SARSA agent...')
rewards_exp_sarsa = train_expected_sarsa(env_exp_sarsa, agent_exp_sarsa, n_episodes=500, verbose=True)
print(f'\nTraining complete!')

env_exp_sarsa.close()

In [None]:
# Analyze Expected SARSA performance
final_100_exp_sarsa = rewards_exp_sarsa[-100:]
success_rate_exp_sarsa = sum([1 for r in final_100_exp_sarsa if r > 0.5]) / len(final_100_exp_sarsa)

print(f"Expected SARSA Performance (last 100 episodes):")
print(f"  Average reward: {np.mean(final_100_exp_sarsa):.3f}")
print(f"  Std deviation: {np.std(final_100_exp_sarsa):.3f}")
print(f"  Success rate: {success_rate_exp_sarsa*100:.1f}%")

<font color='blue'>

**What you should remember**:
- Expected SARSA is a **hybrid approach** between Q-Learning and SARSA
- Uses the expected value over the epsilon-greedy policy distribution
- More stable than SARSA (not dependent on single stochastic action)
- Less aggressive than Q-Learning (considers actual exploration policy)
- Often provides the best balance between stability and convergence speed
</font>

<a name='6'></a>
## 6 - Algorithm Comparison

### 6.0 - Theoretical Comparison

| Aspect | Q-Learning | SARSA | Expected SARSA |
|--------|-----------|-------|----------------|
| **Policy Type** | Off-policy | On-policy | On-policy |
| **What it learns** | Optimal policy | Current policy | Current policy |
| **Target uses** | max(Q(s',a')) | Q(s',a') | E[Q(s',a')] |
| **Stability** | Medium | High | High |
| **Convergence** | Fast | Medium | Medium |
| **Risk/Caution** | Aggressive | Conservative | Balanced |
| **Best for** | Offline learning | Online w/risk | General purpose |
| **Variance** | Medium-High | Low | Low |
| **Bias** | Low | Medium | Medium |

<a name='ex-4'></a>
### Exercise 4 - compare_td_methods

Implement a comprehensive comparison function that:
1. Trains all three algorithms multiple times
2. Computes statistics (mean, std, success rate)
3. Returns a dictionary with results

**Hints**:
- Run multiple training runs to get reliable statistics
- Calculate final 100 episode performance
- Store results in a dictionary with algorithm names as keys

In [None]:
# GRADED FUNCTION: compare_td_methods

def compare_td_methods(n_runs=5, n_episodes=500, verbose=True):
    """
    Compare Q-Learning, SARSA, and Expected SARSA across multiple runs.
    
    Arguments:
    n_runs -- number of independent training runs
    n_episodes -- episodes per run
    verbose -- whether to print progress
    
    Returns:
    results -- dictionary with comparison results
    all_rewards -- dictionary with all reward histories
    """
    # (approx. 40 lines)
    # YOUR CODE STARTS HERE
    
    all_rewards = {'Q-Learning': [], 'SARSA': [], 'Expected SARSA': []}
    results = {}
    
    for run in range(n_runs):
        # Q-Learning
        env = gym.make('FrozenLake-v1', is_slippery=False)
        agent_ql = QLearningAgent(env.action_space.n, alpha=0.1, gamma=0.99, epsilon=1.0)
        ql_rewards = train_q_learning(env, agent_ql, n_episodes=n_episodes, verbose=False)
        all_rewards['Q-Learning'].append(ql_rewards)
        env.close()
        
        # SARSA
        env = gym.make('FrozenLake-v1', is_slippery=False)
        agent_sarsa = SARSAAgent(env.action_space.n, alpha=0.1, gamma=0.99, epsilon=1.0)
        sarsa_rewards = train_sarsa(env, agent_sarsa, n_episodes=n_episodes, verbose=False)
        all_rewards['SARSA'].append(sarsa_rewards)
        env.close()
        
        # Expected SARSA
        env = gym.make('FrozenLake-v1', is_slippery=False)
        agent_exp = ExpectedSARSAAgent(env.action_space.n, alpha=0.1, gamma=0.99, epsilon=1.0)
        exp_rewards = train_expected_sarsa(env, agent_exp, n_episodes=n_episodes, verbose=False)
        all_rewards['Expected SARSA'].append(exp_rewards)
        env.close()
        
        if verbose:
            print(f'Run {run+1}/{n_runs} completed')
    
    # Calculate statistics
    for algo_name, rewards_list in all_rewards.items():
        final_100_rewards = [rewards[-100:] for rewards in rewards_list]
        
        avg_rewards = [np.mean(r) for r in final_100_rewards]
        success_rates = [sum([1 for r in final_100 if r > 0.5]) / len(final_100) 
                        for final_100 in final_100_rewards]
        
        results[algo_name] = {
            'mean_reward': np.mean(avg_rewards),
            'std_reward': np.std(avg_rewards),
            'mean_success_rate': np.mean(success_rates),
            'std_success_rate': np.std(success_rates)
        }
    
    # YOUR CODE ENDS HERE
    return results, all_rewards

print("✓ compare_td_methods function implemented")

In [None]:
# Run comprehensive comparison
print(f'Running comprehensive comparison across {5} runs...\n')
results, all_rewards = compare_td_methods(n_runs=5, n_episodes=500, verbose=True)

print('\n' + '='*70)
print('COMPREHENSIVE TD LEARNING ALGORITHMS COMPARISON')
print('='*70)

for algo_name, metrics in results.items():
    print(f'\n{algo_name}:')
    print(f'  Mean reward (final 100 eps): {metrics["mean_reward"]:.4f} ± {metrics["std_reward"]:.4f}')
    print(f'  Success rate: {metrics["mean_success_rate"]*100:.1f}% ± {metrics["std_success_rate"]*100:.1f}%')

print('\n' + '='*70)

<a name='6-1'></a>
### 6.1 - Comprehensive Analysis

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Individual algorithm convergence curves
window = 50
colors = {'Q-Learning': 'blue', 'SARSA': 'green', 'Expected SARSA': 'orange'}

ax = axes[0, 0]
for algo_name in ['Q-Learning', 'SARSA', 'Expected SARSA']:
    # Average across runs
    avg_rewards = np.mean(all_rewards[algo_name], axis=0)
    moving_avg = np.convolve(avg_rewards, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, 500), moving_avg, linewidth=2.5, 
            label=algo_name, color=colors[algo_name])

ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Reward (Moving Avg)', fontsize=11)
ax.set_title('Convergence Curves', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: Final performance comparison
ax = axes[0, 1]
algo_names = ['Q-Learning', 'SARSA', 'Expected SARSA']
final_rewards = [results[name]['mean_reward'] for name in algo_names]
final_stds = [results[name]['std_reward'] for name in algo_names]

colors_list = ['blue', 'green', 'orange']
ax.bar(algo_names, final_rewards, yerr=final_stds, capsize=10, 
       color=colors_list, alpha=0.7, edgecolor='black', linewidth=2)
ax.set_ylabel('Mean Reward', fontsize=11)
ax.set_title('Final Performance (Last 100 Episodes)', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

for i, (name, reward, std) in enumerate(zip(algo_names, final_rewards, final_stds)):
    ax.text(i, reward + std + 0.05, f'{reward:.3f}', ha='center', fontweight='bold')

# Plot 3: Success rate comparison
ax = axes[1, 0]
success_rates = [results[name]['mean_success_rate']*100 for name in algo_names]
success_stds = [results[name]['std_success_rate']*100 for name in algo_names]

ax.bar(algo_names, success_rates, yerr=success_stds, capsize=10,
       color=colors_list, alpha=0.7, edgecolor='black', linewidth=2)
ax.set_ylabel('Success Rate (%)', fontsize=11)
ax.set_title('Success Rate (Last 100 Episodes)', fontsize=12, fontweight='bold')
ax.set_ylim([0, 110])
ax.grid(True, alpha=0.3, axis='y')

for i, (rate, std) in enumerate(zip(success_rates, success_stds)):
    ax.text(i, rate + std + 2, f'{rate:.1f}%', ha='center', fontweight='bold')

# Plot 4: Statistical comparison table
ax = axes[1, 1]
ax.axis('tight')
ax.axis('off')

table_data = []
for algo_name in algo_names:
    metrics = results[algo_name]
    table_data.append([
        algo_name,
        f"{metrics['mean_reward']:.3f}",
        f"{metrics['std_reward']:.3f}",
        f"{metrics['mean_success_rate']*100:.1f}%"
    ])

table = ax.table(cellText=table_data,
                colLabels=['Algorithm', 'Mean Reward', 'Std Dev', 'Success Rate'],
                cellLoc='center',
                loc='center',
                colWidths=[0.25, 0.25, 0.25, 0.25])

table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2.5)

# Style header
for i in range(4):
    table[(0, i)].set_facecolor('#4CAF50')
    table[(0, i)].set_text_props(weight='bold', color='white')

# Alternate row colors
for i in range(1, 4):
    for j in range(4):
        if i % 2 == 0:
            table[(i, j)].set_facecolor('#f0f0f0')
        else:
            table[(i, j)].set_facecolor('#ffffff')

plt.tight_layout()
plt.show()

print('✓ Comprehensive comparison visualization complete')

<font color='blue'>

**What you should remember**:
- **Q-Learning**: Off-policy, learns optimal policy, can be aggressive in exploration
- **SARSA**: On-policy, learns current policy, conservative but stable
- **Expected SARSA**: Hybrid approach, balances aggressiveness and stability
- In deterministic environments, all three converge to similar performance
- Differences become more pronounced in stochastic environments
- Choice depends on problem characteristics: offline learning vs online with risk
</font>

<a name='7'></a>
## 7 - Cliff Walking Problem

### 7.0 - Problem Description

The **Cliff Walking** environment demonstrates the key differences between on-policy and off-policy algorithms:

- Grid: 4 rows × 12 columns
- Agent starts at (3, 0) - bottom left
- Goal at (3, 11) - bottom right
- Cliff along (3, 1-10) with reward -100
- Each step reward: -1

**The Dilemma**:
- **Optimal path** (along the cliff edge): Fast but risky (-13 rewards)
- **Safe path** (away from cliff): Slower but safe (-25 rewards)

**Expected behavior**:
- **Q-Learning**: Learns the risky optimal path
- **SARSA**: Learns the safe path (avoids cliff during learning)

In [None]:
# Create Cliff Walking environment
class CliffWalkingEnv:
    """Custom Cliff Walking environment"""
    
    def __init__(self):
        self.grid_shape = (4, 12)
        self.start = (3, 0)
        self.goal = (3, 11)
        self.current_pos = self.start
        self.action_names = ['up', 'right', 'down', 'left']
        self.state = self._pos_to_state(self.start)
    
    def _pos_to_state(self, pos):
        return pos[0] * self.grid_shape[1] + pos[1]
    
    def _state_to_pos(self, state):
        return (state // self.grid_shape[1], state % self.grid_shape[1])
    
    def _is_cliff(self, pos):
        return pos[0] == 3 and 1 <= pos[1] <= 10
    
    def reset(self):
        self.current_pos = self.start
        self.state = self._pos_to_state(self.start)
        return self.state, {}
    
    def step(self, action):
        row, col = self.current_pos
        
        if action == 0:  # up
            row = max(0, row - 1)
        elif action == 1:  # right
            col = min(self.grid_shape[1] - 1, col + 1)
        elif action == 2:  # down
            row = min(self.grid_shape[0] - 1, row + 1)
        elif action == 3:  # left
            col = max(0, col - 1)
        
        new_pos = (row, col)
        
        if self._is_cliff(new_pos):
            reward = -100
            new_pos = self.start
            done = True
        elif new_pos == self.goal:
            reward = 0
            done = True
        else:
            reward = -1
            done = False
        
        self.current_pos = new_pos
        self.state = self._pos_to_state(new_pos)
        
        return self.state, reward, done, False, {}

print('✓ Cliff Walking environment created')

In [None]:
# Train on Cliff Walking
print('Training algorithms on Cliff Walking...\n')

cliff_results = {}
env_cliff = CliffWalkingEnv()

# Q-Learning on Cliff Walking
agent_ql_cliff = QLearningAgent(n_actions=4, alpha=0.1, gamma=0.99, epsilon=0.1)
ql_cliff_rewards = []
for episode in range(500):
    state, _ = env_cliff.reset()
    episode_reward = 0
    
    for step in range(100):
        action = agent_ql_cliff.get_action(state)
        next_state, reward, done, _, _ = env_cliff.step(action)
        agent_ql_cliff.update(state, action, reward, next_state, done)
        episode_reward += reward
        state = next_state
        if done:
            break
    
    agent_ql_cliff.decay_epsilon()
    ql_cliff_rewards.append(episode_reward)

print(f'Q-Learning: {np.mean(ql_cliff_rewards[-100:]):.2f} avg reward (last 100 eps)')

# SARSA on Cliff Walking
agent_sarsa_cliff = SARSAAgent(n_actions=4, alpha=0.1, gamma=0.99, epsilon=0.1)
sarsa_cliff_rewards = []
for episode in range(500):
    state, _ = env_cliff.reset()
    action = agent_sarsa_cliff.get_action(state)
    episode_reward = 0
    
    for step in range(100):
        next_state, reward, done, _, _ = env_cliff.step(action)
        next_action = agent_sarsa_cliff.get_action(next_state)
        agent_sarsa_cliff.update(state, action, reward, next_state, next_action, done)
        episode_reward += reward
        state = next_state
        action = next_action
        if done:
            break
    
    agent_sarsa_cliff.decay_epsilon()
    sarsa_cliff_rewards.append(episode_reward)

print(f'SARSA: {np.mean(sarsa_cliff_rewards[-100:]):.2f} avg reward (last 100 eps)')
print('\n✓ Training on Cliff Walking complete')

# Store results
cliff_results['Q-Learning'] = np.mean(ql_cliff_rewards[-100:])
cliff_results['SARSA'] = np.mean(sarsa_cliff_rewards[-100:])

In [None]:
# Visualize Cliff Walking results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

window = 50

# Plot 1: Learning curves
ax = axes[0]
moving_avg_ql = np.convolve(ql_cliff_rewards, np.ones(window)/window, mode='valid')
moving_avg_sarsa = np.convolve(sarsa_cliff_rewards, np.ones(window)/window, mode='valid')

ax.plot(range(window-1, 500), moving_avg_ql, 'b-', linewidth=2.5, label='Q-Learning')
ax.plot(range(window-1, 500), moving_avg_sarsa, 'g-', linewidth=2.5, label='SARSA')
ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Reward', fontsize=11)
ax.set_title('Cliff Walking: Learning Curves', fontsize=12, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.axhline(y=-13, color='r', linestyle='--', alpha=0.5, label='Optimal path')
ax.axhline(y=-25, color='orange', linestyle='--', alpha=0.5, label='Safe path')

# Plot 2: Final performance comparison
ax = axes[1]
algorithms = ['Q-Learning', 'SARSA']
rewards = [cliff_results['Q-Learning'], cliff_results['SARSA']]
colors = ['blue', 'green']

bars = ax.bar(algorithms, rewards, color=colors, alpha=0.7, edgecolor='black', linewidth=2)
ax.set_ylabel('Average Reward', fontsize=11)
ax.set_title('Cliff Walking: Final Performance', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
ax.axhline(y=-13, color='r', linestyle='--', alpha=0.5, linewidth=2, label='Optimal path')
ax.axhline(y=-25, color='orange', linestyle='--', alpha=0.5, linewidth=2, label='Safe path')
ax.legend()

for bar, reward in zip(bars, rewards):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height-3,
            f'{reward:.1f}', ha='center', va='top', fontweight='bold', color='white', fontsize=11)

plt.tight_layout()
plt.show()

print('✓ Cliff Walking analysis complete')

### 7.1 - Key Observations

**Q-Learning Results**:
- Learns the optimal policy along the cliff edge
- Final reward: ~-13 (approaching the optimal -13)
- Takes the risky path because it learns the best long-term policy

**SARSA Results**:
- Learns a safer path away from the cliff
- Final reward: ~-25 (worse than optimal but safer)
- Avoids the cliff during learning because it accounts for exploration risk

**Interpretation**:
- **Off-policy (Q-Learning)**: Can afford to explore risky paths and learn the best policy
- **On-policy (SARSA)**: Must be cautious during exploration, so learns a safer policy
- This demonstrates why **SARSA is preferred for real-world applications with exploration costs**

<a name='8'></a>
## 8 - Summary and Key Takeaways

### 8.1 - Core Concepts

**TD Learning combines the best of both worlds:**
1. **Like DP**: Uses bootstrapping (value estimates of future states)
2. **Like MC**: Learns from real experience without a model
3. **Better than both**: Online learning with lower variance than MC

### 8.2 - Three Main Algorithms

1. **Q-Learning (Off-Policy)**
   - Update: $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$
   - Best for: Offline learning, when exploration cost is low
   - Risk: Can learn dangerous policies during exploration

2. **SARSA (On-Policy)**
   - Update: $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma Q(s',a') - Q(s,a)]$
   - Best for: Online learning, when exploration is risky
   - Advantage: Safe, stable learning

3. **Expected SARSA (On-Policy)**
   - Update: $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \mathbb{E}[Q(s',a')] - Q(s,a)]$
   - Best for: General-purpose learning with good stability
   - Advantage: Balances aggressiveness and safety

### 8.3 - When to Use Each Algorithm

| Scenario | Best Choice | Why |
|----------|------------|-----|
| Offline learning from logs | Q-Learning | Can explore aggressively |
| Real robot, high cost of failure | SARSA | Conservative and safe |
| General purpose learning | Expected SARSA | Good balance |
| Need guaranteed convergence | Any (GLIE) | All work with GLIE |
| Stochastic environment | Expected SARSA | More stable |
| Large action space | Consider function approximation | TD learning scales with approximators |

### 8.4 - Important Parameters

- **Learning rate (α)**: Controls update step size. Common: 0.01 to 0.1
- **Discount factor (γ)**: Values future rewards. 0.99 for long-horizon tasks
- **Exploration rate (ε)**: Probability of random action. Start high (~1.0), decay to low (~0.01)
- **Epsilon decay**: How fast ε decreases. Common: 0.995 per episode

### 8.5 - Convergence Guarantees

All three algorithms converge to optimal $Q^*$ under **GLIE** conditions:
1. **Greedy in the Limit**: Eventually exploit known best action
2. **Infinite Exploration**: Visit every state-action pair infinitely often

Practical implementation: Use epsilon-greedy with decaying epsilon

### 8.6 - Common Challenges and Solutions

| Challenge | Solution | Algorithm |
|-----------|----------|----------|
| Q-values grow unbounded | Use learning rate decay or target networks | DQN |
| High variance in updates | Use experience replay or mini-batches | DQN, Dueling |
| Overestimation of Q-values | Use Double Q-Learning or Expected SARSA | Double Q-Learning |
| Sample inefficiency | Prioritized experience replay | DQN variants |
| Discrete state space limitation | Function approximation (neural networks) | Deep Q-Learning |
| Continuous action spaces | Policy gradient methods | A3C, PPO, TRPO |

### 8.7 - Next Steps in Reinforcement Learning

After mastering TD Learning:

1. **Function Approximation**: Use neural networks for large state spaces
   - Deep Q-Networks (DQN)
   - Double DQN
   - Dueling DQN

2. **Policy Gradient Methods**: Learn policy directly
   - REINFORCE
   - Actor-Critic methods
   - A3C, PPO, TRPO

3. **Model-Based Methods**: Learn environment model
   - Dyna-Q
   - World Models
   - Planning methods

4. **Multi-Agent Learning**: Competitive and cooperative settings

5. **Advanced Topics**:
   - Inverse reinforcement learning
   - Meta-reinforcement learning
   - Transfer learning in RL

<font color='blue'>

**What you should remember**:
- **TD Learning** combines bootstrapping with experience-based learning
- **Q-Learning** (off-policy) learns optimal policy aggressively
- **SARSA** (on-policy) learns cautiously about current policy
- **Expected SARSA** balances both approaches
- Choice depends on problem characteristics and constraints
- All converge under GLIE conditions with proper exploration
- These fundamentals are building blocks for modern deep RL
</font>

In [None]:
print('='*70)
print('TEMPORAL DIFFERENCE LEARNING - COMPLETE')
print('='*70)
print()
print('You have successfully learned:')
print('  ✓ TD Learning fundamentals and theory')
print('  ✓ Q-Learning algorithm (off-policy)')
print('  ✓ SARSA algorithm (on-policy)')
print('  ✓ Expected SARSA algorithm')
print('  ✓ Algorithm comparison and analysis')
print('  ✓ Practical applications and trade-offs')
print()
print('Next steps:')
print('  → Explore function approximation with neural networks')
print('  → Implement Deep Q-Networks (DQN)')
print('  → Study policy gradient methods')
print('  → Apply to real-world problems')
print()
print('='*70)