# Monte Carlo Control

Welcome to Monte Carlo Control! This is a model-free RL method that learns from complete episodes. By the end of this notebook, you'll be able to:

* Understand how Monte Carlo methods differ from Dynamic Programming and TD Learning
* Generate episodes using an ε-greedy policy
* Calculate returns (cumulative rewards) from episodes
* Implement first-visit Monte Carlo for Q-value estimation
* Build a complete Monte Carlo Control algorithm

## Monte Carlo Methods: Key Concepts

**Monte Carlo = Learning from complete episodes**

| Method | Updates | Requires Model | Bootstrap |
|--------|---------|----------------|------------|
| **DP** | All states | ✅ Yes | ✅ Yes |
| **MC** | Visited states | ❌ No | ❌ No |
| **TD** | One step | ❌ No | ✅ Yes |

**Key Ideas:**
- Learn from **complete episodes** (trajectory from start to end)
- Use **actual returns** $G_t$ instead of bootstrapping
- No model required (model-free)
- Can only be applied to **episodic tasks**

**Monte Carlo Update:**
$$Q(s,a) \leftarrow Q(s,a) + \alpha\left[G_t - Q(s,a)\right]$$

Where $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots$ is the actual return.

## Important Note on Submission

Please ensure:
1. No extra print statements
2. No extra code cells
3. Function parameters unchanged
4. No global variables in graded functions

## Table of Contents
- [1 - Packages](#1)
- [2 - Episode Generation](#2)
    - [Exercise 1 - generate_episode](#ex-1)
- [3 - Return Calculation](#3)
    - [Exercise 2 - calculate_returns](#ex-2)
- [4 - Q-Value Update](#4)
    - [Exercise 3 - update_q_values](#ex-3)
- [5 - Complete MC Control](#5)
    - [Exercise 4 - monte_carlo_control](#ex-4)
- [6 - Testing and Comparison](#6)

<a name='1'></a>
## 1 - Packages

In [None]:
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import defaultdict
from monte_carlo_tests import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (12.0, 6.0)

np.random.seed(42)

<a name='2'></a>
## 2 - Episode Generation

Monte Carlo learns from **episodes**: sequences of (state, action, reward) tuples from start to terminal state.

**Episode structure:**
```
Episode = [(s0, a0, r1), (s1, a1, r2), ..., (sT, aT, rT+1)]
```

We use **ε-greedy** policy for exploration:
- With probability ε: random action
- With probability 1-ε: greedy action (argmax Q)

<a name='ex-1'></a>
### Exercise 1 - generate_episode

Generate one complete episode using ε-greedy policy.

In [None]:
# GRADED FUNCTION: generate_episode

def generate_episode(env, Q, epsilon=0.1, max_steps=100):
    """
    Generate one episode using epsilon-greedy policy.
    
    Arguments:
    env -- Gymnasium environment
    Q -- Q-table (defaultdict), Q[state][action]
    epsilon -- exploration rate
    max_steps -- maximum steps per episode
    
    Returns:
    episode -- list of (state, action, reward) tuples
    """
    # (approx. 12-15 lines)
    # 1. Initialize episode list
    # 2. Reset environment to get initial state
    # 3. Loop until done or max_steps:
    #    a. Choose action using epsilon-greedy:
    #       - Random with prob epsilon
    #       - Argmax Q[state] with prob 1-epsilon
    #    b. Take action, get next_state and reward
    #    c. Append (state, action, reward) to episode
    #    d. Update state
    #    e. If done, break
    # 4. Return episode
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return episode

In [None]:
# Test your implementation
env = gym.make('FrozenLake-v1', is_slippery=False)
Q = defaultdict(lambda: np.zeros(env.action_space.n))

episode = generate_episode(env, Q, epsilon=0.5)
print(f"Episode length: {len(episode)}")
print(f"First 3 steps: {episode[:3]}")
print(f"Format: (state, action, reward)")

generate_episode_test(generate_episode)
env.close()

<a name='3'></a>
## 3 - Return Calculation

The **return** $G_t$ is the cumulative discounted reward from time $t$:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

**Efficient computation (backward):**
```python
G = 0
for t in reversed(range(T)):
    G = rewards[t] + gamma * G
    returns[t] = G
```

<a name='ex-2'></a>
### Exercise 2 - calculate_returns

Calculate returns for each time step in an episode.

In [None]:
# GRADED FUNCTION: calculate_returns

def calculate_returns(episode, gamma=0.99):
    """
    Calculate returns for each step in episode.
    
    Arguments:
    episode -- list of (state, action, reward) tuples
    gamma -- discount factor
    
    Returns:
    returns -- list of returns, returns[t] = G_t
    """
    # (approx. 6-8 lines)
    # 1. Initialize returns list (same length as episode)
    # 2. Initialize G = 0
    # 3. Loop backwards through episode (from T-1 to 0):
    #    a. Get reward at time t: episode[t][2]
    #    b. Update G = reward + gamma * G
    #    c. Store returns[t] = G
    # 4. Return returns list
    
    # Hint: Use reversed(range(len(episode)))
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return returns

In [None]:
# Test your implementation
test_episode = [(0, 1, 0), (1, 2, 0), (2, 1, 1)]  # Last reward = 1
returns = calculate_returns(test_episode, gamma=0.9)

print("Episode rewards: [0, 0, 1]")
print(f"Returns: {returns}")
print(f"\nExpected:")
print(f"  G[2] = 1")
print(f"  G[1] = 0 + 0.9*1 = 0.9")
print(f"  G[0] = 0 + 0.9*0.9 = 0.81")

calculate_returns_test(calculate_returns)

<a name='4'></a>
## 4 - Q-Value Update

**First-visit Monte Carlo**: Update Q(s,a) only the first time (s,a) is visited in an episode.

**Update rule:**
$$Q(s,a) \leftarrow Q(s,a) + \alpha\left[G - Q(s,a)\right]$$

**Algorithm:**
```
For each (state, action) in episode:
    If first visit to (state, action):
        G = return from that point
        Q(state, action) += alpha * (G - Q(state, action))
```

<a name='ex-3'></a>
### Exercise 3 - update_q_values

Update Q-values using first-visit Monte Carlo.

In [None]:
# GRADED FUNCTION: update_q_values

def update_q_values(Q, episode, returns, alpha=0.1):
    """
    Update Q-values using first-visit Monte Carlo.
    
    Arguments:
    Q -- Q-table (defaultdict)
    episode -- list of (state, action, reward) tuples
    returns -- list of returns for each time step
    alpha -- learning rate
    
    Returns:
    Q -- updated Q-table
    """
    # (approx. 8-10 lines)
    # 1. Initialize set to track visited (state, action) pairs
    # 2. For each time step t in episode:
    #    a. Get state, action from episode[t]
    #    b. Check if (state, action) already visited
    #    c. If first visit:
    #       - Get return G from returns[t]
    #       - Update: Q[state][action] += alpha * (G - Q[state][action])
    #       - Add (state, action) to visited set
    # 3. Return Q
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return Q

In [None]:
# Test your implementation
Q_test = defaultdict(lambda: np.zeros(4))
test_episode = [(0, 1, 0), (1, 2, 0), (2, 1, 1)]
test_returns = [0.81, 0.9, 1.0]

Q_test = update_q_values(Q_test, test_episode, test_returns, alpha=0.1)
print(f"Q[0][1] after update: {Q_test[0][1]:.4f}")
print(f"Expected: 0 + 0.1*(0.81 - 0) = 0.0810")

update_q_values_test(update_q_values)

<a name='5'></a>
## 5 - Complete Monte Carlo Control

**Monte Carlo Control Algorithm:**
```
Initialize Q(s,a) arbitrarily
For each episode:
    Generate episode using ε-greedy policy
    Calculate returns G_t for each t
    For each (s,a) in episode (first-visit):
        Q(s,a) ← Q(s,a) + α[G - Q(s,a)]
```

<a name='ex-4'></a>
### Exercise 4 - monte_carlo_control

Implement complete Monte Carlo Control.

In [None]:
# GRADED FUNCTION: monte_carlo_control

def monte_carlo_control(env, n_episodes=5000, alpha=0.1, gamma=0.99,
                       epsilon=0.1, epsilon_decay=0.999, epsilon_min=0.01):
    """
    Monte Carlo Control algorithm.
    
    Arguments:
    env -- Gymnasium environment
    n_episodes -- number of episodes
    alpha -- learning rate
    gamma -- discount factor
    epsilon -- initial exploration rate
    epsilon_decay -- decay rate for epsilon
    epsilon_min -- minimum epsilon
    
    Returns:
    Q -- learned Q-table
    policy -- greedy policy extracted from Q
    rewards_history -- list of total rewards per episode
    """
    # (approx. 15-18 lines)
    # 1. Initialize Q as defaultdict
    # 2. Initialize rewards_history
    # 3. For each episode:
    #    a. Generate episode using your generate_episode function
    #    b. Calculate returns using your calculate_returns function
    #    c. Update Q-values using your update_q_values function
    #    d. Calculate total reward (sum of rewards in episode)
    #    e. Store reward in history
    #    f. Decay epsilon
    # 4. Extract greedy policy from Q
    # 5. Return Q, policy, rewards_history
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return Q, policy, rewards_history

<a name='6'></a>
## 6 - Testing and Comparison

In [None]:
# Train Monte Carlo Control
env = gym.make('FrozenLake-v1', is_slippery=False)

print("Training Monte Carlo Control...\n")
Q, policy, rewards = monte_carlo_control(
    env, n_episodes=5000, alpha=0.1, gamma=0.99,
    epsilon=1.0, epsilon_decay=0.999, epsilon_min=0.01
)

print(f"Training completed!")
print(f"Final success rate (last 100): {np.mean(rewards[-100:]):.3f}")

# Visualize
window = 50
moving_avg = np.convolve(rewards, np.ones(window)/window, mode='valid')

plt.figure(figsize=(10, 5))
plt.plot(rewards, alpha=0.3, label='Episode reward')
plt.plot(range(window-1, len(rewards)), moving_avg, 
         label=f'Moving average ({window})', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Monte Carlo Control Learning Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

env.close()

## Congratulations!

You've successfully implemented Monte Carlo Control! Here's what you've learned:

✅ How to generate episodes with ε-greedy policies

✅ How to calculate returns efficiently

✅ First-visit Monte Carlo updates

✅ Complete MC Control algorithm

### Key Takeaways:

1. **Model-free**: MC doesn't need environment dynamics
2. **Episodic**: Requires complete episodes (can't do infinite horizon)
3. **No bootstrapping**: Uses actual returns, not estimates
4. **High variance**: Returns can vary a lot between episodes
5. **Unbiased**: Converges to true Q-values (no bootstrapping bias)

### Monte Carlo vs Other Methods:

| Method | Model | Bootstrap | Variance | Bias | Episodes |
|--------|-------|-----------|----------|------|----------|
| **MC** | Free | No | High | None | Required |
| **TD** | Free | Yes | Low | Small | Not required |
| **DP** | Based | Yes | None | None | Not required |

### Next Steps:

- Compare MC with Q-Learning on same environment
- Learn about **every-visit MC** vs first-visit
- Explore **off-policy MC** with importance sampling
- Understand **n-step methods** (hybrid between MC and TD)