# Monte Carlo Methods in Reinforcement Learning

By the end of this notebook, you'll be able to:
* Understand the principles of Monte Carlo (MC) methods
* Implement First-Visit and Every-Visit MC Prediction
* Implement MC Control with on-policy (epsilon-greedy) learning
* Implement MC Control with off-policy learning using Importance Sampling
* Compare MC methods with Dynamic Programming

**Estimated time**: 3-4 hours

**Prerequisites**: Deep understanding of Dynamic Programming and basic Reinforcement Learning concepts

## Table of Contents

- [1 - Packages](#1)
- [2 - Monte Carlo Introduction](#2)
    - [2.1 - Model-Free Learning](#2-1)
    - [2.2 - Returns and Discounting](#2-2)
- [3 - MC Prediction](#3)
    - [Exercise 1 - implement_first_visit_mc_prediction](#ex-1)
    - [Exercise 2 - implement_every_visit_mc_prediction](#ex-2)
- [4 - MC Control On-Policy](#4)
    - [Exercise 3 - implement_mc_control_on_policy](#ex-3)
- [5 - MC Control Off-Policy](#5)
    - [Exercise 4 - implement_mc_control_off_policy](#ex-4)
    - [Exercise 5 - implement_importance_sampling](#ex-5)
- [6 - Comparative Analysis](#6)
- [7 - Conclusion](#7)

<a name='1'></a>
## 1 - Packages

In [None]:
import sys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Configure visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Import utility functions
sys.path.append('/home/user/Reinforcement-learning-guide/notebooks')
from mc_utils import (
    calculate_returns, first_visit_mc_prediction, every_visit_mc_prediction,
    mc_control_on_policy, mc_control_off_policy,
    first_visit_mc_prediction_test, every_visit_mc_prediction_test,
    mc_control_on_policy_test, mc_control_off_policy_test
)

print('✓ All imports successful')
print(f'NumPy version: {np.__version__}')

<a name='2'></a>
## 2 - Monte Carlo Introduction

Monte Carlo methods learn directly from experience without requiring knowledge of the environment dynamics.

### Key Characteristics:
1. **Model-Free**: No need for transition probabilities $p(s',r|s,a)$
2. **Episode-based**: Learn at the end of complete episodes
3. **Unbiased**: Estimates converge to true values
4. **High variance**: Requires many episodes for convergence

<a name='2-1'></a>
### 2.1 - Model-Free Learning

Unlike Dynamic Programming:
- DP uses bootstrapping: $V(s) \leftarrow \mathbb{E}[R + \gamma V(s')]$
- MC uses actual returns: $V(s) \leftarrow \text{average}(G_1, G_2, G_3, ...)$

<a name='2-2'></a>
### 2.2 - Returns and Discounting

The return $G_t$ is the discounted sum of future rewards:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$$

Key property: $G_t = R_{t+1} + \gamma G_{t+1}$

In [None]:
# Example: Calculate returns for an episode
rewards = [5, 2, -1]  # Example rewards
gamma = 0.9
returns = calculate_returns(rewards, gamma)

print("Episode: S0 → S1 → S2 → S3 (terminal)")
print("=" * 50)
print(f"Rewards: {rewards}\nGamma: {gamma}\n")
for t, (r, G) in enumerate(zip(rewards, returns)):
    print(f"Step t={t}: Reward R_{t+1} = {r}, Return G_{t} = {G:.4f}")

<a name='3'></a>
## 3 - MC Prediction: Policy Evaluation

**Objective**: Given a policy $\pi$, estimate the value function $V^\pi(s)$

$$V^\pi(s) = \mathbb{E}_{\pi}[G_t | S_t = s]$$

**Approach**: Average the observed returns for each state

### First-Visit vs Every-Visit

<table>
  <tr>
    <th>Aspect</th>
    <th>First-Visit</th>
    <th>Every-Visit</th>
  </tr>
  <tr>
    <td><b>Definition</b></td>
    <td>Average only first visit to state per episode</td>
    <td>Average all visits to state in episode</td>
  </tr>
  <tr>
    <td><b>Samples per episode</b></td>
    <td>Maximum one per state</td>
    <td>Multiple possible</td>
  </tr>
  <tr>
    <td><b>Convergence</b></td>
    <td>Guaranteed</td>
    <td>Guaranteed</td>
  </tr>
  <tr>
    <td><b>Variance</b></td>
    <td>Higher</td>
    <td>Lower (more data)</td>
  </tr>
</table>

<a name='ex-1'></a>
### Exercise 1 - implement_first_visit_mc_prediction

Implement First-Visit MC Prediction that evaluates a given policy.

**Instructions:**
- Generate episodes following the policy
- For each state, record the return only on first visit per episode
- Update value estimate as average of returns

**Formula:**
$$V(s) = \frac{1}{N(s)} \sum_{i=1}^{N(s)} G_i(s)$$

In [None]:
# GRADED FUNCTION: implement_first_visit_mc_prediction

def implement_first_visit_mc_prediction(env, policy, num_episodes, max_steps, gamma=0.99):
    """
    First-Visit Monte Carlo Prediction.
    
    Arguments:
    env -- environment with reset() and step() methods
    policy -- function that returns action given state
    num_episodes -- number of episodes to generate
    max_steps -- maximum steps per episode
    gamma -- discount factor
    
    Returns:
    V -- dictionary of state values
    visit_counts -- dictionary of visit counts per state
    """
    
    V = defaultdict(float)
    visit_counts = defaultdict(int)
    returns = defaultdict(list)
    
    # YOUR CODE STARTS HERE
    for episode in range(num_episodes):
        state = env.reset()
        trajectory = []
        rewards = []
        visited_states = set()
        
        # Generate episode
        for step in range(max_steps):
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            trajectory.append(state)
            rewards.append(reward)
            
            if done:
                trajectory.append(next_state)
                break
            state = next_state
        
        # Calculate returns and update V (first-visit only)
        episode_returns = calculate_returns(rewards, gamma)
        for t, (s, G) in enumerate(zip(trajectory[:-1], episode_returns)):
            if s not in visited_states:
                returns[s].append(G)
                visit_counts[s] += 1
                V[s] = np.mean(returns[s])
                visited_states.add(s)
    # YOUR CODE ENDS HERE
    
    return dict(V), dict(visit_counts)

# Test your implementation
first_visit_mc_prediction_test(implement_first_visit_mc_prediction)
print("✓ First-Visit MC Prediction implementation passed!")

<a name='ex-2'></a>
### Exercise 2 - implement_every_visit_mc_prediction

Implement Every-Visit MC Prediction.

**Instructions:**
- Generate episodes following the policy
- For each state, record the return for ALL visits in the episode
- Update value estimate as average of all returns

In [None]:
# GRADED FUNCTION: implement_every_visit_mc_prediction

def implement_every_visit_mc_prediction(env, policy, num_episodes, max_steps, gamma=0.99):
    """
    Every-Visit Monte Carlo Prediction.
    
    Arguments:
    env -- environment with reset() and step() methods
    policy -- function that returns action given state
    num_episodes -- number of episodes to generate
    max_steps -- maximum steps per episode
    gamma -- discount factor
    
    Returns:
    V -- dictionary of state values
    visit_counts -- dictionary of visit counts per state
    """
    
    V = defaultdict(float)
    visit_counts = defaultdict(int)
    returns = defaultdict(list)
    
    # YOUR CODE STARTS HERE
    for episode in range(num_episodes):
        state = env.reset()
        trajectory = []
        rewards = []
        
        # Generate episode
        for step in range(max_steps):
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            trajectory.append(state)
            rewards.append(reward)
            
            if done:
                trajectory.append(next_state)
                break
            state = next_state
        
        # Calculate returns and update V (every visit)
        episode_returns = calculate_returns(rewards, gamma)
        for t, (s, G) in enumerate(zip(trajectory[:-1], episode_returns)):
            returns[s].append(G)
            visit_counts[s] += 1
            V[s] = np.mean(returns[s])
    # YOUR CODE ENDS HERE
    
    return dict(V), dict(visit_counts)

# Test your implementation
every_visit_mc_prediction_test(implement_every_visit_mc_prediction)
print("✓ Every-Visit MC Prediction implementation passed!")

<font color='blue'>

**What you should remember**:
- **First-Visit MC**: Only uses first visit to each state per episode
  - Pro: Stronger theoretical convergence guarantees
  - Con: Less data per episode
- **Every-Visit MC**: Uses all visits to each state per episode
  - Pro: More data per episode, potentially faster convergence
  - Con: Slightly weaker theoretical properties
- Both converge to the true value function as $N(s) \to \infty$

</font>

<a name='4'></a>
## 4 - MC Control On-Policy: Learning Optimal Policies

Now we move from **evaluation** to **control** - finding the optimal policy.

**Key Idea**: Alternate between:
1. **Policy Evaluation**: Estimate $Q^\pi(s,a)$
2. **Policy Improvement**: Make policy greedy w.r.t. $Q$

### Exploration-Exploitation: ε-Greedy Policy

$$\pi(a|s) = \begin{cases}
1 - \epsilon + \frac{\epsilon}{|A|} & \text{if } a = \arg\max_{a'} Q(s,a') \\
\frac{\epsilon}{|A|} & \text{otherwise}
\end{cases}$$

- Probability $\epsilon$: explore (random action)
- Probability $1-\epsilon$: exploit (best known action)

<a name='ex-3'></a>
### Exercise 3 - implement_mc_control_on_policy

Implement MC Control with on-policy ε-greedy learning.

**Instructions:**
- Generate episodes using ε-greedy policy
- Update Q-values using first-visit returns
- Improve policy based on updated Q-values

**Algorithm:**
1. Initialize Q(s,a) = 0
2. For each episode:
   - Generate trajectory using ε-greedy policy
   - Calculate returns G for each (s,a) pair
   - Update: $Q(s,a) \leftarrow Q(s,a) + \alpha[G - Q(s,a)]$

In [None]:
# GRADED FUNCTION: implement_mc_control_on_policy

def implement_mc_control_on_policy(env, num_episodes, epsilon=0.1, max_steps=100, gamma=0.99, alpha=None):
    """
    Monte Carlo Control (On-Policy).
    
    Arguments:
    env -- environment with reset() and step() methods
    num_episodes -- number of episodes to generate
    epsilon -- exploration rate (epsilon-greedy)
    max_steps -- maximum steps per episode
    gamma -- discount factor
    alpha -- learning rate (if None, uses incremental averaging)
    
    Returns:
    Q -- dictionary of Q-values Q[state][action]
    policy -- optimal policy derived from Q
    """
    
    Q = defaultdict(lambda: defaultdict(float))
    returns = defaultdict(lambda: defaultdict(list))
    
    # YOUR CODE STARTS HERE
    for episode in range(num_episodes):
        state = env.reset()
        trajectory = []
        actions_taken = []
        rewards = []
        
        # Generate episode following epsilon-greedy policy
        for step in range(max_steps):
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(0, 4)  # Random action (assuming 4 actions)
            else:
                if len(Q[state]) == 0:
                    action = np.random.randint(0, 4)
                else:
                    action = max(Q[state].items(), key=lambda x: x[1])[0]
            
            trajectory.append(state)
            actions_taken.append(action)
            next_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            
            if done:
                trajectory.append(next_state)
                break
            state = next_state
        
        # Update Q-values (first-visit)
        episode_returns = calculate_returns(rewards, gamma)
        visited = set()
        for t, (s, a, G) in enumerate(zip(trajectory[:-1], actions_taken, episode_returns)):
            if (s, a) not in visited:
                returns[s][a].append(G)
                if alpha is None:
                    Q[s][a] = np.mean(returns[s][a])
                else:
                    Q[s][a] += alpha * (G - Q[s][a])
                visited.add((s, a))
    # YOUR CODE ENDS HERE
    
    # Extract policy
    policy = {}
    for state in Q:
        if len(Q[state]) > 0:
            policy[state] = max(Q[state].items(), key=lambda x: x[1])[0]
    
    return dict(Q), policy

# Test your implementation
mc_control_on_policy_test(implement_mc_control_on_policy)
print("✓ MC Control On-Policy implementation passed!")

<font color='blue'>

**What you should remember**:
- **On-Policy Learning**: Learn about policy you're following
  - Explores with ε-greedy probability
  - Learns optimal policy given this exploration
  - Never learns truly optimal (always has exploration)
- **Epsilon Decay**: Common to start with high ε and decay over time
  - Start: High exploration (ε = 0.3-0.5)
  - End: Low exploration (ε = 0.01-0.05)

</font>

<a name='5'></a>
## 5 - MC Control Off-Policy: Learning from Exploratory Data

**Problem with On-Policy**: Never learns truly optimal policy (always explores with ε)

**Solution: Off-Policy Learning**
- **Behavior Policy** $b$: Exploratory policy that generates data
- **Target Policy** $\pi$: Deterministic optimal policy we want to learn

### Importance Sampling

Adjust returns to account for different policies:

$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$$

This ratio corrects for the probability difference between target and behavior policies.

<a name='ex-4'></a>
### Exercise 4 - implement_mc_control_off_policy

Implement MC Control with off-policy learning using importance sampling.

**Instructions:**
- Generate episodes using behavior policy (exploratory)
- Update Q-values for target policy (greedy) using importance sampling
- Calculate importance ratios: $\rho = \frac{\pi(a|s)}{b(a|s)}$

In [None]:
# GRADED FUNCTION: implement_mc_control_off_policy

def implement_mc_control_off_policy(env, num_episodes, epsilon=0.3, max_steps=100, gamma=0.99):
    """
    Monte Carlo Control (Off-Policy with Importance Sampling).
    
    Arguments:
    env -- environment with reset() and step() methods
    num_episodes -- number of episodes to generate
    epsilon -- exploration rate for behavior policy
    max_steps -- maximum steps per episode
    gamma -- discount factor
    
    Returns:
    Q -- dictionary of Q-values Q[state][action]
    policy -- optimal target policy
    importance_ratios -- list of importance ratios per episode
    """
    
    Q = defaultdict(lambda: defaultdict(float))
    C = defaultdict(lambda: defaultdict(float))  # Cumulative weights
    importance_ratios = []
    
    # YOUR CODE STARTS HERE
    for episode in range(num_episodes):
        state = env.reset()
        trajectory = []
        actions_taken = []
        rewards = []
        
        # Generate episode using behavior policy (epsilon-greedy with high epsilon)
        for step in range(max_steps):
            if np.random.random() < epsilon:
                action = np.random.randint(0, 4)  # Random action
            else:
                if len(Q[state]) == 0:
                    action = np.random.randint(0, 4)
                else:
                    action = max(Q[state].items(), key=lambda x: x[1])[0]
            
            trajectory.append(state)
            actions_taken.append(action)
            next_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            
            if done:
                trajectory.append(next_state)
                break
            state = next_state
        
        # Off-policy update with importance sampling
        G = 0
        W = 1  # Importance weight
        episode_ratio = 1
        num_actions = 4
        
        for t in reversed(range(len(trajectory) - 1)):
            s = trajectory[t]
            a = actions_taken[t]
            r = rewards[t]
            
            G = r + gamma * G
            C[s][a] += W
            Q[s][a] += (W / C[s][a]) * (G - Q[s][a])
            
            # Check if action is greedy (target policy)
            best_action = max(Q[s].items(), key=lambda x: x[1])[0] if len(Q[s]) > 0 else 0
            if a != best_action:
                break
            
            # Update importance weight: 1 / P(a|s,b)
            W *= (1.0 / epsilon) if np.random.random() < epsilon else 1.0
            episode_ratio *= W
        
        importance_ratios.append(episode_ratio)
    # YOUR CODE ENDS HERE
    
    # Extract policy
    policy = {}
    for state in Q:
        if len(Q[state]) > 0:
            policy[state] = max(Q[state].items(), key=lambda x: x[1])[0]
    
    return dict(Q), policy, importance_ratios

# Test your implementation
mc_control_off_policy_test(implement_mc_control_off_policy)
print("✓ MC Control Off-Policy implementation passed!")

<a name='ex-5'></a>
### Exercise 5 - implement_importance_sampling

Implement the importance sampling ratio calculation.

**Formula:**
$$\rho = \frac{\pi(a|s)}{b(a|s)}$$

Where:
- $\pi(a|s)$: Target policy probability
- $b(a|s)$: Behavior policy probability

In [None]:
# GRADED FUNCTION: implement_importance_sampling

def implement_importance_sampling(episode, target_policy_prob, behavior_policy_prob):
    """
    Calculate importance sampling ratio for an episode.
    
    Arguments:
    episode -- list of (state, action, reward) tuples
    target_policy_prob -- probability of actions under target policy
    behavior_policy_prob -- probability of actions under behavior policy
    
    Returns:
    rho -- importance sampling ratio (product of probabilities)
    """
    
    rho = 1.0
    
    # YOUR CODE STARTS HERE
    # For each step in episode, multiply by ratio of probabilities
    for state, action, reward in episode:
        # Assuming deterministic target policy (probability 1 or 0)
        # and epsilon-greedy behavior policy
        if target_policy_prob > 0:
            rho *= target_policy_prob / behavior_policy_prob
    # YOUR CODE ENDS HERE
    
    return rho

# Test your implementation
print("✓ Importance Sampling implementation ready!")

<font color='blue'>

**What you should remember**:
- **Off-Policy Learning**: Learn about target policy using data from behavior policy
  - Behavior policy: Exploratory (high ε)
  - Target policy: Greedy (optimal)
- **Importance Sampling**: Adjust returns by ratio of policies
  - High ratio = unlikely under behavior policy, likely under target
  - Low ratio = likely under behavior policy
  - Can cause high variance when ratios are large
- **Advantage**: Learn optimal policy without always exploring
- **Disadvantage**: Higher variance, may need more episodes

</font>

<a name='6'></a>
## 6 - Comparative Analysis

### MC vs Dynamic Programming

<table>
  <tr>
    <th>Aspect</th>
    <th>Dynamic Programming</th>
    <th>Monte Carlo</th>
  </tr>
  <tr>
    <td><b>Model Required</b></td>
    <td>Yes (p(s',r|s,a))</td>
    <td>No (model-free)</td>
  </tr>
  <tr>
    <td><b>Update Type</b></td>
    <td>Bootstrapping (uses V(s'))</td>
    <td>Actual returns (complete episodes)</td>
  </tr>
  <tr>
    <td><b>Convergence</b></td>
    <td>Fast (few sweeps)</td>
    <td>Slow (many episodes)</td>
  </tr>
  <tr>
    <td><b>Variance</b></td>
    <td>Low</td>
    <td>High</td>
  </tr>
  <tr>
    <td><b>Bias</b></td>
    <td>High (depends on V init)</td>
    <td>None (unbiased)</td>
  </tr>
  <tr>
    <td><b>Best For</b></td>
    <td>Small problems with known model</td>
    <td>Unknown dynamics, simulation available</td>
  </tr>
</table>

### On-Policy vs Off-Policy

<table>
  <tr>
    <th>Aspect</th>
    <th>On-Policy</th>
    <th>Off-Policy</th>
  </tr>
  <tr>
    <td><b>Behavior Policy</b></td>
    <td>ε-greedy (exploratory)</td>
    <td>ε-greedy (exploratory)</td>
  </tr>
  <tr>
    <td><b>Target Policy</b></td>
    <td>ε-greedy (same as behavior)</td>
    <td>Greedy (deterministic, optimal)</td>
  </tr>
  <tr>
    <td><b>What We Learn</b></td>
    <td>Value of exploratory policy</td>
    <td>Value of optimal policy</td>
  </tr>
  <tr>
    <td><b>Variance</b></td>
    <td>Normal</td>
    <td>Very high (importance ratios)</td>
  </tr>
  <tr>
    <td><b>Sample Efficiency</b></td>
    <td>Good</td>
    <td>Poor</td>
  </tr>
  <tr>
    <td><b>Convergence</b></td>
    <td>Guaranteed</td>
    <td>Guaranteed</td>
  </tr>
</table>

<a name='7'></a>
## 7 - Conclusion and Next Steps

### Summary of Key Concepts

1. **MC Prediction**: Evaluates given policies by averaging observed returns
   - First-Visit: Stronger theory, less data
   - Every-Visit: More data, same convergence

2. **MC Control On-Policy**: Finds good (but not optimal) policy with exploration
   - Uses ε-greedy for exploration
   - Never learns truly optimal (always explores)

3. **MC Control Off-Policy**: Learns optimal policy from exploratory data
   - Uses importance sampling for correction
   - Very high variance

### Key Advantages of MC
- Model-free (no need for environment dynamics)
- Can focus on important states
- Unbiased convergence
- Simple to understand and implement

### Limitations of MC
- Requires complete episodes
- High variance (slow convergence)
- Can't update during episode
- Off-policy can have unstable importance ratios

### Next Topics
The limitations of MC led to **Temporal Difference (TD) Learning**:
- Lower variance than MC (bootstrapping)
- Can update before episode completion
- Combines advantages of DP and MC
- Examples: SARSA, Q-Learning, Expected SARSA

---

**Congratulations!** You've completed the Monte Carlo Methods tutorial.

You can now:
- ✓ Understand model-free learning principles
- ✓ Implement First-Visit and Every-Visit MC Prediction
- ✓ Implement On-Policy MC Control with ε-greedy
- ✓ Implement Off-Policy MC Control with importance sampling
- ✓ Compare MC with DP and other methods

**Suggested Next Steps:**
1. Implement MC methods on Gymnasium environments
2. Experiment with different epsilon decay schedules
3. Analyze importance sampling variance
4. Move to Temporal Difference Learning for faster convergence