![Logo](../assets/logo.png)

Made by **Zoltán Barta**

[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/Fortuz/rl_education/blob/main/5.%20Temporal%20Difference/temporal_difference.ipynb)

- This notebook is based on Chapter 6 and 7 of the book *Reinforcement Learning: An Introduction (2nd ed.)* by R. Sutton & A. Barto, available at http://incompleteideas.net/book/the-book-2nd.html

# Temporal Difference Learning: TD(N), SARSA, and Q-learning

## Introduction
Reinforcement Learning (RL) is a fundamental approach for training intelligent agents to interact with an environment and learn optimal behaviors. In this notebook, we will explore **Temporal Difference (TD) learning**, a key concept in RL that allows agents to learn **value functions** without needing a complete model of the environment.

In [None]:
# Importing Required Libraries
import numpy as np
import matplotlib.pyplot as plt  # For visualization
import random
import gym
from typing import Dict, List, Tuple, Iterable
from collections import defaultdict,deque
from collections import deque


# Check Gym version
print(f"Using Gym version: {gym.__version__}")

## Some utility functions


In [None]:
def moving_average(
    data_dict: Dict[str, List[float]], 
    window_size: int, 
    show_original: bool = True, 
    clip_value: float = -500
) -> None:
    """Computes and plots the moving average for multiple algorithms, with optional clipping.

    Parameters:
    - data_dict (Dict[str, List[float]]): A dictionary where keys are algorithm names and values are lists of rewards.
    - window_size (int): Number of samples for the moving average.
    - show_original (bool, optional): Whether to display the original data (dotted line). Default is True.
    - clip_value (float, optional): Minimum value to clip the data to. Default is -500.
    """
    if window_size < 1:
        raise ValueError("Window size must be at least 1")

    plt.figure(figsize=(10, 5))

    for algo, data in data_dict.items():
        data = np.array(data)

        # Clip values at the specified threshold
        data = np.clip(data, clip_value, None)

        # Compute moving average
        moving_avg = np.convolve(data, np.ones(window_size) / window_size, mode='valid')

        # Plot original data if enabled
        if show_original:
            plt.plot(data, linestyle='dotted', alpha=0.4, label=f"{algo} (Original)")

        # Plot moving average
        plt.plot(range(window_size - 1, len(data)), moving_avg, label=f"{algo}")

    plt.xlabel("Episodes")
    plt.ylabel("Return")
    plt.title(f"Moving Average ({window_size}-Point) with Clipping at {clip_value}")
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
def discretize_state(state: int | Iterable) -> Tuple:
    """Converts a given state into a tuple format.

    Parameters:
    - state (int | Iterable): The state to be discretized.

    Returns:
    - Tuple: The discretized state representation.
    """
    if isinstance(state, int):
        return (state,)
    else:
        return tuple(state)

In [None]:
def argmax(values: List[float]) -> int:
    """Returns the index of the maximum value in the list.
    If multiple values have the same maximum, randomly selects one.

    Parameters:
    - values (List[float]): A list of values.

    Returns:
    - int: The index of the maximum value.
    """
    max_val = max(values)
    return np.random.choice([a for a, v in enumerate(values) if v == max_val])

### **Cliff Walking: A Classic Reinforcement Learning Problem**  

#### **Introduction**  
Cliff Walking is a widely used **grid-world** reinforcement learning environment, often used to demonstrate the performance of different RL algorithms like **SARSA**, **Q-learning**, and **Monte Carlo methods**. The problem is inspired by **the windy gridworld**, where an agent must navigate through a grid while avoiding a deadly cliff.

---

#### **Environment Setup**  
The environment consists of a **4x12 grid**, where:
- **Start State**: The agent begins at the bottom-left corner `(3,0)`.
- **Goal State**: The agent's objective is to reach the bottom-right corner `(3,11)`.
- **Cliff Region**: The entire row between `(3,1)` and `(3,10)` is considered a cliff.
- **Actions**: The agent can move in **four directions**: **left, right, up, or down**.
- **Rewards**:  
  - Each step incurs a **reward of -1**.
  - If the agent falls off the cliff, it receives a **reward of -100** and is sent back to the start.
  - If the agent reaches the goal, the episode ends.

---

#### **Learning Objective**  
- The agent must learn the **optimal policy** that minimizes total negative reward while avoiding the cliff.  
- **Exploration vs. Exploitation**: Since the cliff gives a high penalty, the agent needs to balance trying new paths (**exploration**) and sticking to learned safe routes (**exploitation**).  
- **Algorithmic Comparison**:  
  - **SARSA (On-policy control)**: Learns a **conservative** policy that avoids the cliff but is suboptimal.  
  - **Q-learning (Off-policy control)**: Learns the **optimal policy** but can take risky steps near the cliff.  
  - **Monte Carlo methods**: Requires full episodes but converges to a stable policy over time.  


In [None]:
env = gym.make("CliffWalking-v0")

# Monte Carlo Policy for Cliff Walking

Monte Carlo reinforcement learning methods estimate optimal policies by averaging returns over complete episodes without relying on a model of the environment. The `MonteCarloCliffWalking` class implements **Monte Carlo policy iteration** for the **CliffWalking** environment, using an **ε-soft policy** to ensure sufficient exploration. The algorithm collects full episodes, updates the **Q-value estimates** using first-visit Monte Carlo, and gradually improves the action-value function based on observed returns. This approach is particularly effective for **episodic tasks**, where learning is based on complete trajectories rather than step-by-step updates. MC methods rely on **sampling whole episodes** to approximate expected returns.  


![MC](assets/MC.png)


In [None]:
class MonteCarloCliffWalking:
    """Implements Monte Carlo policy iteration for CliffWalking."""

    def __init__(self, env, gamma: float = 0.9, epsilon: float = 0.1, episodes: int = 5000):
        """
        Initializes Monte Carlo policy iteration for CliffWalking.

        Parameters:
        - env: OpenAI Gym CliffWalking-v0 environment.
        - gamma (float): Discount factor.
        - epsilon (float): Exploration rate for ε-greedy policy.
        - episodes (int): Number of training episodes.
        """
        self.env = env
        self.gamma = gamma
        self.epsilon = epsilon
        self.episodes = episodes
        self.n_actions = env.action_space.n
        self.q_table = defaultdict(lambda: np.zeros(self.n_actions))  # Q(s, a)
        self.returns = defaultdict(list)  # Stores all returns per (state, action)

    def generate_episode(self, policy) -> Tuple[List[Tuple[int, int, float]], float]:
        """Generates an episode using the current policy and returns the sequence.

        Parameters:
        - policy: The policy function.

        Returns:
        - List[Tuple[int, int, float]]: The episode as (state, action, reward) tuples.
        - float: The total reward collected in the episode.
        """
        episode = []
        state = self.env.reset()[0]  # Get initial state
        total_reward = 0

        while True:
            action = policy(state)
            next_state, reward, done, _, _ = self.env.step(action)
            episode.append((state, action, reward))
            total_reward += reward
            state = next_state
            if done:
                break

        return episode, total_reward

    def update_q_values(self, episode: List[Tuple[int, int, float]]) -> None:
        """Performs first-visit Monte Carlo update for Q-table.

        Parameters:
        - episode (List[Tuple[int, int, float]]): List of (state, action, reward) transitions.
        """
        G = 0  # Return (discounted sum of rewards)
        visited = set()  # Track first visit

        for t in reversed(range(len(episode))):
            state, action, reward = episode[t]
            G = reward + self.gamma * G  # Discounted return

            if (state, action) not in visited:
                visited.add((state, action))
                self.returns[(state, action)].append(G)
                self.q_table[state][action] = np.mean(self.returns[(state, action)])  # Update Q

    def policy(self, state: int) -> int:
        """ε-soft policy for action selection.
    
        Parameters:
        - state (int): The current state.
    
        Returns:
        - int: The action to take.
        """
        action_probabilities = np.ones(self.n_actions) * (self.epsilon / self.n_actions)  # Base probability
        best_action = np.argmax(self.q_table[state])
        action_probabilities[best_action] += (1 - self.epsilon)  # Assign extra probability to the best action
        return np.random.choice(np.arange(self.n_actions), p=action_probabilities)

    def train(self) -> List[float]:
        """Trains the Monte Carlo agent and returns training rewards.

        Returns:
        - List[float]: The rewards per episode.
        """
        rewards_per_episode = []

        for _ in range(self.episodes):
            episode, total_reward = self.generate_episode(self.policy)
            self.update_q_values(episode)
            rewards_per_episode.append(total_reward)  # Store episode reward
        print("Training done!")
        return rewards_per_episode

    def get_optimal_policy(self) -> Dict[int, int]:
        """Extracts the optimal policy from the Q-table.

        Returns:
        - Dict[int, int]: Mapping of state to best action.
        """
        policy = {}
        for state in self.q_table.keys():
            policy[state] = np.argmax(self.q_table[state])
        return policy

    def __repr__(self):
        return f"Monte-Carlo (ε={self.epsilon}, γ={self.gamma})"

In [None]:
# Initialize Monte Carlo agent
mc_agent = MonteCarloCliffWalking(env, gamma=0.9, epsilon=0.1, episodes=5000)

# Train the agent and collect rewards
training_rewards = mc_agent.train()


In [None]:
moving_average({"Monte Carlo": training_rewards}, window_size=20, show_original=False)

## Limitations of Monte Carlo Methods  
While Monte Carlo methods are effective in **episodic tasks**, they suffer from a major limitation:  
- **Delayed Learning**: Monte Carlo methods **only update value estimates at the end of an episode**. This makes them **inefficient** in environments where episodes are long or rarely terminate.  
- **No Online Learning**: Since updates occur only after an episode concludes, MC methods **cannot adapt immediately** to new experiences, making them slow to respond to dynamic environments.  
- **High Variance**: Because each episode may vary significantly, Monte Carlo estimates often have **high variance**, leading to instability in learning.  

## How could we improve the efficiency?

To address these issues, **Temporal Difference (TD) Learning** introduces an alternative approach:  
- **TD(0) Learning** updates value estimates **after each step** rather than waiting for the full episode to end.  
- This allows **online learning**, where the agent improves its estimates continuously during interaction with the environment.  
- TD methods strike a balance between **Monte Carlo** (which waits until the episode ends) and **Dynamic Programming** (which requires a full model of the environment).  

TD(0) is the simplest form of temporal difference learning and serves as the foundation for more advanced algorithms like **SARSA** and **Q-learning**.  

# TD(0) Update Rule

The Temporal Difference (TD(0)) update rule is defined as:


$$
V(s_t) \leftarrow V(s_t) + \alpha \left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]
$$

where:
- $V(s_t)$: The current estimate of the value function for state $s_t$.
- $\alpha$: The learning rate, controlling the step size of updates.
- $r_{t+1}$: The reward received after transitioning from $s_t$ to $s_{t+1}$.
- $\gamma$: The discount factor, weighting the importance of future rewards.
- $V(s_{t+1})$: The estimated value of the next state.




<div style="width:50%; height:auto; overflow:hidden; position:relative;">
    <img src="assets/backup_diagram.png" style="position:relative; left:-50%; width:200%;">
</div>

[Source](https://www.researchgate.net/publication/366838727_An_intelligent_resource_management_method_in_SDN_based_fog_computing_using_reinforcement_learning)

In [None]:
class Policy:
    """
    An epsilon-greedy policy for reinforcement learning.

    This policy selects actions using an ε-greedy approach, balancing exploration and exploitation.
    It maintains a Q-table for action-value estimates and allows updating and selecting actions.

    Attributes:
    - epsilon (float): The probability of selecting a random action for exploration.
    - action_space_size (int): The total number of possible actions.
    - q_table (Dict[Tuple, List[float]]): A dictionary mapping states to action-value estimates.
    """

    def __init__(self, action_space_size: int, epsilon: float):
        """
        Initializes the epsilon-greedy policy.

        Parameters:
        - action_space_size (int): The number of available actions.
        - epsilon (float): The probability of selecting a random action (exploration rate).
        """
        self.epsilon: float = epsilon 
        self.action_space_size: int = action_space_size
        self.q_table: Dict[Tuple, List[float]] = defaultdict(lambda: np.zeros(action_space_size))

    def __call__(self, state) -> int:
        """
        Selects an action using the epsilon-greedy policy.

        Parameters:
        - state: The current state of the environment.

        Returns:
        - int: The action to take.
        """
        state = discretize_state(state)
        if random.uniform(0, 1) < self.epsilon:
            return random.randint(0, self.action_space_size - 1)  # Explore
        else:
            return argmax(self.q_table[state])  # Exploit

    def update_and_select_action(self, state: int | Iterable, action: int, reward: float, next_state: int | Iterable,done:bool) -> int:
        """
        Updates the policy (if necessary) and selects an action for the next state.

        Parameters:
        - state (int | Iterable): The current state.
        - action (int): The action taken.
        - reward (float): The reward received for the action.
        - next_state (int | Iterable): The next state after taking the action.

        Returns:
        - int: The next action to take.
        """
        return self(next_state)

    def __repr__(self) -> str:
        """
        Returns a string representation of the policy.

        Returns:
        - str: A description of the policy.
        """
        return "Random Policy"

# SARSA: On-Policy Temporal Difference Learning

## Overview  
SARSA (State-Action-Reward-State-Action) is an **on-policy** reinforcement learning algorithm that updates the Q-value of a state-action pair based on the action taken according to the current policy. Unlike **Q-learning**, which uses the maximum possible future reward (off-policy), SARSA follows the policy's actual behavior when updating values.

## Update rule
At each time step \( t \), SARSA updates the Q-value using the following rule:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]
$$

where:
- $Q(s_t, a_t)$ is the estimated value of taking action $a_t$ in state $s_t$.
- $\alpha$ is the learning rate, controlling the step size of updates.
- $r_{t+1}$ is the reward received after taking action $a_t$.
- $\gamma$ is the discount factor, weighting the importance of future rewards.
- $Q(s_{t+1}, a_{t+1})$ is the value of the next state-action pair, chosen by the current policy.

## Key Characteristics  
- **On-Policy Learning:** The update is based on the action actually taken by the agent, making SARSA more conservative compared to Q-learning.
- **Exploration Sensitivity:** Since it follows its own policy, SARSA naturally integrates exploration strategies like **ε-greedy**.
- **Smooth Learning Curve:** SARSA tends to learn safer policies in environments with high penalties, as it accounts for exploratory actions during training.


![SARSA](assets/SARSA.png)

In [None]:
class SARSA(Policy):
    """
    Implements the SARSA (State-Action-Reward-State-Action) reinforcement learning algorithm.

    This class inherits from the `Policy` class and updates the Q-values using the SARSA update rule.
    It follows an epsilon-greedy policy for action selection and updates Q-values based on the expected
    return of the next state-action pair.

    Attributes:
    - alpha (float): The learning rate, controlling the step size for Q-value updates.
    - gamma (float): The discount factor, representing the importance of future rewards.
    - epsilon (float): The probability of selecting a random action (exploration rate).
    """

    def __init__(self, action_space_size: int, epsilon: float = 0.1, alpha: float = 0.1, gamma: float = 0.9):
        """
        Initializes the SARSA algorithm.

        Parameters:
        - action_space_size (int): The number of available actions.
        - epsilon (float): The probability of selecting a random action (exploration rate).
        - alpha (float): The learning rate for Q-value updates.
        - gamma (float): The discount factor for future rewards.
        """
        super().__init__(action_space_size, epsilon=epsilon)
        self.alpha: float = alpha
        self.gamma: float = gamma
        self.epsilon: float = epsilon

    def update_and_select_action(self, state: int | Iterable, action: int, reward: float, next_state: int | Iterable,done:bool) -> int:
        """
        Updates the Q-table using the SARSA update rule and selects the next action.

        Parameters:
        - state (int | Iterable): The current state.
        - action (int): The action taken.
        - reward (float): The reward received for the action.
        - next_state (int | Iterable): The next state after taking the action.

        Returns:
        - int: The next action to take based on the updated Q-values.
        """
        state = discretize_state(state)
        next_state = discretize_state(next_state)
        next_action = self(next_state)

        td_target = reward + self.gamma * self.q_table[next_state][next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += self.alpha * td_error

        return next_action

    def __repr__(self) -> str:
        """
        Returns a string representation of the SARSA algorithm.

        Returns:
        - str: A formatted string showing the SARSA parameters.
        """
        return f"SARSA (ε={self.epsilon}, α={self.alpha}, γ={self.gamma})"

# Expected SARSA: An Improvement Over SARSA

## Overview  
Expected SARSA is a refinement of the standard **SARSA algorithm** that incorporates the expected value of the next Q-values instead of relying on a single sampled action. This approach **reduces variance** while maintaining the on-policy nature of SARSA, leading to more stable learning.

## How Expected SARSA Works  
At each time step \( t \), Expected SARSA updates the Q-value using the following rule:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \mathbb{E}_{a_{t+1} \sim \pi} Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right]
$$

where:
- $Q(s_t, a_t)$ is the estimated value of taking action $a_t$ in state $s_t$.
- $\alpha$ is the learning rate, controlling the step size of updates.
- $r_{t+1}$ is the reward received after taking action $a_t$.
- $\gamma$ is the discount factor, weighting the importance of future rewards.
- $\mathbb{E}_{a_{t+1} \sim \pi} Q(s_{t+1}, a_{t+1})$ is the **expected Q-value** over all possible actions in state $s_{t+1}$, weighted by the action selection probabilities.

## Key Differences from SARSA  
- **Expectation Over Next Actions:** Instead of using a single action $a_{t+1}$, Expected SARSA **computes the weighted sum** of all possible Q-values based on the current policy.
- **Lower Variance:** By averaging over possible future actions, Expected SARSA reduces fluctuations caused by **highly variable rewards** in stochastic environments.
- **Smoother Convergence:** Learning is more stable because updates are based on a distribution rather than a single sampled action.

In [None]:
class ExpectedSARSA(Policy):
    """
    Implements the Expected SARSA reinforcement learning algorithm.

    Expected SARSA is a variation of the SARSA algorithm that updates Q-values based on the 
    expected value of the next state's Q-values under the current policy, rather than using 
    the actual next action taken. This leads to a smoother learning process and more stable convergence.

    Attributes:
    - alpha (float): The learning rate, controlling the step size for Q-value updates.
    - gamma (float): The discount factor, representing the importance of future rewards.
    - epsilon (float): The probability of selecting a random action (exploration rate).
    """

    def __init__(self, action_space_size: int, epsilon: float = 0.1, alpha: float = 0.1, gamma: float = 0.9):
        """
        Initializes the Expected SARSA algorithm.

        Parameters:
        - action_space_size (int): The number of available actions.
        - epsilon (float): The probability of selecting a random action (exploration rate).
        - alpha (float): The learning rate for Q-value updates.
        - gamma (float): The discount factor for future rewards.
        """
        super().__init__(action_space_size, epsilon=epsilon)
        self.alpha: float = alpha
        self.gamma: float = gamma
        self.epsilon: float = epsilon

    def get_expected_q(self, state: int | Iterable) -> float:
        """
        Computes the expected Q-value for a given state under the current policy.

        This is done by calculating the weighted sum of Q-values, where the weights 
        correspond to the probability of selecting each action based on the ε-greedy policy.

        Parameters:
        - state (int | Iterable): The state for which to compute the expected Q-value.

        Returns:
        - float: The expected Q-value for the given state.
        """
        state = discretize_state(state)
        policy_probs = np.ones(self.action_space_size) * (self.epsilon / self.action_space_size)
        best_action = np.argmax(self.q_table[state])
        policy_probs[best_action] += (1.0 - self.epsilon)
        return np.dot(self.q_table[state], policy_probs)

    def update_and_select_action(self, state: int | Iterable, action: int, reward: float, next_state: int | Iterable,done:bool) -> int:
        """
        Updates the Q-table using the Expected SARSA update rule and selects the next action.

        Unlike standard SARSA, this update rule uses the expected Q-value of the next state 
        rather than the Q-value of the actual next action.

        Parameters:
        - state (int | Iterable): The current state.
        - action (int): The action taken.
        - reward (float): The reward received for the action.
        - next_state (int | Iterable): The next state after taking the action.

        Returns:
        - int: The next action to take based on the updated Q-values.
        """
        state = discretize_state(state)
        next_state = discretize_state(next_state)

        expected_q = self.get_expected_q(next_state)

        td_target = reward + self.gamma * expected_q
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += self.alpha * td_error

        return self(next_state)

    def __repr__(self) -> str:
        """
        Returns a string representation of the Expected SARSA algorithm.

        Returns:
        - str: A formatted string showing the Expected SARSA parameters.
        """
        return f"ExpectedSARSA (ε={self.epsilon}, α={self.alpha}, γ={self.gamma})"

# Q-Learning: Off-Policy Temporal Difference Learning

## Overview  
Q-Learning is a **model-free, off-policy** reinforcement learning algorithm that estimates the optimal **action-value function** by learning from **maximized future rewards** rather than the agent’s actual behavior. Unlike **SARSA**, which follows the current policy, Q-learning **optimizes independently** of the agent’s exploration, making it more aggressive in finding optimal strategies.

## How Q-Learning Works  
At each time step \( t \), Q-learning updates the Q-value using the following rule:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t) \right]
$$

where:
- $Q(s_t, a_t)$ is the estimated value of taking action $a_t$ in state $s_t$.
- $\alpha$ is the learning rate, controlling the step size of updates.
- $r_{t+1}$ is the reward received after taking action $a_t$.
- $\gamma$ is the discount factor, weighting the importance of future rewards.
- $\max_{a} Q(s_{t+1}, a)$ represents the **maximum estimated Q-value** of the next state, assuming the best possible action.

## Key Characteristics  
- **Off-Policy Learning:** Q-learning updates its Q-values using the **greedy** action selection in the update step, regardless of the agent’s actual behavior.
- **More Optimistic Learning:** Since it assumes the best action will always be taken, it learns faster in deterministic environments but can be unstable in stochastic settings.
- **Exploration via ε-Greedy:** While updates are based on **greedy action selection**, exploration can still be encouraged using an **ε-greedy** action selection strategy.

## Q-Learning vs. SARSA vs. Expected SARSA  
| Feature             | Q-Learning (Off-Policy) | SARSA (On-Policy) | Expected SARSA (On-Policy) |
|---------------------|------------------|-----------------|------------------|
| Policy Type        | Off-policy (greedy) | On-policy (follows its own updates) | On-policy (soft policy) |
| Update Method      | Uses the max Q-value of the next state | Uses the Q-value of the next sampled action | Expected value over all actions |
| Variance           | Moderate (depends on environment) | Higher (single action sample) | Lower (averages over actions) |
| Learning Stability | Can be unstable but finds the optimal policy | More stable but may learn safer policies | More stable due to expectation calculation |
| Exploration Handling | Assumes greedy action selection | Explicitly follows $\epsilon$-greedy exploration | Naturally accounts for stochastic policies |

Q-learning is widely used in **reinforcement learning applications** due to its ability to learn the **optimal policy** independently of the agent’s actual behavior. However, it can be unstable in **stochastic environments**, making techniques like **Double Q-Learning** and **Deep Q Networks (DQN)** essential for real-world applications.

!![SARSA](assets/QLearn.png)

In [None]:
class QLearning(Policy):
    """
    Implements the Q-Learning reinforcement learning algorithm.

    Q-Learning is an off-policy learning algorithm that updates the Q-values based on the 
    maximum future reward possible from the next state. It follows an epsilon-greedy policy 
    for action selection.

    Attributes:
    - alpha (float): The learning rate, controlling the step size for Q-value updates.
    - gamma (float): The discount factor, representing the importance of future rewards.
    - epsilon (float): The probability of selecting a random action (exploration rate).
    """

    def __init__(self, action_space_size: int, epsilon: float, alpha: float = 0.1, gamma: float = 0.99):
        """
        Initializes the Q-Learning algorithm.

        Parameters:
        - action_space_size (int): The number of available actions.
        - epsilon (float): The probability of selecting a random action (exploration rate).
        - alpha (float): The learning rate for Q-value updates.
        - gamma (float): The discount factor for future rewards.
        """
        super().__init__(action_space_size, epsilon=epsilon)
        self.alpha: float = alpha
        self.gamma: float = gamma
        self.epsilon: float = epsilon

    def update_and_select_action(self, state: int | Iterable, action: int, reward: float, next_state: int | Iterable,done:bool) -> int:
        """
        Updates the Q-table using the Q-Learning update rule and selects the next action.

        Unlike SARSA, Q-Learning uses the **maximum Q-value** from the next state to update 
        the current Q-value, making it an off-policy learning algorithm.

        Parameters:
        - state: The current state.
        - action (int): The action taken.
        - reward (float): The reward received for the action.
        - next_state: The next state after taking the action.

        Returns:
        - int: The next action to take based on the updated Q-values.
        """
        state = discretize_state(state)
        next_state = discretize_state(next_state)

        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + self.gamma * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += self.alpha * td_error

        return self(next_state)

    def __repr__(self) -> str:
        """
        Returns a string representation of the Q-Learning algorithm.

        Returns:
        - str: A formatted string showing the Q-Learning parameters.
        """
        return f"Q-Learning (ε={self.epsilon}, α={self.alpha}, γ={self.gamma})"

In [None]:
def train(env, policy, num_episodes: int = 10000, max_steps: int = 1000,verbose = False) -> List[float]:
    """
    Trains an RL agent using the given policy.

    This function runs multiple episodes where the agent interacts with the environment, 
    updating its policy based on the rewards received. The training process follows the 
    epsilon-greedy exploration strategy.

    Parameters:
    - env: The reinforcement learning environment (e.g., OpenAI Gym environment).
    - policy: The policy object that determines action selection and updates Q-values.
    - num_episodes (int, optional): The number of episodes to train the agent. Default is 10,000.
    - max_steps (int, optional): The maximum number of steps per episode before termination. Default is 1,000.

    Returns:
    - List[float]: A list of total rewards obtained per episode.
    """
    rewards = []

    for episode in range(num_episodes):
        if episode % 100 == 0 and verbose:
            print(f"Episode {episode}/{num_episodes}")

        state, _ = env.reset()
        action = policy(state)
        total_reward = 0

        for step in range(max_steps):
            next_state, reward, done, truncated, _ = env.step(int(action))
            next_action = policy.update_and_select_action(state, action, reward, next_state, done)
            state, action = next_state, next_action
            total_reward += reward

            if done:
                break

        rewards.append(total_reward)
    print("Training done!")
    return rewards

## Let's do some testing!

In [None]:
NUM_EPISODES = 500
MAX_STEPS = 1000

# Initialize the environment 
env = gym.make("CliffWalking-v0",)


# Initialize agents
q_learning = QLearning(env.action_space.n,epsilon=0.1,alpha=0.1, gamma=0.9)
sarsa = SARSA(env.action_space.n,epsilon=0.1,alpha=0.1, gamma=0.9)
expected_sarsa = ExpectedSARSA(env.action_space.n,epsilon=0.1,alpha=0.1, gamma=0.9)

data = {
    str(q_learning): train(env, q_learning, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
    str(sarsa): train(env, sarsa, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
    str(expected_sarsa): train(env, expected_sarsa, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
}
moving_average(data, window_size=50,show_original=False)



In [None]:
NUM_EPISODES = 5000
MAX_STEPS = 1000

# Initialize the environment 
env = gym.make("CliffWalking-v0",)


# Initialize agents
q_learning = QLearning(env.action_space.n,epsilon=0.01,alpha=0.01, gamma=0.9)
sarsa = SARSA(env.action_space.n,epsilon=0.01,alpha=0.01, gamma=0.9)
expected_sarsa = ExpectedSARSA(env.action_space.n,epsilon=0.01,alpha=0.01, gamma=0.9)

data = {
    str(q_learning): train(env, q_learning, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
    str(sarsa): train(env, sarsa, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
    str(expected_sarsa): train(env, expected_sarsa, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
}
moving_average(data, window_size=50,show_original=False)



# On-Policy vs. Off-Policy Learning

Reinforcement learning (RL) algorithms can be broadly categorized into **on-policy** and **off-policy** methods based on how they learn from their experiences. The choice between these approaches has important implications for **policy safety, stability, and efficiency** in real-world applications.

---

## On-Policy Learning
- **Definition:**  
  On-policy algorithms learn the value of the policy being carried out by the agent. They update their estimates using actions that are generated by the **current** policy.

- **Examples:**  
  - **SARSA:** Updates Q-values based on the action actually taken by the current ε-greedy policy.
  - **Policy Gradient Methods (e.g., REINFORCE, PPO):** Optimize policy parameters directly using trajectories generated from the current policy.

- **Policy Safety:**  
  On-policy methods tend to be more conservative because they explicitly consider the **exploration behavior** during learning. This makes them well-suited for applications where **taking risky actions can have serious consequences**.

---

## Off-Policy Learning
- **Definition:**  
  Off-policy algorithms learn the value of an optimal policy **independently** of the agent's current behavior. They can use data collected from **a different policy** to improve the learned policy.

- **Examples:**  
  - **Q-Learning:** Uses the maximum estimated Q-value for the next state, regardless of the action taken by the behavior policy.
  - **Deep Q-Networks (DQN):** Extends Q-learning to high-dimensional state spaces using deep neural networks.
  - **Experience Replay Buffers (used in DQN, SAC):** Store past transitions and reuse them for training, making learning more sample-efficient.

- **Policy Safety:**  
  Off-policy methods are typically more **aggressive** in seeking the optimal policy. They often **converge faster** but may be **unstable** in environments where exploration is risky or where the behavior policy differs significantly from the optimal policy.


In [None]:
def evaluate(env, policy) -> float:
    """
    Evaluates a trained RL policy in the given environment.

    This function runs a single episode using the provided policy and returns the total 
    reward accumulated. The environment is rendered during the evaluation process.

    Parameters:
    - env: The reinforcement learning environment (e.g., OpenAI Gym environment).
    - policy: The trained policy object that selects actions.

    Returns:
    - float: The total reward obtained during the episode.
    """
    state, _ = env.reset()
    total_reward = 0
    done = False
    steps = 0

    while not done and steps < 1000:
        env.render()
        action = policy(state)
        next_state, reward, done, truncated, _ = env.step(action)
        state = next_state
        total_reward += reward
        steps += 1

    return total_reward

## Visualize different agents' approaches 

In [None]:
env = gym.make("CliffWalking-v0")
q_learning = QLearning(env.action_space.n,epsilon=0.05,alpha=0.1, gamma=0.9)
train(env,q_learning,num_episodes=5000,verbose=True)
env = gym.make("CliffWalking-v0",render_mode="human")
evaluate(env,q_learning)
env.close()

In [None]:
env = gym.make("CliffWalking-v0")
sarsa = SARSA(env.action_space.n,epsilon=0.05,alpha=0.1, gamma=0.9)
train(env,sarsa,num_episodes=5000,verbose=True)
env = gym.make("CliffWalking-v0",render_mode="human")
evaluate(env,sarsa)
env.close()

# N-Step TD: Bridging Monte Carlo and TD Learning

N-Step Temporal Difference (TD) Learning is an **intermediate approach** between **one-step TD methods** (such as SARSA and Q-learning) and **Monte Carlo (MC) methods**. It generalizes TD learning by updating value estimates based on rewards accumulated over **multiple steps**, rather than a single step (as in TD(0)) or a full episode (as in Monte Carlo methods).

## How N-Step TD Works  
At each time step $ t $, the **n-step TD update rule** is:

$$
V(s_t) \leftarrow V(s_t) + \alpha \left[ G_t - V(s_t) \right]
$$

where the **n-step return** $ G_t$ is calculated as:

$$
G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})
$$

where:  
- $ V(s_t) $ is the estimated value of state $ s_t $.
- $ \alpha $ is the learning rate, controlling the step size for updates.
- $ r_{t+1}, r_{t+2}, \dots, r_{t+n} $ are the rewards collected over the next $ n $ steps.
- $ \gamma $ is the discount factor, weighting the importance of future rewards.
- $ V(s_{t+n}) $ is the estimated value of the state after $ n $ steps.

## Key Characteristics  
- **Balances Bias and Variance:**  
  - **TD(0)** has **low bias but high variance** because it updates after every step.
  - **Monte Carlo** has **high bias but low variance**, since it updates only at the end of an episode.
  - **N-Step TD** provides a **tradeoff**, reducing variance while incorporating more future information.
  
- **Adaptive Credit Assignment:**  
  - TD(0) only considers immediate rewards.
  - Monte Carlo methods consider the full return after the episode.
  - N-step TD assigns credit **partially**, considering rewards over multiple steps.

- **Handles Partial Episodes:**  
  - Unlike Monte Carlo, **N-step TD can update values before an episode ends**, making it suitable for **online learning**.

## N-Step TD vs. Monte Carlo vs. TD(0)  
| Feature             | N-Step TD | Monte Carlo | TD(0) |
|---------------------|----------|-------------|--------|
| Update Timing      | After \( n \) steps | After full episode | After each step |
| Variance           | Moderate | High | Low |
| Bias               | Moderate | Low | High |
| Online Learning    | Yes | No | Yes |
| Sample Efficiency  | Higher than MC | Low | Very High |
| Stability          | More stable than TD(0) | Stable but high variance | Unstable in stochastic cases |

## N-Step TD in Practice  
N-step TD is widely used in **reinforcement learning applications** where full episode learning (Monte Carlo) is too slow, and **TD(0) updates are too noisy**. **Choosing the right value of \( n \)** depends on the problem—small values resemble **TD(0)**, while larger values approximate **Monte Carlo methods**.

Techniques like **TD(λ)** use **eligibility traces** to dynamically adjust \( n \), allowing smooth interpolation between TD and Monte Carlo methods.

![Backup_diagrams](assets/td_mc_backup.png)

![NstepSarsa](assets/NstepSASRA.png)

In [None]:
class NStepSARSA(Policy):
    """
    Implements N-step SARSA without modifying the training loop.
    All N-step logic (buffering, flushing, partial returns) is handled
    inside update_and_select_action.
    """

    def __init__(self, 
                 action_space_size: int,
                 epsilon: float = 0.1, 
                 alpha: float = 0.1, 
                 gamma: float = 0.9,
                 n: int = 5):
        """
        Initializes the N-step SARSA algorithm.

        Parameters:
        - action_space_size (int): Number of available actions.
        - epsilon (float): Probability of choosing a random action (exploration).
        - alpha (float): Learning rate.
        - gamma (float): Discount factor.
        - n (int): Number of steps to look ahead for updates.
        """
        super().__init__(action_space_size, epsilon=epsilon)
        self.alpha = alpha
        self.gamma = gamma
        self.n = n
        
        self.buffer = deque()

    def update_and_select_action(
        self,
        state: int | Iterable,
        action: int,
        reward: float,
        next_state: int | Iterable,
        done: bool
    ) -> int:
        """
        Choose the next action (epsilon-greedy), store this step's transition,
        and perform an N-step SARSA update if possible. If the episode ends,
        flush remaining transitions.

        Parameters:
        - state (int | Iterable): Current state.
        - action (int): Action taken in `state`.
        - reward (float): Reward received.
        - next_state (int | Iterable): Next state after taking `action`.
        - done (bool): Whether the episode has ended.

        Returns:
        - int: The next action for `next_state`, or 0 if the episode ended.
        """
        s = discretize_state(state)
        ns = discretize_state(next_state)

      
        if not done:
            next_action = self(ns)
        else:
          
            next_action = 0

       
        transition = (s, action, reward, done, ns, next_action)
        self.buffer.append(transition)

      
        if len(self.buffer) >= self.n:
            self._update_earliest_transition()

        
        if done:
            self._flush_buffer()

       
        return next_action

    def _update_earliest_transition(self):
        """
        Perform an N-step SARSA update for the oldest transition in the buffer.
        Uses the next (n-th) transition to add gamma^n * Q(...) if the episode
        has not ended by that step.
        """
        n_seq = list(self.buffer)[:self.n]  # first n transitions
        s_old, a_old, _, _, _, _ = n_seq[0]

        G = 0.0
        steps = 0
        for i, (s_i, a_i, r_i_plus_1, done_i_plus_1, ns_i, na_i) in enumerate(n_seq):
            G += (self.gamma ** i) * r_i_plus_1
            steps += 1
            if done_i_plus_1:
                break

        last_s, last_a, _, last_done, _, _ = n_seq[-1]
        if (not last_done) and (steps == self.n):
            #
            _, _, _, _, nth_s, nth_a = n_seq[-1]
            G += (self.gamma ** self.n) * self.q_table[nth_s][nth_a]

    
        td_error = G - self.q_table[s_old][a_old]
        self.q_table[s_old][a_old] += self.alpha * td_error

        self.buffer.popleft()

    def _flush_buffer(self):
        """
        Flush any leftover transitions in the buffer at episode end,
        performing shorter updates as the episode has ended.
        """
        while self.buffer:
            self._update_earliest_transition()

    def __repr__(self):
        return (f"N-Step SARSA(n={self.n}, ε={self.epsilon}, "
                f"α={self.alpha}, γ={self.gamma})")

In [None]:
NUM_EPISODES =3000
MAX_STEPS = 1000

env = gym.make("CliffWalking-v0")

In [None]:

sarsa = SARSA(env.action_space.n,epsilon=0.1,alpha=0.01, gamma=0.9)
n_step_sarsa_small = NStepSARSA(env.action_space.n,epsilon=0.1,alpha=0.01, gamma=0.9,n=3)
n_step_sarsa_large = NStepSARSA(env.action_space.n,epsilon=0.1,alpha=0.01, gamma=0.9,n=12)
mc_agent = MonteCarloCliffWalking(env, gamma=0.9, epsilon=0.1, episodes=NUM_EPISODES)

In [None]:

data = {
    str(sarsa): train(env, sarsa, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
    str(n_step_sarsa_small): train(env, n_step_sarsa_small, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
    str(n_step_sarsa_large): train(env, n_step_sarsa_large, num_episodes=NUM_EPISODES, max_steps=MAX_STEPS),
    str(mc_agent): mc_agent.train()

}


In [None]:
moving_average(data, window_size=50,show_original=False)

# Evaluation of Reinforcement Learning Algorithms: Reliable and Standardized Protocols

Evaluating reinforcement learning (RL) algorithms is crucial to understanding their generalization, robustness, and efficiency. A well-defined evaluation protocol ensures that RL models are not only optimized for a specific task but also perform reliably across different environments and conditions.

To ensure consistency and scientific reproducibility, researchers follow standardized evaluation protocols, including benchmarking against established datasets, statistical significance testing, and multiple training runs.

---

## Key Principles of RL Evaluation

### 1. Performance Metrics
The most commonly used metrics for RL evaluation include:  
- **Total Reward (Return):** The cumulative reward obtained per episode.  
- **Sample Efficiency:** The number of interactions required to reach optimal performance.  
- **Learning Stability:** How consistently the agent improves over time.  
- **Final Convergence:** Whether the algorithm stabilizes at an optimal policy.  
- **Robustness:** The agent’s ability to generalize across different scenarios.  

---

### 2. Reliable Experimental Design
To ensure fair comparisons, RL algorithms should be evaluated under standardized conditions.  

#### Multiple Training Runs
Since RL training involves stochastic processes, single-run evaluations are not reliable. A standard practice is to conduct multiple independent training runs (e.g., 5 to 10 runs) and report:  
- The **mean** and **standard deviation** of performance metrics.  
- Confidence intervals to assess statistical significance.  

#### Benchmarking on Standard Environments
RL algorithms should be tested across diverse environments to assess their generalization ability. Commonly used benchmarks include:  
- **OpenAI Gym** (e.g., CartPole, LunarLander, Atari games)  
- **DeepMind Control Suite**  
- **Mujoco (Robotics Simulations)**  
- **ProcGen (Procedurally Generated Environments)**  

Different environments **require vastly different training durations** due to their complexity.  
- **Simple environments** like **CartPole** may require only **a few thousand episodes** to reach optimal performance.  
- **More complex environments** such as **Atari games** often require between **100,000 to 1,000,000 episodes** to learn effective policies.  
- **High-dimensional control tasks** (e.g., Mujoco, robotic manipulation) may require millions of steps due to **continuous action spaces and sparse rewards**.  

Training duration should be carefully chosen based on the **difficulty of the task** and the **expected sample efficiency** of the RL algorithm.  

#### Hyperparameter Sensitivity Analysis
RL models often require extensive tuning. A robust evaluation must analyze:  
- The sensitivity of the algorithm to **learning rate (α), discount factor (γ), and exploration rate (ε)**.  
- Whether the model **converges reliably** across different settings.  

---

### 3. Statistical Robustness and Reporting
A rigorous RL evaluation protocol should:  
- Report variance: RL algorithms can be highly unstable. Showing only the best-performing run is misleading.  
- Plot learning curves with confidence intervals (e.g., shaded regions for standard deviation).  
- Use statistical tests (e.g., t-tests, bootstrapping) to compare different RL algorithms.  

---

## Best Practices for RL Evaluation
- Train and evaluate on multiple environments to test generalization.  
- Run experiments multiple times to reduce randomness.  
- Use moving averages to smooth learning curves for better visualization.  
- Report variance and confidence intervals, not just raw scores.  
- Conduct hyperparameter tuning and analyze sensitivity.  
- Compare against strong baselines (e.g., random policies, human performance, existing RL benchmarks).  
- **Adjust training duration** based on the complexity of the environment:
  - **Small-scale environments**: 10,000 – 100,000 episodes.
  - **Atari-like environments**: 100,000 – 1,000,000 episodes.
  - **Continuous control and robotics**: 1,000,000+ episodes.  

By following these reliable and standardized evaluation protocols, we ensure that RL research remains reproducible, fair, and scientifically rigorous.

In [None]:

def plot_rl_training_runs(data_dict: Dict[str, List[List[float]]], window_size: int = 10) -> None:
    """
    Plots the mean and standard deviation of episode rewards across multiple training runs.

    This function visualizes the performance of different RL algorithms by computing the 
    mean and standard deviation of rewards over multiple training runs. A moving average 
    is applied to smooth the learning curves.

    Parameters:
    - data_dict (Dict[str, List[List[float]]]): A dictionary where keys are algorithm names and values 
      are 2D lists or arrays of shape (M, N), where:
        - M is the number of training runs.
        - N is the number of episodes per run.
    - window_size (int, optional): The number of episodes for moving average smoothing. Default is 10.

    Returns:
    - None: Displays the plotted learning curves.
    """
    plt.figure(figsize=(10, 5))

    for algo, rewards in data_dict.items():
        rewards = np.array(rewards)  # Convert to numpy array (M, N)
        rewards = np.clip(rewards, -500, None)
        # Compute mean and standard deviation across runs (axis=0)
        mean_rewards = np.mean(rewards, axis=0)
        std_rewards = np.std(rewards, axis=0)

        # Smooth with moving average
        if window_size > 1:
            mean_rewards = np.convolve(mean_rewards, np.ones(window_size) / window_size, mode='valid')
            std_rewards = np.convolve(std_rewards, np.ones(window_size) / window_size, mode='valid')

        # Plot mean curve
        x_values = np.arange(len(mean_rewards))
        plt.plot(x_values, mean_rewards, label=f"{algo} (Mean)")

        # Plot shaded region for standard deviation
        plt.fill_between(x_values, mean_rewards - std_rewards, mean_rewards + std_rewards, alpha=0.2)

    plt.xlabel("Episodes")
    plt.ylabel("Returns")
    plt.title(f"Algorithm Performance Comparison with {window_size}-Episode Smoothing")
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
#Constants
NUM_RUNS = 5
NUM_EPISODES = 1000
MAX_STEPS = 1000

# Initialize the environment
env = gym.make("CliffWalking-v0")
returns = {
    'Optimal Policy': [[-13 for _ in range(NUM_EPISODES)]] # Optimal policy rewards for CliffWalking, -13 is the best possible reward
}


for _ in range(NUM_RUNS):
    algo = QLearning(env.action_space.n,epsilon=0.01,alpha=0.01, gamma=0.9)
    ret = returns.get(str(algo), [])
    ret.append(train(env, algo,NUM_EPISODES,MAX_STEPS))
    returns[str(algo)] = ret
for _ in range(NUM_RUNS):
    algo = SARSA(env.action_space.n,epsilon=0.3,alpha=0.1, gamma=0.9)
    ret = returns.get(str(algo), [])
    ret.append(train(env, algo,NUM_EPISODES,MAX_STEPS))
    returns[str(algo)] = ret
for _ in range(NUM_RUNS):
    algo = ExpectedSARSA(env.action_space.n,epsilon=0.1,alpha=0.1, gamma=0.9)
    ret = returns.get(str(algo), [])
    ret.append(train(env, algo,NUM_EPISODES,MAX_STEPS))
    returns[str(algo)] = ret
for _ in range(NUM_RUNS):
    algo = NStepSARSA(env.action_space.n,epsilon=0.1,alpha=0.1, gamma=0.9,n=4)
    ret = returns.get(str(algo), [])
    ret.append(train(env, algo,NUM_EPISODES,MAX_STEPS))
    returns[str(algo)] = ret
for _ in range(NUM_RUNS):
    algo = MonteCarloCliffWalking(env, gamma=0.9, epsilon=0.1, episodes=NUM_EPISODES)
    ret = returns.get(str(algo), [])
    ret.append(algo.train())
    returns[str(algo)] = ret

In [None]:
plot_rl_training_runs(returns, window_size=10)


You can try to train a tabular Q-Learning agent on the Pong environment using the RAM state representation.

The environment is very high-dimensional, so it may take a long time to train. (4+ hours)

You can also try other algorithms like SARSA, Expected SARSA, or N-Step SARSA.

Decomment the following code to train the agent on the Pong environment.

In [None]:
# env = gym.make("Pong-ram-v4")
# ql = QLearning(env.action_space.n,epsilon=0.1,alpha=0.1, gamma=0.9)
# rewards = train(env, ql,100_000,1000,verbose=True)
# moving_average({"Q-Learning": rewards}, window_size=1000, show_original=False)