# **Reinforcement Learning**
<img align="right" src="https://vitalflux.com/wp-content/uploads/2020/12/Reinforcement-learning-real-world-example.png">

- In reinforcement learning, your system learns how to interact intuitively with the environment by basically doing stuff and watching what happens.

if you need the last version of gym use block of code below:

```sh
!pip uninstall gym -y
!pip install gym
```
<br>

And here is gymnasium version:

```python
gymnasium.__version__
```
1.2.0

In [None]:
# !pip install -U gym==0.25.2
!pip install swig
!pip install gymnasium[atari]
!pip install gymnasium[box2d]
!pip install gymnasium[accept-rom-license]
# !pip install autorom[accept-rom-license]

In [19]:
import numpy as np
import matplotlib.pyplot as plt
import gymnasium as gym
from IPython.core.display import HTML
from base64 import b64encode
from gym.wrappers import record_video, record_episode_statistics
from gym.wrappers import RecordVideo, RecordEpisodeStatistics
import torch
import os
from tqdm.auto import tqdm
os.environ["SDL_VIDEODRIVER"] = "dummy"

import warnings
warnings.filterwarnings('ignore')

In [12]:
def display_video(episode=0, video_width=600):
    """
    Displays a video from a specified episode with customizable width.

    Args:
        episode (int): The episode number to load the video for. Defaults to 0.
        video_width (int): The width of the video player in pixels. Defaults to 600.

    Returns:
        IPython.display.HTML: An HTML video element that can be rendered in Jupyter notebooks.

    Note:
        - The function expects video files to be in './video/' directory with naming format 'rl-video-episode-{N}.mp4'
        - Videos are base64 encoded and embedded directly in the HTML for display
    """
    # Construct the path to the video file based on episode number
    video_path = f"./video/rl-video-episode-{episode}.mp4"

    # Read the video file as binary data
    video_file = open(video_path, "rb").read()

    # Encode the binary video data as base64 string
    decoded = b64encode(video_file).decode()

    # Create a data URL for the video
    video_url = f"data:video/mp4;base64,{decoded}"

    # Return an HTML video element with the embedded video
    return HTML(f"""<video width="{video_width}"" controls><source src="{video_url}"></video>""")

def create_env(name, render_mode="rgb_array", record=False, eps_record=50, video_folder='./video'):
    """
    Creates and configures a Gym environment with optional video recording and statistics tracking.

    Args:
        name (str): Name of the Gym environment to create (e.g., 'CartPole-v1')
        render_mode (str): Rendering mode - "human", "rgb_array", or "ansi". Defaults to "rgb_array"
        record (bool): Whether to record videos of the environment. Defaults to False
        eps_record (int): Record a video every N episodes (when record=True). Defaults to 50
        video_folder (str): Directory to save recorded videos. Defaults to './video'

    Returns:
        gym.Env: Configured Gym environment wrapped with recording and statistics tracking

    Note:
        - When record=True, videos will be saved in the specified folder with automatic naming
        - The environment is always wrapped with episode statistics tracking
    """
    # Create base Gym environment with specified render mode
    env = gym.make(name, render_mode=render_mode)

    # Optionally wrap environment with video recorder
    if record:
        # Record video every eps_record episodes (trigger function)
        env = RecordVideo(env, video_folder=video_folder,
                         episode_trigger=lambda x: x % eps_record == 0)

    # Always wrap environment with episode statistics tracker
    env = RecordEpisodeStatistics(env)

    return env

def show_reward(total_rewards):
    """
    Plots the progression of rewards across episodes using matplotlib.

    Args:
        total_rewards (list or array-like): A sequence of reward values obtained per episode.

    Displays:
        A line plot showing the reward trend over episodes with:
        - X-axis: Episode number
        - Y-axis: Reward value

    Note:
        - This function immediately displays the plot using plt.show()
        - The plot uses default matplotlib styling
        - Useful for visualizing training progress in reinforcement learning
    """
    # Create line plot of reward values
    plt.plot(total_rewards)

    # Label the x-axis as 'Episode'
    plt.xlabel('Episode')

    # Label the y-axis as 'Reward'
    plt.ylabel('Reward')

    # Display the plot
    plt.show()

## **Monte Carlo Learning**

In the previous Notebook, we evaluated and solved a **Markov Decision Process (MDP)** using **dynamic programming (DP)**. **Model-based** methods such as DP have some drawbacks. They require the environment to be fully known, including the transition matrix and reward matrix. They also have limited scalability, especially for environments with plenty of states

**model-free approach**, the Monte Carlo (MC) methods, which have no requirement of prior knowledge of the environment and are much more scalable than DP.

The term **Monte Carlo** is often used more broadly for any estimation method. Monte Carlo methods require only experience, sample sequences of states, actions, and rewards from actual or simulated interaction with an environment.

It is a method for estimating Value-action(Value|State, Action) or Value function(Value|State) using some sample runs from the environment for which we are estimating the Value function

<br>

- **types of Monte Carlo learning:**
1. $\textit{First Visit Monte Carlo}$: First visit estimates (Value|State: S1) as the average of the returns following the first visit to the state S1
2. $\textit{Every Visit Monte Carlo}$: It estimates (Value|State: S1) as the average of returns for every visit to the State S1.
- **Example:**

$$
\textit{First iteration: }A+3 \rightarrow A+2 \rightarrow B-4 \rightarrow A+4 \rightarrow
B-3 \rightarrow terminated \\
\textit{Second iteration: }B-2 \rightarrow A+3 \rightarrow B-3 \rightarrow terminated
$$

<br>

$\textit{First Visit}$ | $\textit{V(a)}$ | $\textit{V(b)}$|
-----------------------|-----------------|----------------|
$\textit{First iteraion}$ | $3+2-4+4-3=2$ | $-4+4–3=-3$
$\textit{Second iteraion}$ | $3-3=0$ | $-2+3+-3=-2$
$\textit{Sum}$ | $\frac{2+0}{2}=1$ | $\frac{-3-2}{2}=-2.5$

<br>
<br>

$\textit{Every Visit}$ | $\textit{V(a)}$ | $\textit{V(b)}$
-----------------------|-----------------|----------------|
$\textit{First iteraion}$ | $(3+2-4+4-3)+(2-4+4-3)+(4-3)=2-1+1$ | $(-4+4-3)+(-3)=-3+-3$
$\textit{Second iteraion}$ | $3-3=0$ | $(-2+3–3)+(-3)=-2+-3$
$\textit{Sum}$ | $\frac{2+-1+1+0}{4}=0.5$ | $\frac{-3+-3+-2-3}{4}=-2.75$


<br>

**Note**:As we have been given 2 different iterations, we will be summing all the rewards coming after A (including that of A) after the first visit to ‘A’. It must be noted that if an episode doesn’t have an occurence of ‘A’, it won’t be considered in the average.


## **Performing Monte Carlo policy evaluation**

- **model-based algorithm**: A reinforcement learning algorithm that needs a known MDP is categorized as a model-based algorithm.
- **model-free algorithm**: On the other hand, one with no requirement of prior knowledge of transitions and rewards is called a model-free algorithm. Monte Carlo-based reinforcement learning is a model-free approach.

we will evaluate the value function using the Monte Carlo method. assuming we don't have access to both of environment transition and reward matrices. You will recall that the returns of a process, which are the total rewards over the long run, are as follows:

$$
\large G_t=\sum_{k}^{\infty}\gamma^k R_{t+k+1}
$$

<br>

> $G_t​$:The total discounted reward accumulated from time step t onward.<br>
$\sum_{k=0}^{\infty}$ (Summation): Monte Carlo methods sum rewards over complete episodes (from current state to termination).<br>
$\gamma$ (Discount factor): $\gamma$ (gamma) is the discount factor (0 ≤ $\gamma$ ≤ 1).<br>
​$R_{t+k+1}$(Reward sequence): The actual reward received at time $t+k+1$.<br>
$t$ is current time step and $k$ is steps into the future

<br>

MC policy evaluation uses **empirical mean return** instead of **expected return** (as in DP) to estimate the value function.

- **Note**: in the Monte Carlo setting, we need to keep track of the states and rewards for all steps, since we don't have access to the full environment, including the transition probabilities and reward matrix.


### **First-Visit Monte Carlo Prediction**
**Input**: policy ($\pi$), number of episodes *num_episodes*  
**Output**: value function: $V \approx v_{\pi}$ if num_episodes is large enough

1. Initialize for all states ($s \in \mathcal{S}$):  
   - $N(s) = 0$
   - $\text{Returns}(s) = 0$  

2. **For each episode** ($e \leftarrow 1$ to $e \leftarrow \textit{num_episodes}$):  
   - Generate an episode $S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T$ using $\pi$
   - $G \leftarrow 0$
   - **Loop backwards** from $t = T-1$ to $t=0$:  
     - $G \leftarrow G + R_{t+1}$  
     - **If** $S_t$ appears for the **first time** in the episode:  
       - $\text{Returns}(S_t) \leftarrow \text{Returns}(S_t) + G$  
       - $N(S_t) \leftarrow N(S_t) + 1$

3. **Update value function**:  
$$
V(s) \leftarrow \frac{\text{Returns}(s)}{N(s)} \quad \forall s \in \mathcal{S}
$$

---

### **Every-Visit Monte Carlo Prediction**
**Input**: policy $\pi$, number of episodes *num_episodes*  
**Output**: value function:$V \approx v_{\pi}$ if num_episodes is large enough

1. Initialize for all states $s \in \mathcal{S}$:  
   - $N(s) = 0$
   - $\text{Returns}(s) = 0$

2.  **For each episode** ($e \leftarrow 1 $ to $e \leftarrow \textit{num_episodes}$): :  
   - Generate an episode $S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_T$ using $\pi$
   - $G \leftarrow 0$
   - **Loop backwards** from $t = T-1$ to $t=0$:  
     - $G \leftarrow G + R_{t+1}$
     - **Update for every visit** to $S_t$:
       - $\text{Returns}(S_t) \leftarrow \text{Returns}(S_t) + G$
       - $N(S_t) \leftarrow N(S_t) + 1$

3. **Update value function**:  
$$
V(s) = \frac{\text{Returns}(s)}{N(s)} \quad \forall s \in \mathcal{S}
$$

In [13]:
env = create_env("FrozenLake-v1")

In [14]:
def run_episode(env, policy):
    """
    Executes one episode in the environment following the given policy.

    Args:
        env (gym.Env): The environment to interact with
        policy (torch.Tensor): The policy mapping states to actions

    Returns:
        tuple: (states, rewards) where:
            - states (torch.Tensor): Sequence of visited states
            - rewards (torch.Tensor): Sequence of received rewards

    Note:
        - The episode runs until termination (is_done = True)
        - States and rewards are converted to PyTorch tensors
        - The policy must implement the [] operator for state indexing
    """
    # Initialize the environment and get starting state
    state, _ = env.reset()

    # Initialize lists to store episode data
    rewards = []
    states = [state]

    # Initialize termination flag
    is_done = False

    # Run the episode until termination
    while not is_done:
        # Select action according to policy for current state
        action = policy[state].item()

        # Execute action in environment
        state, reward, is_done, info = env.step(action)

        # Store the reward
        rewards.append(reward)

        # Store next state if episode continues
        if not is_done:
            states.append(state)
        else:
            break

    # Convert lists to PyTorch tensors
    states = torch.tensor(states)
    rewards = torch.tensor(rewards)

    return states, rewards

In [21]:
def mc_first_visit(env, policy, gamma, n_episode):
    """
    Monte Carlo (First-Visit) method for estimating the value function under a given policy.

    Parameters:
        env: The environment object (assumed to follow OpenAI Gym's structure).
        policy: A policy to follow, which maps states to actions.
        gamma: Discount factor (0 <= gamma < 1) to weigh future rewards.
        n_episode: Number of episodes to sample.

    Returns:
        V (torch.Tensor): Estimated value function for each state.
    """
    # Number of states in the environment
    n_state = env.observation_space.n

    # Initialize the value function and visit counts to zero
    V = torch.zeros(n_state)  # Value function
    N = torch.zeros(n_state)  # Number of first visits to each state

    # Iterate over the specified number of episodes
    pbar = tqdm(range(n_episode), desc="Episode")
    for episode in pbar:
        # Generate a single episode following the policy
        states, rewards = run_episode(env, policy)

        # Initialize return and other variables
        G = torch.zeros(n_state)  # To store return for each state
        first_visit = torch.zeros(n_state)  # Marks the first visit to a state in the episode
        return_t = 0

        # Process the episode in reverse (from terminal state to the initial state)
        for state_t, reward_t in zip(reversed(states), reversed(rewards)):
            return_t = reward_t + gamma * return_t  # Calculate return
            G[state_t] = return_t  # Store the return for the state
            first_visit[state_t] = 1  # Mark the state as visited for the first time

        # Update value function and visit counts for first-visit states
        for state in range(n_state):
            if first_visit[state] > 0:  # Only update for first-visit states
                V[state] += G[state]
                N[state] += 1

        pbar.set_description(f"G: {G.sum()}")
    # Finalize the value function by averaging returns for each state
    for state in range(n_state):
        if N[state] > 0:
            V[state] = V[state] / N[state]

    return V

In [27]:
def mc_every_visit(env, policy, gamma, n_episode):
    """
    Monte Carlo (Every-Visit) method for estimating the value function under a given policy.

    Parameters:
        env: The environment object (assumed to follow OpenAI Gym's structure).
        policy: A policy to follow, which maps states to actions.
        gamma: Discount factor (0 <= gamma < 1) to weigh future rewards.
        n_episode: Number of episodes to sample.

    Returns:
        V (torch.Tensor): Estimated value function for each state.
    """
    # Number of states in the environment
    n_state = env.observation_space.n

    # Initialize the value function, visit counts, and cumulative returns to zero
    V = torch.zeros(n_state)  # Value function
    N = torch.zeros(n_state)  # Visit counts for every visit
    G = torch.zeros(n_state)  # Cumulative return for each state

    # Iterate over the specified number of episodes
    pbar = tqdm(range(n_episode), desc="Episode")
    for episode in pbar:
        # Generate a single episode following the policy
        states, rewards = run_episode(env, policy)

        # Initialize return
        return_t = 0

        # Process the episode in reverse (from terminal state to the initial state)
        for state_t, reward_t in zip(reversed(states), reversed(rewards)):
            return_t = reward_t + gamma * return_t  # Calculate return
            G[state_t] += return_t  # Add the return to the cumulative return
            N[state_t] += 1  # Increment visit count for the state

        pbar.set_description(f"G: {G.sum()}")

    # Finalize the value function by averaging returns for each state
    for state in range(n_state):
        if N[state] > 0:
            V[state] = G[state] / N[state]

    return V

## **Another approach**

We store the cumulative total, the count of first visits and the average:

$$
    N(s) = N(s) + 1;\hspace{5mm} S(s) = S(s) + G;\hspace{5mm} V(s) = \frac{S(s)}{N(s)}
$$

> $N(s)$: visit count;<br>
$S(s)$: cumulative state values;<br>
$V(s)$: estimate state values

<br>

We can update $V(s)$ in another way
---

---

$$
\begin{split}
    \\
    V(s)_{n+1} & = \frac{S(s)_n + G}{N(s)_{n+1}}; \hspace{1cm} S(s) = V(s) * N(s) \\
    \\
    V(s)_{n+1} & = \frac{V(s)_n * N(s)_n + G}{N(s)_{n+1}}; \hspace{1cm} N(s)_{n+1} = N(s)_n + 1 \\
    \\
    V(s)_{n+1} & = \frac{V(s)_n * N(s)_{n+1} - V(s)_n + G}{N(s)_{n+1}} \\
    \\
    V(s)_{n+1} & = V(s)_n + \frac{1}{N(s)_{n+1}}\bigl[ G - V(s)_n\bigr]\\
\end{split}
$$

---
<br>

$\bigl[ G - V(s)_n\bigr]$ can be viewed as an error, and $\frac{1}{N}$ reduce to zero as $N$ becomes very large. we can use constant $\alpha$ as a factor instead of $\frac{1}{N}$, which is better for nonstationary problems:


$$
    V(s)_{n+1} = V(s)_n + \alpha\bigl( G - V(s)_n\bigr)\\
$$


In [26]:
def mc_first_visit2(env, policy, gamma, n_episode):
    """
    Monte Carlo (First-Visit) method for estimating the value function under a given policy.
    This version updates the value function incrementally using an online formula.

    Parameters:
        env: The environment object (assumed to follow OpenAI Gym's structure).
        policy: A policy to follow, which maps states to actions.
        gamma: Discount factor (0 <= gamma < 1) to weigh future rewards.
        n_episode: Number of episodes to sample.

    Returns:
        V (torch.Tensor): Estimated value function for each state.
    """
    # Number of states in the environment
    n_state = env.observation_space.n

    # Initialize the value function and visit counts to zero
    V = torch.zeros(n_state)  # Value function
    N = torch.zeros(n_state)  # Number of first visits to each state

    # Iterate over the specified number of episodes
    pbar = tqdm(range(n_episode), desc="Episode")
    for episode in pbar:
        # Generate a single episode following the policy
        states, rewards = run_episode(env, policy)

        # Initialize return and other variables
        G = torch.zeros(n_state)  # To store return for each state
        first_visit = torch.zeros(n_state)  # Marks the first visit to a state in the episode
        return_t = 0
        count = len(states)  # Total number of states visited in the episode

        # Process the episode in reverse (from terminal state to the initial state)
        for t in range(count - 1, -1, -1):  # Traverse the states in reverse order
            state_t, reward_t = states[t], rewards[t]
            return_t = reward_t + gamma * return_t  # Calculate return
            G[state_t] = return_t  # Store the return for the state
            first_visit[state_t] = 1  # Mark the state as visited for the first time

            # Check if this is the first visit to the state
            if state_t not in states[:t]:
                N[state_t] += 1  # Increment the visit count for the state
                # Incrementally update the value function using the online formula
                V[state_t] = V[state_t] + (1 / N[state_t]) * (G[state_t] - V[state_t])

        pbar.set_description(f"G: {G.sum()}")

    return V

|Aspect | mc_first_visit | mc_first_visit2|
|-------|----------------|----------------|
|Update Method |	Uses a batch update after all episodes.	| Updates incrementally after each visit.|
|Return Storage	|Stores and sums all returns before averaging.	|Directly computes averages during updates.|
|Visit Check |Uses a first_visit flag to track first visits.| Checks if the state is in states[:t].|
|Memory Usage| Requires additional memory for storing G and first_visit.| Lower memory usage due to incremental updates.|
|Convergence Speed| May converge slightly slower due to batch updates.| Potentially faster convergence due to online updates.|
|Code Complexity| Straightforward with clear separation of steps.| Slightly more complex due to conditional checks and incremental updates.|

In [24]:
gamma = 1
# use policy from previous notebook
policy = torch.tensor([0., 3., 3., 3., 0., 3., 2., 3., 3., 1., 0., 3., 3., 2., 1., 3.]).long()
n_episode = 1000

first_visit = mc_first_visit(env, policy, gamma, n_episode)
print(f"The value function calculated by first visit MC:\n{first_visit}\n ")

Episode:   0%|          | 0/1000 [00:00<?, ?it/s]

The value function calculated by first visit MC:
tensor([0.7460, 0.4887, 0.4815, 0.4286, 0.7460, 0.0000, 0.3833, 0.0000, 0.7460,
        0.7467, 0.6756, 0.0000, 0.0000, 0.8065, 0.8881, 0.0000])
 


In [28]:
every_visit = mc_every_visit(env, policy, gamma, n_episode)
print(f"The value function calculated by every visit MC:\n{every_visit}\n ")

Episode:   0%|          | 0/1000 [00:00<?, ?it/s]

The value function calculated by every visit MC:
tensor([0.5969, 0.4928, 0.4410, 0.4389, 0.6098, 0.0000, 0.3775, 0.0000, 0.6398,
        0.6725, 0.6420, 0.0000, 0.0000, 0.7616, 0.8794, 0.0000])
 


In [29]:
first_visit2 = mc_first_visit2(env, policy, gamma, n_episode)
print(f"The value function calculated by first visit MC:\n{first_visit2}\n ")

Episode:   0%|          | 0/1000 [00:00<?, ?it/s]

The value function calculated by first visit MC:
tensor([0.7570, 0.4841, 0.4766, 0.4023, 0.7570, 0.0000, 0.3902, 0.0000, 0.7570,
        0.7570, 0.6839, 0.0000, 0.0000, 0.8086, 0.9044, 0.0000])
 


## **Bias and Variance**


**Bias** refers to the property of the model to converge to the true  value. Some estimators are biased, meaning they are not able to converge to the true value due to lack of flexibility.

**Variance** refers to the model estimate being sensitive to the specific sample data being used. This means the estimate value may fluctuate a lot and hence may require a large data set or trials for the estimate average to converge to a stable value.

<br>

- **bias-variance trade-off**
Flexible models have low bias; however, they can overfit to the data, making the estimates vary a lot as the training data changes. On the contrary, simple models have high bias. So they may not be able to represent the true underlying model. But they will also have low variance as they do not overfit.

<br>

### **first visit** is unbiased but has high variance. **Every visit** has bias that goes down to zero, and it has low variance, which usually converges to the true value estimates faster than first visit.

## **BlackJack**
<img align='right' width='400' src="https://www.gymlibrary.dev/_images/blackjack.gif">


|              |            |
|--------------|------------|
| Action Space | Discrete(2)|
| Observation Space | Tuple(Discrete(32), Discrete(11), Discrete(2)) |
| Import | gym.make("Blackjack-v1") |

Card Values:

- Face cards (Jack, Queen, King) have a point value of 10.
- Aces can either count as 11 (called a ‘usable ace’) or 1.
- Numerical cards (2-9) have a value equal to their number.

Action Space:

- There are two actions: stick (0), and hit (1).

Observation Space:

- The observation consists of a 3-tuple containing: the player’s current sum, the value of the dealer’s one showing card (1-10 where 1 is ace), and whether the player holds a usable ace (0 or 1).

In [30]:
env = create_env("Blackjack-v1")

In [48]:
from collections import defaultdict

def run_episode(env, hold_score):
    state, _ = env.reset()
    rewards = []
    states = [state]
    is_done = False
    while not is_done:
        action = 1 if state[0] < hold_score else 0
        state, reward, is_done, info = env.step(action)
        rewards.append(reward)
        if is_done:
            break
        else:
            states.append(state)

    states = torch.tensor(states)
    rewards = torch.tensor(rewards)
    return states, rewards

def mc_first_visit_blackjack(env, hold_score, gamma, n_episode):
    V = defaultdict(float)
    N = defaultdict(int)
    pbar = tqdm(range(n_episode), desc="Epsiode")
    for episode in pbar:
        states, rewards = run_episode(env, hold_score)
        G = {}
        return_t = 0
        for state_t, reward_t in zip(reversed(states), reversed(rewards)):
            return_t = reward_t + gamma * return_t
            G[state_t] = return_t

        for state, return_t in G.items():
            if state[0] <= 21:
                V[state] += return_t
                N[state] += 1
        pbar.set_description(f"G: {sum(G.values())}")

    for state in V:
        V[state] = V[state] / N[state]
    return V

In [49]:
env = create_env("Blackjack-v1")
hold_score = 18
gamma = 1
n_episode = 500
value = mc_first_visit_blackjack(env, hold_score, gamma, n_episode)

Epsiode:   0%|          | 0/500 [00:00<?, ?it/s]

In [50]:
len(value)

868

## **Performing on-policy Monte Carlo control**

**Monte Carlo prediction** is used to evaluate the value for a given policy, while **Monte Carlo control (MC control)** is for finding the optimal policy when such a policy is not given. There are basically categories of MC control: **on-policy** and **off-policy**.
- On-policy methods learn about the optimal policy by executing the policy and evaluating and improving it
- off-policy methods learn about the optimal policy using data generated by another policy.

**Note:** The way on-policy MC control works is quite similar to policy iteration in dynamic programming, which has two phases, evaluation and improvement:

- **In the evaluation phase**, instead of evaluating the value function (also called the state value, or utility), it evaluates the action-value. The action-value is more frequently called the **Q-function**, which is the utility of a state-action pair $(s, a)$ by taking action a in state s under a given policy. Again, the evaluation can be conducted in a first-visit manner or an every-visit manner.

- **In the improvement phase**, the policy is updated by assigning the optimal action to each state:

<br>

$$
\large\pi(s) = \underset{a}{argmax}Q(s, a)
$$

In [54]:
from collections import defaultdict

# Function to simulate a single episode based on the given Q-table and environment
def run_episode(env, Q, n_action):
    """
    Simulates one episode of the environment using the given Q-values.

    Parameters:
        env: The environment object (assumed to follow OpenAI Gym's structure).
        Q (defaultdict): Q-table storing state-action value estimates.
        n_action (int): The number of possible actions in the environment.

    Returns:
        states (list): A list of states visited during the episode.
        actions (list): A list of actions taken during the episode.
        rewards (list): A list of rewards received during the episode.
    """
    state, _ = env.reset()  # Reset the environment to its initial state
    rewards = []         # List to store rewards for the episode
    actions = []         # List to store actions taken
    states = []          # List to store states visited
    is_done = False      # Variable to track if the episode is finished

    # Take a random initial action
    action = torch.randint(0, n_action, [1]).item()

    while not is_done:
        actions.append(action)    # Store the current action
        states.append(state)      # Store the current state

        # Perform the action and observe the next state, reward, and termination flag
        state, reward, is_done, info = env.step(action)
        rewards.append(reward)    # Store the reward for the current step

        if is_done:
            break

        # Select the next action using the current policy (greedy policy derived from Q)
        action = torch.argmax(Q[state]).item()

    return states, actions, rewards

In [55]:
# Monte Carlo On-Policy Control with First-Visit MC Prediction
def mc_on_policy(env, gamma, n_episode):
    """
    Implements on-policy first-visit Monte Carlo control to learn the optimal policy.

    Parameters:
        env: The environment object (assumed to follow OpenAI Gym's structure).
        gamma (float): Discount factor for future rewards (0 <= gamma < 1).
        n_episode (int): The number of episodes to run for training.

    Returns:
        Q (defaultdict): The learned Q-table with state-action values.
        policy (dict): The learned policy derived from the Q-table.
    """
    n_action = env.action_space.n  # Number of possible actions
    G_sum = defaultdict(float)    # Cumulative sum of returns for each state-action pair
    N = defaultdict(int)          # Count of visits to each state-action pair
    Q = defaultdict(lambda: torch.empty(env.action_space.n))  # Q-table initialized with empty tensors

    pbar = tqdm(range(n_episode), desc = "Episode")
    for episode in pbar:

        # Generate an episode using the current Q-table
        states, actions, rewards = run_episode(env, Q, n_action)

        G = {}  # Dictionary to store the return G for each state-action pair
        return_t = 0  # Initialize the return for the episode

        # Iterate over the episode in reverse order (to calculate returns)
        for state_t, action_t, reward_t in zip(states[::-1], actions[::-1], rewards[::-1]):
            return_t = reward_t + gamma * return_t  # Calculate the discounted return
            G[state_t, action_t] = return_t         # Store the return for the state-action pair

        # Update the Q-table for each state-action pair in the episode
        for state_action, return_t in G.items():
            state, action = state_action
            # Update only for states with valid indices
            if state[0] <= 21:
                G_sum[state_action] += return_t        # Update cumulative return
                N[state_action] += 1                  # Increment visit count
                Q[state][action] = G_sum[state_action] / N[state_action]  # Update Q-value

        pbar.set_description(f"G: {sum(G.values())}")

    # Derive the policy from the Q-table by taking the action with the highest value in each state
    policy = {}
    for state, actions in Q.items():
        policy[state] = torch.argmax(actions).item()

    return Q, policy

In [56]:
env = create_env("Blackjack-v1")
gamma = 1
n_episode = 500
optimal_Q, optimal_policy = mc_on_policy(env, gamma, n_episode)

Episode:   0%|          | 0/500 [00:00<?, ?it/s]

In [59]:
# Initialize a dictionary to store the optimal state values
optimal_value = defaultdict(float)

# Iterate through the optimal Q-table to extract the maximum value for each state
for state, action_values in optimal_Q.items():
    """
    For each state in the optimal Q-table:
        - action_values: Tensor containing the Q-values for all possible actions in the state.
        - torch.max(action_values).item(): Finds the maximum Q-value (optimal value) for the state.
        - Store this maximum value in the `optimal_value` dictionary.
    """
    optimal_value[state] = torch.max(action_values).item()

# Print the dictionary containing the optimal state values
# print(optimal_value)
for i, (k, v) in enumerate(optimal_value.items()):
    print(f"{k}: {v}")
    if i == 10:
        break

(13, 7, 0): 0.20000000298023224
(20, 6, 0): 0.4000000059604645
(11, 10, 0): 0.20000000298023224
(9, 10, 0): -0.5
(21, 8, 1): 1.0
(5, 7, 0): 0.0
(20, 5, 0): 1.0
(20, 10, 0): 0.25
(14, 10, 0): -0.27272728085517883
(17, 10, 0): -0.3333333432674408
(15, 10, 0): -0.5


In [60]:
def simulate_episode(env, policy):
    """
    Simulates an episode in the given environment following a specific policy.

    Parameters:
        env: The environment object (assumed to follow OpenAI Gym's structure).
        policy (dict): A dictionary mapping states to actions. If a state is not in the policy,
                       a random action is chosen.

    Returns:
        reward (float): The reward received at the end of the episode.
    """
    # Reset the environment to the initial state
    state, _ = env.reset()

    # Flag to track if the episode is finished
    done = False

    # Simulate the episode
    while not done:
        # Check if the state exists in the policy
        if state in policy:
            # Follow the policy to select the action
            action = policy[state]
        else:
            # Choose a random action if the state is not in the policy
            action = torch.randint(2, [1]).item()

        # Perform the action in the environment and observe the new state and reward
        state, reward, done, info = env.step(action)

        # If the episode ends, return the final reward
        if done:
            return reward

In [61]:
win = 0
pbar = tqdm(range(100))
for i in pbar:
    R = simulate_episode(env, optimal_policy)
    if R > 0:
        win += int(R)
    pbar.set_description(f"{win} times win which means {win:.2f} % winning chanse")

  0%|          | 0/100 [00:00<?, ?it/s]

### **Developing MC control with epsilon-greedy policy**

In MC control with **epsilon-greedy** policy, we no longer exploit the best action all the time, but choose an action randomly under certain probabilities. As the name implies, the algorithm has two folds:

Epsilon: given a parameter, $ε$, with a value from 0 to 1, each action is taken with a probability calculated as follows:

$$
\large \pi(s, a) = \frac{ε}{|A|}
$$
- Here, |A| is the number of possible actions.

Greedy: the action with the highest state-action value is favored, and its probability of being chosen is increased by $1-ε$:

$$
\large \pi(s, a) = 1 - ε + \frac{ε}{|A|}
$$

Epsilon-greedy policy exploits the best action most of the time and also keeps exploring different actions from time to time.

<br>

<center><img width="600" src="https://www.baeldung.com/wp-content/ql-cache/quicklatex.com-5b10393cf0c6395ae5fb22260220c574_l3.svg">

In [62]:
def take_action(state, Q, epsilon, n_action):
    """
    Selects an action for a given state using an epsilon-greedy policy.

    Parameters:
        state (int): The current state for which an action is to be chosen.
        Q (dict): A dictionary where Q[state][action] represents the estimated value of taking
                  'action' in 'state'.
        epsilon (float): The probability of choosing a random action for exploration (0 <= epsilon <= 1).
        n_action (int): The total number of possible actions.

    Returns:
        action (int): The chosen action.
    """

    # Generate a random number to decide between exploration or exploitation
    if np.random.random() < epsilon:
        # Exploration: Choose a random action with uniform probability
        return torch.randint(0, n_action, (1,)).item()
    else:
        # Exploitation: Choose the action with the highest Q-value for the current state
        return torch.argmax(Q[state]).item()


def take_action2(state, Q, epsilon, n_action):
    """
    Selects an action for a given state using an epsilon-greedy policy.

    Parameters:
        state (int): The current state for which an action is to be chosen.
        Q (dict): A dictionary where Q[state][action] represents the estimated value of taking
                  'action' in 'state'.
        epsilon (float): The probability of choosing a random action for exploration (0 <= epsilon <= 1).
        n_action (int): The total number of possible actions.

    Returns:
        action (int): The chosen action.
    """
    # Initialize a probability distribution for all actions, with each action having an equal probability
    # of being selected (epsilon / n_action).
    probs = torch.ones(n_action) * epsilon / n_action

    # Find the action with the highest Q-value for the current state (exploitation).
    best_action = np.argmax(Q[state])

    # Increase the probability of selecting the best action by the remaining probability (1 - epsilon).
    probs[best_action] += (1.0 - epsilon)

    # Sample an action based on the computed probability distribution.
    action = torch.multinomial(probs, 1).item()

    return action

In [63]:
def run_episode(env, Q, epsilon, n_action):
    """
    Simulates an episode in the environment using an epsilon-greedy policy for action selection.

    Parameters:
        env: The environment object (e.g., OpenAI Gym environment).
        Q (dict): A dictionary where Q[state][action] represents the estimated value of taking
                  'action' in 'state'.
        epsilon (float): The probability of choosing a random action for exploration (0 <= epsilon <= 1).
        n_action (int): The total number of possible actions.

    Returns:
        states (list): A list of states visited during the episode.
        actions (list): A list of actions taken during the episode.
        rewards (list): A list of rewards received during the episode.
    """

    # Initialize the environment and get the starting state
    state, _ = env.reset()

    # Initialize lists to track states, actions, and rewards
    rewards = []  # Rewards received during the episode
    actions = []  # Actions taken during the episode
    states = []   # States visited during the episode

    # Flag to indicate whether the episode is done
    is_done = False

    # Loop until the episode ends
    while not is_done:
        # Select an action using the epsilon-greedy policy
        action = take_action2(state, Q, epsilon, n_action)

        # Record the chosen action and the current state
        actions.append(action)
        states.append(state)

        # Perform the action in the environment and observe the outcome
        state, reward, is_done, info = env.step(action)

        # Record the reward received for the action
        rewards.append(reward)

        # Check if the episode has ended
        if is_done:
            break

        # If the episode continues, select the next action based on the current Q-values
        action = torch.argmax(Q[state]).item()

    # Return the recorded states, actions, and rewards
    return states, actions, rewards

In [64]:
def mc_epsilon_greedy(env, gamma, n_episode, epsilon):
    """
    Monte Carlo Control using the epsilon-greedy method to estimate the optimal policy.

    Parameters:
        env: The environment object (e.g., OpenAI Gym environment).
        gamma (float): Discount factor (0 <= gamma <= 1), determines the importance of future rewards.
        n_episode (int): Number of episodes to run for learning.
        epsilon (float): Exploration probability for the epsilon-greedy policy (0 <= epsilon <= 1).

    Returns:
        Q (defaultdict): A dictionary mapping state-action pairs to their estimated Q-values.
        policy (dict): The learned policy, mapping states to optimal actions.
    """

    # Number of possible actions in the environment
    n_action = env.action_space.n

    # Initialize accumulators for state-action returns
    G_sum = defaultdict(float)  # Sum of returns for each state-action pair
    N = defaultdict(int)        # Count of visits to each state-action pair

    # Initialize Q-values for all state-action pairs
    Q = defaultdict(lambda: torch.empty(env.action_space.n))

    # Loop through episodes
    pbar = tqdm(range(n_episode), desc = "Episode")
    for episode in pbar:

        # Generate an episode using epsilon-greedy policy
        states, actions, rewards = run_episode(env, Q, epsilon, n_action)

        # Dictionary to store returns for state-action pairs in this episode
        G = {}
        return_t = 0  # Initialize cumulative return

        # Loop through the episode in reverse to calculate returns
        for state_t, action_t, reward_t in zip(states[::-1], actions[::-1], rewards[::-1]):
            # Calculate the cumulative return for the state-action pair
            return_t = reward_t + gamma * return_t
            G[state_t, action_t] = return_t

        # Update Q-values based on the observed returns
        for state_action, return_t in G.items():
            state, action = state_action

            # Ensure only valid states are updated (e.g., some states may be terminal)
            if state[0] <= 21:  # Example condition specific to the problem
                # Update cumulative return and visit count
                G_sum[state_action] += return_t
                N[state_action] += 1

                # Update Q-value for the state-action pair using the average return
                Q[state][action] = G_sum[state_action] / N[state_action]

        pbar.set_description(f"G: {sum(G.values())}")

    # Derive the policy from the Q-values
    policy = {}
    for state, actions in Q.items():
        # Choose the action with the highest Q-value for each state
        policy[state] = torch.argmax(actions).item()

    # Return the learned Q-values and policy
    return Q, policy

In [67]:
env = create_env("Blackjack-v1")
gamma = 1
epsilon = 0.1
n_episode = 500
optimal_Q, optimal_policy = mc_epsilon_greedy(env, gamma, n_episode, epsilon)

Episode:   0%|          | 0/500 [00:00<?, ?it/s]

In [70]:
def simulate_episode(env, policy):
     state, _ = env.reset()
     done = False
     while not done:
        if state in policy:
            action = policy[state]
        else:
            action = torch.randint(2, [1]).item()
        state, reward, done, info = env.step(action)
        if done:
            return reward

win = 0
pbar = tqdm(range(100))
for i in pbar:
    R = simulate_episode(env, optimal_policy)
    if R > 0:
        win += int(R)
    pbar.set_description(f"{win} times win which means {win:.2f} % winning chanse")

  0%|          | 0/100 [00:00<?, ?it/s]

## **Performing off-policy Monte Carlo control**

The off-policy method optimizes the **target policy** ($\pi$), using data generated by another policy, called the **behavior policy** ($b$). The target policy performs **exploitation** all the time while the behavior policy is for **exploration** purposes. This means that the target policy is greedy with respect to its current Q-function, and the behavior policy generates behavior so that the target policy has data to learn from.

We start with the latest step whose action taken under the behavior policy is different from the action taken under the greedy policy. And to learn about the target policy with another policy, we use a technique called **importance sampling**, which is commonly used to estimate the expected value under a distribution, given samples generated from a different distribution. The weighted importance for a state-action pair is calculated as follows:

$$
\omega_t= \sum_{k=t}\frac{\pi(a_k|s_k)}{b{(a_k|s_k)}}$$

> Here, $π(a_k | s_k)$ is the probability of taking action $a_k$ in state $s_k$ under the target policy; <br> $b(a_k | s_k)$ is the probability under the behavior policy; <br> the weight, $w_t$, is the multiplication of ratios between those two probabilities from step $t$ to the end of the episode. The weight, $w_t$, is applied to the return at step $t$.

In [71]:
def creat_random_policy(n_action):
    """
    Creates a random policy where each action has an equal probability of being chosen.

    Parameters:
        n_action (int): The number of possible actions in the environment.

    Returns:
        policy_fn (function): A function that returns a random probability distribution over the actions.
    """
    # Generate equal probabilities for each action
    probs = torch.ones(n_action) / n_action

    def policy_fn(observation):
        """
        This is the policy function that, given an observation (state), returns the action probabilities.

        Parameters:
            observation: The current state of the environment (not used in this random policy, but included for consistency).

        Returns:
            probs (torch.Tensor): The probability distribution over the actions, which is uniform.
        """
        return probs

    return policy_fn

In [72]:
def run_episode(env, random_policy):
    """
    Runs a single episode of interaction with the environment using a random policy.

    Parameters:
        env: The environment object (e.g., OpenAI Gym environment).
        random_policy: The random policy function that generates action probabilities for each state.

    Returns:
        states (list): List of states encountered during the episode.
        actions (list): List of actions taken during the episode.
        rewards (list): List of rewards received during the episode.
    """
    # Reset the environment to get the initial state
    state, _ = env.reset()

    # Lists to store states, actions, and rewards
    rewards = []
    actions = []
    states = []

    # Initialize the done flag to control the termination of the episode
    is_done = False

    # Run the episode until termination (done)
    while not is_done:
        # Get the action probabilities from the random policy
        probs = random_policy(state)

        # Sample an action from the probability distribution
        action = torch.multinomial(probs, 1).item()

        # Append the action and state to the lists
        actions.append(action)
        states.append(state)

        # Take a step in the environment using the action
        state, reward, is_done, info = env.step(action)

        # Append the reward to the rewards list
        rewards.append(reward)

        # Break the loop if the episode is done
        if is_done:
            break

    # Return the states, actions, and rewards collected during the episode
    return states, actions, rewards

In [73]:
def mc_off_policy(env, gamma, n_episode, epsilon, behavior_policy):
    """
    Performs Monte Carlo off-policy control using the importance sampling technique.

    Parameters:
        env (gym.Env): The environment.
        gamma (float): The discount factor.
        n_episode (int): The number of episodes to run.
        epsilon (float): The epsilon value for epsilon-greedy behavior policy.
        behavior_policy (function): The behavior policy used to generate episodes.

    Returns:
        Q (dict): The action-value function learned through off-policy Monte Carlo control.
        policy (dict): The optimal policy derived from Q.
    """

    # Number of possible actions in the environment
    n_action = env.action_space.n

    # Initialize dictionaries to accumulate the returns and counts
    G_sum = defaultdict(float)  # Sum of returns for state-action pairs
    N = defaultdict(int)        # Count of visits to state-action pairs
    Q = defaultdict(lambda: torch.empty(env.action_space.n))  # Action-value function (Q)

    # Loop through episodes
    pbar = tqdm(range(n_episode), desc="Episode")
    for episode in pbar:
        # Initialize importance sampling weight and dictionary to store weights
        w = 1
        W = {}

        # Generate the episode using the behavior policy
        states, actions, rewards = run_episode(env, behavior_policy)
        G = {}  # Dictionary to store return for each state-action pair
        return_t = 0  # The return (sum of discounted rewards) at each step

        # Iterate over the episode in reverse order (bootstrapping)
        for state_t, action_t, reward_t in zip(states[::-1], actions[::-1], rewards[::-1]):
            return_t = reward_t + gamma * return_t  # Update the return
            G[state_t, action_t] = return_t        # Store the return for the state-action pair
            W[state_t, action_t] = w              # Store the weight for the state-action pair

            # If the action taken is not the greedy action, break the loop (off-policy condition)
            if action_t != torch.argmax(Q[state_t]).item():
                break

            # Update the importance sampling weight (only if the behavior policy action was taken)
            w *= 1. / behavior_policy(state_t)[action_t]

        # Update Q-values based on the weighted returns
        for state_action, return_t in G.items():
            state, action = state_action
            if state[0] <= 21:  # This condition restricts to certain states (perhaps for environment constraints)
                G_sum[state_action] += return_t * W[state_action]  # Weighted sum of returns
                N[state_action] += 1  # Count the visits
                Q[state][action] = G_sum[state_action] / N[state_action]  # Update Q-value with average return

        pbar.set_description(f"G: {sum(G.values())}")

    # Derive the policy from the action-value function Q
    policy = {}
    for state, actions in Q.items():
        policy[state] = torch.argmax(actions).item()  # Choose the action with the highest Q-value for each state

    return Q, policy  # Return the learned action-value function and the optimal policy

In [74]:
env = create_env("Blackjack-v1")
gamma = 1
epsilon = 0.1
n_episode = 500
random_policy = creat_random_policy(env.action_space.n)
optimal_Q, optimal_policy = mc_off_policy(env, gamma, n_episode, epsilon, random_policy)

Episode:   0%|          | 0/500 [00:00<?, ?it/s]

In [75]:
def simulate_episode(env, policy):
     state, _ = env.reset()
     done = False
     while not done:
        if state in policy:
            action = policy[state]
        else:
            action = torch.randint(2, [1]).item()
        state, reward, done, info = env.step(action)
        if done:
            return reward

win = 0
pbar = tqdm(range(100))
for i in pbar:
    R = simulate_episode(env, optimal_policy)
    if R > 0:
        win += int(R)
    pbar.set_description(f"{win} times win which means {win:.2f} % winning chanse")

  0%|          | 0/100 [00:00<?, ?it/s]