# Estimating Action Values Through Sampling

In this notebook, we'll explore the concept of estimating action values through sampling in the context of Reinforcement Learning. We'll delve into what it is, its importance, drawbacks, and real-world applications. We'll also provide exercises along with their solutions for a comprehensive understanding.

## Table of Contents
1. [Introduction](#Introduction)
2. [Importance](#Importance)
3. [Drawbacks](#Drawbacks)
4. [Real-world Applications](#Real-world-Applications)
5. [Exercises](#Exercises)
6. [Exercise Solutions](#Exercise-Solutions)

## Introduction

Estimating action values through sampling is a fundamental concept in Reinforcement Learning (RL). In RL, an agent interacts with an environment to achieve a goal. The agent takes actions, and the environment responds by providing rewards and new states. The agent's objective is to find a policy—a mapping from states to actions—that maximizes the expected sum of rewards.

Action values, also known as Q-values, represent the expected return (sum of rewards) of taking a particular action from a given state and then following a specific policy. Estimating these Q-values accurately is crucial for the agent to make informed decisions.

Sampling is one of the methods to estimate these action values. In this method, the agent takes an action multiple times and averages the observed rewards to estimate the action value.

Let's consider a simple example to understand this better.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define the true action values
true_action_values = [1.2, 0.8, 1.5, 1.3, 0.9]

# Function to simulate pulling an arm of a bandit
def pull_arm(action):
    return np.random.normal(true_action_values[action], 0.1)

# Function to estimate action values through sampling
def estimate_action_values(n_samples=1000):
    estimated_values = [0] * 5
    for action in range(5):
        samples = [pull_arm(action) for _ in range(n_samples)]
        estimated_values[action] = np.mean(samples)
    return estimated_values

# Estimate action values
estimated_values = estimate_action_values()

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(range(5), true_action_values, alpha=0.6, label='True Action Values')
plt.bar(range(5), estimated_values, alpha=0.6, label='Estimated Action Values')
plt.xlabel('Actions')
plt.ylabel('Action Value')
plt.legend()
plt.title('True vs Estimated Action Values')
plt.show()

estimated_values

## Code Explanation

In the code above, we simulated a 5-armed bandit problem where each arm has a different true action value. We then estimated these action values through sampling.

### Key Components:
1. **True Action Values**: These are the true means of the reward distributions for each arm. We set them as `[1.2, 0.8, 1.5, 1.3, 0.9]`.
2. **`pull_arm(action)` Function**: This function simulates pulling an arm and returns a reward sampled from a normal distribution centered around the true action value of the pulled arm.
3. **`estimate_action_values(n_samples)` Function**: This function estimates the action values by pulling each arm `n_samples` times and averaging the rewards.

### Results:
The bar chart compares the true action values with the estimated action values. As we can see, the estimated values are close to the true values, demonstrating the effectiveness of sampling as a method for estimating action values.

### Evaluation:
The estimated action values are close to the true action values, indicating that our sampling method is effective. However, it's worth noting that the accuracy of these estimates depends on the number of samples. More samples will generally lead to more accurate estimates but at the cost of computational time.

## Importance

Estimating action values accurately is crucial for the success of any RL agent. Here's why:

1. **Informed Decision-Making**: Accurate estimates allow the agent to make decisions that are more likely to result in higher rewards.
2. **Efficiency**: With accurate estimates, the agent can quickly identify the best actions, reducing the need for excessive exploration.
3. **Adaptability**: In non-stationary environments where reward distributions can change, having a reliable estimation method helps the agent adapt more quickly.

### Real-world Analogy

Imagine you're trying to find the best coffee shop in town. You could visit each shop multiple times, sampling different types of coffee. By averaging your experiences (rewards), you can estimate the quality (action value) of each shop. This way, you can make an informed decision about which shop to frequent, similar to how an RL agent estimates action values to choose the best actions.

## Drawbacks

While estimating action values through sampling is effective, it has its limitations:

1. **Computational Cost**: Taking multiple samples for each action can be computationally expensive, especially in large action spaces.
2. **Non-Stationarity**: In environments where the reward distributions change over time, the agent needs to continuously update its estimates, which can be challenging.
3. **Initial Bias**: If the initial samples are not representative, the estimates can be biased, leading to suboptimal decisions.

### Real-world Analogy

Continuing with the coffee shop example, if a shop recently changed its coffee beans, your previous samples might not be representative of the current quality. Also, continuously sampling from all shops to keep your estimates updated would be time-consuming and impractical.

## Real-world Applications

Estimating action values through sampling has various real-world applications:

1. **Finance**: In algorithmic trading, agents can use sampling to estimate the expected returns of different trading strategies.
2. **Healthcare**: In personalized medicine, algorithms can sample different treatment options to estimate their effectiveness for individual patients.
3. **Robotics**: Robots can use sampling methods to estimate the success rates of different actions, such as picking up objects or navigating through a space.

### Real-world Analogy

In a warehouse, a robotic arm sorts packages onto different conveyor belts. By sampling, it can estimate which actions (e.g., speed and angle of movement) result in the most efficient sorting, thereby optimizing its performance.

## Exercises

To deepen your understanding, here are some exercises:

1. **Exercise 1**: Modify the code to estimate action values using a different number of samples (e.g., 500, 2000). Compare the results with the original estimates.
2. **Exercise 2**: Implement a method to update the action value estimates in a non-stationary environment. Simulate a changing environment and observe how well your method adapts.
3. **Exercise 3**: Add a confidence interval to the estimated action values in the bar chart. Use the standard error of the mean as a measure of uncertainty.

## Exercise Solutions

Below are the solutions to the exercises:

### Solution to Exercise 1

To modify the code for different numbers of samples, you can change the `n_samples` parameter in the `estimate_action_values` function. Here's how you can do it for 500 and 2000 samples.

In [None]:
# Estimate action values with 500 samples
estimated_values_500 = estimate_action_values(n_samples=500)

# Estimate action values with 2000 samples
estimated_values_2000 = estimate_action_values(n_samples=2000)

# Plotting
plt.figure(figsize=(15, 6))

plt.subplot(1, 3, 1)
plt.bar(range(5), estimated_values, alpha=0.6, label='Estimated with 1000 samples')
plt.xlabel('Actions')
plt.ylabel('Action Value')
plt.title('Estimated with 1000 samples')

plt.subplot(1, 3, 2)
plt.bar(range(5), estimated_values_500, alpha=0.6, label='Estimated with 500 samples', color='g')
plt.xlabel('Actions')
plt.title('Estimated with 500 samples')

plt.subplot(1, 3, 3)
plt.bar(range(5), estimated_values_2000, alpha=0.6, label='Estimated with 2000 samples', color='r')
plt.xlabel('Actions')
plt.title('Estimated with 2000 samples')

plt.tight_layout()
plt.show()

estimated_values_500, estimated_values_2000

### Evaluation of Exercise 1 Solution

As we can see from the bar charts, the estimates become more accurate as we increase the number of samples. The estimates with 2000 samples are closer to the original estimates made with 1000 samples, demonstrating the benefit of having more samples for better accuracy.

However, it's important to note that increasing the number of samples also increases the computational cost. Therefore, there's a trade-off between accuracy and computational efficiency.

### Solution to Exercise 2

To handle a non-stationary environment, we can use a moving average to update the action value estimates. This allows the agent to adapt to changes in the reward distribution.

In [None]:
# Function to estimate action values in a non-stationary environment using moving average
def estimate_action_values_moving_avg(alpha=0.1, n_rounds=1000):
    estimated_values = [0] * 5
    for action in range(5):
        for _ in range(n_rounds):
            reward = pull_arm(action)
            estimated_values[action] = (1 - alpha) * estimated_values[action] + alpha * reward
    return estimated_values

# Simulate a non-stationary environment by changing the true action values
true_action_values = [1.5, 0.9, 1.2, 1.6, 1.0]

# Estimate action values using moving average
estimated_values_moving_avg = estimate_action_values_moving_avg()

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(range(5), true_action_values, alpha=0.6, label='True Action Values (Changed)')
plt.bar(range(5), estimated_values_moving_avg, alpha=0.6, label='Estimated Action Values (Moving Avg)', color='r')
plt.xlabel('Actions')
plt.ylabel('Action Value')
plt.legend()
plt.title('True vs Estimated Action Values in Non-Stationary Environment')
plt.show()

estimated_values_moving_avg

## Real-world Applications

Estimating action values through sampling has various applications:

1. **Healthcare**: In personalized medicine, it helps in choosing the most effective treatment for individual patients.
2. **Finance**: Used in portfolio optimization to estimate the expected returns of different assets.
3. **E-commerce**: Helps in recommending the most relevant products to users.
4. **Robotics**: In robotic arms, it assists in selecting the most efficient movements.

### Real-world Analogy

Think of it like a talent scout for a sports team. The scout watches multiple games (samples) to estimate the skill level (action value) of each player. Based on these estimates, the team can make informed decisions on which players to recruit.

## Exercises

### Exercise 1: Implement Sampling for a 3-Armed Bandit

Implement a function that estimates the action values of a 3-armed bandit through sampling. The true action values are `[0.9, 0.8, 0.7]`. Run the function and plot the estimated values.

### Exercise 2: Vary the Number of Samples

Modify the function from Exercise 1 to accept the number of samples as an argument. Run the function with different numbers of samples and observe how the estimates change.

### Exercise 3: Non-Stationary Bandit

Implement a function that estimates the action values for a non-stationary 3-armed bandit. The true action values change over time. Run the function and plot the estimated values at different time steps.

### Evaluation of Exercise 2 Solution

The bar chart shows the estimated action values using a moving average in a non-stationary environment where the true action values have changed. The estimates are close to the new true action values, indicating that the moving average method is effective in adapting to changes in the environment.

This adaptability is crucial in real-world applications where conditions can change over time.

### Solution to Exercise 3

To add confidence intervals to the estimated action values, we can use the standard error of the mean. The standard error is calculated as the standard deviation of the samples divided by the square root of the number of samples.

In [None]:
# Exercise 1: Implement Sampling for a 3-Armed Bandit

true_action_values_3arm = [0.9, 0.8, 0.7]

def pull_arm_3arm(action):
    return np.random.normal(true_action_values_3arm[action], 0.1)

def estimate_action_values_3arm(n_samples=1000):
    estimated_values = [0] * 3
    for action in range(3):
        samples = [pull_arm_3arm(action) for _ in range(n_samples)]
        estimated_values[action] = np.mean(samples)
    return estimated_values

# Estimate action values
estimated_values_3arm = estimate_action_values_3arm()

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(range(3), true_action_values_3arm, alpha=0.6, label='True Action Values')
plt.bar(range(3), estimated_values_3arm, alpha=0.6, label='Estimated Action Values')
plt.xlabel('Actions')
plt.ylabel('Action Value')
plt.legend()
plt.title('True vs Estimated Action Values for 3-Armed Bandit')
plt.show()

estimated_values_3arm

In [None]:
# Function to estimate action values and standard errors through sampling
def estimate_action_values_with_error(n_samples=1000):
    estimated_values = [0] * 5
    standard_errors = [0] * 5
    for action in range(5):
        samples = [pull_arm(action) for _ in range(n_samples)]
        estimated_values[action] = np.mean(samples)
        standard_errors[action] = np.std(samples) / np.sqrt(n_samples)
    return estimated_values, standard_errors

# Estimate action values and standard errors
estimated_values, standard_errors = estimate_action_values_with_error()

# Plotting with confidence intervals
plt.figure(figsize=(10, 6))
plt.bar(range(5), estimated_values, yerr=standard_errors, alpha=0.6, label='Estimated Action Values with Error', color='b')
plt.xlabel('Actions')
plt.ylabel('Action Value')
plt.title('Estimated Action Values with Confidence Intervals')
plt.show()

estimated_values, standard_errors

In [None]:
# Exercise 2: Vary the Number of Samples

def estimate_action_values_vary_samples(n_samples=1000):
    estimated_values = [0] * 3
    for action in range(3):
        samples = [pull_arm_3arm(action) for _ in range(n_samples)]
        estimated_values[action] = np.mean(samples)
    return estimated_values

# Estimate action values with different number of samples
estimated_values_100 = estimate_action_values_vary_samples(100)
estimated_values_500 = estimate_action_values_vary_samples(500)
estimated_values_2000 = estimate_action_values_vary_samples(2000)

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(range(3), true_action_values_3arm, alpha=0.6, label='True Action Values')
plt.bar(np.array(range(3))-0.2, estimated_values_100, alpha=0.6, width=0.2, label='Estimated with 100 samples')
plt.bar(np.array(range(3)), estimated_values_500, alpha=0.6, width=0.2, label='Estimated with 500 samples')
plt.bar(np.array(range(3))+0.2, estimated_values_2000, alpha=0.6, width=0.2, label='Estimated with 2000 samples')
plt.xlabel('Actions')
plt.ylabel('Action Value')
plt.legend()
plt.title('True vs Estimated Action Values with Varying Samples')
plt.show()

estimated_values_100, estimated_values_500, estimated_values_2000

In [None]:
# Exercise 3: Non-Stationary Bandit

def pull_arm_non_stationary(action, time_step):
    return np.random.normal(true_action_values_3arm[action] + 0.01 * time_step, 0.1)

def estimate_action_values_non_stationary(n_samples=1000, time_step=0):
    estimated_values = [0] * 3
    for action in range(3):
        samples = [pull_arm_non_stationary(action, time_step) for _ in range(n_samples)]
        estimated_values[action] = np.mean(samples)
    return estimated_values

# Estimate action values at different time steps
estimated_values_t0 = estimate_action_values_non_stationary(time_step=0)
estimated_values_t10 = estimate_action_values_non_stationary(time_step=10)
estimated_values_t20 = estimate_action_values_non_stationary(time_step=20)

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(range(3), estimated_values_t0, alpha=0.6, label='Estimated at t=0')
plt.bar(np.array(range(3))-0.2, estimated_values_t10, alpha=0.6, width=0.2, label='Estimated at t=10')
plt.bar(np.array(range(3))+0.2, estimated_values_t20, alpha=0.6, width=0.2, label='Estimated at t=20')
plt.xlabel('Actions')
plt.ylabel('Action Value')
plt.legend()
plt.title('Estimated Action Values at Different Time Steps for Non-Stationary Bandit')
plt.show()

estimated_values_t0, estimated_values_t10, estimated_values_t20

## Exercise Solutions

### Solution for Exercise 1

In this exercise, we implemented a function to estimate the action values of a 3-armed bandit with true action values `[0.9, 0.8, 0.7]`. The function pulls each arm multiple times and averages the rewards to estimate the action values. The plot shows that the estimated action values are close to the true values.

### Solution for Exercise 2

We modified the function to accept the number of samples as an argument. We then ran the function with different numbers of samples (100, 500, 2000) and plotted the estimates. As expected, more samples lead to more accurate estimates.

### Solution for Exercise 3

In this exercise, we dealt with a non-stationary 3-armed bandit where the true action values change over time. We implemented a function that estimates the action values at different time steps (t=0, t=10, t=20). The plot shows how the estimates change over time.