# Being Optimistic Under Uncertainties

## Introduction

In the world of decision-making, especially in uncertain environments, being optimistic can be a strategic advantage. This concept is often employed in Reinforcement Learning under the strategy known as **Optimistic Initialization**.

### What is it?

Being optimistic under uncertainties involves initializing the estimated value of each action to be optimistically high. This encourages the agent to explore each action at least once before settling into a more exploitative behavior.

### Importance

1. **Encourages Exploration**: The agent is more likely to try all available actions.
2. **Quick Convergence**: Helps the agent to quickly find the action that yields the highest reward.
3. **Robustness**: Makes the agent robust to non-stationary environments where the reward distribution can change over time.

### Drawbacks

1. **Over-exploration**: The agent might waste time exploring obviously suboptimal actions.
2. **Computational Overhead**: Requires additional computation to keep track of optimistic values.

### Real-World Applications

1. **Stock Market**: Traders often use optimistic strategies to explore new investment opportunities.
2. **Healthcare**: In drug discovery, an optimistic approach can lead to the exploration of new molecular structures.
3. **Marketing**: Marketers use it to test various strategies before committing to one.

In this notebook, we will go through exercises to understand this concept better.

In [None]:
# Importing required libraries
import numpy as np
import matplotlib.pyplot as plt

## Exercise 1: Implementing Optimistic Initialization

In this exercise, you will implement the optimistic initialization strategy in a simple 3-armed bandit problem. The true mean rewards for the arms are [0.2, 0.5, 0.8].

### Task

1. Initialize the estimated mean rewards for each arm to 1 (an optimistic value).
2. In each round, choose the arm with the highest estimated mean reward.
3. Pull the chosen arm and update its estimated mean based on the observed reward.

Compare the total rewards obtained with and without optimistic initialization.

In [None]:
# Function to pull an arm
def pull_arm(mean):
    return np.random.normal(mean, 1)

# Optimistic Initialization Algorithm
def optimistic_initialization(true_means, initial_value=1, n_rounds=100):
    estimated_means = [initial_value, initial_value, initial_value]
    n_pulls = [0, 0, 0]
    rewards = []
    for _ in range(n_rounds):
        best_arm = np.argmax(estimated_means)
        reward = pull_arm(true_means[best_arm])
        rewards.append(reward)
        n_pulls[best_arm] += 1
        estimated_means[best_arm] = ((n_pulls[best_arm] - 1) * estimated_means[best_arm] + reward) / n_pulls[best_arm]
    return np.sum(rewards), estimated_means

# True mean rewards for the arms
true_means = [0.2, 0.5, 0.8]

# Running the algorithm
total_reward, final_estimated_means = optimistic_initialization(true_means)
total_reward, final_estimated_means

## Explanation of Exercise 1: Implementing Optimistic Initialization

It seems that the code cell did not run as expected. However, let's go through what it aims to achieve.

### Algorithm Steps:
1. **Initialize estimated means**: We start with an optimistic initial value of 1 for the estimated mean rewards of each arm.
2. **Choose the Best Arm**: In each round, the arm with the highest estimated mean reward is selected.
3. **Pull the Arm**: The chosen arm is pulled, and the reward is observed.
4. **Update Estimates**: The estimated mean reward of the pulled arm is updated based on the observed reward.

### Expected Output:
The output should show the total rewards obtained after 100 rounds and the final estimated mean rewards for each arm.

### Real-world Analogy:
Imagine you are at a buffet with three types of dishes. Being optimistic, you initially assume all are delicious. As you try each dish, you update your 'mental rating' for them. Eventually, you find the dish that satisfies your taste buds the most, much like how the algorithm finds the arm with the highest reward.

## Exercise 2: Comparing Optimistic Initialization with Epsilon-Greedy

In this exercise, you will compare the performance of the optimistic initialization strategy with the epsilon-greedy strategy.

### Task

1. Implement the epsilon-greedy strategy with an epsilon value of 0.1.
2. Run both the optimistic initialization and epsilon-greedy strategies for 100 rounds.
3. Plot the total rewards obtained in each round for both strategies.

Analyze which strategy performs better and why.

In [None]:
# Epsilon-Greedy Algorithm
def epsilon_greedy(true_means, epsilon=0.1, n_rounds=100):
    estimated_means = [0, 0, 0]
    n_pulls = [0, 0, 0]
    rewards = []
    for _ in range(n_rounds):
        if np.random.rand() < epsilon:
            arm = np.random.randint(0, 3)
        else:
            arm = np.argmax(estimated_means)
        reward = pull_arm(true_means[arm])
        rewards.append(reward)
        n_pulls[arm] += 1
        estimated_means[arm] = ((n_pulls[arm] - 1) * estimated_means[arm] + reward) / n_pulls[arm]
    return np.sum(rewards), estimated_means

# Running both algorithms
total_reward_optimistic, _ = optimistic_initialization(true_means)
total_reward_epsilon, _ = epsilon_greedy(true_means)

# Plotting the results
plt.bar(['Optimistic Initialization', 'Epsilon-Greedy'], [total_reward_optimistic, total_reward_epsilon])
plt.ylabel('Total Rewards')
plt.title('Comparison of Optimistic Initialization and Epsilon-Greedy')
plt.show()

## Explanation of Exercise 2: Comparing Optimistic Initialization with Epsilon-Greedy

The code cell didn't run, but let's discuss what it aims to do.

### Algorithm Steps:
1. **Epsilon-Greedy Algorithm**: It uses a small probability (epsilon) to explore random arms and a high probability (1-epsilon) to exploit the best-known arm.
2. **Optimistic Initialization**: It starts with an optimistic initial value for each arm and exploits the best-known arm.

### Expected Output:
The output should be a bar chart comparing the total rewards obtained by both strategies after 100 rounds.

### Real-world Analogy:
Imagine you're trying to decide between two investment strategies. One is more conservative, diversifying your portfolio (Epsilon-Greedy), while the other is more optimistic, putting more money into what seems to be the best option (Optimistic Initialization). Over time, you'd compare the returns from both to decide which strategy is more effective.

## Exercise 3: Optimistic Initialization in Non-Stationary Environments

In this exercise, you will explore the performance of optimistic initialization in a non-stationary environment, where the true mean rewards of the arms change over time.

### Task

1. Modify the `pull_arm` function to make the environment non-stationary. For example, add a small random value to the mean reward of each arm in each round.
2. Run the optimistic initialization strategy in this non-stationary environment for 100 rounds.
3. Plot the total rewards obtained in each round.

Analyze how well the optimistic initialization strategy adapts to the changing environment.

In [None]:
# Modified pull_arm function for non-stationary environment
def pull_arm_non_stationary(mean):
    return np.random.normal(mean + np.random.normal(0, 0.1), 1)

# Optimistic Initialization Algorithm for Non-Stationary Environment
def optimistic_initialization_non_stationary(true_means, initial_value=1, n_rounds=100):
    estimated_means = [initial_value, initial_value, initial_value]
    n_pulls = [0, 0, 0]
    rewards = []
    for _ in range(n_rounds):
        best_arm = np.argmax(estimated_means)
        reward = pull_arm_non_stationary(true_means[best_arm])
        rewards.append(reward)
        n_pulls[best_arm] += 1
        estimated_means[best_arm] = ((n_pulls[best_arm] - 1) * estimated_means[best_arm] + reward) / n_pulls[best_arm]
    return np.sum(rewards), estimated_means

# Running the algorithm in a non-stationary environment
total_reward_non_stationary, final_estimated_means_non_stationary = optimistic_initialization_non_stationary(true_means)
total_reward_non_stationary, final_estimated_means_non_stationary

## Explanation of Exercise 3: Optimistic Initialization in Non-Stationary Environments

The code cell didn't run, but let's discuss what it aims to do.

### Algorithm Steps:
1. **Modified Pull Arm Function**: The `pull_arm_non_stationary` function adds a small random value to the mean reward of each arm, making the environment non-stationary.
2. **Optimistic Initialization**: The algorithm starts with an optimistic initial value for each arm and exploits the best-known arm.

### Expected Output:
The output should show the total rewards obtained after 100 rounds and the final estimated mean rewards for each arm in a non-stationary environment.

### Real-world Analogy:
Imagine you're a stock trader in a volatile market. The value of stocks (arms) changes frequently (non-stationary environment). Being optimistic, you initially assume high returns for each stock. As you trade, you update your expectations based on the actual returns, allowing you to adapt to the market's volatility.

## Explanation of Exercise 2: Comparing Optimistic Initialization with Epsilon-Greedy

### Algorithm Steps:
1. **Epsilon-Greedy**: In this strategy, with probability \(\epsilon\), we choose a random arm, and with probability \(1-\epsilon\), we choose the arm with the highest estimated mean.
2. **Optimistic Initialization**: Here, we start with an optimistic initial value for each arm and always choose the arm with the highest estimated mean.

### Expected Output:
The output should be a bar chart comparing the total rewards obtained by both strategies after 100 rounds.

### Analysis:
Optimistic Initialization tends to perform better in the early rounds as it encourages exploration. However, Epsilon-Greedy can catch up as it balances exploration and exploitation.

### Real-world Analogy:
Imagine you are choosing between two investment strategies. One is optimistic, assuming high returns but diversifying quickly (Optimistic Initialization). The other is more balanced, taking calculated risks (Epsilon-Greedy). Over time, you'll notice that while the optimistic strategy may give quick initial gains, the balanced approach could yield more stable returns.

## Exercise 3: Optimistic Initialization in Non-Stationary Environments

In this exercise, you will explore how optimistic initialization performs in non-stationary environments, where the true mean rewards for the arms can change over time.

### Task

1. Modify the `pull_arm` function to add a small random value to the mean reward of each arm in each round.
2. Run the optimistic initialization strategy for 200 rounds.
3. Plot the estimated mean rewards for each arm over time.

Analyze how well the strategy adapts to the changing environment.

In [None]:
# Modified pull_arm function for non-stationary environment
def pull_arm_non_stationary(mean):
    return np.random.normal(mean + np.random.normal(0, 0.1), 1)

# Optimistic Initialization Algorithm for non-stationary environment
def optimistic_initialization_non_stationary(true_means, initial_value=1, n_rounds=200):
    estimated_means = [initial_value, initial_value, initial_value]
    n_pulls = [0, 0, 0]
    rewards = []
    estimated_means_over_time = [[] for _ in range(3)]
    for _ in range(n_rounds):
        best_arm = np.argmax(estimated_means)
        reward = pull_arm_non_stationary(true_means[best_arm])
        rewards.append(reward)
        n_pulls[best_arm] += 1
        estimated_means[best_arm] = ((n_pulls[best_arm] - 1) * estimated_means[best_arm] + reward) / n_pulls[best_arm]
        for i in range(3):
            estimated_means_over_time[i].append(estimated_means[i])
    return estimated_means_over_time

# Running the algorithm
estimated_means_over_time = optimistic_initialization_non_stationary(true_means)

# Plotting the estimated means over time
plt.figure(figsize=(10, 6))
for i in range(3):
    plt.plot(estimated_means_over_time[i], label=f'Arm {i+1}')
plt.xlabel('Rounds')
plt.ylabel('Estimated Mean Reward')
plt.title('Estimated Mean Rewards Over Time in Non-Stationary Environment')
plt.legend()
plt.show()

## Explanation of Exercise 3: Optimistic Initialization in Non-Stationary Environments

### Algorithm Steps:
1. **Modified Pull Arm**: In this version of the `pull_arm` function, a small random value is added to the mean reward of each arm in each round to simulate a non-stationary environment.
2. **Optimistic Initialization**: The algorithm is similar to the one in Exercise 1 but adapted for a non-stationary environment.

### Expected Output:
The output should be a line chart showing how the estimated mean rewards for each arm change over 200 rounds.

### Analysis:
Optimistic Initialization can adapt to non-stationary environments but may be slower to react to changes compared to more sophisticated algorithms.

### Real-world Analogy:
Imagine you're a farmer who is optimistic about the weather. You plant crops based on this optimism. However, the weather is non-stationary; it changes. Your optimism might help you take risks and plant various crops, but you'll need to adapt your strategies as the seasons change.