# Homework 3


### Question 1
For the $\epsilon$-greedy algorithm, you need to set the parameter $\epsilon$, which is a constant defining the probability of selecting a random action. In $\epsilon$-decreasing strategy, you also set an initial $\epsilon$ value along with a decay function, which determines how $\epsilon$ decreases over time; this decay could be linear, exponential, or inverse.

For the Thomson-sampling algorithm, you do not need to set $\epsilon$ or a decay function, but you do need to define the prior distribution parameters for the model, such as a beta distribution characterized by two parameters: $\alpha$ and $\beta$. These parameters describe the prior beliefs about the distribution of rewards, which are updated as evidence accumulates.

### Question 2

In [75]:
import pandas as pd
import numpy as np
import random
from collections import Counter
random.seed(0)

In [24]:
training = pd.read_csv('Training.csv', header=None)
training

Unnamed: 0,0,1,2
0,0,1,1
1,1,1,1
2,0,1,0
3,0,1,1
4,1,0,0
...,...,...,...
245,1,1,1
246,0,1,1
247,1,1,0
248,0,0,0


In [25]:
training = training.rename(columns={0:'Type_1', 1:'Type_2', 2:'Type_3'})
training

Unnamed: 0,Type_1,Type_2,Type_3
0,0,1,1
1,1,1,1
2,0,1,0
3,0,1,1
4,1,0,0
...,...,...,...
245,1,1,1
246,0,1,1
247,1,1,0
248,0,0,0


In [26]:
n = training.shape[0]
arms = training.shape[1]

In [27]:
eps_list = np.arange(0.01, 0.61, 0.01).round(2).tolist()
training_array = [training[col].values for col in training.columns]

In [36]:
def eps_greedy(eps, revenue, n, arms):
    counts = np.zeros(arms)
    cumulative_rewards = np.zeros(arms)
    rewards = np.zeros(n)
    greedy_record_pulls_list = []

    for t in range(n):
        if np.random.rand() <= eps or t == 0:
            next_arm = random.randint(0, arms - 1)
        else:
            next_arm = np.nanargmax(cumulative_rewards / counts)
        counts[next_arm] += 1
        reward = revenue[t, next_arm]
        cumulative_rewards[next_arm] += reward
        rewards[t] = reward
        greedy_record_pulls_list.append(next_arm)

    return cumulative_rewards.sum(), rewards, greedy_record_pulls_list

In [37]:
def eps_decreasing(eps, decay_rate, revenue, n, arms):
    counts = np.zeros(arms)
    cum_rewards = np.zeros(arms)
    rewards = np.zeros(n)
    record_pulls_list = []

    for t in range(n):
        if np.random.rand() <= eps or t == 0:
            next_arm = random.randint(0, arms - 1)
        else:
            next_arm = np.nanargmax(cum_rewards / counts)
        counts[next_arm] += 1
        reward = revenue[t, next_arm]
        cum_rewards[next_arm] += reward
        rewards[t] = reward  
        record_pulls_list.append(next_arm)
        eps *= decay_rate  

    return cum_rewards.sum(), rewards, record_pulls_list

In [46]:
def thompson(revenue, n): 
    counts = np.zeros(arms)
    alpha = np.ones(arms)
    beta = np.ones(arms)
    cumulative_rewards = np.zeros(arms)
    rewards = np.zeros(n)
    thompson_record_pulls = np.zeros((n,arms))
    thompson_record_pulls_list = []
    
    for t in range(n):
            sampled_parameters = np.random.beta(alpha, beta)
            next_arm = np.argmax(sampled_parameters)
            
            reward = revenue[t, next_arm]
            rewards[t] = reward
            thompson_record_pulls[t, next_arm] = 1
            thompson_record_pulls_list.append(next_arm)
            
            if reward == 1:
                alpha[next_arm] += 1
            else:
                beta[next_arm] += 1
            
            cumulative_rewards[next_arm] += reward
       
    return cumulative_rewards.sum() , rewards, thompson_record_pulls, thompson_record_pulls_list

In [40]:
training_array = training.values if isinstance(training, pd.DataFrame) else training

eps_greedy_rewards = []
eps_decreasing_rewards1 = []
eps_decreasing_rewards2 = []

for eps in eps_list:
    result_greedy = eps_greedy(eps, training_array, n, arms)
    result_decreasing1 = eps_decreasing(eps, 0.99, training_array, n, arms)
    result_decreasing2 = eps_decreasing(eps, 0.9, training_array, n, arms)
    eps_greedy_rewards.append(result_greedy[0])
    eps_decreasing_rewards1.append(result_decreasing1[0])
    eps_decreasing_rewards2.append(result_decreasing2[0])

# Store results in DataFrame
eps_greedy_training = pd.DataFrame({
    'epsilon': eps_list,
    'rewards': eps_greedy_rewards
})
eps_decreasing_training = pd.DataFrame({
    'epsilon': eps_list,
    'decay_rate1': 0.99,
    'eps_decreasing_rewards1': eps_decreasing_rewards1,
    'decay_rate2': 0.9,
    'eps_decreasing_rewards2': eps_decreasing_rewards2
})

  next_arm = np.nanargmax(cumulative_rewards / counts)


In [56]:
eps_greedy_training

Unnamed: 0,epsilon,rewards
0,0.01,135.0
1,0.02,86.0
2,0.03,163.0
3,0.04,143.0
4,0.05,155.0
5,0.06,137.0
6,0.07,158.0
7,0.08,143.0
8,0.09,155.0
9,0.1,142.0


In [54]:
eps_decreasing_training

Unnamed: 0,epsilon,decay_rate1,eps_decreasing_rewards1,decay_rate2,eps_decreasing_rewards2
0,0.01,0.99,161.0,0.9,135.0
1,0.02,0.99,85.0,0.9,135.0
2,0.03,0.99,85.0,0.9,135.0
3,0.04,0.99,143.0,0.9,135.0
4,0.05,0.99,159.0,0.9,85.0
5,0.06,0.99,144.0,0.9,161.0
6,0.07,0.99,160.0,0.9,85.0
7,0.08,0.99,149.0,0.9,85.0
8,0.09,0.99,84.0,0.9,85.0
9,0.1,0.99,119.0,0.9,161.0


In [44]:
training1 = eps_decreasing_training[['epsilon', 'decay_rate1', 'eps_decreasing_rewards1']].rename(columns={'decay_rate1': 'decay_rate', 'eps_decreasing_rewards1': 'rewards'})
training2 = eps_decreasing_training[['epsilon', 'decay_rate2', 'eps_decreasing_rewards2']].rename(columns={'decay_rate2': 'decay_rate', 'eps_decreasing_rewards2': 'rewards'})

eps_decreasing_training_final = pd.concat([training1, training2]).reset_index(drop=True)
eps_decreasing_training_final

Unnamed: 0,epsilon,decay_rate,rewards
0,0.01,0.99,161.0
1,0.02,0.99,85.0
2,0.03,0.99,85.0
3,0.04,0.99,143.0
4,0.05,0.99,159.0
...,...,...,...
115,0.56,0.90,156.0
116,0.57,0.90,161.0
117,0.58,0.90,85.0
118,0.59,0.90,158.0


In [48]:
def print_max_rewards(df, method_name):
    max_cumulative_reward = df['rewards'].max()
    max_epsilon = df.loc[df['rewards'].idxmax(), 'epsilon']
    print(f'Highest total reward using {method_name}: {max_cumulative_reward}')
    print(f'Epsilon value: {max_epsilon}')

In [49]:
print_max_rewards(eps_greedy_training, "Epsilon-Greedy")

Highest total reward using Epsilon-Greedy: 163.0
Epsilon value: 0.03


In [50]:
print_max_rewards(eps_decreasing_training_final, "Epsilon-Decreasing")

Highest total reward using Epsilon-Decreasing: 162.0
Epsilon value: 0.12


In [47]:
max_cumulative_reward = thompson(training_array, n)[0]

print(f'Highest total reward using Thompson Sampling: {max_cumulative_reward}')

Highest total reward using Thompson Sampling: 154.0


### Question 3

In [51]:
test = pd.read_csv('Test1.csv', header=None)
test_arr = test.values
n = test.shape[0]
arms = test.shape[1]

In [71]:
reward_eps_greedy = eps_greedy(0.03, test_arr, n, arms)[0]
reward_eps_decreasing = eps_decreasing(0.12, 0.99, test_arr, n, arms)[0]
reward_thompson = thompson(test_arr, n)[0]
total_reward = reward_eps_greedy + reward_eps_decreasing + reward_thompson

print(f'Total reward using Epsilon-Greedy: {reward_eps_greedy}')
print(f'Total reward using Epsilon-Decreasing: {reward_eps_decreasing}')
print(f'Total reward using Thompson Sampling: {reward_thompson}')
print(f'Total reward: {total_reward}')

Total reward using Epsilon-Greedy: 54.0
Total reward using Epsilon-Decreasing: 41.0
Total reward using Thompson Sampling: 48.0
Total reward: 143.0


  next_arm = np.nanargmax(cumulative_rewards / counts)


### Question 4

In [72]:
test2 = pd.read_csv('Test2.csv', header=None)
test2_arr = test2.values
n = test2.shape[0]
arms = test2.shape[1]

In [74]:
reward_eps_greedy = eps_greedy(0.03, test2_arr, n, arms)[0]
reward_eps_decreasing = eps_decreasing(0.12, 0.99, test2_arr, n, arms)[0]
reward_thompson = thompson(test2_arr, n)[0]
total_reward = reward_eps_greedy + reward_eps_decreasing + reward_thompson

print(f'Total reward using Epsilon-Greedy: {reward_eps_greedy}')
print(f'Total reward using Epsilon-Decreasing: {reward_eps_decreasing}')
print(f'Total reward using Thompson Sampling: {reward_thompson}')
print(f'Total reward: {total_reward}')

Total reward using Epsilon-Greedy: 92.0
Total reward using Epsilon-Decreasing: 101.0
Total reward using Thompson Sampling: 94.0
Total reward: 287.0


  next_arm = np.nanargmax(cumulative_rewards / counts)


### Analysis of Results
- **Test1**: The results here show lower total rewards for all strategies compared to Test2. Given that this dataset is homogeneous and closely matches the training set, the strategies may not fully capitalize beyond the specific scope and variance of the original training data. This suggests that the strategies, while optimized for the initial training population, do not have additional complexities or variances to explore within this test set, potentially limiting their reward maximization.

- **Test2**: This dataset shows significantly higher rewards. The introduction of a new demographic (older adults) could introduce new patterns or trends in the data that were not present in the training set. If these new patterns align better with some of the random explorations or more conservative decays in $\epsilon$, these strategies can potentially yield higher rewards. Here’s how each strategy might benefit:
  - **$\epsilon$-Greedy**: Given a consistent exploration rate, this strategy might occasionally select arms that unexpectedly perform well with the older demographic.
  - **$\epsilon$-Decreasing**: This strategy reduces exploration over time. If early trials with the older demographic are successful, the strategy might quickly exploit these successes, leading to improved cumulative rewards.
  - **Thompson Sampling**: Known for its ability to balance exploration and exploitation based on observed successes and failures, this strategy might adapt more dynamically to the differing responses between the two age groups.

### Explanation of Differences
- The increased diversity in Test2 allows the strategies to either explore more effectively or exploit new patterns that weren’t visible in the homogeneous Test1. This might explain why all strategies performed better on the second test set.
- The older demographic might have more pronounced or distinct responses to certain treatments, which could be better captured by the adaptive exploration and exploitation dynamics of the multi-armed bandit algorithms.

### Question 5

In [95]:
n = test.shape[0]
arms = test.shape[1]

pull_list_greedy = eps_greedy(0.13, test_arr, n, arms)[2]
pull_list_decreasing = eps_decreasing(0.12, 0.99, test_arr, n, arms)[2]
pull_list_thompson = thompson(test_arr, n)[3]

print(f'Number of arms pulled using Epsilon-Greedy: {Counter(pull_list_greedy)}')
print(f'Number of arms pulled using Epsilon-Decreasing: {Counter(pull_list_decreasing)}')
print(f'Number of arms pulled using Thompson Sampling: {Counter(pull_list_thompson)}')

n = test2.shape[0]
arms = test2.shape[1]

pull_list_greedy = eps_greedy(0.13, test2_arr, n, arms)[2]
pull_list_decreasing = eps_decreasing(0.12, 0.99, test2_arr, n, arms)[2]
pull_list_thompson = thompson(test2_arr, n)[3]

print(f'Number of arms pulled using Epsilon-Greedy: {Counter(pull_list_greedy)}')
print(f'Number of arms pulled using Epsilon-Decreasing: {Counter(pull_list_decreasing)}')
print(f'Number of arms pulled using Thompson Sampling: {Counter(pull_list_thompson)}')

Number of arms pulled using Epsilon-Greedy: Counter({0: 85, 1: 10, 2: 5})
Number of arms pulled using Epsilon-Decreasing: Counter({2: 93, 0: 7})
Number of arms pulled using Thompson Sampling: Counter({2: 54, 1: 24, 0: 22})
Number of arms pulled using Epsilon-Greedy: Counter({2: 120, 1: 74, 0: 6})
Number of arms pulled using Epsilon-Decreasing: Counter({2: 193, 1: 4, 0: 3})
Number of arms pulled using Thompson Sampling: Counter({2: 115, 1: 69, 0: 16})


  next_arm = np.nanargmax(cumulative_rewards / counts)


1. **For Test Dataset 1:**
   - **Epsilon-Greedy**: Mostly chose arm 0.
   - **Epsilon-Decreasing**: Predominantly chose arm 2.
   - **Thompson Sampling**: Chose arm 2 most frequently.

2. **For Test Dataset 2:**
   - **Epsilon-Greedy**: Predominantly chose arm 2.
   - **Epsilon-Decreasing**: Almost exclusively chose arm 2.
   - **Thompson Sampling**: Chose arm 2 most frequently.

### Recommendation
Based on the result:
- **Arm 2** appears to be the most promising drug prototype for further testing and preparation for market. It is consistently selected by two of the strategies that are good at optimizing long-term rewards. This suggests that, on average, this prototype might be performing well in terms of effectiveness or other metrics being implicitly measured by your testing algorithms.