# Reinforcement Learning Project
## Article: *Bridging the gap between regret minimization and best arm identification, with application to A/B tests*
### Students: Hadrien & Emilie SALEM

In this notebook, we will attempt to implement some of the algorithms presented in the article, and reproduce some of the experiments.


## Framework and setup

In this part we import the relevant functions, and present the classes we developed for the bandit model.

In [1]:
import numpy as np
from framework import *

### Environment
#### Pulling arms

The `Arm` class is just a convenience to draw samples from a certain distribution. Its usage is shown thereafter.

In [2]:
mean = 0
std = 1

test_arm = Arm(mean, std, gaussian_sampling)
results = test_arm.pull(times=10000)
print(f"Empirical mean = {results.mean()} (theoretical = {mean})")    
print(f"Empirical std = {results.std()} (theoretical = {std})")    

Empirical mean = -0.004391464479489824 (theoretical = 0)
Empirical std = 1.0048306258279196 (theoretical = 1)


#### Creating an environment

An `Environment` is an object defined by a list of arms. It lets a user pull an arm and exposes the history of observed rewards.

In [3]:
# Create an environement
test_arm_bis = Arm(mean, std, gaussian_sampling)
test_env = Environment([test_arm, test_arm_bis])

In [4]:
# Pull an arm
test_env.pull_arm(1)

-0.8314253716997045

In [5]:
# Display the reward history
test_env.reward_history

[[], [-0.8314253716997045]]

In [6]:
# Reset the reward history
test_env.reset_history()
test_env.reward_history

[[], []]

### Test with an Explore-Then-Commit (ETC) agent

We implement an ETC agent such as described in the article (*cf*. page 2). We then compare the theoretical regret to the empirical regret.

In [7]:
# Parameters
mean1 = 0
mean2 = 5
std = 1

n_steps = 100
confidence = 0.02

# Create an environment with 2 arms
arm1 = Arm(mean1, std, gaussian_sampling)
arm2 = Arm(mean2, std, gaussian_sampling)
env = Environment([arm1, arm2])

# Initialize the agent
agent = ETC_Agent()

We compute the regret for a large number of experiments in order to increase the precision of empirical results.

In [8]:
# Initialize the experiment
n_experiments = 1000
regrets = []
decision_times = []

for _  in range(n_experiments):
    results = agent.play(n_steps, confidence, env)
    regrets.append(results.regret)
    decision_times.append(results.decision_time)
    
theoretical_regret = (8*env.std**2)/(mean2-mean1)*np.log(1/confidence)
print(f"Average regret at time of decision: {np.mean(regrets)}, Average decision time: {np.mean(decision_times)}.")
print(f"Theoretical regret bound: {theoretical_regret}")

Average regret at time of decision: 4.902281818652251, Average decision time: 2.088.
Theoretical regret bound: 6.259236808685034


According to the article, the regret at the time of decision is bound by a number slightly larger than
$\frac{8\sigma^2}{\Delta}\log(1/\delta)$
, which seems to be respected in the experiment above.

## The UCB $_{\alpha}$ algorithm

In this section, we implement and experiment with the $UCB_{\alpha}$ algorithm proposed in the article.