# Reinforcement Learning Project
## Article: *Bridging the gap between regret minimization and best arm identification, with application to A/B tests*
### Students: Hadrien & Emilie SALEM

In this notebook, we will attempt to implement some of the algorithms presented in the article, and reproduce some of the experiments.


## Part 1: Framework and setup

In this part we import the relevant functions, and present the classes we developed for the bandit model.

In [1]:
import numpy as np
from framework import *

### Unit testing
#### Pulling arms

In [2]:
mean = 0
std = 1

test_arm = Arm(mean, std, gaussian_sampling)
results = test_arm.pull(times=10000)
print(f"Empirical mean = {results.mean()} (theoretical = {mean})")    
print(f"Empirical std = {results.std()} (theoretical = {std})")    

Empirical mean = -0.005589927474713255 (theoretical = 0)
Empirical std = 1.0123986293150093 (theoretical = 1)


#### Creating an environment

In [3]:
test_arm_bis = Arm(mean, std, gaussian_sampling)
test_env = Environment([test_arm, test_arm_bis])

In [4]:
test_env.pull_arm(1)

0.609236320235054

In [5]:
test_env.reward_history
test_env.best_arm_history

[1]

In [6]:
test_env.reset_history()
test_env.reward_history

[[], []]

#### Test with an Explore-Then-Commit (ETC) agent

In [12]:
mean1 = 0
mean2 = 1
std = 1

n_steps = 1000
confidence = 0.01

arm1 = Arm(mean1, std, gaussian_sampling)
arm2 = Arm(mean2, std, gaussian_sampling)
env = Environment([arm1, arm2])

agent = ETC_Agent()
chosen_arms, rewards_obtained = agent.play(n_steps, confidence, env)