In [None]:
from collections import namedtuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## Further reading on A/B testing

[Udacity A/B testing](https://www.udacity.com/course/ab-testing--ud257)

[A/B Testing at Scale Tutorial](https://exp-platform.com/2017abtestingtutorial/)

# Multiple Testing

Chapter 3 of [Practical Statistics for Data Scientists](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/), Chapter 2 of [Sutton & Barto - Reinforcement Learning: An Introduction](http://incompleteideas.net/book/RLbook2018.pdf).

## Why is classical hypothesis testing not enough?

The hypothesis testing we looked at previously was of the form
- run an experiment, collecting a dataset of A & B
- perform a hypothesis test on any observed effects seen in the dataset

There are a few problems with this
- we only compare two options (one of which is usually a default)
- we might lose money by showing customers a suboptimal option 
- the hypothesis test only shows us if an observed effect is unlikely, not if the effect is large
- in a real business, there is no 'experiment over' date
- the real world is non-stationary - the results we collect might be from a distribution that is changing

We want an experiment where we can take advantage of the results as we learn

In a business context we are not concerned with statistical significance
- we are concerned with optimizing user experience as quickly as possible

## The mulit-armed bandit

Bandits allow
- testing mulitple options at once
- reach conclusions faster

The term bandit comes from slot machines 
- known as one armed bandits for their ability to extract money from gambles

The goal of a multi armed bandit problem is to vin as much money as possible
- this is the same as figuring out which arm is best as quick as possible

## The reinforcement learning approach

Is supervised learning we use data to predict
- in a bandit problem we could predict the expected value of each arm
- this is prediction / evaluation

Use data (the results we get from pulling an arm) to select an action
- this is control

We can define the value of an action (pulling a specific arm) as an expectation
- the expected reward of an action

$q_{*}(a) = \mathbb{E}[r(a)]$

Reinforcement has a convienent goal - maximizing expected reward
- if we did know the true expectation of each action, maximization is an argmax

The bandit problem is one step short of the full reinforcement learning problem
- the bandit is a single state, no transitions of states happen

Bandits share in common with reinforcement learning
- exploration verses exploitation
- potentially non-stationary

## Exploration versus exploitation

todo

## Example

Let's imagine you have the following results from comparing three different landing pages:

In [None]:
np.random.seed(42)

Param = namedtuple('Parameter', ['loc', 'scale', 'initial_size'])

params = {
    'A': Param(24, 20, 1),
    'B': Param(25, 20, 1),
    'C': Param(26, 20, 1),
    'D': Param(24, 10, 1),
    'E': Param(25, 10, 1)
}

start = 10
end = 50
num_options = 20

params = {
    str(option): Param(loc, scale, 1) 
    for option, (loc, scale) 
    in enumerate(zip(np.linspace(start, end, num_options), np.random.uniform(10, size=num_options)))
}

results = {
    arm: list(np.random.normal(*stats))
    for arm, stats in params.items()
}

def expectation(results):
    return {arm: np.mean(data) for arm, data in results.items()}

expectation(results)

One approach to the results above would be to conclude that one option is optimal and send all our users there.  This would be a **greedy** solution to the exploration & exploitation dilemma.

Another solution would be to favour the option that appears optimal, while still sampling from the options that appear sub-optimal.

## epsilon-greedy

A simple algorithm to tackle the exploration-exploitation dilemma is known as **epsilon-greedy** - it is the method used for exploration in DeepMind's 2013 DQN.

The algorithm has a single parameter $\epsilon$, which controls how greedy we are.  The basic algorithm is as follows:

In [None]:
def get_performance(results):
    d = []
    for arm, data in results.items():
        d.extend(data)
    return np.mean(d)

results = {
    arm: list(np.random.normal(*stats))
    for arm, stats in params.items()
}

eps = 0.3
choices = list(params.keys())

steps = 1000
values = np.zeros((steps, len(choices)))
actions = np.empty((steps)).astype(str)
eps_performance = np.zeros(steps)

for step in range(steps):
    prob = np.random.rand()
    if prob < eps:
        strat = 'random'
        action = np.random.choice(choices)

    else:
        strat = 'greedy'
        expectations = expectation(results)
        values[step, :] = list(expectations.values())
        action = max(expectations, key=expectations.get)
        
    actions[step] = action
    
    p = params[action]
    results[action].append(float(np.random.normal(p.loc, p.scale, 1)))
    eps_performance[step] = get_performance(results)
    
plt.plot(eps_performance, label='eps {}'.format(eps))
_ = plt.legend()

We can see from above that
- $\epsilon$ = 1 -> standard A/B test
- $\epsilon$ = 0 -> greedy

In reinforcement learning $\epsilon$ is often decayed from 1 to a 0.05 over an agents lifetime.  Proper selection of $\epsilon$ depends on
- how accurate your greedy estimate is
- how non-stationary the process is

## Upper Confidence Bound (UCB)

2.7 in [Sutton & Barto - Reinforcement Learning: An Introduction](http://incompleteideas.net/book/RLbook2018.pdf)

Select an action based on it's historical mean + an exploration bonus

$a = \underset{x}{\text{argmax}} \left[ q(a) + c \cdot \sqrt{\frac{\ln t}{N(a)}} \right] $

$t$ = timestep

$N(a)$ = number of times action $a$ taken

In [None]:
c = 5

def ucb(results, step):
    return {
        arm: np.mean(data)+ c * np.sqrt(np.log(step)/len(data))
        for arm, data in results.items()
    }

def get_performance(results):
    d = []
    for arm, data in results.items():
        d.extend(data)
    return np.mean(d)

results = {
    arm: list(np.random.normal(*stats))
    for arm, stats in params.items()
}

steps = 1000
values = np.zeros((steps, len(choices)))
actions = np.empty((steps)).astype(str)
ucb_performance = np.zeros(steps)

for step in range(steps):
    ucbs = ucb(results, 2)

    action = max(ucbs, key=ucbs.get)
    actions[step] = action
    values[step, :] = list(ucbs.values())
    
    p = params[action]
    results[action].append(float(np.random.normal(p.loc, p.scale, 1)))
    ucb_performance[step] = get_performance(results)

plt.plot(eps_performance, label='eps') 
plt.plot(ucb_performance, label='ucb')
_ = plt.legend()