In [None]:
from collections import namedtuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from answers import expectation, ucb, run_ucb_expt
from common import generate_bandit_dataset

%matplotlib inline

# Multi Armed Bandits

## Resources

- Chapter 3 of [Practical Statistics for Data Scientists](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/)
- Chapter 2 of [Sutton & Barto - Reinforcement Learning: An Introduction](http://incompleteideas.net/book/RLbook2018.pdf).

## When is classical hypothesis testing not enough?

In classical hypothesis testing we
- run an experiment, collecting a dataset of A & B
- perform a hypothesis test on observed effects (differences between A and B)

There are a few problems with this
- we only compare two options (one of which is usually a default)
- we might lose money by showing customers a suboptimal option 
- the hypothesis test only shows us if an observed effect is unlikely, not if the effect is large
- in a real business, there is no 'experiment over' date
- the real world is non-stationary - the results we collect might be from a distribution that is changing

We want an experiment where we can **take advantage of the results as we learn** - not have to wait until the experiment results can be used

In a business context we are not concerned with statistical significance - we are (often) concerned with optimizing money as quickly as possible

## The mulit-armed bandit

Bandits allow
- testing mulitple options at once
- reach conclusions faster

The term bandit comes from slot machines 
- one armed bandits for their ability to extract money from gamblers

The goal of a multi armed bandit problem is to vin as much money as possible
- this is the same as figuring out which arm is best as quick as possible

Bandits are a simplification of the full reinforcement learing problem

## Exploration versus exploitation

Favourice resturant or somewhere new?
- if you know all the resturants well, trust your judgement
- if you don't, just randomly pick

## Reinforcement learning context

Bandits share in common with reinforcement learning
- exploration verses exploitation problem (which arm to pull)
- potentially non-stationary

Is supervised learning we use data learn a function to use for prediction on unseen data
- in a bandit problem we could predict the expected value of each arm
- this is **prediction** / evaluation

Use data (the results we get from pulling an arm) to select an action
- this is **control**

We can define the value of an action (pulling a specific arm) as an expectation
- the expected reward of an action

$q_{*}(a) = \mathbb{E}[r(a)]$

Reinforcement learning has a **convienent goal** - maximizing expected reward
- if we did know the true expectation of each action, maximization is an argmax

The bandit problem is one step short of the full reinforcement learning problem
- the bandit is a single state, no transitions of state happen

## Example

You have the following results from comparing different landing pages:

In [None]:
params, results = generate_bandit_dataset(arms=20, samples=3)

results

## Practical 

Write a function to take an expectation over the results
- one number for each arm

In [None]:
# answer
# expectation(results)

One approach would be to conclude that one of the arms is optimal and send all our users there
- this is a **greedy** solution to the exploration & exploitation dilemma

## Practical

Take a greedy action based on the results
- take the argmax across expected reward

The problem with a greedy stragety is that we might have noise in our samples that
- our expectation has variance

## Question to class

Is the expectation above biased?

## epsilon-greedy

Another solution would be to favour the option that appears optimal, while still sampling from the options that appear sub-optimal.

A simple algorithm to tackle the exploration-exploitation dilemma is known as **epsilon-greedy** - it is the method used for exploration in DeepMind's 2013 DQN.

The algorithm has a single parameter $\epsilon$, which controls how greedy we are.  

- $\epsilon$ = 1 -> standard A/B test
- $\epsilon$ = 0 -> greedy

In reinforcement learning $\epsilon$ is often decayed from 1 to a 0.05 over an agents lifetime.  Proper selection of $\epsilon$ depends on
- how accurate your greedy estimate is
- how non-stationary the process is

The basic algorithm is as follows:

In [None]:
def get_performance(results):
    d = []
    for arm, data in results.items():
        d.extend(data)
    return np.mean(d)

results = {
    arm: list(np.random.normal(*stats))
    for arm, stats in params.items()
}

eps = 0.3
choices = list(params.keys())

steps = 1000
values = np.zeros((steps, len(choices)))
actions = np.empty((steps)).astype(str)
eps_performance = np.zeros(steps)

for step in range(steps):
    prob = np.random.rand()
    if prob < eps:
        strat = 'random'
        action = np.random.choice(choices)

    else:
        strat = 'greedy'
        expectations = expectation(results)
        values[step, :] = list(expectations.values())
        action = max(expectations, key=expectations.get)
        
    actions[step] = action
    
    p = params[action]
    results[action].append(float(np.random.normal(p.loc, p.scale, 1)))
    eps_performance[step] = get_performance(results)
    
plt.plot(eps_performance, label='eps {}'.format(eps))
_ = plt.legend()

#TODO put min & max lines of arms


## Upper Confidence Bound (UCB)

2.7 in [Sutton & Barto - Reinforcement Learning: An Introduction](http://incompleteideas.net/book/RLbook2018.pdf)

Select an action based on it's historical mean + an exploration bonus

$a = \underset{x}{\text{argmax}} \left[ q(a) + c \cdot \sqrt{\frac{\ln t}{N(a)}} \right] $

$t$ = timestep

$N(a)$ = number of times action $a$ taken

## Practical

Implement a function that performs a UCB update

In [None]:
# answer
# ucb(results, step, c)

## Practical

Implement a UCB experiment

In [None]:
# answer
# ucb_performance = run_ucb_expt(5)

#plt.plot(eps_performance, label='eps') 
#plt.plot(ucb_performance, label='ucb')
#_ = plt.legend()