# Bayesian Methods 
In this section we are going to start looking at Bayesian methods, and some motivation for why bayesian methods are beneficial. First we will discuss disadvantages of traditional A/B testing. 

We have the following extreme example: 
> A drug you are testing is working well-can you stop the test and improve quality of life of all participants? Frequentist statistics says "no"

This is because it is bad to stop early when doing frequentist statistics, since in increases the chances that you find a false positive. Remember: the p-value can pass below/above the threshold over time 

# Multi-Armed Bandit
> Imagine you are at a casino playing slots (hence the term "arm"). The slot machines are bandits because they are taking your money. Not all the slot machines are equal, 
* 1 pays out 30% of the time
* 1 pays out 20% of the time
* and 1 pays out 10% of the time

This is the same problem as click-through rate or conversion rate! You would just have a click = getting a prize! So you would set up an AB test, run the experiment, and by the end you would know if any one arm was significantly better than the others. E.g. run A/B test for a specific number of N trials, then calculate the p-value. 

## What would you do in real life?
* say 1 arm gives you a prize 3 out of 3 times, and the other arm gives you a prize 0 out of 3 times
* you intuitively will want to play the arm that payed out 3 times! 
* Why? You **ADAPT** 
* 3 plays is probably not enough to gain statistical significance, but you still feel compelled to believe the 1st arm is better

## Reinforcement Learning
* This is the same problem we face in reinforcement learning, where we are trying to teach a machine to play a game
* It models rewards it gets based on actions it takes
* The problem is that this process is stochastic, and your model of the rewards is only an estimate (it is not feasible to try every possible action, in every possible state)
* Early on: few actions taken, unsure about most rewards 
* can't "choose the action that leads to the best reward", because currently knowledge about rewards is minimal
* only after collecting a lot of data will the estimate be accurate


# Explore vs. Exploit 
This problem has a name and it is called the explore-exploit dilemma. The question is:
> If I have gotten 3/3 on bandit 1, or 0/3 from bandit 2, should I:
* **Exploit** bandit 1 more?
* **Explore** at random to gather more data? 

We will look at several solutions to this problem! All will be adaptive, but not all will be bayesian. We will show the bayesian method last. 

# Epsilon-Greedy Solution 
One method of solving the explore-exploit dilemma is called the epsilon-greedy algorithm (and it is intuitive). It just so happens that this algorithm is also used in reinforcement learning. Here is how it works:
> Say you have some software that is serving 2 advertisements (A & B), you will *not* run a full A/B test by blindly serving ad A and ad B an equal # of times. Instead you will adapt to the data you have collected. If ad A is performing better, it will be shown more often. 

Note: even with this adaptive system, you can still do traditional A/B test after collecting data. Remember the contingency table for the chi-square test doesn't require both samples to be the same size. The thing changing here is not the test, but the machine that serves the ads. It will adapt based on the performance of each ad. 

## How does it work?
* simpler than it sounds! 
* We start by choosing a small number epsilon ($\epsilon$) between 0 and 1
* That will be the probability of exploration 
* some psuedo code may look like:

In [7]:
# epsilon = 0.1 
# while True:
#     r = rand()
#     if r < epsilon:
#         # explore
#         # show random advertisement
#     else:
#         # exploit
#         # show best advertisement (as determined by #clicks/#impressions)

## Analyis of Epsilon-Greedy
We can clearly see that that was pretty simple. So lets analyze further...
* One problem is that this algorithm does the same thing *forever*
* even when ad A is statistically significantly better than B, it will still sometimes choose B
* Suppose test has converged, A is better than B, how much reward will we get?
* suppose:
### $$reward(click) = 1 \;,\; reward(noclick) =0$$
* After N impressions:
### $$reward = N(1-\frac{\epsilon}{2})$$
* In other words, for $\epsilon = 0.1$, we get 0.95N instead of N
* However, this is still better than traditional A/B testing with no adaptation 


---
# UCB1 
Can we improve upon the Epsilon-Greedy Algorithm?
* Recall our previous discussion about confidence intervals: 
* define upper and lower limit to represent where we believe true CTR is 
* We'll now look at a similar idea, but we are not going to use the confidence bound we discussed before

## Confidence Intervals
* We are going to use a "tighter" bound: the **Chernoff-Hoeffding** bound
* It says, for some: 
### $$\epsilon > 0$$
* then it is true:
### $$P(\hat{\mu}>\mu+\epsilon \leq exp(-2\epsilon^2N))$$
* The opposite side is also true (symmetry)
### $$P(\hat{\mu}<\mu-\epsilon \leq exp(-2\epsilon^2N))$$
* this is different from our old confidence intervals - those gave us an equality (Specific numbers)
* in other words, the old confidence interval gave us the true probability that the true parameter lies within some interval is 95%. 
* E.g. $P(\mu\;in\;interval) = 95%$
* This new equation however says:
### $$P(\mu\;in\;interval) > f(N)$$
* To get the greater than sign, we just need to combine and rearange the above probabilities:

![ucb1%20confidence.png](attachment:ucb1%20confidence.png)

## Upper Confidence Bound 
* In our case, we are only interested in the upper bound, because we want to get the highest reward possible
* How does this help us? 
* Intuitively, if we have a "tighter" upper bound, we can be more confident about the max. possible win rate of any bandit
* we are going to skip some math, but essentially, for any arm j, we "choose" the epsilon:

![ucb1%20epsilon.png](attachment:ucb1%20epsilon.png)

* Here: 
    * N = the total number of games we have played overall
    * $N_j$ = the total times we have played arm j
    
## UCB1 Algorithm 

![ucb1%20algorithm.png](attachment:ucb1%20algorithm.png)

* here we can see that in the while loop we choose the bandit with the highest upper bound. We play that arm, and then we update the estimated mean for that arm. 
* You may wonder, how does this algorithm help us explore and exploit? 
* Lets look at each of the terms separately:

![ucb1%20argmax.png](attachment:ucb1%20argmax.png)

* The first term: is the estimated mean (the Click through rate estimate)
    * if that is high, then we want to exploit more often
* the second term: depends on N and Nj
    * if N is high, Nj is low, we are not so confident about the CTRj, so explore this more
* consider asymptotic behavior of ln(N)/N
    * as N approaches infinity, ln(N)/N approaches 0, and the second term will go to 0, and we will only be using the estimated means when N is large! So we use only the true CTR in the limit!
    
## UCB1 Expected loss
* coding-wise, this is no more complicated than espilon-greedy, but what is our expected loss?
* It "can be" (complicated math we will skip) shown that the loss is proportional to ln(N)
* compare to epsilon-greedy: the loss is proportional to N
* in the long run, UCB1 will perform much better than epsilon-greedy
* code is just as easy to write, even if the theory is more complex

---
# Conjugate Priors
Lets now build on our mathematical tools, in preparation for bayesian A/B testing! 

## Frequentist Paradigm
* In an earlier section, we talked about the bayesian paradigm
* we said that when we use the frequentist paradigm, we measure things like the mean and CTR (click-through rate) with point estimates
* E.g. sum(X)/N
* The problem was it didn't take into account how accurate those estimates were
* Our solution was to use the central limit theorem to show that the confidence interval was approximately gaussian, and from there we could get an upper and lower bound on the 95% confidence interval 
* so the frequentist paradigm:
    * calculates the likelihood (probability of your data, given the parameter)
        * e.g. probability (how likely was it) you observed the data you did, given a gaussian distribution with mean $\mu=4$
    * maximizes the likelihood with respect to the parameter in question 
        * e.g. find mean $\mu$ (and associated gaussian distribution) that maximizes the probability you observed the data you did
    * we end up with the maximum likelihood estimate for that parameter 
    * the equation below is general but think of the case: likelihood of the data we observed (X) given a value of the mean ($\theta=\mu$)
    
![mle%20estimate.png](attachment:mle%20estimate.png)

