<center><h2>Multi-Arm Bandits: Reinforcement Learning Simplified</h2></center>

<center><img src="http://www.offtopia.net/ecai-2012-vago-poster/bandit.png" width="50%"/></center>

<center><h2>Reinforcement Learning: A simple definition</h2></center>
<br>
<br>
<center>"Randomly try strategies.  </center>
<center>If they work, choose them more often."  </center>
<center>— Raymond Hettinger </center> 

<center><h2>Learning Outcomes</h2></center>

#### By the end of this session, you should be able to:

- Explain multi-arm bandits in your own words.
- Explain the exploration / exploitation trade-off.
- Define regret in your own words and mathematically.

<center><h2>Traditional A/B Testing</h2></center>

<center><img src="images/ab.png" height="50%"/></center>

Traditional A/B Testing
-----

First, assign equal numbers of users to Group A and Group B.

Then, stop the experiment and send all the users to the more successful version of your site.

What is the problem with traditional A/B testing?
-----

Even a small set of decisions can quickly lead to 1,000’s of unique page layouts. Controlled A/B tests are well suited to simple binary decisions, but they do not scale well to hundreds or thousands of treatments.

[Source](https://blog.acolyer.org/2017/11/22/canopy-an-end-to-end-performance-tracing-and-analysis-system/)

<center><h2>Multi-Arm Bandit</h2></center>


<center><img src="images/b.jpg" width="70%"/></center>

<center>Systematically adjust the numbers of samples for each condition based on current results.</center>

Where does the bandit name come from?
-----



<center><img src="http://images.fineartamerica.com/images-medium-large-5/one-arm-bandit-slot-machine-20130308-wingsdomain-art-and-photography.jpg" width="400"/></center>

Multi-Arm Bandit
-----

<center><img src="https://conversionxl.com/wp-content/uploads/2015/09/multiarmedbandit.jpg" height="500"/></center>

bandits (slot machines) = versions of the website

Multi-Arm Bandit Approach 
-----

1) Show a user the version of the site that you think is best most of the time 



2) As the experiment runs, update the belief about the CTR (Click Through Rate) for each version


3) Run for however long until satisfied we know the best version

Example Scenario  
-----

The company you work for is testing out a new version of it's mobile website.

After running an A/B test for an afternoon, the new version of the site appears to performing slightly better than the old version.

After running the test over night, the old version of the site is performing better than the old version with a statistically significant p-value of 0.04.

Do you stop the test, or do you keep running it?

<center><h2>Student Activity</h2></center>

Pretend you are in a job interview. Pair off. One person answer each question in 30 seconds - 2 minutes. Switch pairs after each question.

- List the biggest 3 limitations of traditional A/B testing.
- How does MAB address each one?
- How does data analysis change between A/B testing and MAB?
- Which one is more efficient? Why?
- Which one is more ethical? Why?
- Why would you choose multi-arm bandits (MAB) over A/B testing?

Challenge Question: What are specific situations where you would choose A/B testing over multi-arm bandits (MAB)?

<center><h2>Specific situations to choose A/B testing over  multi-arm bandits (MAB)</h2></center>
 
1. Required to have statistical significance (e.g., publishing a paper or your boss is a Statistician).
 
2. Unable to implement multi-arm bandit due to technical debt or system complexity.

<center><img src="images/yum.jpg" width="80%"/></center>

If you are going out to eat, should try something new or go with an old favorite?

<center><h2>Exploration vs. Exploitation  </h2></center>

<center><h2>Language Note</h2></center>

__Explore & Exploit__ is conventionally used by the CS / ML, a group not known for their inclusive language.

Alternative terms are __Explore & Execute__ (better, but not great).

Source: [Algorithms to Live By: The Computer Science of Human Decisions ](https://www.amazon.com/Algorithms-Live-Computer-Science-Decisions/dp/1627790365)

<center><h2>Exploration vs. Exploitation  </h2></center>

<center><img src="images/eee.jpg" width="70%"/></center>

Exploration
------

Trying out different options to find the reward associated with the given approach (aka, acquiring more knowledge).

Exploitation
-----

Choosing the approach that you believe to have the highest expected payoff (aka, optimizing outcomes).

<center><img src="images/ee.jpg" width="80%"/></center>

<center><h2>Traditional A/B Testing  </h2></center>

A short period of __pure exploration__, in which you assign equal numbers of users to Group A and Group B. 

A long period of __pure exploitation__, in which you send all of your users to the more successful version of your site and never come back to the option that seemed to be inferior  

Check for understanding
-------

Why might this be a suboptimal strategy?  

Potential Inefficiencies  
------

- Equal number of observations are routed to A and B for a preset amount of time or iterations

- Need to wait for the experiment to conclude for certain statistical guarantees to be provided 

- Only after that preset amount of time or iterations do we stop and use the better performer 
  

- This wastes time and __money__ showing users the suboptimal site

- changes over time - seasonality
- fraud
- random fluencations

Check for understanding
----

What are examples of where we can apply bandits to reduce inefficiencies?

* Dynamically A/B testing websites


* Adaptive routing in attempts to minimize network delays (either packets 🖥 or packages 📦)

* Clinical trial (I would agrue __NOT__ doing bandits is immoral)


* Budget allocation amongst competing projects (not a way to make friends)


<center><img src="images/bezos.jpeg" width="75%"/></center>

[Source](https://medium.com/@savedali/jeff-bezos-guide-to-quitting-medicine-24e16325f159)

<center><h2>What is "Regret" in Explore / Exploit? </h2></center>

The difference of what we actually won vs. what we would expect to win with an optimal strategy.

Regret is the cost function we are trying to minimize at a strategic level.

<center><h2>Regret Formalism</h2></center>

$$ 
\begin{align*} 
\text{regret} &= \sum_{i = 1}^k (p_{opt} - p_i)   \\ 
&= k p_{opt} - \sum_{i = 1}^k p_i
\end{align*} $$

where there are k trials and $p_i$ is the probability of winning with the bandit chosen on the i-th run.   

$p_{opt}$ is the probability of winning with the best bandit.

<center><h2>Check for understanding</h2></center>

What would a regret of zero mean?

A regret of 0 would mean you always made the best choice.

<center><h2>Check for understanding</h2></center>

In what situations can you can guarantee zero regret?

It is __never__ possible since you need to collect data to determine which variation is the best.

<sub>Note that you need to know the true probabilities to calculate the regret. This is a theoretical idea to evaluate which algorithm is best.</sub>

<center><h2>Check for understanding</h2></center>

The traditional bandit problem is specified with a discrete and finite number of arms, often indicated by the variable $K$.

What optimize technique should you use if you do __not__ have discrete, independent arms?

<center><h2>Bayesian Optimization</h2></center>

<center><img src="images/bo.jpeg" width="75%"/></center>

A general method for “learning to optimize” an unknown function.

Efficient sampling of all spaces, especially continuous and dependent.

[Source](https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083)

Summary
-----

- In your life, you need to balance exploration (finding what you like) and exploitation (doing what you like).
- Multi-arm bandits is a way of optimizing A/B testing by minimizing regret.
- Regret is a mathematical term for quantifying not doing the optimal thing.
- Multi-arm bandits are a simple version of Reinforcement Learning.