In [2]:
import math
import numpy as np
from scipy import stats

from IPython.display import Image
from IPython.core.display import HTML 

### Uber's Experimentation Platform

The aim of this notebook is to understand the reasoning behind the statistical methods used in Uber's experimentation platform.  In particular, I am looking at the statistical methods used to analyse continuous metrics, as defined in figure 5.

In [7]:
Image(url= "http://1fykyq3mdn5r21tpna3wkdyi-wpengine.netdna-ssl.com/wp-content/uploads/2018/08/image16.png",
      width=500, height=500)

### Continuous Metrics

For continuous metrics, the methodology is as follows:

- Large sample size | unskewed data --> Welch’s t-test
- Large sample size | skewed data --> MWW U Test
- Small sample size --> Bootstrap + t-test

#### Table Summary

| Metric Type | Sample Size | Skew       | Statistical Method | Commentary                                                                                                                                                                                                                                                                          |
|-------------|-------------|------------|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Continuous  | Large       | Not Skewed | Welch's t-test     | Default method for analysing continuous metrics.                                                                                                                                                                                                                                   |
| Continuous  | Large       | Skewed     | MWW U Test         | If the sampling distribution is non-normal, the MWW U Test is more powerful than the t-test. **Under what circumstances would a large sample size yield a non-normal sampling distribution? By the central limit theorem, we would expect the sampling distribution to be normal.** Perhaps they are testing a feature out on drivers in a certain geography. |  
| Continuou   | Small       | -          | Bootstrap + t-test | According to [Wikipedia](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)): > When the sample size is small, the bootstrap can offer a measure of the standard deviation that is more robust against the distortions of a particular sample.  But for small sample size, the sample may not accurately represent the population, in which case the bootstrap does not work very well|

#### Questions
- Why does the bootstrap offer a better estimate of the standard error than just computing the standard error from the standard deviation estimate of the data?

Below I simulate the effect this has.

### Ratio Metrics

Placeholder

> Ratio metrics contain two numeric value columns, the numerator values and the denominator values, e.g., the trip completion ratio, where the numerator values are the number of completed trips, and the denominator values are the number of total trip requests.

- Delta method ([Deng et al. 2011](https://alexdeng.github.io/public/files/jsm2011-deng.pdf))

### Categorical Metrics
Placeholder

- One of our team’s main goals is to deliver one-size-fits-most methodologies of hypothesis testing 
- They perform automated outlier detection
    - Kohavi recommends first understanding the origin of your outliers then removing them.


In [50]:
n = 1000
loc = 1
scale = 1
s = 1
data = stats.lognorm.rvs(s, loc, scale, size=n)

In [55]:
true_se = scale / math.sqrt(n)
analytical_se = np.std(data, ddof=1) / np.sqrt(n)
boostraped_se = bootstrapped_se(data)

print (true_se * 2)
print (analytical_se)
print (boostraped_se)

0.06324555320336758
0.06619491642071582
0.06630626386014395


In [54]:
def bootstrapped_se(data):
    sample_means = []
    for _ in range(50000):
        sample = np.random.choice(data, size=len(data), replace=True)
        sample_mean = np.mean(sample)
        sample_means.append(sample_mean)
    return np.std(sample_means)


Questions:
- For a large sample size, how would the sampling distribution be skewed? By the central limit theorem, we would expect the sampling distribution to be normal even if the parent distribution is heavily skewed.
- What are the assumptions of a t-test? Would they be broken above?