# Statistical Considerations in Testing

<!-- How do you deduce **causality**? *Run an experiment!*

Covered in this lesson:
1. [What is an expirement](#what_is)<br>

> 1.1 [1.1 What types of studies are there](#study_types)<br>

2. [What types of experiments are there](#types)<br>

> 2.1 [Types of Sampling](#sampling)<br>

3. [How are outcomes measured](#measured)<br>

> 3.1: [Creating Metrics](#met_create)<br>

> 3.2: [Controlling Variables](#cont_var)<br>

> 3.3: [Checking Validity](#validity)<br>

> 3.4: [Checking Bias](#bias)<br>

> 3.5: [Ethics in Experimentation](#ethics)<br>

4. [Experiment Design Plaining](#planning)<br>

By the end of this section, you will know what is required to create experiments that effectively address your goals. -->

* How much data will you need before you can judge the success of your experiment on solid grounds?
* How many data points will I need to see the effect I am interested in
> * Factors like the size of the effect that you want to see can have a major effect on how much data you need to collect and how long it will take before you get your results

<img src="stat_cons_01.png" width="500">

<img src="stat_cons_02.png" width="500">

# Lesson 1: L2 Statistical Significance - Exercise

This lesson assumes that you already know about basic inferential statistics. In particular, you should know how to perform a statistical test for the difference in means between two groups, and for comparing the mean of a single group against a reference value.

Let's say that we've collected data for a web-based experiment. In the experiment, we're testing the change in layout of a product information page to see if this affects the proportion of people who click on a button to go to the download page. This experiment has been designed to have a cookie-based diversion, and we record two things from each user: which page version they received, and whether or not they accessed the download page during the data recording period. (We aren't keeping track of any other factors in this example, such as number of pageviews, or time between accessing the page and making the download, that might be of further interest.)

Your objective in this notebook is to perform a statistical test on both recorded metrics to see if there is a statistical difference between the two groups.

In [2]:
import numpy as np
import pandas as pd

import scipy.stats as stats
from statsmodels.stats import proportion as proptests

import matplotlib.pyplot as plt

%matplotlib inline
%config Completer.use_jedi = False



In [4]:
# import data
data = pd.read_csv('01_stat_considerations/data/statistical_significance_data.csv')
data.head()

Unnamed: 0,condition,click
0,1,0
1,0,0
2,0,0
3,1,1
4,1,0


In the dataset, the 'condition' column takes a 0 for the control group, and 1 for the experimental group. The 'click' column takes a values of 0 for no click, and 1 for a click.

## Checking the Invariant Metric

First of all, we should check that the number of visitors assigned to each group is similar. It's important to check the invariant metrics as a prerequisite so that our inferences on the evaluation metrics are founded on solid ground. If we find that the two groups are imbalanced on the invariant metric, then this will require us to look carefully at how the visitors were split so that any sources of bias are accounted for. It's possible that a statistically significant difference in an invariant metric will require us to revise random assignment procedures and re-do data collection.

In this case, we want to do a two-sided hypothesis test on the proportion of visitors assigned to one of our conditions. Choosing the control or the experimental condition doesn't matter: you'll get the same result either way. Feel free to use whatever method you'd like: we'll highlight two main avenues below.

If you want to take a simulation-based approach, you can simulate the number of visitors that would be assigned to each group for the number of total observations, assuming that we have an expected 50/50 split. Do this many times (200,000 repetitions should provide a good speed-variability balance in this case) and then see in how many simulated cases we get as extreme or more extreme a deviation from 50/50 that we actually observed. Don't forget that, since we have a two-sided test, an extreme case also includes values on the opposite side of 50/50. (e.g. Since simulated outcomes of .48 and lower are considered as being more extreme than an actual observation of 0.48, so too will simulated outcomes of .52 and higher.) The proportion of flagged simulation outcomes gives us a p-value on which to assess our observed proportion. We hope to see a larger p-value, insufficient evidence to reject the null hypothesis.

If you want to take an analytic approach, you could use the exact binomial distribution to compute a p-value for the test. The more usual approach, however, is to use the normal distribution approximation. Recall that this is possible thanks to our large sample size and the central limit theorem. To get a precise p-value, you should also perform a 
continuity correction, either adding or subtracting 0.5 to the total count before computing the area underneath the curve. (e.g. If we had $\frac{415}{850}$ assigned to the control group, then the normal approximation would take the area to the left of $\frac{415 + 0.5}{850} = 0.489$ and to the right of $\frac{435 - 0.5}{850} = 0.511$ .)

You can check your results by completing the quiz and watching the video following the workspace. You could also try using multiple approaches and seeing if they come up with similar outcomes!

### Analytic Approach

In [5]:
# get number of trials and number of 'successes'
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]

In [6]:
# Compute a z-score and p-value
p = 0.5
sd = np.sqrt(p * (1-p) * n_obs)

z = ((n_control + 0.5) - p * n_obs) / sd

print(z)
print(2 * stats.norm.cdf(z))

-0.5062175977346661
0.6127039025537114


### Simulation Approach

In [10]:
# get number of trials and number of 'successes'
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]

In [22]:
# # simulate outcomes under null, compare to observed outcome
p = 0.5
n_trials = 200_000

samples = np.random.binomial(n_obs, p, n_trials)

# print(np.logical_or(samples <= n_control, samples >= (n_obs - n_control)).mean())
invar_p = np.logical_or(samples <= n_control, samples >= (n_obs - n_control)).mean()
print("p-value for the test on the invariant metric (number of visitors assigned to each group): {}".format(invar_p))

0.614335
p-value for the test on the invariant metric (number of visitors assigned to each group): 0.614335


## Checking the Evaluation Metric

After performing our checks on the invariant metric, we can move on to performing a hypothesis test on the evaluation metric: the click-through rate. In this case, we want to see that the experimental group has a significantly larger click-through rate than the control group, a one-tailed test.

The simulation approach for this metric isn't too different from the approach for the invariant metric. You'll need the overall click-through rate as the common proportion to draw simulated values from for each group. You may also want to perform more simulations since there's higher variance for this test.

There's a few analytic approaches possible here, but you'll probably make use of the normal approximation again in these cases. In addition to the pooled click-through rate, you'll need a pooled standard deviation in order to compute a z-score. While there is a continuity correction possible in this case as well, it's much more conservative than the p-value that a simulation will usually imply. Computing the z-score and resulting p-value without a continuity correction should be closer to the simulation's outcomes, though slightly more optimistic about there being a statistical difference between groups.

As with the previous question, you'll find a quiz and video following the workspace for you to check your results.

In [23]:
p_click = data.groupby('condition').mean()['click']
p_click

condition
0    0.079430
1    0.112205
Name: click, dtype: float64

In [31]:
# Difference in average click rate between groups
p_click[1] - p_click[0]

0.03277498917523293

### Analytic Approach

In [29]:
# get number of trials and overall 'success' rate under null
n_control = data.groupby('condition').size()[0]
n_exper = data.groupby('condition').size()[1]
p_null = data['click'].mean()

In [34]:
# compute standard error, z-score, and p-value
se_p = np.sqrt(p_null * (1-p_null) * (1/n_control + 1/n_exper))

z = (p_click[1] - p_click[0]) / se_p
# print(z)
eval_p=1-stats.norm.cdf(z)
print("p-value for the test on the evaluation metric (difference in click-through rates across groups): {:.3f}".format(eval_p))

p-value for the test on the evaluation metric (difference in click-through rates across groups): 0.039


### Simulation Approach

In [35]:
# get number of trials and overall 'success' rate under null
n_control = data.groupby('condition').size()[0]
n_exper = data.groupby('condition').size()[1]
p_null = data['click'].mean()

In [36]:
# simulate outcomes under null, compare to observed outcome
n_trials = 200_000

ctrl_clicks = np.random.binomial(n_control, p_null, n_trials)
exp_clicks = np.random.binomial(n_exper, p_null, n_trials)
samples = exp_clicks / n_exper - ctrl_clicks / n_control

print((samples >= (p_click[1] - p_click[0])).mean())

0.03986


# Lesson 2: Practical Significance

**Practical Significance:** *Level of observed change required to deploy a tested experimental manipulation

<center><img src="stat_cons_06.png" width="500"></center>


Even if an experiment result shows a statistically significant difference in an evaluation metric between control and experimental groups, that does not necessarily mean that the experiment was a success. If there are any costs associated with deploying a change, those costs might outweigh the benefits expected based on the experiment results. **Practical significance** refers to the level of effect that you need to observe in order for the experiment to be called a true success and implemented in truth. Not all experiments imply a practical significance boundary, but it's an important factor in the interpretation of outcomes where it is relevant.

If you consider the confidence interval for an evaluation metric statistic against the null baseline and practical significance bound, there are a few cases that can come about.

### Confidence interval is fully in practical significance region
(Below, $m_{0}$indicates the null statistic value, $d_{min}$ the practical significance bound, and the blue line the confidence interval for the observed statistic. We assume that we're looking for a positive change, ignoring the negative equivalent for $d_{min}$)

<center><img src="stat_cons_03.png" width="500"></center>

If the confidence interval for the statistic does not include the null or the practical significance level, then the experimental manipulation can be concluded to have a statistically and practically significant effect. It is clearest in this case that the manipulation should be implemented as a success.

### Confidence interval completely excludes any part of practical significance region

<center><img src="stat_cons_04.png" width="500"></center>

If the confidence interval does not include any values that would be considered practically significant, this is a clear case for us to not implement the experimental change. This includes the case where the metric is statistically significant, but whose interval does not extend past the practical significance bounds. With such a low chance of practical significance being achieved on the metric, we should be wary of implementing the change.

### Confidence interval includes points both inside and outside practical significance bounds

<center><img src="stat_cons_05.png" width="500"></center>

This leaves the trickiest cases to consider, where the confidence interval straddles the practical significance bound. In each of these cases, there is an uncertain possibility of practical significance being achieved. In an ideal world, you would be able to collect more data to reduce our uncertainty, reducing the scenario to one of the previous cases. Outside of this, you'll need to consider the risks carefully in order to make a recommendation on whether or not to follow through with a tested change. Your analysis might also reveal subsets of the population or aspects of the manipulation that **do** work, in order to refine further studies or experiments.

# Lesson 3: Experiment Size

<center><img src="stat_cons_07.png" width="500"></center>

We can use the knowledge of our desired practical significance boundary to plan out our experiment. By knowing how many observations we need in order to detect our desired effect to our desired level of reliability, we can see how long we would need to run our experiment and whether or not it is feasible.

Let's use the example from the video, where we have a baseline click-through rate of 10% and want to see a manipulation increase this baseline to 12%. How many observations would we need in each group in order to detect this change with power $1-\beta = .80$ (i.e. detect the 2% absolute increase 80% of the time), at a Type I error rate of $\alpha = .05$?

The curves on these two plots represent the difference in sample means given 1,000 observations in each of the control groups and experimental groups, with the top being no effect from the treatment and the bottom being the desired outcome from the treatment, a 2% absolute increase:
<center><img src="stat_cons_08.png" width="500"></center>

The vertical line at 0.02 indicates a type-one error rate of 5% (because 95% of the data lies to the left of it). 0.02 is the critical value. Matching that critical value on the desired result shows a type one error rate of 44%, and type two error rate on the left of the line of 56%:
<center><img src="stat_cons_09.png" width="500"></center>

Increasing the number of data points will narrow both curves, increasing the statistical power. If we want a statistical power of 0.8, or an 80% chance of rejecting the null, assuming a 12% TRUE click-through rate (or treatment effect), then we need at least 2,863 observations in each the control and test group. We can achieve this over 12 days if we get about 500 people a day. $\frac{500}{2} \cdot x \approx 2863 \rightarrow \frac{2863}{250} = x = 11.45 \approx 12$
<center><img src="stat_cons_09.png" width="500"></center>

After computing the number of observations needed for an experiment to reliably detect a specified level of experimental effect (i.e. statistical power), we need to divide by the expected number of observations per day in order to get a minimum experiment length. We want to make sure that an experiment can be completed in a reasonable time frame so that if we do have a successful effect, it can be deployed as soon as possible and resources can be freed up to run new experiments. What a 'reasonable time frame' means will depend on how important a change will be, but if the length of time is beyond a month or two, that's probably a sign that it's too long.

There are a few ways that an experiment's duration can be reduced. We could, of course, change our statistical parameters. Accepting higher Type I or Type II error rates will reduce the number of observations needed. So too will increasing the effect size: it's much easier to detect larger changes.

Another option is to change the unit of diversion. A 'wider' unit of diversion will result in more observations being generated. For example, you could consider moving from a cookie-based diversion in a web-based experiment to an event-based diversion like pageviews. The tradeoff is that event-based diversion could create inconsistent website experiences for users who visit the site multiple times.

# Experiment Size - Exercise

We can use the knowledge of our desired practical significance boundary to plan out our experiment. By knowing how many observations we need in order to detect our desired effect to our desired level of reliability, we can see how long we would need to run our experiment and whether or not it is feasible.

Let's use the example from the video, where we have a baseline click-through rate of 10% and want to see a manipulation increase this baseline to 12%. How many observations would we need in each group in order to detect this change with power $1-\beta = .80$ (i.e. detect the 2% absolute increase 80% of the time), at a Type I error rate of $\alpha = .05$?

## Method 1: Trial and Error

One way we could solve this is through trial and error. Every sample size will have a level of power associated with it; testing multiple sample sizes will gradually allow us to narrow down the minimum sample size required to obtain our desired power level. This isn't a particularly efficient method, but it can provide an intuition for how experiment sizing works.

Fill in the `power()` function below following these steps:

1. Under the null hypothesis, we should have a critical value for which the Type I error rate is at our desired alpha level.
  - `se_null`: Compute the standard deviation for the difference in proportions under the null hypothesis for our two groups. The base probability is given by `p_null`. Remember that the variance of the difference distribution is the sum of the variances for the individual distributions, and that _each_ group is assigned `n` observations.
  - `null_dist`: To assist in re-use, this should be a [scipy norm object](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html). Specify the center and standard deviation of the normal distribution using the "loc" and "scale" arguments, respectively.
  - `p_crit`: Compute the critical value of the distribution that would cause us to reject the null hypothesis. One of the methods of the `null_dist` object will help you obtain this value (passing in some function of our desired error rate `alpha`).
2. The power is the proportion of the distribution under the alternative hypothesis that is past that previously-obtained critical value.
  - `se_alt`: Now it's time to make computations in the other direction. This will be standard deviation of differences under the desired detectable difference. Note that the individual distributions will have different variances now: one with `p_null` probability of success, and the other with `p_alt` probability of success.
  - `alt_dist`: This will be a scipy norm object like above. Be careful of the "loc" argument in this one. The way the `power` function is set up, it expects `p_alt` to be greater than `p_null`, for a positive difference.
  - `beta`: Beta is the probability of a Type-II error, or the probability of failing to reject the null for a particular non-null state. That means you should make use of `alt_dist` and `p_crit` here!

The second half of the function has already been completed for you, which creates a visualization of the distribution of differences for the null case and for the desired detectable difference. Use the cells that follow to run the function and observe the visualizations, and to test your code against a few assertion statements. Check the following page if you need help coming up with the solution.