# Practice: Statistical Significance

Let's say that we've collected data for a web-based experiment. In the experiment, we're testing the change in layout of a product information page to see if this affects the proportion of people who click on a button to go to the download page. This experiment has been designed to have a cookie-based diversion, and we record two things from each user: which page version they received, and whether or not they accessed the download page during the data recording period. (We aren't keeping track of any other factors in this example, such as number of pageviews, or time between accessing the page and making the download, that might be of further interest.)

Your objective in this notebook is to perform a statistical test on both recorded metrics to see if there is a statistical difference between the two groups.

In [3]:
# import packages

import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats import proportion as proptests

import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
# import data

data = pd.read_csv('../data/statistical_significance_data.csv')
data.head(10)

Unnamed: 0,condition,click
0,1,0
1,0,0
2,0,0
3,1,1
4,1,0
5,1,0
6,0,0
7,1,1
8,0,0
9,1,0


In the dataset, the 'condition' column takes a 0 for the control group, and 1 for the experimental group. The 'click' column takes a values of 0 for no click, and 1 for a click.

## Checking the Invariant Metric

First of all, we should check that the number of visitors assigned to each group is similar. It's important to check the invariant metrics as a prerequisite so that our inferences on the evaluation metrics are founded on solid ground. If we find that the two groups are imbalanced on the invariant metric, then this will require us to look carefully at how the visitors were split so that any sources of bias are accounted for. It's possible that a statistically significant difference in an invariant metric will require us to revise random assignment procedures and re-do data collection.

In this case, we want to do a two-sided hypothesis test on the proportion of visitors assigned to one of our conditions. Choosing the control or the experimental condition doesn't matter: you'll get the same result either way. Feel free to use whatever method you'd like: we'll highlight two main avenues below.

If you want to take a simulation-based approach, you can simulate the number of visitors that would be assigned to each group for the number of total observations, assuming that we have an expected 50/50 split. Do this many times (200 000 repetitions should provide a good speed-variability balance in this case) and then see in how many simulated cases we get as extreme or more extreme a deviation from 50/50 that we actually observed. Don't forget that, since we have a two-sided test, an extreme case also includes values on the opposite side of 50/50. (e.g. Since simulated outcomes of .48 and lower are considered as being more extreme than an actual observation of 0.48, so too will simulated outcomes of .52 and higher.) The proportion of flagged simulation outcomes gives us a p-value on which to assess our observed proportion. We hope to see a larger p-value, insufficient evidence to reject the null hypothesis.

If you want to take an analytic approach, you could use the exact binomial distribution to compute a p-value for the test. The more usual approach, however, is to use the normal distribution approximation. Recall that this is possible thanks to our large sample size and the central limit theorem. To get a precise p-value, you should also perform a 
continuity correction, either adding or subtracting 0.5 to the total count before computing the area underneath the curve. (e.g. If we had 415 / 850 assigned to the control group, then the normal approximation would take the area to the left of $(415 + 0.5) / 850 = 0.489$ and to the right of $(435 - 0.5) / 850 = 0.511$.)

You can check your results by completing the following the workspace and the solution on the following page. You could also try using multiple approaches and seeing if they come up with similar outcomes!

In [None]:
# your work here: feel free to create additional code cells as needed!

In [45]:
# Check the number of visitors in each group:
groups = {str(i): data['condition'].value_counts()[i] for i in data['condition'].value_counts().index}
for group in groups.keys():
    print(f"Group {group} has {groups[group]} members")
print("Difference of {} members".format(abs(groups['0'] - groups['1'])))
print("Proportionally, {:.2%} more members in the larger group".format((abs(groups['0'] - groups['1']) / max(groups['0'], groups['1']))))
num_obs = data.shape[0]
n_control = groups['0']

Group 1 has 508 members
Group 0 has 491 members
Difference of 17 members
Proportionally, 3.35% more members in the larger group


In [56]:
## Simulation Approach:
#######################
## If you want to take a simulation-based approach, you can simulate the number of visitors that would be assigned 
## to each group for the number of total observations, assuming that we have an expected 50/50 split.
p = 0.5 # 50/50 split
## Do this many times (200,000 repetitions should provide a good speed-variability balance in this case) 
n_trial = 500000 # Set number for repetitions
## simulate the number of visitors that would be assigned to each group for the number of total observations
## , assuming that we have an expected 50/50 split (p)
samples = np.random.binomial(n=num_obs, p=p, size=n_trial) # Sample from binomial distribution with 0.5 distribtion
comp_vect = np.logical_or(samples <= min(groups['0'], groups['1']), samples >= max(groups['0'], groups['1'])) # vectorized comparison of number in group to sample
p_val = comp_vect.mean() # average of times the sample was outside of min and max values of group obs
print(f"Simulation Approach P-Value: {p_val}")

Simulation Approach P-Value: 0.612616


In [67]:
## Analytical Approach:
#######################
p = 0.5 # 50/50 split
## could use the exact binomial distribution to compute a p-value for the test.
sd = np.sqrt(p * (1-p) * num_obs) # standard dev

# z = ((min(groups['0'],groups['1']) + 0.5) - p * num_obs) / sd
# z2 = ((max(groups['0'],groups['1']) - 0.5) - p * num_obs) / sd
print(2 * stats.norm.cdf(z)) # Computing the area under the distribution curve of the random-variable z

0.6127039025537114


## Checking the Evaluation Metric

After performing our checks on the invariant metric, we can move on to performing a hypothesis test on the evaluation metric: the click-through rate. In this case, we want to see that the experimental group has a significantly larger click-through rate than the control group, a one-tailed test.

The simulation approach for this metric isn't too different from the approach for the invariant metric. You'll need the overall click-through rate as the common proportion to draw simulated values from for each group. You may also want to perform more simulations since there's higher variance for this test.

There are a few analytic approaches possible here, but you'll probably make use of the normal approximation again in these cases. In addition to the pooled click-through rate, you'll need a pooled standard deviation in order to compute a z-score. While there is a continuity correction possible in this case as well, it's much more conservative than the p-value that a simulation will usually imply. Computing the z-score and resulting p-value without a continuity correction should be closer to the simulation's outcomes, though slightly more optimistic about there being a statistical difference between groups.

As with the previous question, you'll find a quiz and solution following the workspace for you to check your results.

In [112]:
# your work here: feel free to create additional code cells as needed!
p_click = data.groupby('condition').mean()['click']
print(p_click)
diff = abs(p_click[0] - p_click[1])
print()
print(f"Difference between average click rate: {diff:.2} points")

condition
0    0.079430
1    0.112205
Name: click, dtype: float64

Difference between average click rate: 0.033 points


In [115]:
## Simulation Approach:
#######################
# get number of trials and overall 'success' rate under null
groups = {str(i): data['condition'].value_counts()[i] for i in data['condition'].value_counts().index}
# n_control = group['0']
# n_exper = group['1']
p_null = data['click'].mean() # pooled click-rate

# simulate outcomes under null, compare to observed outcome
n_trials = 200000
ctrl_clicks = np.random.binomial(groups['0'], p_null, n_trials) # Random sample control grp clicks by average clicks
avg_ctl_click = ctrl_clicks / groups['0']

exp_clicks = np.random.binomial(groups['1'], p_null, n_trials) # Random sample exper grp clicks by average clicks
avg_exp_clicks = exp_clicks / groups['1']

samples = avg_exp_clicks - avg_ctl_click # Difference in the average number of clicks per group simulated

p_val = (samples >= (p_click[1] - p_click[0])).mean() # Rate of times the simulation exceeds the actual
print(f"Simulation Approach P-Value: {p_val}")

Simulation Approach P-Value: 0.039785


In [118]:
## Analytical Approach:
#######################
se_p = np.sqrt(p_null * (1-p_null) * (1/groups['0'] + 1/groups['1'])) # pooled standard deviation
z = (p_click[1] - p_click[0]) / se_p # z-score
p_val = 1-stats.norm.cdf(z)
print(f"Analytical Approach P-Value: {p_val}")

Analytical Approach P-Value: 0.039442821974613684
