In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/homepage-experiment-data.csv')

In [3]:
df.head()

Unnamed: 0,Day,Control Cookies,Control Downloads,Control Licenses,Experiment Cookies,Experiment Downloads,Experiment Licenses
0,1,1764,246,1,1850,339,3
1,2,1541,234,2,1590,281,2
2,3,1457,240,1,1515,274,1
3,4,1587,224,1,1541,284,2
4,5,1606,253,2,1643,292,3


### Checking the Invariant Metric
What is the p-value for the test on the number of cookies assigned to each group?

In [8]:
control_cookies = df['Control Cookies'].values
experiment_cookies = df['Experiment Cookies'].values

control_size = np.sum(control_cookies)
experiment_size = np.sum(experiment_cookies)

number_observations = control_size + experiment_size

In [12]:
control_size, experiment_size

(46851, 47346)

In [10]:
# simulate outcomes under null, compare to observed outcome
p = 0.5
simulation_size = 200_000

theorical_control_obs = np.random.binomial(number_observations, p, simulation_size)

# two-tailed hyphotesis testing
p_value = np.logical_or(
    theorical_control_obs <= control_size, 
    theorical_control_obs >= (number_observations - control_size)
).mean()

In [11]:
p_value

0.10707

Great job! Even though there's a few hundred more cookies in the experimental group than the control group, the difference between groups isn't statistically significant. We should feel fine about moving on to test the evaluation metrics.

### Checking the Evaluation Metric I
What is the p-value for the test on the download rate between groups?

In [63]:
# get number of trials and overall 'success' rate under null
control_downloads = df['Control Downloads'].values
experiment_downloads = df['Experiment Downloads'].values

# calculating proportions per group
control_download_prop = np.sum(control_downloads) / control_size
experiment_download_prop = np.sum(experiment_downloads) / experiment_size

# joining the proportions following the null aka the data as a whole (null = no difference)
group_download_prop_observed = (control_download_prop + experiment_download_prop)/2

# success rate = joined proportions
p_null = group_download_prop_observed

In [64]:
p_null

0.17088889349981923

p_null is calculated assuming the data follows the null. The null says there's nor difference, therefore, we could mix the control and experiment proportions (since that's what the null says) and that would be the actual **success rate** under the null.

In [61]:
# simulate outcomes under null, compare to observed outcome
n_trials = 200_000

# create theorical distributions ...
theorical_control_downloads = np.random.binomial(control_size, p_null, n_trials)
theorical_exp_downloads = np.random.binomial(experiment_size, p_null, n_trials)

# ... and perform the same operation as in the previous cell
# here we don't `np.sum` since each of the elements in the array (numerator) are the actual success counts
theorical_control_download_prop = theorical_control_downloads / control_size
theorical_experiment_download_prop = theorical_exp_downloads / experiment_size

# calculate our test statistic (proportion difference) for both the theorical and observed distributions
# array
theorical_diff_prop = theorical_experiment_download_prop - theorical_control_download_prop
# number
observed_diff_prop = experiment_download_prop - control_download_prop

# get the probability of getting values as extremes as the one observed
p_value = (theorical_diff_prop >= observed_diff_prop).mean()

In [62]:
p_value

0.0

The download rate is very much statistically significant, beyond all conventional signficance levels. If you used the whole data, you should have gotten a z-score of about 8.55. If you were clever (see the next question) you should have gotten a z-score of about 7.13.

### Checking the Evaluation Metric II
What is the p-value for the test on the license purchasing rate between groups?

In [67]:
# correcting the size of the groups for the license feature 
# (check out the course notes about why we do this)
df_corrected = df[df.Day <= 21]
control_size_corrected = np.sum(df_corrected['Control Cookies'].values)
experiment_size_corrected = np.sum(df_corrected['Experiment Cookies'].values)

# get number of trials and overall 'success' rate under null
control_licenses = df_corrected['Control Licenses'].values
experiment_licenses = df_corrected['Experiment Licenses'].values

# calculating proportions per group
control_licenses_prop = np.sum(control_licenses) / control_size_corrected
experiment_licenses_prop = np.sum(experiment_licenses) / experiment_size_corrected

# joining the proportions following the null aka the data as a whole (null = no difference)
group_licenses_prop_observed = (control_licenses_prop + experiment_licenses_prop)/2

# success rate = joined proportions
p_null = group_licenses_prop_observed

In [68]:
p_null

0.013201281858188361

In [77]:
# simulate outcomes under null, compare to observed outcome
n_trials = 500_000

# create theorical distributions
theorical_control_licenses = np.random.binomial(control_size_corrected, p_null, n_trials)
theorical_exp_licenses = np.random.binomial(experiment_size_corrected, p_null, n_trials)

# calculate theorical proportions per group
theorical_control_licences_prop = theorical_control_licenses / control_size_corrected
theorical_experiment_licences_prop = theorical_exp_licenses / experiment_size_corrected

# calculate test statistic (proportion difference) for both the theorical and observed distributions
theorical_diff_prop = theorical_experiment_licences_prop - theorical_control_licences_prop
observed_diff_prop = experiment_licenses_prop - control_licenses_prop

# get the probability of getting values as extremes as the one observed
p_value = (theorical_diff_prop >= observed_diff_prop).mean()

In [78]:
p_value

0.429264