In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats
data = pd.read_csv('data/homepage-experiment-data.csv')
data.head()

Unnamed: 0,Day,Control Cookies,Control Downloads,Control Licenses,Experiment Cookies,Experiment Downloads,Experiment Licenses
0,1,1764,246,1,1850,339,3
1,2,1541,234,2,1590,281,2
2,3,1457,240,1,1515,274,1
3,4,1587,224,1,1541,284,2
4,5,1606,253,2,1643,292,3


## 1. P-value for invariant metrics 

In [6]:
# get number of control and experiment cookies
n_control = data["Control Cookies"].sum()
n_exp = data["Experiment Cookies"].sum()
n_total = n_control+n_exp

In [7]:
# Compute z-score 
p = 0.5
sd = np.sqrt(p * (1-p) * n_total)

z = ((n_control + 0.5) - p * n_total) / sd

print(z)


-1.6095646049678511


In [8]:
# Compute p-value
print(2*stats.norm.cdf(z))

0.10749294050130412


## 2. P-value for evaluation metrics: download rate
 The invariant metric passed inspection, we can move on to the evaluation metrics. 
 The download rate is the total number of downloads divided by the number of cookies

In [14]:

p_null = float((data['Control Downloads'].sum()+data["Experiment Downloads"].sum())/n_total)
se_p = np.sqrt(p_null * (1-p_null) * (1/n_control + 1/n_exp))
p_exp = data["Experiment Downloads"].sum()/n_exp
p_download = data['Control Downloads'].sum()/n_control
z = (p_exp-p_download) / se_p
print(z)

7.870833726066236


In [15]:
print(stats.norm.cdf(-z))

1.7614279636728079e-15


## 3. P-value for evaluation metrics: purchase rate
One tricky point to consider is that there is a seven or eight day delay between when most people download the software and when they make a purchase. There's no direct way of attributing cookies all the way through license purchases due to the daily aggregation of results, so the best we can do is to make a justified argument for handling the data. To answer the question below about the license purchasing rate, we only take the cookies observed through day 21 as the denominator of the ratio as being responsible for all of the license purchases observed.

In [16]:
n_control = data["Control Cookies"].iloc[:21].sum()
n_exp = data["Experiment Cookies"].iloc[:21].sum()
n_total = n_control+n_exp

In [17]:
p_null = float((data['Control Licenses'].sum()+data["Experiment Licenses"].sum())/n_total)
se_p = np.sqrt(p_null * (1-p_null) * (1/n_control + 1/n_exp))
p_exp = data["Experiment Licenses"].sum()/n_exp
p_download = data['Control Licenses'].sum()/n_control
z = (p_exp-p_download) / se_p
print(z)

0.2586750111658684


In [18]:
print(stats.norm.cdf(-z))

0.3979430008399871
