## 0. Import Data & Data Integrity Check

In [1]:
!pwd
!ls

/Users/VanessaG/Desktop/udacity_experiment
AB Testing Case.ipynb  [34mWeek 4 AB Testing Data[m[m


In [2]:
import numpy as np
import pandas as pd

df_basevals = pd.read_csv("Week 4 AB Testing Data/baseline.csv", index_col=False, header=None, names = ['metric','baseline_val'])
df_basevals.metric = df_basevals.metric.map(lambda x: x.lower())
df_basevals

Unnamed: 0,metric,baseline_val
0,unique cookies to view page per day:,40000.0
1,"unique cookies to click ""start free trial"" per...",3200.0
2,enrollments per day:,660.0
3,"click-through-probability on ""start free trial"":",0.08
4,"probability of enrolling, given click:",0.20625
5,"probability of payment, given enroll:",0.53
6,"probability of payment, given click",0.109313


In [3]:
df_control = pd.read_csv("Week 4 AB Testing Data/control.csv")
df_experiment = pd.read_csv("Week 4 AB Testing Data/experiment.csv")

In [4]:
df_control.shape, df_experiment.shape

((37, 5), (37, 5))

In [5]:
np.all(df_control['Date'].values == df_experiment['Date'].values)

True

## 1. Variability

#### Assuming Binomial distribution with probability p and population N, the analytical standard deviation is computed as std = sqrt(p * (1-p) / N)

In [6]:
# Random sample size
sample_size = 10000
p_ctp = 0.080000
p_enroll_given_clicked = 0.206250
p_pay_given_enrolled = 0.530000
p_pay_given_clicked = 0.109313

####  Regarding GrossConversion 

Unit of analysis : User ids enrolled
    
Unit of diversion : Cookies who click the "start free trial" page.
    
Unit of analysis and diversion correlated but not same

So, better to collect data for empirical values if possible.

In [7]:
print('Sample Size: {} '.format(sample_size))
print('Baseline Click through probability on start free trial : {} '.format(p_ctp))      
print('Baseline probability for enrolling when start free trial clicked : {} '.format(p_enroll_given_clicked))
analytical_sd_gc = round(np.sqrt((p_enroll_given_clicked*(1-p_enroll_given_clicked))/(sample_size*p_ctp)),4)
print('Analytical SD for GrossConversion : {} '.format(analytical_sd_gc))

Sample Size: 10000 
Baseline Click through probability on start free trial : 0.08 
Baseline probability for enrolling when start free trial clicked : 0.20625 
Analytical SD for GrossConversion : 0.0143 


####  Regarding NetConversion 

Unit of analysis : User ids paid

Unit of diversion : Cookies who click the "start free trial" page. 

Unit of analysis and diversion correlated but not same

So, better to collect data for empirical values if possible.

In [8]:
print('Sample Size: {} '.format(sample_size))
print('Baseline Click through probability on start free trial : {} '.format(p_ctp))      
print('Baseline probability for payment when start free trial clicked : {} '.format(p_pay_given_clicked))
analytical_sd_nc = round(np.sqrt((p_pay_given_clicked*(1-p_pay_given_clicked))/(sample_size*p_ctp)),4)
print('Analytical SD for NetConversion : {} '.format(analytical_sd_nc))

Sample Size: 10000 
Baseline Click through probability on start free trial : 0.08 
Baseline probability for payment when start free trial clicked : 0.109313 
Analytical SD for NetConversion : 0.011 


####  Regarding Retention 

Unit of analysis : User ids paid
    
Unit of diversion : User ids enrolled

Unit of analysis and diversion same

So, empirical and analytical values should match

In [9]:
print('Sample Size: {} '.format(sample_size))
print('Baseline Click through probability on start free trial : {} '.format(p_ctp))    
print('Baseline probability for enrolling when start free trial clicked : {} '.format(p_enroll_given_clicked))
print('Baseline probability for payment when enrolled : {} '.format(p_pay_given_enrolled))
analytical_sd_retention = round(np.sqrt((p_pay_given_enrolled*(1-p_pay_given_enrolled))/(sample_size*p_enroll_given_clicked*p_ctp)),4)
print('Analytical SD for NetConversion : {} '.format(analytical_sd_retention))

Sample Size: 10000 
Baseline Click through probability on start free trial : 0.08 
Baseline probability for enrolling when start free trial clicked : 0.20625 
Baseline probability for payment when enrolled : 0.53 
Analytical SD for NetConversion : 0.0389 


## 2. Sizing

We used this [sample size calulator](http://www.evanmiller.org/ab-testing/sample-size.html) to calculate sample size using a statistical power of 80% (1 - β | β = 0.2) and α = 0.05.

### 2.1 Choosing Number of Samples given Power

####  Regarding GrossConversion 

In [12]:
# 20.625% baseline conversion for gross conversion & minimum detectable effect of 1%  
gc_sample_size = 25835

# for both control & experiment
gc_total_sample_size = 2 * gc_sample_size

# divide by click through probability to get pageviews
gc_pageviews = gc_total_sample_size / p_ctp

print('{} pageviews required to power Gross Conversion'.format(gc_pageviews))

645875.0 pageviews required to power Gross Conversion


#### Regarding NetConversion

In [13]:
# 10.9313% baseline conversion for net conversion & minimum detectable effect of 0.0075%
nc_sample_size = 27413

# for both control & experiment
nc_total_sample_size = 2 * nc_sample_size

# divide by click through probability to get pageviews
nc_pageviews = nc_total_sample_size / p_ctp

print('{} pageviews required to power Net Conversion'.format(nc_pageviews))

685325.0 pageviews required to power Net Conversion


#### Regarding Retention

In [18]:
# 53.0% baseline conversion for retention & minimum detectable effect of 1%  
ret_sample_size = 39115

# for both control & experiment
ret_total_sample_size = 2 * ret_sample_size

# divide by click through probability, then gross conversion to get pageviews
ret_pageviews = ret_total_sample_size / p_ctp / p_enroll_given_clicked

print('{0:.2f} pageviews required to power Retention'.format(ret_pageviews))

4741212.12 pageviews required to power Retention


As the number pageviews to power the metric of Retention is quite high it would take too long to run the experiment, further supporting our decision to not include Retention as a metric. 

### 2.2 Choosing Duration vs Exposure

#### Divert 100% of traffic to experiment

In [26]:
exp_100 = (nc_pageviews / df_basevals.baseline_val[0]) / 1

print('With 40,000 unique pageviews/day & 100% exposure, run test for {0:.2f} days'.format(exp_100))

With 40,000 unique pageviews/day & 100% exposure, run test for 17.13 days


#### Divert 75% of traffic to experiment

In [25]:
exp_75 = (nc_pageviews / df_basevals.baseline_val[0]) / 0.75

print('With 40,000 unique pageviews/day & 75% exposure, run test for {0:.2f} days'.format(exp_75))

With 40,000 unique pageviews/day & 75% exposure, run test for 22.84 days


#### Divert 50% of traffic to experiment

In [27]:
exp_50 = (nc_pageviews / df_basevals.baseline_val[0]) / 0.5

print('With 40,000 unique pageviews/day & 50% exposure, run test for {0:.2f} days'.format(exp_50))

With 40,000 unique pageviews/day & 50% exposure, run test for 34.27 days


## 3. Sanity Check

Model the control and experiment group as a Bernoulli distribution with probability 0.5

In [10]:
df_results = pd.DataFrame(data={'Control':df_control.sum().drop('Date'), 'Experiment':df_experiment.sum().drop('Date')},
                          dtype=int)
df_results

Unnamed: 0,Control,Experiment
Pageviews,345543,344660
Clicks,28378,28325
Enrollments,3785,3423
Payments,2033,1945


In [11]:
df_results['Total']=df_results.Control + df_results.Experiment
df_results['Prob'] = 0.5
df_results['StdErr'] = np.sqrt((df_results.Prob * (1 - df_results.Prob))/df_results.Total)
df_results["MargErr"] = 1.96 * df_results.StdErr
df_results["CI_lower"] = df_results.Prob - df_results.MargErr
df_results["CI_upper"] = df_results.Prob + df_results.MargErr
df_results["Obs_val"] = df_results.Experiment/df_results.Total
df_results["Pass_Sanity"] = df_results.apply(lambda x: (x.Obs_val > x.CI_lower) and (x.Obs_val < x.CI_upper),axis=1)
df_results['Diff'] = abs((df_results.Experiment - df_results.Control)/df_results.Total)

df_results

Unnamed: 0,Control,Experiment,Total,Prob,StdErr,MargErr,CI_lower,CI_upper,Obs_val,Pass_Sanity,Diff
Pageviews,345543,344660,690203,0.5,0.000602,0.00118,0.49882,0.50118,0.49936,True,0.001279
Clicks,28378,28325,56703,0.5,0.0021,0.004116,0.495884,0.504116,0.499533,True,0.000935
Enrollments,3785,3423,7208,0.5,0.005889,0.011543,0.488457,0.511543,0.474889,False,0.050222
Payments,2033,1945,3978,0.5,0.007928,0.015538,0.484462,0.515538,0.488939,True,0.022122


## 4. Metrics Calculation

In [12]:
df_control_t =  df_control[df_control['Enrollments'].notnull()].sum().drop('Date')
df_experiment_t =  df_experiment[df_experiment['Enrollments'].notnull()].sum().drop('Date')

In [13]:
df_control_t.shape, df_experiment_t.shape

((4,), (4,))

In [14]:
df_experiment_t

Pageviews      211362
Clicks          17260
Enrollments      3423
Payments         1945
dtype: object

In [15]:
# experiment values
enrollments_exp = df_experiment_t["Enrollments"]
clicks_exp = df_experiment_t["Clicks"]
payments_exp = df_experiment_t["Payments"]

# control values
enrollments_cont = df_control_t["Enrollments"]
clicks_cont = df_control_t["Clicks"]
payments_cont = df_control_t["Payments"]

# metrics
GrossConversion_exp = enrollments_exp/clicks_exp
NetConversion_exp = payments_exp/clicks_exp
GrossConversion_cont = enrollments_cont/clicks_cont
NetConversion_cont = payments_cont/clicks_cont

GrossConversion = (enrollments_exp + enrollments_cont)/(clicks_cont + clicks_exp)
NetConversion = (payments_cont + payments_exp)/(clicks_cont + clicks_exp)

In [16]:
def ab_pool_cal(p_exp, p_cont, p_pool, n_exp, n_cont, z_alpha):
    std_err = np.sqrt(p_pool * (1- p_pool )*(1/n_cont + 1/n_exp))
    diff = p_exp - p_cont
    marg_err = z_alpha * std_err
    ci_lower = diff - marg_err
    ci_upper = diff + marg_err
    if diff > z_alpha*std_err or diff < -z_alpha*std_err:
        print ('Test Z-score: {}, REJECT null hypothesis statistically significant'.format(diff/std_err))
    else:
        print ('Test Z-score: {}, FAIL to reject null hypothesis'.format(diff/std_err))
    print ('Confidence Interval: ({},{})'.format(ci_lower, ci_upper))

## 5. Hypothesis Test

### 5.1 Hypothesis Test without Bonferroni Correction

####  Regarding GrossConversion 

In [17]:
print('GrossConversion: {} '.format(GrossConversion))
print('GrossConversion_exp: {} '.format(GrossConversion_exp))
print('GrossConversion_cont: {} '.format(GrossConversion_cont))
ab_pool_cal(GrossConversion_exp, GrossConversion_cont, GrossConversion, clicks_exp, clicks_cont, 1.96)

GrossConversion: 0.20860706740369866 
GrossConversion_exp: 0.19831981460023174 
GrossConversion_cont: 0.2188746891805933 
Test Z-score: -4.701830023753982, REJECT null hypothesis statistically significant
Confidence Interval: (-0.0291233583354044,-0.01198639082531873)


Since **Practical Significance** level for Gross Conversion is **0.01**, our metric Confidence Interval is outside of this (-0.01, 0.01) boundry. Therefore, the metric **is** also **practical significant**. 

#### Regarding NetConversion 

In [18]:
print('NetConversion: {} '.format(NetConversion))
print('NetConversion_exp: {} '.format(NetConversion_exp))
print('NetConversion_cont: {} '.format(NetConversion_cont))
ab_pool_cal(NetConversion_exp, NetConversion_cont, NetConversion, clicks_exp, clicks_cont, 1.96)

NetConversion: 0.1151274853124186 
NetConversion_exp: 0.1126882966396292 
NetConversion_cont: 0.11756201931417337 
Test Z-score: -1.4192001144365733, FAIL to reject null hypothesis
Confidence Interval: (-0.011604624359891718,0.001857179010803383)


Since **Practical Significance** level for Net Conversion is **0.0075**, our metric Confidence Interval is within this (-0.0075, 0.0075) boundry. Therefore, the metric **is not practical significant** either.

### 5.2 Hypothesis Test with Bonferroni Correction

##### We have two tests and the previous alpha level is 0.05, thus alpha level of individual test after bonferroni correction is 0.05/2 = 0.025, which indicates z-alpha is 2.24.

#### Regarding GrossConversion 

In [19]:
print('GrossConversion: {} '.format(GrossConversion))
print('GrossConversion_exp: {} '.format(GrossConversion_exp))
print('GrossConversion_cont: {} '.format(GrossConversion_cont))
ab_pool_cal(GrossConversion_exp, GrossConversion_cont, GrossConversion, clicks_exp, clicks_cont, 2.24)

GrossConversion: 0.20860706740369866 
GrossConversion_exp: 0.19831981460023174 
GrossConversion_cont: 0.2188746891805933 
Test Z-score: -4.701830023753982, REJECT null hypothesis statistically significant
Confidence Interval: (-0.030347427443267662,-0.010762321717455467)


Since **Practical Significance** level for Gross Conversion is **0.01**, our metric Confidence Interval is outside of this (-0.01, 0.01) boundry. Therefore, the metric **is** also **practical significant**. 

#### Regarding NetConversion 

In [20]:
print('NetConversion: {} '.format(NetConversion))
print('NetConversion_exp: {} '.format(NetConversion_exp))
print('NetConversion_cont: {} '.format(NetConversion_cont))
ab_pool_cal(NetConversion_exp, NetConversion_cont, NetConversion, clicks_exp, clicks_cont, 2.24)

NetConversion: 0.1151274853124186 
NetConversion_exp: 0.1126882966396292 
NetConversion_cont: 0.11756201931417337 
Test Z-score: -1.4192001144365733, FAIL to reject null hypothesis
Confidence Interval: (-0.012566181743512797,0.0028187363944244623)


Since **Practical Significance** level for Net Conversion is **0.0075**, our metric Confidence Interval is within this (-0.0075, 0.0075) boundry. Therefore, the metric **is not practical significant** either.

## 6. Sign Test

In [21]:
df_control_sign = df_control[df_control['Enrollments'].notnull()].copy()
df_control_sign['GrossConversion'] = df_control_sign.Enrollments / df_control_sign.Clicks
df_control_sign['NetConversion'] = df_control_sign.Payments / df_control_sign.Clicks
df_control_sign = df_control_sign[['Date','GrossConversion','NetConversion']]
df_control_sign.head()

Unnamed: 0,Date,GrossConversion,NetConversion
0,"Sat, Oct 11",0.195051,0.101892
1,"Sun, Oct 12",0.188703,0.089859
2,"Mon, Oct 13",0.183718,0.10451
3,"Tue, Oct 14",0.186603,0.125598
4,"Wed, Oct 15",0.194743,0.076464


In [22]:
df_experiment_sign = df_experiment[df_experiment['Enrollments'].notnull()].copy()
df_experiment_sign['GrossConversion'] = df_experiment_sign.Enrollments / df_experiment_sign.Clicks
df_experiment_sign['NetConversion'] = df_experiment_sign.Payments / df_experiment_sign.Clicks
df_experiment_sign = df_experiment_sign[['Date','GrossConversion','NetConversion']]

In [23]:
df_SignTest = pd.merge(df_control_sign,df_experiment_sign,on="Date",suffixes=('_cont', '_exp'))
df_SignTest['GC_Sign'] = df_SignTest.GrossConversion_cont > df_SignTest.GrossConversion_exp
df_SignTest['NC_Sign'] = df_SignTest.NetConversion_cont >= df_SignTest.NetConversion_exp

In [24]:
len(df_SignTest['GC_Sign']), df_SignTest['GC_Sign'].sum(), df_SignTest['NC_Sign'].sum()

(23, 19, 13)

In [25]:
from scipy.stats import binom_test
binom_test(19, 23, 0.5)

0.0025994777679443364

For **Gross Conversion**, the number of days we see an improvement in experiment group is 19, out of total 23 days of experiment. With probability 0.5 (for sign test), the p-value is 0.0026, which is smaller than alpha level, 0.05. Therefore the change is **statistical significant**.

In [26]:
binom_test(13, 23, 0.5)

0.67763948440551747

For **Net Conversion**, the number of days we see a decrease in experiment group is 13, out of total 23 days of experiment. With probability 0.5 (for sign test), the p-value is 0.677, which is bigger than alpha level, 0.05. Therefore the change is **not statistical significant**.

## 7. Conclusion

Our objective is to determine whether filtering students by study time commitment would improve the overall student experience and the coaches' capacity to support students. This is expected to lead to a higher number of students who are likely to complete the course. Moreover, it should also not significantly reduce the number of students who continue past the free trial. 

A statistically and practically signficant decrease in Gross Conversion was observed but with no significant differences in Net Conversion. This translates to a decrease in enrollment not coupled to an increase in students staying for the requisite 14 days to trigger payment. 

Considering this, our recomendation is not to launch, but rather to pursue other experiments.