In [1]:
from scipy import stats 
import math as mt
import numpy as np
import pandas as pd
from scipy.stats import norm
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import binom_test

# Context and Project Challenge

In collaboration with Google, Udacity provides an introductory course to A/B testing. The course covers the design and analysis of A/B tests using a frequentist approach. This notebook provides a walkthrough of the course's final project.

* Project Challenge
Udacity's mission is to power careers through tech education. Working towards this mission, the company aims to provide a stimulating learning experience that is tailored to the individual learner and supported by experienced coaches. To improve its services, Udacity tinkered with changing the user flow on its website and set up an A/B test titled "Free Trial Screener" to test its idea.


* Status quo
    - At the time of the experiment, Udacity courses have two options on the course overview page: "start free trial", and "access course materials".

    - If students click "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first.

    - If students click "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.


* Treatment
    - In the experiment, Udacity tests a change where if students click "start free trial", they are asked how much time they have available to devote to the course.

    - If students indicate 5 or more hours per week, they are taken through the checkout process as usual.

    - If they indicate fewer than 5 hours per week, a message appears indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, students have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.


* Reasoning
    - The hypothesis is that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course.

    - If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

# 1. Pre-test analysis and Experiment Setup

## 1.1 Choose the unit of diversion
* The unit of diversion is a cookie
* If the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

## 1.2 Initial hypotheses

* H0: The treatment has no effect on the share of people who enroll in the free trial
* H1: The treatment reduces the share of people who enroll in the free trial
##   
* H0: The treatment has no effect on the share of people who leave the free trial
* H1: The treatment improves the overall student experience thereby reducing the share of people who leave the free trial
##    
* H0: The treatment has no effect on the number of people who continue past the free trial
* H1: The treatment affects the number of people who continue past the free trial
file:///media/sf_shared_folder/guardrail_metrics.png

## 1.3 Choose invariant/evaluation metrics
* It is helpful to visualize the experiment. A visualization with the provided metric options is shown below:
![user_flow.png](attachment:user_flow.png)

* Invariant/guardrail metrics:
![guardrail_metrics-2.png](attachment:guardrail_metrics-2.png)

* variant/evaluation metrics:
![evaluation_metrics-2.png](attachment:evaluation_metrics-2.png)

## 1.4 Hypotheses revisited
Given the available and selected metrics, we can now specify our hypotheses. <b>While it could be argued that in some cases a one-sided test is appropriate, we are thereby sticking with a more conservative two-sided test</b>

* H0:CGtreatment=CGcontrol
* H1:CGtreatment≠CGcontrol
####  
* H0:Rtreatment=Rcontrol
* H1:Rtreatment≠Rcontrol
####  
* H0:CNtreatment=CNcontrol
* H1:CNtreatment≠CNcontrol

## 1.5 Set the significance level and power /  alpha and beta of the test
- alpha : 5%
- belta: 20%
- power: 1-belta = 80%

In [2]:
#storing alpha and beta in a dictionary
error_prob = {"alpha": 0.05, "beta": 0.20}
error_prob

{'alpha': 0.05, 'beta': 0.2}

## 1.6 Measure varaibility in metrics

### 1.6.1 Collect baseline data

In [3]:
#Storing baseline data
d = {"Metric Name": ["Cookies", "Clicks", "User-ids", "Click-through-probability", "Gross conversion", "Retention", "Net conversion"], 
     "Estimator": [40000, 3200, 660, 0.08, 0.20625, 0.53, 0.109313],
     "dmin": [3000, 240, -50, 0.01, -0.01, 0.01, 0.0075]}
md = pd.DataFrame(data=d, index=["C", "CL", "ID", "CTP", "CG", "R", "CN"])
md

Unnamed: 0,Metric Name,Estimator,dmin
C,Cookies,40000.0,3000.0
CL,Clicks,3200.0,240.0
ID,User-ids,660.0,-50.0
CTP,Click-through-probability,0.08,0.01
CG,Gross conversion,0.20625,-0.01
R,Retention,0.53,0.01
CN,Net conversion,0.109313,0.0075


### 1.6.2  Calculate standard errors

* Clculate the standard deviation of the sampling distribution of the sample mean (standard error, in short) for each of the evaluation metrics

* To be more precise, in this case we calculate the estimated standard errors of the sample proportions as our evaluation metrics are probabilities. 

* The standard error is an estimate of how far the sample proportion is likely to be from the population proportion.

#### 1.6.2.1 Scaling

* Since the sample size given by Udacity is n = 5000 cookies, we first need to scale the collected count data, i.e. the number of cookies, the number of clicks and the number of user-ids.

In [4]:
# create new column to store scaled estimators
# dmin is the gnificance boundary for each metric(effect size)

md.insert (2, "Scaled_Est", np.nan)

#scale count estimates
scaling_factor = 5000/md.loc["C"]["Estimator"]

for i in ["C", "CL", "ID"]:
    md.at[i, "Scaled_Est"] = md.loc[i]["Estimator"] * scaling_factor
md

Unnamed: 0,Metric Name,Estimator,Scaled_Est,dmin
C,Cookies,40000.0,5000.0,3000.0
CL,Clicks,3200.0,400.0,240.0
ID,User-ids,660.0,82.5,-50.0
CTP,Click-through-probability,0.08,,0.01
CG,Gross conversion,0.20625,,-0.01
R,Retention,0.53,,0.01
CN,Net conversion,0.109313,,0.0075


#### 1.6.2.2 Assumptions

* Only when the unit of diversion is the same as the unit of analysis, the experiments can considered to be independent.(Assumption of of binomial distribution)

* Since the unit of diversion is the same as the unit of analysis (denominator of the metric formula) for each evaluation metric (cookie in the case of Gross Conversion and Net Conversion and user-id in the case of Retention) and we can make assumptions about the distributions of the metrics (binominal), we can calculate the standard errors analytically (instead of empirically).

* Further, as n is relatively large in each case, we can assume that the sampling distribution of a sample proportion approaches a normal distribution (due to the Central Limit Theorem). We can also use a rule such as the 3-standard-deviation rule to check if n is large enough:

![3_std_rule.png](attachment:3_std_rule.png)

In [5]:
def checkN (n, p, metric):
    '''Given sample size n and probability p, return whether n is large enough to pass the 3-standard deviation rule,
    i.e. whether we can assume that the distribution can be approximated by the normal distribution'''
    if n > 9*((1-p)/p) and n > 9*(p/(1-p)):
        result = print(metric,":  n =", n, "is large enough to assume normal distribution approximation")
    else:
        result = print(metric,":  n =", n, "is not large enough to assume normal distribution approximation")
    return result

#check whether n is large enough to assume normal distribution approximation
for i,j in zip(["CL", "ID", "CL"],["CG", "R", "CN"]):
    checkN (md.at[i, "Scaled_Est"], md.at[j,"Estimator"], md.at[j,"Metric Name"])

Gross conversion :  n = 400.0 is large enough to assume normal distribution approximation
Retention :  n = 82.5 is large enough to assume normal distribution approximation
Net conversion :  n = 400.0 is large enough to assume normal distribution approximation


#### 1.6.2.3  Compute standard errors

Given above assumptions we can approximate the standard error through:

$ SE = \sqrt{\frac{\hat{p}*(1-\hat{p})}{n}}$

with $ \sqrt{\hat{p}*(1-\hat{p})}$ estimating the population standard deviation.

In [6]:
#create new column to store standard errors
md["SE"] = np.nan

#formula to calculate standard deviation
def standardError (n, p):
    '''Return the standard deviation for a given probability p and sample size n'''
    return (p*(1-p)/n)**0.5

#calculating standard errors for evaluation metrics and store them in md
for i in ["CG", "CN"]:
    md.at[i, "SE"] = standardError(md.loc["CL"]["Scaled_Est"], md.loc[i]["Estimator"]) 
    
md.at["R", "SE"] = standardError(md.loc["ID"]["Scaled_Est"], md.loc["R"]["Estimator"])
md

Unnamed: 0,Metric Name,Estimator,Scaled_Est,dmin,SE
C,Cookies,40000.0,5000.0,3000.0,
CL,Clicks,3200.0,400.0,240.0,
ID,User-ids,660.0,82.5,-50.0,
CTP,Click-through-probability,0.08,,0.01,
CG,Gross conversion,0.20625,,-0.01,0.020231
R,Retention,0.53,,0.01,0.054949
CN,Net conversion,0.109313,,0.0075,0.015602


## 1.7 Calculating the required sample size

* Gross conversion, retention and net conversion are all probabilities, which means that they are binomially distributed. Based on the central limit theorem, the standard deviation of these metrics is given by this formula:

![SE.png](attachment:SE.png)


* To calculate the required sample size, imagine that we have two samples, one is the baseline (current version of the website) and the other is the test version (which we haven’t launched yet, and want to find its size). The experiment power has the following relationship with sample size:
![Z.png](attachment:Z.png)

* the sample size is embedded in the standard error. If you place the proportions inside the formula and assume that we want to have a nice 50/50 split between control and test group, then the sample size would be:

![sample_size.png](attachment:sample_size.png)

* If we want a 50/50 split between test and control groups, then r=1. To be able to calculate the required sample size for each metrics, we need the baseline conversion and the minimum desired change (a.k.a practical significance level).
* Assuming that the power is set to 80% and the significance level is set to 5%, then a more general formula would be:
![sample_size_f.png](attachment:sample_size_f.png)

In [7]:
#create new column n_c to store sample sizes
md["n_C"] = np.nan

#define function for calculating sample sizes
def get_sampleSize (alpha, beta, p, dmin):
    '''Return sample size given alpha, beta, p and dmin'''
    return (p*(1-p)+(p+dmin)*(1-(p+dmin)))*pow(stats.norm.ppf(1-alpha/2)+stats.norm.ppf(1-beta),2)/pow(dmin,2)

#calculate sample sizes for evaluation metrics with defined adjustments and store results in md
for i in ["CG", "CN"]:
    md.at[i, "n_C"] = round((get_sampleSize(error_prob["alpha"], error_prob["beta"], md.loc[i]["Estimator"], md.loc[i]["dmin"])/md.loc["CTP"]["Estimator"])*2)

md.at["R", "n_C"] = round(((get_sampleSize(error_prob["alpha"], error_prob["beta"], md.loc["R"]["Estimator"], md.loc["R"]["dmin"])/md.loc["CTP"]["Estimator"])/md.loc["CG"]["Estimator"])*2)
md

Unnamed: 0,Metric Name,Estimator,Scaled_Est,dmin,SE,n_C
C,Cookies,40000.0,5000.0,3000.0,,
CL,Clicks,3200.0,400.0,240.0,,
ID,User-ids,660.0,82.5,-50.0,,
CTP,Click-through-probability,0.08,,0.01,,
CG,Gross conversion,0.20625,,-0.01,0.020231,630749.0
R,Retention,0.53,,0.01,0.054949,4733112.0
CN,Net conversion,0.109313,,0.0075,0.015602,699532.0


## 1.8 Experiment exposure and duration

Now, for each case, we can calculate how many days we would approximately need to run the experiment in order to reach n_C. According to the challenge description, we are thereby assuming that there are no other experiments we want to run simultaneously. So, theoretically, we could divert 100% of the traffic to our experiment (i.e. about 50% of all visitors would then be in the treatment condition). Given our estimation that there are about 40,000 unique pageviews per day, this would result in:

In [8]:
#traffic diverted to experiment [0:1]
traffic_diverted = 1

#Days it would take to run experiment for each case
for i, j in zip(["CG", "CN", "R"],["CG", "CG+CN", "CG+CN+R"]):
   print("Days required for",j,":", round(md.loc[i]["n_C"]/(md.loc["C"]["Estimator"]*traffic_diverted),2))

Days required for CG : 15.77
Days required for CG+CN : 17.49
Days required for CG+CN+R : 118.33


We see that we would need to run the experiment for about 119 days in order to test all three hypotheses (and this does not even take into account the 14 additional days (free trial period) we have to wait until we can evaluate the experiment). Such a duration (esp. with 100% traffic diverted to it) appears to be very risky. First, we cannot perfom any other experiment during this period (opportunity costs). Secondly, if the treatment harms the user experience (frustrated students, inefficient coaching resources) and decreases conversion rates, we won't notice it (or cannot really say so) for more than four months (business risk). <i>Consequently, it seems more reasonable to only test the first and third hypothesis and to discard retention as an evaluation metric.</i> Especially since net conversion is a product of rentention and gross conversion, so that we might be able to draw inferences about the retention rate from the two remaining evaluation metrics.

So, how much traffic should we divert to the experiment? Given the considerations above, we want the experiment to run relatively fast and for not more than a few weeks. Also, as the nature of the experiment itself does not seem to be very risky (e.g. the treatment doesn't involve a feature that is critical with regards to potential media coverage), we can be confident in diverting a high percentage of traffic to the experiment. Still, since there is always the potential that something goes wrong during implemention, we may not want to divert all of our traffic to it. Hence, 80% (22 days) would seem to be quite reasonable. <i>However, when we look at the data provided by Udacity (see 4.1) we see that it takes 37 days to collect 690,203 pageviews, meaning that they most likely diverted somewhere between 45% and 50% of their traffic to the experiment</i>

In [9]:
#traffic diverted to experiment
traffic_diverted = 0.47

#Days it would take to run experiment if we use net conversion and gross coversion as evaluation metrics
print("Experiment duration in days, CN+CG: ",round(md.loc["CN"]["n_C"]/(md.loc["C"]["Estimator"]*traffic_diverted),2))


Experiment duration in days, CN+CG:  37.21


## 1.9 Accounting for multiple hypotheses?
<a id="section3_9"></a>

As we now have more than one hypothesis, the chance to get false positives increases. However, our metrics are not fully independent which is why the true probability for false positives will still be lower than 9.75% ((1-pow(0.95,2), that's the case for independent metrics). We could then use family-wise error rate such as Bonferroni or false discovery rate methods to account for the multiple hypotheses problem. However, they have flaws as well (e.g. we could easily end up with more false negatives; see [here](https://multithreaded.stitchfix.com/blog/2015/10/15/multiple-hypothesis-testing/) and [here](https://www.statisticshowto.datasciencecentral.com/multiple-testing-problem/)). Hence, given that the chance to get more false positives is only slightly increased in this case, we won't control for multiple hypothese here.

# 2. Data Analysis

## 2.1 Loading experiment and control data


In [10]:
#loading experiment data into new dataframes
control = pd.read_csv("/home/ruifan/Downloads/control_data.csv") 
experiment = pd.read_csv("/home/ruifan/Downloads/experiment_data.csv")

#check if loaded correctly
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [11]:
#check if loaded correctly
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


In [12]:
#check number of entries
control.count()

Date           37
Pageviews      37
Clicks         37
Enrollments    23
Payments       23
dtype: int64

In [13]:
#check number of entries
experiment.count()

Date           37
Pageviews      37
Clicks         37
Enrollments    23
Payments       23
dtype: int64

In [14]:
#check sample size and store it as sample_size
sample_size_control = control["Pageviews"].sum()
sample_size_experiment = experiment["Pageviews"].sum()
sample_size = sample_size_control+sample_size_experiment
sample_size

690203

## 2.2 Sanity Check
To ensure that the experiment has been run properly, we first conduct a sanity check using the three invariant metrics outlined above (3.3.1). We have two counts (number of cookies, number of clicks) and one probability. As stated earlier, we would expect that these metrics do not differ significantly between control and treatment group. Otherwise, this would imply that someting is wrong with the experiment setup and that our results are biased. 


### 2.2.1 Sanity check: number of cookies + number of clicks

#### 2.2.1.1 Binomial test

In the provided data, the column "pageviews" represents the number of cookies that browse the course overview page. Given our assumptions, we would expect that the total number of cookies in the treatment group and the total number of cookies in the control group each account for about 50% of the combined number of cookies of both groups (treatment + control) as they should have been assigned randomly. <b>If we now regard being assigned to the control group as a success, we can use the binominal distribution to model the number of successes in the given sample (treatment+control) and perform a binomial test/one-proportion z-test as a sanity check:</b>

a)  Compute confidence interval around the binominal, i.e. the number of success we expect to get out of n (as n is large, we can further assume that the sampling distribution of the sample proportion approaches a normal distribution (Central Limit Theorem))


$ CI = [\hat{p}-Z_{1-\alpha/2}*SE; \hat{p}+Z_{1-\alpha/2}*SE] $

with $ SE = \sqrt{\frac{\hat{p}*(1-\hat{p})}{n}} $


&nbsp;

b)  Check if observed fraction $ \frac{\text{Number of successes}}{n} $ is within the interval. If yes, then the sanity checked is passed.


We will conduct the same test also for our second invariant metric "number of clicks". Again, as n is large we can assume that the sampling distribution of the sample proportion approximates a normal distribution.

In [15]:
# binomial test

#create empty dataframe to store sanity check results
sanity_check = pd.DataFrame(columns=["CI_left", "CI_right", "obs","passed?"], index=["C", "CL", "CTP"])

#set alpha and p_hat
p = 0.5
alpha = 0.05

#fill dataframe with results from binomial test
#for cookies and clicks do the following
for i,j in zip(["C", "CL"], ["Pageviews", "Clicks"]):
    #calculate the number of successes (n_control) and number of observations (n)
    n = control[j].sum()+experiment[j].sum()
    n_control = control[j].sum()
    
    #compute confidence interval
    sanity_check.at[i, "CI_left"] = p-(stats.norm.ppf(1-alpha/2)*standardError(n,p))
    sanity_check.at[i, "CI_right"] = p+(stats.norm.ppf(1-alpha/2)*standardError(n,p))
    
    #compute observed fraction of successes
    sanity_check.at[i, "obs"] = round(n_control/(n),4)
    
    #check if the observed fraction of successes lies within the 95% confidence interval
    if sanity_check.at[i, "CI_left"] <= sanity_check.at[i, "obs"] <= sanity_check.at[i, "CI_right"]:
        sanity_check.at[i, "passed?"] = "yes"
    else:
        sanity_check.at[i, "passed?"] = "no"

#return results
sanity_check

Unnamed: 0,CI_left,CI_right,obs,passed?
C,0.49882,0.50118,0.5006,yes
CL,0.495885,0.504115,0.5005,yes
CTP,,,,


#### 2.2.1.2 one-propotion z-test and exact binomial test

Alternatively, we could have calculated the test-statistic Z and compared the corresponding p-value against our selected alpha level. Another option would have been an exact binomial test. This would have looked like this for the metric "number of cookies":

In [16]:
# one-proportion z-test

#calculate the number of observations
n = control["Pageviews"].sum()+experiment["Pageviews"].sum()
#calculate the number of successes
n_control = control["Pageviews"].sum()

#calculate the test-statistic Z and corresponding p_value
z_statistic, p_value = proportions_ztest(n_control, n, value=0.5, alternative="two-sided", prop_var=0.5)

print("z-test-statistic: ", z_statistic)
print("p-value:" , p_value)

#alternatively compute p-value using the exact binomial test
p_value_binom = binom_test(n_control, n, prop=0.5, alternative='two-sided')
print("p-value_binomial: ", p_value_binom)

#check whether p_value is smaller than alpha
alpha = 0.05

if p_value_binom > 0.05:
    print("The null hypothesis cannot be rejected and the sanity check is passed")
else:
    print("The null hypothesis is rejected and the sanity check is not passed")

z-test-statistic:  1.0628507473597084
p-value: 0.2878496417066284
p-value_binomial:  0.28839593105760386
The null hypothesis cannot be rejected and the sanity check is passed


### 2.2.2 Sanity check: click-through probability

#### 2.2.2.1 calculate a confidence interval around the expected difference of the two metrics which is 0

To check whether the click-through probabilites in the control and treatment groups are significantly different from each other, we conduct a two proportion z-test with a click being interpreted as a success. We thereby assume that the two populations have normal distributions but not necessarily equal variances (hence p is not pooled below). <b>To perform the test, we can calculate a confidence interval around the expected difference of the two metrics which is 0. Alternatively</b>, we can calculate the Z-test-statistic and then check the corresponding p-value. The steps for the first approach are the following:

a) Compute confidence interval around the expected difference of 0.

$ CI = [0-Z_{1-\alpha/2}*SE; 0+Z_{1-\alpha/2}*SE] $

with $ SE_{pooled} = \sqrt{\frac{S_{cont}^2}{n_{cont, pageviews}}+\frac{S_{exp}^2}{n_{exp, pageviews}}} $

whereby $ S = \sqrt{p*(1-p)} $

and $ p = CTP = \frac{n_{clicks}}{n_{pageviews}} $

&nbsp;

b) Compute the observed difference between the two metrics d and check whether d lies within CI

$ d = CTP_{experiment}-CTP_{control} $

In [17]:
#compute CTP for both groups
CTP_control = control["Clicks"].sum()/control["Pageviews"].sum()
CTP_experiment = experiment["Clicks"].sum()/experiment["Pageviews"].sum()

#compute sample standard deviations for both groups
S_control = (CTP_control*(1-CTP_control))**0.5
S_experiment = (CTP_experiment*(1-CTP_experiment))**0.5

#compute SE_pooled
SE_pooled = (S_control**2/control["Pageviews"].sum()+S_experiment**2/experiment["Pageviews"].sum())**0.5

#compute 95% confidence interval and store it in sanity check
alpha = 0.05

sanity_check.at["CTP", "CI_left"] = 0-(stats.norm.ppf(1-alpha/2)*SE_pooled)
sanity_check.at["CTP", "CI_right"] = 0+(stats.norm.ppf(1-alpha/2)*SE_pooled)

#compute observed difference d and store it in sanity check
sanity_check.at["CTP", "obs"] = round(CTP_experiment - CTP_control,4)

#check if sanity check is passed
if sanity_check.at["CTP", "CI_left"] <= sanity_check.at["CTP", "obs"] <= sanity_check.at["CTP", "CI_right"]:
    sanity_check.at["CTP", "passed?"] = "yes"
else:
    sanity_check.at["CTP", "passed?"] = "no"

#return results
sanity_check

Unnamed: 0,CI_left,CI_right,obs,passed?
C,0.49882,0.50118,0.5006,yes
CL,0.495885,0.504115,0.5005,yes
CTP,-0.001296,0.001296,0.0001,yes


#### 2.2.2.2 alternative approach

The alternative approach using normalstats' proportion z-test function would have looked like this:

In [18]:
#calculate the number of observations for each group and store results in an array
n = np.array([control["Pageviews"].sum(), experiment["Pageviews"].sum()])
#calculate the number of successes for each group and store results in an array
n_clicks = np.array([control["Clicks"].sum(), experiment["Clicks"].sum()])

#calculate the test-statistic Z and corresponding p_value
z_statistic, p_value = proportions_ztest(n_clicks, n, value=0, alternative="two-sided", prop_var=0)

print("z-test-statistic: ", z_statistic)
print("p-value:" , p_value)

#check whether p_value is smaller than alpha
alpha = 0.05

if p_value > 0.05:
    print("The null hypothesis cannot be rejected and the sanity check is passed")
else:
    print("The null hypothesis is rejected and the sanity check is not passed")

z-test-statistic:  -0.08566094109242048
p-value: 0.9317359524473912
The null hypothesis cannot be rejected and the sanity check is passed


## 2.3 Testing analysis and Examining the effect size
<a id="section4_3"></a>

Similar to the click-through probability, we can test our evaluation metric hypotheses using two proportion z-tests (thereby, the same assumptions as outlined above apply). However, in contrast to the previous implementation, this time we will compute the respective confidence interval around the observed difference between the conversion metrics. Further, we will check if the observed changes also matter to the business (dmin).

Recall our hypotheses:

* $ H_{0}: CG_{treatment} = CG_{control} $
* $ H_{1}: CG_{treatment} \neq CG_{control} $


* $ H_{0}: CN_{treatment} = CN_{control} $
* $ H_{1}: CN_{treatment} \neq CN_{control} $


<i>Note: As could be seen in 4.1 when we loaded the data, "payments" (and strangely also "enrollments") were only tracked for 37 days (23+14 days) and not for 51 days (37+14 days) which would have been necessary in order to fully account for the 14-day trial period. Consequentally, in our actual A/B test, the true sample size is lower (n_true = 423,525) than we initially aimed for (n = 685,336 (3.7)). However, at this point there is not much we can do about it other than taking it into consideration in our interpretations.</i>


In [19]:
#compute true sample size
true_sample_size = control.iloc[:23]["Pageviews"].sum()+experiment.iloc[:23]["Pageviews"].sum()
true_sample_size

423525

In [20]:
#create dataframe test_results
test_results = pd.DataFrame(columns=["CI_left", "CI_right", "d","stat sig?", "dmin", "pract rel?"], index=["CG", "CN"])

#set alpha
alpha = 0.05


#run two proportion z test for both metrics
for i,j in zip(["Enrollments", "Payments"],["CG", "CN"]):
    #compute sample conversion rates
    conv_control = control.iloc[:23][i].sum()/control.iloc[:23]["Clicks"].sum()
    conv_experiment = experiment.iloc[:23][i].sum()/experiment.iloc[:23]["Clicks"].sum()
    
    #compute observed difference between treatment and control conversion d
    test_results.at[j, "d"] = conv_experiment-conv_control
    
    #compute sample standard deviations
    S_control = (conv_control*(1-conv_control))**0.5
    S_experiment = (conv_experiment*(1-conv_experiment))**0.5
    
    #compute SE_pooled
    SE_pooled = (S_control**2/control.iloc[:23]["Clicks"].sum()+S_experiment**2/experiment.iloc[:23]["Clicks"].sum())**0.5
    
    #compute 95% confidence interval around observed difference d
    test_results.at[j, "CI_left"] = test_results.at[j, "d"]-(stats.norm.ppf(1-alpha/2)*SE_pooled)
    test_results.at[j, "CI_right"] = test_results.at[j, "d"]+(stats.norm.ppf(1-alpha/2)*SE_pooled)
    
    #check statistical significance
    if test_results.at[j, "CI_left"] <= 0 <= test_results.at[j, "CI_right"]:
        test_results.at[j, "stat sig?"] = "no"
    else:
        test_results.at[j, "stat sig?"] = "yes"
    
    #import dmin
    test_results.at[j, "dmin"] = md.loc[j]["dmin"]
    
    
    #check if practical relevant
    #check if dmin is positive or negative
    if test_results.at[j, "dmin"] >= 0:
        #check if d is larger than dmin and if dmin lies left of the confidence interval around d
        if test_results.at[j, "d"] > test_results.at[j, "dmin"] and test_results.at[j, "CI_left"] > test_results.at[j, "dmin"]:
                test_results.at[j, "pract rel?"] = "yes"
        else:
            test_results.at[j, "pract rel?"] = "no"
    else:
        #check if d is smaller than dmin and if dmin lies right of the confidence interval around d
        if test_results.at[j, "d"] < test_results.at[j, "dmin"] and test_results.at[j, "dmin"] > test_results.at[j, "CI_right"]:
                test_results.at[j, "pract rel?"] = "yes"
        else:
            test_results.at[j, "pract rel?"] = "no"

#return results
test_results

Unnamed: 0,CI_left,CI_right,d,stat sig?,dmin,pract rel?
CG,-0.02912,-0.01199,-0.020555,yes,-0.01,yes
CN,-0.011604,0.001857,-0.004874,no,0.0075,no


While Udacity suggests conducting an additional sign-test to double-check the results, we will forgo this test as the traditional sign-test assumes dependent samples. Instead, we will jump right at the interpretation of our results.
While Udacity suggests conducting an additional sign-test to double-check the results, we will forgo this test as the traditional sign-test assumes dependent samples. Instead, we will jump right at the interpretation of our results.

# 3. Interpretation of Results and Recommendations

Gross conversion: the observed gross conversion in the treatment group is around 2.06% smaller than the gross conversion observed in the control group. Further, we see that also the values within the confidence interval are most compatible with a negative effect. Lastly, this effect appears to be practically relevant as those values are smaller than dmin, the minimum effect size to be considered relevant for the business.

Net conversion: While we cannot reject the null hypothesis for this test, we see that the observed net conversion in the treatment group is around 0.49% smaller than the net conversion observed in the control group. Further, the values that are considered most reasonabily compatible with the data range from -1.16% to 0.19%.

Given these results, we can assume that the introduction of the "Free Trial Screener" may indeed help to set clearer expectations for students upfront. However, the results are less compatible with the assumption that the decrease in gross conversion is entirely absorbed by an improvement in the overall student experience and still less compatible with dmin(net conversion), the minimum effect size to be considered relevant for the business. Consequently, assuming that Udacity has a fair interest in increasing revenues, we would recommend to not roll out the "Free Trial Screener" feature.

This being said, as outlined in 3.3.2, the feature may increase the total number of people who opt for the freely available materials. If true and assuming a steady conversion rate from users who first learn with the freely accessible materials and then upgrade, the feature may still help to increase net conversion. However, if at all, this effect is more likely to happen over a longer time period and, hence, would require a test with a longer timeframe.