## Introduction

This is the final project of the A/B Testing online course of Udacity. In this project, we will consider an actual experiment that was run by Udacity. The specific numbers have been changed, but the patterns have not.

## I. Context

**Two existing options** <br/>
Udacity is an online tech education platforms. It provides courses with videos, reading materials and coaching supports for students.
<br/>
At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials".
<br/>
1. If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first.
<br/>
2. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.
<br/>

**New feature** <br/>
In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. 

**The goal of this change is to improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.**<br/>
Specifically, we want to test if this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course.

## II. Experient set up

### 1. Unit of diversion

Before the students enroll in the free trial, the unit of diversion is "cookie". Afterwards, they are tracked by user-ids. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-ids are not tracked in the experiment, even if they were signed in when they visited the course overview page.

### 2. Funnel for Enrollment

The plot below visualizes the whole enrollment process using "funnel". The black funnel on the left-hand side represents the process before the change, whereas the right green funnel shows the process with new feature added. Potential metrics to measure are labeled in the plot as well.
<br/>
![title](UdacityFunel.jpg)

**Potential metrics to use and the minimum effect size to be considered relevant for business (dmin):**
1. Number of cookies: C - number of unique cookies to view the course overview page. (dmin=3000)
2. Number of user-ids: ID - number of users who enroll in the free trial. (dmin=50)
3. Number of clicks: CL - number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
3. Click-through-probability: CTP = CL/C - number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
4. Gross conversion: GC - number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
5. Retention: R = CL/Payments - number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
6. Net conversion: NC = Enrollments/CL - number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

### 3. Possible Hypotheses for the Experiment

Before we choose metrics to evaluate the experiment, we should firstly think of the hypothesis we want to test.
<br/>
I set up 3 groups of hypothesis that could be tested with the given data.I will further discuss the final hypotheses chosen for testing in III-3.
<br/>
- Group A: about the new feature's effect towards the number of people enrolled in free trial
    - H0: GC(experiment) = GC(control)
    - H1: GC(experiment) ≠ GC(control)
<br/>
- Group B: about the new features' effect towards the number of people make payments
    - H0: NC(experiment) = NC(control)
    - H1: NC(experiment) ≠ NC(control)
<br/>
- Group C: about the new feature's effect towards the number of people who make payments after free trial
    - H0: R(experiment) = R(control)
    - H1: R(experiment) ≠ R(control)

### 4. Metrics

**Evaluation Metrics:**
<br/>
For evaluation metrics, we want them to be sentitive enough to capture changes we want to test, and meanwhile be robust enough so that they will not be affected by factors that are not revelant.
<br/>
Given the context of this experiment and the hypotheses, I choose the following 3 metrics as evaluation metrics:
- Gross Conversion rate (GC): we should expect a decrease in GC, since we want to use the new feature to filter out student that are not likely to invest enough time in the course. In this case, fewer students will enroll in free trial given that part of them are recommended to access the materials without enrollment.
<br/>
- Retention rate (R): retention should increase because students that tend to churn (those who cannot study for enough time) are likely to be filtered out before enrollment.
<br/>
- Net Conversion rate (NC):this ratio is nothing but the product of the above two metrics. From the business perspective, we want this ratio to increase.

**Invariant Metrics:**
<br/>
Invariant metrics should be the ones that stay unchanged between the treatment and control group. We can use them to assure the experiment is setup properly through performing sanity check for the result.
<br/>
Theoretically, the following 3 metrics should be independent from the change between the experiment and control group.
<br/>
- Number of cookies: C
- Number of clicks: CL
- Click-through-probability: CL/C

## III. Experiment

In [1]:
# Import all the packages needed
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
pd.set_option('display.float_format', lambda x: '%.4f' % x)

### 1. Measuring Variability for Baseline Values

The baseline values for metrics mentioned above are provided and stored in the dataframe below.

In [2]:
# Storing baseline data
baseline_data = {"Metric": ["C", "CL", "ID", "CTP", "GC", "R", "NC"], 
                 "Estimator": [40000, 3200, 660, 0.08, 0.20625, 0.53, 0.109313],
                 "dmin": [3000, 240, -50, 0.01, -0.01, 0.01, 0.0075]}
df = pd.DataFrame(data=baseline_data, index = ["C", "CL", "ID", "CTP", "GC", "R", "NC"])
df

Unnamed: 0,Metric,Estimator,dmin
C,C,40000.0,3000.0
CL,CL,3200.0,240.0
ID,ID,660.0,-50.0
CTP,CTP,0.08,0.01
GC,GC,0.2062,-0.01
R,R,0.53,0.01
NC,NC,0.1093,0.0075


The sample size given is 5000 cookies. In this case, we first need to scale the baseline data, including number of cookies, number of clicks and number of user-ids.

In [3]:
# Calculate the scaling factor according to the number of cookies
scale_factor = 5000 / df.loc['C']['Estimator']

# Create a new column for scaled estimators
df['ScaledEst'] = np.nan

# Scale count metrics
for m in ['C', 'CL', 'ID']:
    df.at[m, 'ScaledEst'] = df.loc[m]['Estimator'] * scale_factor

df

Unnamed: 0,Metric,Estimator,dmin,ScaledEst
C,C,40000.0,3000.0,5000.0
CL,CL,3200.0,240.0,400.0
ID,ID,660.0,-50.0,82.5
CTP,CTP,0.08,0.01,
GC,GC,0.2062,-0.01,
R,R,0.53,0.01,
NC,NC,0.1093,0.0075,


**Assumptions and Computation:**
<br/>
We can assume that the distribution of our three chosen evaluation metric follows binomial distribution. Also, we know that the unit of analysis (demoniator) for each metric is the same as the unit of diversion (CL for GC and NC, ID for R). (Details about why the difference in unit of diversion and unit of analysis affects the calculation of variability can be found [here](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36500.pdf).)Given the two facts, we can calculate the standard error for each metric analytically, using the formula: $\sqrt{\frac{\hat{p}\times (1 - \hat{p})}{n}}$
Precisely, here the standard error is an estimate of how far the sample proportion is likely to be from the population proportion.
<br/>
Due to the Central Limit Theorem, as sample size is relatively large in each case, we can assume that the sampling distribution of a sample proportion approaches a normal distribution. 

In [4]:
# Create a new column to store standard errors
df["SE"] = np.nan

# Define function to calculate standard deviation
def standardError (n, p):
    '''p = probability, n = sample size'''
    '''Return the standard deviation'''
    return (p*(1-p)/n)**0.5

# Calculating standard errors for evaluation metrics
for m in ['GC', 'NC']:
    df.at[m, 'SE'] = standardError(df.loc['CL']['ScaledEst'], df.loc[m]['Estimator']) 
    
df.at['R', 'SE'] = standardError(df.loc['ID']['ScaledEst'], df.loc['R']['Estimator'])
df

Unnamed: 0,Metric,Estimator,dmin,ScaledEst,SE
C,C,40000.0,3000.0,5000.0,
CL,CL,3200.0,240.0,400.0,
ID,ID,660.0,-50.0,82.5,
CTP,CTP,0.08,0.01,,
GC,GC,0.2062,-0.01,,0.0202
R,R,0.53,0.01,,0.0549
NC,NC,0.1093,0.0075,,0.0156


### 2. Determine Sample Size for Experiment

Set alpha level to 0.05, and statistical power to 0.80 (beta = 0.20). <br/>
The required sample size is calculated using the online sample size calculators by [Evan Miller](https://www.evanmiller.org/ab-testing/sample-size.html). Keep in mind that the result is the sample size **per group**.Therefore, we multiple the result by 2, since we have one group of control and another for experiment, to get the total sample size required.<br/>
Further, we want to calculate the experiment sample size in terms of cookies that visit the page. Thus, we also need to account for the circumstance that our evaluation metrics, clicks and user-ids, respectively.<br/>
Hence, total sample size for GC and NC is $\frac{n}{CTP}\times{2}$, for R is $\frac{\frac{n}{CTP}}{GC}\times{2}$

In [5]:
# Store the results
df['total_n'] = np.nan
df.at['GC', 'total_n'] = round(25830 / 0.08 * 2)
df.at['NC', 'total_n'] = round(27411 / 0.08 * 2)
df.at['R', 'total_n'] = round(39115 / 0.08 / 0.2063 * 2)

df

Unnamed: 0,Metric,Estimator,dmin,ScaledEst,SE,total_n
C,C,40000.0,3000.0,5000.0,,
CL,CL,3200.0,240.0,400.0,,
ID,ID,660.0,-50.0,82.5,,
CTP,CTP,0.08,0.01,,,
GC,GC,0.2062,-0.01,,0.0202,645750.0
R,R,0.53,0.01,,0.0549,4740063.0
NC,NC,0.1093,0.0075,,0.0156,685275.0


Given our calculations, we would need around 645,750 pageviews (cookies) to test the hypothesis in Group A and a total of 685,275 pageviews for Group B. If we want to test Group C, we will need 4,740,063 pageviews to conduct the experiment!

### 3. Experiment Duration

According to the description, we can assume that there's no other experiment running at the same time and that the traffic can be diverted to our experiment at 100%. Let's first calculate the time we need to gather enough sample (the experiment duration) if we are able to use a traffic divertion rate of 100%.

In [6]:
# Traffic diverted to the experiment: [0:1]
traffic_divertion = 1

# Days it will take to test Group A Hypothesis
days_GC = round(df.loc['GC']['total_n'] / 
                (df.loc['C']['Estimator']*traffic_divertion))
print(f'Days required for GC: {days_GC}')

# Days it will take to test Group A & Group B Hypotheses
days_GC_NC = round(df.loc['NC']['total_n'] / 
                   (df.loc['C']['Estimator']*traffic_divertion))
print(f'Days required for GC & NC: {days_GC_NC}')

# Days it will take to test Group A & Group B & Group C Hypothesis
days_GC_NC_R = round(df.loc['R']['total_n'] / 
                     (df.loc['C']['Estimator']*traffic_divertion))
print(f'Days required for GC & NC & R: {days_GC_NC_R}')

Days required for GC: 16
Days required for GC & NC: 17
Days required for GC & NC & R: 119


To test the first 2 groups of hypothesis, we only need about 17 days, which seems acceptable. However, if we include the last hypothesis in Group C, we will need to run the experiment for 119 days! Such a long duration is risky because we're not sure if the treatment will hurt users during such a long time and result in loss for the business. <br/>
Also, recall that the retention rate is nothing but the ration between Net and Gross Conversion rate, we can have a pretty accurate estimte of R using the result in the first 2 metrics.<br/>
In this case, it's reasonable to use only GC and NC as evalutation metrics.

Then, the next question would be: how much traffic should we divert to the experiment? Well, the change involved doesn't look very bold, so maybe it's safe for us to have high traffic divertion rate, which also enable us to finish the test within a short period of time. <br/>
Looking back at the dataset provided by Udacity, it in fact used 37 days to collect 690203 pageviews. But pay attention, we also need to wait for 14 days for user to finish their free trial. Therefore, only 23 days(23+14 = 37) are taken to collect enough data to reflect the whole enrollment process. So, the traffic divertion is approximately 74%.<br/>
Now, let's adjust this divertion rate to re-calculate the experiment duration.

In [7]:
# Traffic diverted to the experiment: [0:1]
traffic_divertion = 0.74

# Days it will take to test Group A Hypothesis
days_GC = round(df.loc['GC']['total_n'] / 
                (df.loc['C']['Estimator']*traffic_divertion))
print(f'Days required for GC: {days_GC}')

# Days it will take to test Group A & Group B Hypotheses
days_GC_NC = round(df.loc['NC']['total_n'] / 
                   (df.loc['C']['Estimator']*traffic_divertion))
print(f'Days required for GC & NC: {days_GC_NC}')

# Days it will take to test Group A & Group B & Group C Hypothesis
days_GC_NC_R = round(df.loc['R']['total_n'] / 
                     (df.loc['C']['Estimator']*traffic_divertion))
print(f'Days required for GC & NC & R: {days_GC_NC_R}')

Days required for GC: 22
Days required for GC & NC: 23
Days required for GC & NC & R: 160


### 4. Data Analysis

In [8]:
# Load experiment data
control = pd.read_csv("Final Project Results - Control.csv") 
experiment = pd.read_csv("Final Project Results - Experiment.csv")

In [9]:
print(control.describe())
control.head()

       Pageviews   Clicks  Enrollments  Payments
count    37.0000  37.0000      23.0000   23.0000
mean   9339.0000 766.9730     164.5652   88.3913
std     740.2396  68.2868      29.9770   20.6502
min    7434.0000 632.0000     110.0000   56.0000
25%    8896.0000 708.0000     146.5000   70.0000
50%    9420.0000 759.0000     162.0000   91.0000
75%    9871.0000 825.0000     175.0000  102.5000
max   10667.0000 909.0000     233.0000  128.0000


Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [10]:
print(experiment.describe())
experiment.head()

       Pageviews   Clicks  Enrollments  Payments
count    37.0000  37.0000      23.0000   23.0000
mean   9315.1351 765.5405     148.8261   84.5652
std     708.0708  64.5784      33.2342   23.0608
min    7664.0000 642.0000      94.0000   34.0000
25%    8881.0000 722.0000     127.0000   69.0000
50%    9359.0000 770.0000     142.0000   91.0000
75%    9737.0000 827.0000     172.0000   99.0000
max   10551.0000 884.0000     213.0000  123.0000


Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


#### 1) Sanity Check

To ensure the experiment is conducted propertly, we use the invariant variable chosen (C, CL, CTP) to do the sanity check. Theoretically, we should not see significant difference between the control and experiment group in these 3 metrics.<br/>
For C and CL, observations should be randomly assign to either control or experiment group, so that the probability of being in the experiment group should be 50%. That is, $\frac{observations in experiment group}{total observations} = 0.5$

In [11]:
# Build a dataframe for sanity check
df_sanity = pd.DataFrame(columns = ["CI_left", "CI_right", "Ratio","Passed?"], 
                         index = ['C', 'CL', 'CTP'])

for i,j in zip(["C", "CL"], ["Pageviews", "Clicks"]):
    # Calculate pageviews
    n_experiment = experiment[j].sum()
    n_total = control[j].sum() + experiment[j].sum()
    
    # Confidence interval
    p = 0.5
    alpha = 0.05
    z_stat = stats.norm.ppf(1 - alpha/2)
    se = standardError(n_total, p)
    df_sanity.at[i, 'CI_left'] = p - z_stat*se
    df_sanity.at[i, 'CI_right'] = p + z_stat*se
    
    # The observed ratio in experiment
    ratio_obs = n_experiment / n_total
    df_sanity.at[i, 'Ratio'] = ratio_obs
    
    # Check if the observed ratio lies within the confident interval
    result = df_sanity.loc[i]['Ratio']
    if df_sanity.loc[i]['CI_left'] <= result <= df_sanity.loc[i]['CI_right']:
        print(i, 'Yes')
    else:
        print(i, 'No')

df_sanity

C Yes
CL Yes


Unnamed: 0,CI_left,CI_right,Ratio,Passed?
C,0.4988,0.5012,0.4994,
CL,0.4959,0.5041,0.4995,
CTP,,,,


For CTP, we can conduct a two proportion Z-test using the ratio of number of clicks per pageview.Observed ratios in control and experiemnt group should not be significantlly different. Specifically, we assume that the two populations have normal distributions but not necessarily equal variances.<br/>
This time, I will try another approach with Z-test function of normal distribution.

In [12]:
# Calculate the number of observations and counted clicks for each group and store results in numpy arraies
n_pageviews = np.array([control["Pageviews"].sum(), experiment["Pageviews"].sum()])
n_clicks = np.array([control["Clicks"].sum(), experiment["Clicks"].sum()])

# Calculate the test-statistic Z and corresponding p_value
z_statistic, p_value = proportions_ztest(n_clicks, n_pageviews, 
                                         value=0, alternative="two-sided", 
                                         prop_var=False)

print("Z-test-statistic: ", z_statistic)
print("p-value:" , p_value)

alpha = 0.05
if p_value > alpha:
    df_sanity.at["CTP", "Passed?"] = "Yes"
else:
    df_sanity.at["CTP", "Passed?"] = "No"

df_sanity

Z-test-statistic:  -0.08566094109242048
p-value: 0.9317359524473912


Unnamed: 0,CI_left,CI_right,Ratio,Passed?
C,0.4988,0.5012,0.4994,
CL,0.4959,0.5041,0.4995,
CTP,,,,Yes


**p-value > alpha=0.05, we cannot reject the null hypothesis that ratios in two groups are the same.**

**Great! All invariant variables passed the sanity check.**

#### 2) Test Analysis

**Correction for multiple hypotheses?**
<br/>
Now we have more than one hypothesis, therefore the chance to get false positives increases. However, our metrics are not fully independent which is why the true probability for false positives will still be lower than 9.75% (the case for independent metrics: 1 - 0.95 * 0.95). We could use Bonferroni correction but then we could easily end up with more false negatives. Given that the increase for the probability for false positives is mild, I choose not to correct the alpha here.

In [13]:
# Create dataframe for test results
df_test = pd.DataFrame(columns=["CI_left", "CI_right", 
                                "Diff","SatSig?", "dmin", "Pract?"], 
                       index=["GC", "NC"])

# Set alpha
alpha = 0.05

# Two proportion Z-test for both metrics
for i,j in zip(["Enrollments", "Payments"],["GC", "NC"]):
    # Compute sample conversion rates
    obs_control = control.iloc[:23][i].sum()/control.iloc[:23]["Clicks"].sum()
    obs_experiment = experiment.iloc[:23][i].sum()/experiment.iloc[:23]["Clicks"].sum()
    
    # Compute observed difference between experiment and control conversion d
    df_test.at[j, "Diff"] = obs_experiment - obs_control
    
    # Compute sample standard deviations
    sd_control = (obs_control*(1 - obs_control))**0.5
    sd_experiment = (obs_experiment*(1 - obs_experiment))**0.5
    
    # Compute standard error for the whole population
    se_pooled = (sd_control**2/control.iloc[:23]["Clicks"].sum()
                 +sd_experiment**2/experiment.iloc[:23]["Clicks"].sum())**0.5
    
    # Compute 95% confidence interval around observed difference d
    df_test.at[j, "CI_left"] = df_test.at[j, "Diff"]-(
        stats.norm.ppf(1-alpha/2)*se_pooled)
    df_test.at[j, "CI_right"] = df_test.at[j, "Diff"]+(
        stats.norm.ppf(1-alpha/2)*se_pooled)
    
    # Check statistical significance
    if df_test.at[j, "CI_left"] <= 0 <= df_test.at[j, "CI_right"]:
        df_test.at[j, "SatSig?"] = "No"
    else:
        df_test.at[j, "SatSig?"] = "Yes"
    
    #import dmin
    df_test.at[j, "dmin"] = df.loc[j]["dmin"]
    
    
    # Check if practical relevant
    # Check if dmin is positive or negative
    effect = df_test.at[j, "dmin"]
    if effect >= 0:
        # If d is larger than dmin and 
        # if dmin lies left of the confidence interval around d
        if df_test.at[j, "Diff"] > effect and df_test.at[j, "CI_left"] > effect:
                df_test.at[j, "Pract?"] = "Yes"
        else:
            df_test.at[j, "Pract?"] = "No"
    else:
        #check if d is smaller than dmin 
        # and if dmin lies right of the confidence interval around d
        if df_test.at[j, "Diff"] < effect and effect > df_test.at[j, "CI_right"]:
                df_test.at[j, "Pract?"] = "Yes"
        else:
            df_test.at[j, "Pract?"] = "No"

#return results
df_test

Unnamed: 0,CI_left,CI_right,Diff,SatSig?,dmin,Pract?
GC,-0.0291,-0.012,-0.0206,Yes,-0.01,Yes
NC,-0.0116,0.0019,-0.0049,No,0.0075,No


### 5. Interpretation of Experiment Result

We observe an decrease in Gross Conversion rate, which is both statistically significant and bigger than the minimum effect size to be considered meaningful for the business. It shows that the "Free Trial Screener" feature indeed filter out a few users by setting a clearer expectation for the students before they enroll in the course.<br/>
However, we fail to reject the null hypothesis in Group B since the change in Net Conversion rate is neither statistically significant nor surpass the minimum practical effect size. That is, the decrease in percentage of users enrolled in free trial cannot be "overly" offset by higher Retention rate with students who are more likely to invest enough time during the learning process. In fact, we witness a slight decrease in this rate, whereas we set a positive dmin for NC, expecting the new feature can help us improve user experience without significantly reducing the number of students to continue past the free trial and eventually complete the course.Based on the data we have, we may even recommend Udacity not to launch this new feature if it aims to increase revenue gained through course payments. <br/>

**Further thoughts...** <br/>
The data collected appear in a quite short period of time. What will happen if Udacity chooses to add this new feature? The functionality of "Free Trial Screener" is pretty simple and it's reasonable to assume many users find it out and make unrealistic "promise" during the screener. If it happens, then it's possible to see a further decrease in the Retention rate which makes us hard to tell whether Net Conversion rate will decrease. <br/>
But in a longer time period, if we assume a steady coversion rate of users who start with free accessable course materials then make payment, an increase in number of people start with free materials may also bring us higher Net Conversion rate. Unfortunately, we don't data in this experiment to test this effect.