In [1]:
import pandas as pd
from scipy.stats import norm
import numpy as np

# Summary 

A/B testing is a methodology used online when you want to test a feature or new product’s performance. Udacity plans to improve the entire enrolled student experience and coaches’ capacity to support students who are likely to complete the course by adding a time-spent-on-course feature (Free Trial Screener). This project aims at analyzing two versions of Udacity’s website (experience group and control group), and determine whether the new feature should be launched in order to reduce the number of students who left the free trial because of the time issue. This project includes metrics choice, measuring variability, sizing, sanity checks, analysis, recommendation, and the follow-up experiment. Data used in this project was given by Udacity. 

# Experiment Design 

#### Null Hypothesis: 
This approach might not significantly decrease the number of students who left the free trial because they don’t have enough time.
#### Alternative Hypothesis: 
This approach will obviously reduce the free trial cancellation rate.


## Metrics Choice 

### Invariant Metrics:
Number of Cookies: The number of unique cookies to view the course overview page. 
Number of Clicks: The number of unique cookies to click the “Start Free Trial” button (which before the free trial screener is triggered). 

Click-Through-Probability: The number of unique cookies to click the “Start the trial” button divided by the number of unique cookies to view the course overview page. 

These three metrics won’t be affected by the screener approach because click the “Start” button has happened before the screener notice so that the users won’t be noticed differently yet. Also, cookies were assigned as the division of measure which means it should be randomly and equally split into both experience group and control group. Therefore, these are good invariant metrics but not evaluation metrics.


### Evaluation Metrics:

Gross Conversion: The number of user-ids to complete checkout and enroll in the free trial divided by the number of unique cookies to click the “Start” button.

Net Conversion: The number of user-ids to remain enrolled past the 14 days boundary divided by the number of unique cookies to click the “Start” button.

Retention: The number of user-ids to remain enrolled past the 14-day boundary divided by the number of user-ids to complete checkout. The retention is expected higher in experiment than the control group. 

Our goal is to reduce the number of users who couldn’t continue the course because of the time issue and improve the learning experience at the same time. Gross conversion is a good metric to track the number of users enrolled in the free trial and Net conversion metric could help us to check the number of users who still remain in the course. We can use the combination of Gross conversion and Net conversion metrics as evaluation metrics. The Gross conversion is expected to decrease and Net conversion is expected to increase or keep the same in the experiment group.


### Unused Metrics:

Number of User-id: The number of users who enroll in the free trial. This assumes that the fewer users enroll, the more users would complete the course because their education experiment improved and thus users are not equally distributed between the control and experiment groups. So, this could be a good metric but not an invariant metric. Since Gross conversion is more robust than number of user-id, we choose Gross conversion.


## Measuring Standard Deviation

The baseline information:

In [2]:
baseline = {"Cookies":40000,"Clicks":3200,"Enrollment": 660, "CTP":0.08, "GrossConversion":0.20625,"Retention": 0.53,"NetConversion":0.109313}

In [3]:
baseline

{'Cookies': 40000,
 'Clicks': 3200,
 'Enrollment': 660,
 'CTP': 0.08,
 'GrossConversion': 0.20625,
 'Retention': 0.53,
 'NetConversion': 0.109313}

Given 5000 cookies to view the course overview page per day:

In [4]:
baseline['Cookies'] = 5000
baseline['Clicks'] = baseline['Cookies'] * baseline['CTP']
baseline['Enrollment'] = baseline['Clicks'] * baseline['GrossConversion']

In [5]:
baseline

{'Cookies': 5000,
 'Clicks': 400.0,
 'Enrollment': 82.5,
 'CTP': 0.08,
 'GrossConversion': 0.20625,
 'Retention': 0.53,
 'NetConversion': 0.109313}

### Gross Conversion

In [6]:
np.sqrt(baseline['GrossConversion']*(1-baseline['GrossConversion'])/baseline['Clicks'])

0.020230604137049392

###  Retention

In [7]:
np.sqrt(baseline['Retention']*(1-baseline['Retention'])/baseline['Enrollment'])

0.054949012178509081

### Net Conversion

In [8]:
np.sqrt(baseline['NetConversion']*(1-baseline['NetConversion'])/baseline['Clicks'])

0.015601575884425905

## Sizing

### Number of Samples VS Power

I didn’t use Bonferroni correction, because these metrics are more related and more likely to move at the same time which means it’s too conservative. 

Given alpha = 0.05, beta = 0.02. To calculate how many page views we need, we can use this [link](http://www.evanmiller.org/ab-testing/sample-size.html).

| Metrics | p | d(min) | Sample Size | Pageviews |
| --- | --- | --- | --- | --- |
|Gross Conversion | 0.20625 | 0.01 | 25835 | 645875 |
| Retention | 0.53 | 0.01 | 39115 | 4741212 |
|Net Conversion | 0.1093125 | 0.0075 | 27413 | 685325 |


Pageviews required is the maximum of pageviews of the selected evaluation metrics. Therefore, the pageviews we need is 4741212.


### Duration vs. Exposure

If we divert 100% users, we need 119 days for testing retention metric but 17 days for gross conversion and 18 days for net conversion. 119 Days is a really long time to run. Therefore, we only need 18 days to run the test with 100% diversion and 34 days for 50% diversion if we drop the retention metric. 

In general, this screener may cause a decrease of the users who plan to enroll. Considering the fact that we do not want to expose 100% of the traffic to the experiment, I would take 75% diversion for 23 days.


# Experiment Analysis 

## Sanity Checks

For the three invariant metrics that we choose at the beginning, given the 95% confidence interval for the value we expect to observe, we expect them equal diversion into the experiment and control group. 

In [9]:
cont = pd.read_csv("Final Project Results - Control.csv")
exp = pd.read_csv("Final Project Results - Experiment.csv")

In [10]:
cont.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [11]:
exp.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


####  Number of Cookies in Course Overview Page

In [12]:
def zscore(alpha):
    return norm.ppf(alpha)

In [13]:
p = 0.5
alpha =0.05
sd = np.sqrt(p *(1-p)/(cont.Pageviews.sum() + exp.Pageviews.sum()))
print("The lower bound is: ", p-zscore(1-alpha/2)*sd)
print("The upper bound is: ", p+zscore(1-alpha/2)*sd)
print("The obeserved value is: ",cont.Pageviews.sum()/(exp.Pageviews.sum()+ cont.Pageviews.sum()) )

The lower bound is:  0.498820413825
The upper bound is:  0.501179586175
The obeserved value is:  0.500639666881


The observed p is inside this range which means test passes for this metric.

#### Number of Cookies who clicked the Free Trial Button

In [14]:
p = 0.5
alpha =0.05
sd = np.sqrt(p *(1-p)/(cont.Clicks.sum() + exp.Clicks.sum()))
print("The lower bound is: ", p-zscore(1-alpha/2)*sd)
print("The upper bound is: ", p+zscore(1-alpha/2)*sd)
print("The obeserved value is: ",cont.Clicks.sum()/(exp.Clicks.sum()+ cont.Clicks.sum()) )

The lower bound is:  0.495884571347
The upper bound is:  0.504115428653
The obeserved value is:  0.500467347407


The observed p is inside this range which means test passes for this metric.

#### Click - Through - Probabiliyu of the Free Trial Button

In [15]:
p = (exp.Clicks.sum()+ cont.Clicks.sum())/(exp.Pageviews.sum()+ cont.Pageviews.sum())
alpha =0.05
sd = np.sqrt(p *(1-p)*(1/cont.Pageviews.sum() + 1/exp.Pageviews.sum()))
print("The lower bound is: ", -zscore(1-alpha/2)*sd)
print("The upper bound is: ", +zscore(1-alpha/2)*sd)
print("The obeserved value is: ",exp.Clicks.sum()/exp.Pageviews.sum()- cont.Clicks.sum()/cont.Pageviews.sum())

The lower bound is:  -0.00129565539024
The upper bound is:  0.00129565539024
The obeserved value is:  5.66270915869e-05


According to the result, they all pass the sanity checks ([find here](https://docs.google.com/spreadsheets/d/14BpnIZ7t4qRmX-G7W89yZOgNfyje3aNWzpA4OVWNitM/edit#gid=0)).


| Invariant Metrics | Lower Bound | Upper Bound | Observed | Pass? |
| --- | --- | --- | --- | --- |
|# of Cookies | 0.498820 | 0.502280 | 0.500640 | Yes |
| # of Clicks| 0.495885 | 0.504115 | 0.500467 | Yes |
|Click Through Probability | -0.001296 | 0.001296 | 0.000057 | Yes |




## Result Analysis

### Effect Size Test

For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. 


####  Gross Conversion

In [16]:
alpha =0.05

exp_enroll = exp[~exp.Enrollments.isnull()].Enrollments.sum()
exp_clicked = exp[~exp.Enrollments.isnull()].Clicks.sum()
cont_enroll = cont[~cont.Enrollments.isnull()].Enrollments.sum()
cont_clicked = cont[~cont.Enrollments.isnull()].Clicks.sum()
d0 = exp_enroll/exp_clicked - cont_enroll/cont_clicked
p = (exp_enroll + cont_enroll)/(exp_clicked + cont_clicked)
sd = np.sqrt(p * (1-p) * (1/exp_clicked + 1/cont_clicked))

print("The lower bound is: ", d0-zscore(1-alpha/2)*sd)
print("The upper bound is: ", d0+zscore(1-alpha/2)*sd)
print("The obeserved value is: ",d0)

The lower bound is:  -0.0291232008875
The upper bound is:  -0.0119865482732
The obeserved value is:  -0.0205548745804


#### Net Conversion

In [17]:
alpha =0.05

exp_pmt = exp[~exp.Enrollments.isnull()].Payments.sum()
exp_clicked = exp[~exp.Enrollments.isnull()].Clicks.sum()
cont_pmt = cont[~cont.Enrollments.isnull()].Payments.sum()
cont_clicked = cont[~cont.Enrollments.isnull()].Clicks.sum()
d0 = exp_pmt/exp_clicked - cont_pmt/cont_clicked
p = (exp_pmt + cont_pmt)/(exp_clicked + cont_clicked)
sd = np.sqrt(p * (1-p) * (1/exp_clicked + 1/cont_clicked))

print("The lower bound is: ", d0-zscore(1-alpha/2)*sd)
print("The upper bound is: ", d0+zscore(1-alpha/2)*sd)
print("The obeserved value is: ",d0)

The lower bound is:  -0.011604500678
The upper bound is:  0.00185705532891
The obeserved value is:  -0.00487372267454


![alt](interval.png "Title")

According to our interval [result](https://docs.google.com/spreadsheets/d/14BpnIZ7t4qRmX-G7W89yZOgNfyje3aNWzpA4OVWNitM/edit?usp=sharing), Gross conversion is confidently significant change both statistically and practically. However, Net conversion is neither statistically nor statistically and practically significant. 



| Evaluation Metrics | Lower Bound | Upper Bound | d0 | Statistial Significance |Practical Significance
| --- | --- | --- | --- | --- | --- |
|Gross Conversion | -0.029123 | -0.011987 | -0.020555 | Yes | No |
|Net Conversion | -0.011605 | 0.001857 | -0.004874 | Yes | No |



### Sign Test

Total days that we use to count for sign test:

In [18]:
exp[~exp.Enrollments.isnull()].Clicks.count()

23

Total days with the positive change for Gross Conversion:

In [19]:
sum(exp[~exp.Enrollments.isnull()].Enrollments/exp[~exp.Enrollments.isnull()].Clicks > 
    cont[~cont.Enrollments.isnull()].Enrollments/cont[~cont.Enrollments.isnull()].Clicks)

4

In [20]:
sum(exp[~exp.Enrollments.isnull()].Payments/exp[~exp.Enrollments.isnull()].Clicks > 
    cont[~cont.Enrollments.isnull()].Payments/cont[~cont.Enrollments.isnull()].Clicks)

10

We can use [calculator](https://www.graphpad.com/quickcalcs/binomial1/) to find the probability of # of days for the positive effect("success")

| Evaluation Metrics | # of Days w/ Positive Change | Total Days | P Value | Significance |
| --- | --- | --- | --- | --- |
|Gross Conversion | 4 | 23 | 0.0026 | Yes |
|Net Conversion | 10 | 23 | 0.67764 | No |

We got the same result as Effect size test had above.


# Summary

In the experiment, Udacity tested a change by dividing potential students into two group- experiment and control group. Students in the experiment group are asked how much time they had available to devote course if they want to start a free trial and students in the control group are continually enroll without any suggestion as usual. We use three invariant metrics for sanity check and two evaluation metrics for size and sign test check. Our expectation is to reduce the number of students who do not have enough time spending on Udacity study in order to increase the other student’s education experience which means we want a decrease in gross conversion and an increase in net conversion. We didn’t use Bonferroni correction because it is too conservative to have a reasonable alpha. In this case, the risk type I error (false positive) and type II errors (false negative) increase as the number of metrics increases, and change the final decision.

Net conversion both fail the significant of effect size test and sign test whereas Gross conversion is significantly impacted by the new screener.

We didn’t consider retention as our evaluation metrics because we may not get enough sample page views in a limited time.


# Recommendation 

We expect to decrease the number of users who enroll in the free trial without significant reduction of the number of users who stay enrolled after 14-day. According to our data, there is a statistically and practically significant decrease in Gross conversion which matches our expectation. However, Net conversion indicates that the remaining users are no significant differences or even less than the control group. Considering this, my advice is not to launch this experiment.


## Follow- Up Experiment

In my opinion, I don’t think Net conversion is a perfect metric to measure the overall student experience because 
1.	The fewer students enroll the free trial means less potential students may stay in course after 14-day. 
2.	This experience is to improve the overall students’ experience, not just the new-enroll students’. Therefore, we may add some metrics that can test some metrics that include the whole group’s experience.

There are two evaluation metrics I may consider using:

##### Course Completion Conversion: 
That is, number of user-ids to complete the course divided by number of user-ids to enroll in the course.

##### Course Drop Conversion: 
That is, number of user-ids who left the course divided by number of user-ids to enroll in the course.


A variety of methods that could improve the overall student learning experience by reducing students who don’t have enough time. An ideal approach is that we can test a change where after student finish the 14-day free trial and summarize how much time they spent on devoting to the course per day on average and notice them this course usually require how many hours per day. If they spend less than the minimum suggestion, they will be asked if they still want to stay, otherwise, they will be encouraged to continue completing the course.

The hypothesis is that student all have the opportunity to try the free trial course to learn if they are interested in this course and want to spend enough time to finish the course by reducing the number of students who are not have or willing to have more time on this course. If they already spend reasonable time on the course, Udacity will boost their enthusiasm and provide improved coach experience.

Unit of Diversion: user-id
Invariant Metrics: user-id
Evaluation Metrics: 
Retention Conversion: the number of user-ids to remain enrolled past the 14-day boundary divided by the number of user-ids to complete the checkout and enroll the free trial.
Second Payment Conversion: the number of user-ids to complete the second checkout divided by the number of user-ids to the number of user-ids to remain enrolled past the 14-day boundary before the time-spent-summary screener is triggered.

