# A/B Testing Experiment Design

## Experiment Overview

At the time of this experiment, Udacity courses currently have two options on the home page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. 

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

## Metric Choice

### Invariant Metrics

* Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)
    * I choose it because this happens before the new feature.
* Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
    * This also happens before the new feature.
* Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
    * Since cookies and clicks of "Start free trial" all happens before the new feature, their ratio which is click-through-probability is also invariant metrics.

The reason why I didn't choose user-ids, gross conversion, retention and net conversion is that they happen after this new feature and might be impacted by it.

### Evaluation Metrics

* Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
    * How many users, who click the button and see the new feature, eventually complete checkout is what we want to measure.
* Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
    * Retention measures how many users make payment after complete checkout, which tells me whether this new feature impact on the users who actually start the free trial.
* Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)
    * This metric is also evaluation metrics because this can measure the impact of new feature on the users who want to start the free trial.

The reason why I didn't choose user-ids is that this metrics is redundant to other metrics. Also, the user-ids might be very fluctuate because of fluctuate pageviews and clicks on different days.

### Hypothesis

* set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time
    * Gross conversion and retention measure this hypothesis, and we expect a decrease of gross conversion and an increase of retention.
* without significantly reducing the number of students to continue past the free trial and eventually complete the course
    * Net conversion is the best one to measure this hypothesis. Hopefully, the users who are willing to pay will not decrease based on clicks of button.

## Standard Deviation of evaluation metrics

In [1]:
import pandas as pd
import numpy as np

In [2]:
# SD of Gross Conversion
round(np.sqrt((.206250*(1-.206250))/(5000*3200/40000)),4)

0.0202

The unit of analysis of gross conversion is the cookies to click button and the unit of diversion is also cookie. So I expect the analytical estimates will match empirical one.

In [10]:
# SD of Retension
round(np.sqrt((0.53*(1-0.53))/(82.5)),4)

0.0549

The unit of analysis of retension is user-id and the unit of diversion is cookie. So I expect the analytical estimates will not match empirical one. We might need to collect more data and get empirical estimates since the analytical one is not that accurate.

In [4]:
# SD of Net Conversion
round(np.sqrt((.109313*(1-.109313))/(5000*3200/40000)),4)

0.0156

The unit of analysis of net conversion is the cookies to click button and the unit of diversion is also cookie. So I expect the analytical estimates will match empirical one.

## Sizing

### Gross Conversion
* total sample size = 51,670 clicks
* clicks/pageview: 3200/40000 = 0.08 clicks/pageview
* pageviews = 51,670/0.08 = 645,875

### Retension
* total sample size = 78,230 enrollments
* enrollments/pageview: 660/40000 = 0.0165 enrollments/pageview
* pageviews = 78,230/0.0165 = 4,741,212

### Net Conversion
* total sample size = 54,826 clicks
* clicks/pageview: 3200/40000 = 0.08 clicks/pageview
* pageviews = 54,826/0.08 = 685,325

## Duration and exposure

Since 4,741,212 pageviews will need around 118 days to run the experiement and it's too long. So we exclude the Retension metric and use pageviews of Net Conversion, which is 685,325. Also, the retension metric may have inaccurate analytical standard deviation. It would be better to leave it out.


In [12]:
685325.0/(40000*0.5)

34.26625

Use 100% traffic to do the test is too risky because it will impact whole population. Let's say the exposure is 50% of pageviews every day. Then the experiement length will be around 35 days, which is quite reasonable. And the total experiment group will be 25% of the whole pageviews.

This experiment is not risky since it will not involve political attitudes, personal disease history, sexual preferences.

# Experiment Analysis

## Sanity Check

### Number of cookies

In [16]:
# load data of control and experiment group
df_control = pd.read_csv("Final Project Results - Control.csv")
df_experiment = pd.read_csv("Final Project Results - Experiment.csv")

In [31]:
# Sum of pageviews of control group
c_pv=df_control.Pageviews.sum()
c_pv

345543L

In [101]:
# Sum of pageviews of experiment group
e_pv=df_experiment.Pageviews.sum()
e_pv

344660L

In [53]:
# Observed pageviews
float(c_pv)/(c_pv+e_pv)

0.5006396668806133

In [13]:
# SD of pageviews
np.sqrt((0.5*(1-0.5))/(345543+344660))

0.00060184074029432473

In [14]:
# Upper bound
0.5+1.96*0.00060184074029432473

0.5011796078509769

In [15]:
# Lower bound
0.5-1.96*0.00060184074029432473

0.49882039214902313

Since 0.5006 is within the confidence interval. This sanity check for # of cookies passed.

### Number of clicks on "Start free trial"

In [32]:
# Sum of clicks of control group
c_c=df_control.Clicks.sum()
c_c

28378L

In [102]:
# Sum of clicks of experiment group
e_c=df_experiment.Clicks.sum()
e_c

28325L

In [55]:
# Observed clicks
float(c_c)/(c_c+e_c)

0.5004673474066628

In [35]:
# SD of clicks
np.sqrt((0.5*(1-0.5))/(28378+28325))

0.0020997470796992519

In [36]:
# Upper bound
0.5+1.96*0.0020997470796992519

0.5041155042762105

In [37]:
# Lower bound
0.5-1.96*0.0020997470796992519

0.49588449572378945

Since 0.5005 is within the confidence interval. This sanity check for # of clicks passed.

### Click-through-probability on "Start free trial"

In [105]:
# Click-through-probability of control group
c_ctp=float(c_c)/c_pv
c_ctp

0.08212581357457682

In [106]:
# Click-through-probability of experiment group
e_ctp=float(e_c)/e_pv
e_ctp

0.08218244066616376

In [109]:
# SD of click-through-probability
sd_ctp=np.sqrt(c_ctp*(1-c_ctp)/344660)
sd_ctp

0.00046766619548322742

In [112]:
# Upper bound
c_ctp+1.96*sd_ctp

0.083042439317723954

In [113]:
# Lower bound
c_ctp-1.96*sd_ctp

0.081209187831429691

Since 0.0821824 is within the confidence interval of control group's click-through-probability, this sanity check passed.

## Result Analysis

### Effective Size Tests

#### Gross Conversion

In [67]:
c_en=df_control.Enrollments.sum()# Sum of enrollments of control group
e_en=df_experiment.Enrollments.sum()# Sum of enrollments of experiment group
df_control_notnull = df_control[pd.isnull(df_control.Enrollments) != True]# Filter out null value of enrollments
df_experiment_notnull = df_experiment[pd.isnull(df_control.Enrollments) != True]
c_c1=df_control_notnull.Clicks.sum()# Sum of clicks of control group without enrollments nulls
e_c1=df_experiment_notnull.Clicks.sum()# Sum of clicks of experiment group without enrollments nulls

In [68]:
# Pooled gross conversion ratio
gc_pooled = float(c_en+e_en)/(c_c1+e_c1)
gc_pooled

0.20860706740369866

In [89]:
# Pooled SD
se_pooled = np.sqrt(gc_pooled*(1-gc_pooled)*(1.0/c_c1+1.0/e_c1))
se_pooled

0.0043716753852259364

In [90]:
# Difference of gross conversion ratio between control and experiment groups
d=float(e_en)/e_c1-float(c_en)/c_c1
d

-0.020554874580361565

In [91]:
# Lower bound
d-1.96*se_pooled

-0.029123358335404401

In [92]:
# Upper bound
d+1.96*se_pooled

-0.01198639082531873

Because the confidence interval do not contain 0, so this metric is statistically significant. It doesn't include dmin=+/-0.01, so it is also practical significant.

#### Net Conversion

In [77]:
# Sum of payments of control group
c_p = df_control.Payments.sum()
# Sum of payments of experiment group
e_p = df_experiment.Payments.sum()

In [78]:
# Pooled net conversion ratio
nc_pooled = float(c_p+e_p)/(c_c1+e_c1)
nc_pooled

0.1151274853124186

In [86]:
# Pooled SD
se_pooled2 = np.sqrt(nc_pooled*(1-nc_pooled)*(1.0/c_c1+1.0/e_c1))
se_pooled2

0.0034341335129324238

In [87]:
# Difference of net conversion ratio between control and experiment groups
d2=float(e_p)/e_c1-float(c_p)/c_c1
d2

-0.0048737226745441675

In [94]:
# Upper bound
d2+1.96*se_pooled2

0.001857179010803383

In [93]:
# Lower bound
d2-1.96*se_pooled2

-0.011604624359891718

Because the confidence interval contains 0, so this metric is not statistically significant. It also includes dmin=+/-0.0075, so it is not practical significant.

### Sign Tests

#### Gross Conversion

* Number of success: 4
* Number of trials: 23
* Probability: 0.5

In [99]:
from scipy.stats import binom_test
binom_test(4,23)

0.0025994777679443364

p-value of 0.0026 is smaller than alpha = 0.5, so this is a statistically significant change.

#### Net Conversion

* Number of success: 10
* Number of trials: 23
* Probability: 0.5

In [100]:
binom_test(10,23)

0.67763948440551747

p-value of 0.0.6776 is larger than alpha = 0.5, so this is not a statistically significant change.

### Summary

I didn't use Bonferroni correction because we will only launch this new feature when all the evalutation metrics passed. If we use Bonferroni correction, it will be too conservative. The Bonferroni correction goal is to reduce the chance of false positives since the more metrics, the more likely to get false positives if we will launch new feature based on significance of any metrics. However, we do require all the evaluation metrics to show significant, so the Bonferroni conrrection is not necessary.

### Recommendation

My final recommendation to Udacity team is not to launch this feature. 

For Gross Conversion, we do see the statistical and practical significant change during the experiment. It matches our expectation before experiment, which is decrease of gross conversion. We also confirm it with sign test. We can conclude that this new feature really impact our enrollments and in the experiment, we have less enrollments(because people who cannot contribute their time will not enroll.) In that case, we can focuse on more high-quality customers.

From Net Conversion perspective, there is no statistical and practical change, even in the sign test. That is to say, this new feature has little impact on the number of paid users. This is exactly what we want -  "without significantly reducing the number of students to continue past the free trial and eventually complete the course".

However, the confidence interval of net conversion does include the negative of the practical significance boundary, which means the net conversion might be decreased to a level that impact the business. So I will not recommend to launch this new feature.

### Follow-Up Experiment

If you wanted to reduce the number of frustrated students who cancel early in the course, what experiment would you try? Give a brief description of the change you would make, what your hypothesis would be about the effect of the change, what metrics you would want to measure, and what unit of diversion you would use. Include an explanation of each of your choices.

* My suggested change: Give a free coach session (put it as one of the required assignment) to those students because I assume that students are frustrated about how to start learning new stuff and how to engage with the class.
* Hypothesis: If we provide this free coach session, students will find the direction to continue the course.
* Invariant metric: number of user-ids who enroll the course, because we only focus on the students has already enrolled and get a user-id.
* Evaluation metrics: Retension rate, since we want to know whether this free coach session during the free trial really help students and those students are willing to study further.
* Unit of diversion: User-id, because this free coach session is for each user not click or pageview.