## Experiment Design
### Metric Choice

I will list below which metrics I will use in this experiment as invariant metrics or evaluation metrics.  **Invariant metrics** are metrics that shouldn't change across our experiment and control.  So, these metrics should be _independent_ of our experiment.  Conversely, our **evaluation metrics** are metrics which should change as a direct result from our experiment; therefore, these metrics are _dependent_ upon our experiment.

I will also explain for each metric why I did or did not use it as an invariant metric and why you did or did not use it as an evaluation metric.

The metrics I chose to use as invariant metrics were:
    - number of cookies
    - number of clicks
    - click-through-probability

The metrics I chose to use as evaluation metrics were
    - gross conversion
    - retention
    - net conversion

The rationale I had for choosing or not choosing each metric is as follows:

**Number of cookies:** I chose this as an invariant metric because the number of unique cookies occurs before each visitor sees the experiment so this metric is independent from the experiment.

**Number of user-ids:** I didn't choose this metric as either an invariant metric or an evaluation metric because the number of users who enroll in the free trial is dependent on the experiment.

**Number of clicks:** This is a good invariant metric because the number of unique cookies to click the "Start Free Trial" button is independent from the free trial screener (i.e. the click happens before the user sees the experimet).

**Click-through-probability:** This is a good invariant metric because the user clicks before the experiment happens, so the click is independent from the experiment.

**Gross conversion:** I chose this as an evaluation metric because the gross conversion is directly dependent on the results of the experiment.  The number of user-ids divided by the number of unique cookies should theoretically increase as a result of the experiment.

**Retention:** I chose this as an evaluation metric because it is dependent on the experiment since those users who are asked to honestly asses their own time commitment (as is able to commit to that time) for the nanodegree are more likely to enroll past the trial period.

**Net conversion:** I chose this as an evaluation metric because it is dependent on the effect of the experiment; the number of user-ids divided by the number of unique cookies to click on the "Start Free Trial" button should increase with the addition of the self-evaluation.

### Measuring Standard Deviation

I will list below the standard deviation of each of the chosen evaluation metrics. I will also indicate whether I think the analytic estimate would be comparable to the the empirical variability, or whether I expect them to be different.

In [1]:
import pandas as pd
import numpy as np

import matplotlib as plt
import seaborn as sns

from IPython.display import display 

%matplotlib inline

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [7]:
# get baseline
baseline = pd.read_csv("baseline.csv", index_col=False,header = None, names = ['metric','value'])
display( baseline )

Unnamed: 0,metric,value
0,Unique cookies to view page per day:,40000.0
1,"Unique cookies to click ""Start free trial"" per...",3200.0
2,Enrollments per day:,660.0
3,"Click-through-probability on ""Start free trial"":",0.08
4,"Probability of enrolling, given click:",0.20625
5,"Probability of payment, given enroll:",0.53
6,"Probability of payment, given click",0.109313


In [13]:
# given a sample size of 5000 cookies visiting enrollment page
sample_size_cookies = 5000

prob_enrolling = 0.206250
unique_cookies = 40000
unique_cookies_click = 3200

std_gross_conv = round(np.sqrt((prob_enrolling*(1.-prob_enrolling))/    \
                               (sample_size_cookies*unique_cookies_click/unique_cookies)), 4)
print( 'standard deviation of gross conversion:', std_gross_conv )

standard deviation of gross conversion: 0.0202


In [19]:
prob_pmt_enroll = 0.53

std_retention = round(np.sqrt((prob_pmt_enroll*(1.-prob_pmt_enroll))/    \
                              (sample_size_cookies*enroll_per_day/unique_cookies)), 4)
print( 'standard deviation of retention:', std_retention )

standard deviation of retention: 0.0549


In [18]:
prob_pmt_click = 0.109313
enroll_per_day = 660

std_net_conv = round(np.sqrt((prob_pmt_click*(1.-prob_pmt_click))/    \
                              (sample_size_cookies*unique_cookies_click/unique_cookies)), 4)
print( 'standard deviation of net conversion:', std_net_conv )

standard deviation of net conversion: 0.0156


### Sizing
#### Number of Samples vs. Power

I will not use the Bonferroni correction during my analysis phase.  To calculate the number of samples needed, I used the calculator at http://www.evanmiller.org/ab-testing/sample-size.html. The pageviews needed for each evaluation metric is as follows:

##### Gross conversion

* Baseline conversion rate = 20.6255%
* d_min = 0.01
* alpha = 0.05
* 1 - beta = 0.2
* calculated samples = 25835
* required pageviews = 25835 / 0.08 * 2 = 645,875

##### Retention

* Baseline conversion rate = 53%
* d_min = 0.01
* alpha = 0.05
* 1 - beta = 0.2
* calculated samples = 39,115
* required pageviews = 39,115 / 0.08 / 0.20625 * 2 = 4,741,212

##### Net conversion

* Baseline conversion rate = 10.93125%
* d_min = 0.0075
* alpha = 0.05
* 1 - beta = 0.2
* calculated samples = 27,413
* required pageviews = 27413 / 0.08 * 2  = 685,325

#### Duration vs. Exposure
Indicate what fraction of traffic you would divert to this experiment and, given this, how many days you would need to run the experiment. (These should be the answers from the "Choosing Duration and Exposure" quiz.)

Give your reasoning for the fraction you chose to divert. How risky do you think this experiment would be for Udacity?

In [2]:
# list first 5 rows of control data as well as some summary statistics
df_control = pd.read_csv('control.csv')
display(df_control.head())
display(df_control.describe())

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


Unnamed: 0,Pageviews,Clicks,Enrollments,Payments
count,37.0,37.0,23.0,23.0
mean,9339.0,766.972973,164.565217,88.391304
std,740.239563,68.286767,29.977,20.650202
min,7434.0,632.0,110.0,56.0
25%,8896.0,708.0,,
50%,9420.0,759.0,,
75%,9871.0,825.0,,
max,10667.0,909.0,233.0,128.0


In [3]:
# list first 5 rows of experiment data as well as some summary statistics
df_experiment = pd.read_csv('experiment.csv')
display(df_experiment.head())
display(df_experiment.describe())

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


Unnamed: 0,Pageviews,Clicks,Enrollments,Payments
count,37.0,37.0,23.0,23.0
mean,9315.135135,765.540541,148.826087,84.565217
std,708.070781,64.578374,33.234227,23.060841
min,7664.0,642.0,94.0,34.0
25%,8881.0,722.0,,
50%,9359.0,770.0,,
75%,9737.0,827.0,,
max,10551.0,884.0,213.0,123.0


## Experiment Analysis
#### Sanity Checks
For each of your invariant metrics, give the 95% confidence interval for the value you expect to observe, the actual observed value, and whether the metric passes your sanity check. (These should be the answers from the "Sanity Checks" quiz.)


For any sanity check that did not pass, explain your best guess as to what went wrong based on the day-by-day data. Do not proceed to the rest of the analysis unless all sanity checks pass.


#### Result Analysis
##### Effect Size Tests
For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. Indicate whether each metric is statistically and practically significant. (These should be the answers from the "Effect Size Tests" quiz.)


##### Sign Tests
For each of your evaluation metrics, do a sign test using the day-by-day data, and report the p-value of the sign test and whether the result is statistically significant. (These should be the answers from the "Sign Tests" quiz.)


##### Summary
State whether you used the Bonferroni correction, and explain why or why not. If there are any discrepancies between the effect size hypothesis tests and the sign tests, describe the discrepancy and why you think it arose.


#### Recommendation
Make a recommendation and briefly describe your reasoning.

## Follow-Up Experiment
Give a high-level description of the follow up experiment you would run, what your hypothesis would be, what metrics you would want to measure, what your unit of diversion would be, and your reasoning for these choices.