# Running an A/B Test on the Udacity Homepage

**Background**

At the time of this experiment, Udacity courses currently have two options on the home page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

### Part 1: Experiment Design

**Metric Choice**

Which of the following metrics would you choose to measure for this experiment and why? For each metric you choose, indicate whether you would use it as an invariant metric or an evaluation metric. The practical significance boundary for each metric, that is, the difference that would have to be observed before that was a meaningful change for the business, is given in parentheses. All practical significance boundaries are given as absolute changes.
Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.

* Number of cookies: *That is, number of unique cookies to view the course overview page. (dmin=3000)* This is an appropriate choice as an invariant metric because cookies are counted before the users are exposed to the experiment, also it is the unit of diversion for the project and therefore an equal number of cookies would be expected in each group.
* Number of user-ids: *That is, number of users who enroll in the free trial. (dmin=50)* This is not a good choice as an invariant metric because the user-ids are counted after the click occurs, and therefore will be effected by the test. User-ids is also not useful as an evaluation metric because, although it does tell us how many users converted, there is no way to normalize the number because the value has no denominator.
* Number of clicks: *That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)* This is a good choice as an invariant metric because clicks happen before users are exposed to the experiment, Also, the users who click are the ones who see the change and for the experiment it is necessary to have an equal number in each group.
* Click-through-probability: *The number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)* This would be a valid choice as an invariant metric because it is counted before the user sees the experimental change, however, because it is a product of Number of Cookies and Number of Clicks, it is not necessary to include and will therefore be left out.
* Gross conversion: *That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)* This is a good evaluation metric because it will allow us to see if the experiment had an effect on the number of users enrolling after they click.
* Retention: *That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)* Retention actually is a relevant metric for evaluation because it tells us exactly how many users remained enrolled after the trial period, however, it will be determined later in this report that using retention as an evaluation metric would cause the experiment to take too long, so it will not be included.
* Net conversion: *That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)* This is a good evaluation metric because the hypothesis is based on a change in this metric. Net conversion will allow us to see if the experiment had an effect on the number of uses who click and become paying customers after the free trial ends.

You should also decide now what results you will be looking for in order to launch the experiment. Would a change in any one of your evaluation metrics be sufficient? Would you want to see multiple metrics all move or not move at the same time in order to launch? This decision will inform your choices while designing the experiment.

**Launch Criteria**

Since the goal of the is to decrease the number of frustrated students dropping out during the free trial, without creating a drop in the number of paying students, in order to recommend a launch after the experiment we would have to see a satistically and practically significant decrease in gross conversion, while at least maintaining a constant level of Net Conversions.

Both of these criteria must be met to warrant a recommendation to launch the change after the test:
* A practically and statistically significant decrease in Gross Conversion.
* No drop in Net Conversion.

### Measuring Variability

In [1]:
import numpy as np
import pandas as pd
import math

data = pd.read_csv("abtest.csv", names = ("Statistic","Value"))
data

Unnamed: 0,Statistic,Value
0,Unique cookies to view page per day:,40000.0
1,"Unique cookies to click ""Start free trial"" per...",3200.0
2,Enrollments per day:,660.0
3,"Click-through-probability on ""Start free trial"":",0.08
4,"Probability of enrolling, given click:",0.20625
5,"Probability of payment, given enroll:",0.53
6,"Probability of payment, given click",0.109313


In [2]:
print "The standard deviation for Gross Conversion is {:.4f}.".format(np.sqrt(0.20625*(1.0 - 0.20625)/(5000 * .08)))

The standard deviation for Gross Conversion is 0.0202.


In [3]:
print "The standard deviation for Retention is {:.4f}.".format(np.sqrt(0.53*(1.0 - 0.53)/(5000. * (660./40000.))))

The standard deviation for Retention is 0.0549.


In [4]:
print "The standard deviation for Net Conversion is {:.4f}.".format(np.sqrt(0.1093125*(1.0 - 0.1093125)/(5000 * .08)))

The standard deviation for Net Conversion is 0.0156.


For each of your evaluation metrics, indicate whether you think the analytic estimate would be comparable to the the empirical variability, or whether you expect them to be different (in which case it might be worth doing an empirical estimate if there is time). Briefly give your reasoning in each case.
* Since both Gross Conversion and Net Conversion metrics depend on the number of cookies, which is also the unit of diversion and an invariant metric, we can expect the analytical estimate of variability to be comparable to the emprical variability so it will not be necessary to calculate the emperical variability.
* If Retention ends up being used as an evaluation metric it may be necessary to calculate the emperical variance because it uses user-id as its denominator.


# Sizing

### Choosing Number of Samples given Power

Using the analytic estimates of variance, how many pageviews total (across both groups) would you need to collect to adequately power the experiment? Use an alpha of 0.05 and a beta of 0.2. Make sure you have enough power for each metric. Also indicate whether or not you will be using Bonferroni Correction.

* Bonferroni Correction is used to correct for false positives, and it comes with the side-effect of decreasing power/increasing false negatives. This correction is useful when the hypothesis is looking for a statistically significant change in any metric. In this case, since a statistically significant change must be seen in all evaluation metrics, Bonferroni Correction is not an appropriate choice and will not be used.

In order to caclulate the required the required number of pageviews for the expirement, I will first use this online tool, http://www.evanmiller.org/ab-testing/sample-size.html, to calculate the necessary sample size. Using the following values as inputs:

**Retention**
* Base conversion rate for Retention is 53%
* The minimum detectable effect is 1.0%
* The statistical power (1-β) is 0.8
* The significance level (α) is .05

Based on these inputs the minimum sample size is 39,115.

Since there will be two groups (control and experiment) this number must be doubled.

Finally, to get the number of pageviews, this number must be divided by the enrollment rate (The number of enrollments per day divided by the number of unique cookies per day).

**Gross Conversion**
* Base conversion rate for Net Conversions is 10.9%
* The minimum detectable effect is 0.75%
* The statistical power (1-β) is 0.8
* The significance level (α) is .05

Based on those inputs, the minimum sample size is 27,345.

Since there will be two groups in the experiment (control and experiment) this number needs to be doubled.

Finally, to calculate the number of pageviews, the doubled sample size must be divided by the click through probability of .08

In [5]:
print "The total number of pageviews needed using Retention is {:.2f}.".format((39115*2)/(660./40000.))
print "The total number of pageviews needed using Net Conversion is {}.".format((27345*2)/.08)

The total number of pageviews needed using Retention is 4741212.12.
The total number of pageviews needed using Net Conversion is 683625.0.


### Choosing Duration vs. Exposure

What percentage of Udacity's traffic would you divert to this experiment (assuming there were no other experiments you wanted to run simultaneously)? Is the change risky enough that you wouldn't want to run on all traffic?
* A test would be considered high risk if it could be potentially causing harm to its users, or if it collects sensitive information. Considering that this test doesn't raise either of these issues, and will not impact existing paying customers, I think it is safe to go ahead with running the test on 100% of user traffic. 

Given the percentage you chose, how long would the experiment take to run, using the analytic estimates of variance? If the answer is longer than a few weeks, then this is unreasonably long, and you should reconsider an earlier decision.

In [6]:
print "Using Retention, the experiment should be run for {} days to collect the requisite number of pageviews.".format(math.ceil(4741212.12/40000.))
print "Excluding Retention, the experiment should be run for {} days to collect the requisite number of pageviews.".format(math.ceil(685325./40000.))

Using Retention, the experiment should be run for 119.0 days to collect the requisite number of pageviews.
Excluding Retention, the experiment should be run for 18.0 days to collect the requisite number of pageviews.


Based on this information, including retention as an evaluation metric would cause the duration of the experiment to be much longer than the client wants, therefore, Retention will be ignored as an evaluation metric and only Gross Conversion and Net Conversion will be considered, and only 18 days will be required to complete the test.

### Analysis

The data from the experiment is in the following table. This data contains the raw information needed to compute the above metrics, broken down day by day.

In [7]:
control = pd.read_csv("results - control.csv")
experiment = pd.read_csv("results - experiment.csv")
df = pd.merge(control, experiment, on="Date", suffixes = ("_control", "_exp"))
df.head()

Unnamed: 0,Date,Pageviews_control,Clicks_control,Enrollments_control,Payments_control,Pageviews_exp,Clicks_exp,Enrollments_exp,Payments_exp
0,"Sat, Oct 11",7723,687,134,70,7716,686,105,34
1,"Sun, Oct 12",9102,779,147,70,9288,785,116,91
2,"Mon, Oct 13",10511,909,167,95,10480,884,145,79
3,"Tue, Oct 14",9871,836,156,105,9867,827,138,92
4,"Wed, Oct 15",10014,837,163,64,9793,832,140,94


The meaning of each column is:
* Pageviews: Number of unique cookies to view the course overview page that day.
* Clicks: Number of unique cookies to click the course overview page that day.
* Enrollments: Number of user-ids to enroll in the free trial that day.
* Payments: Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)

### Sanity Checks

At this point I will create a confidence interval for each of my invarient metrics and make sure that the actual number for each metric is within that confidence interval.

** Number of Cookies**

In [8]:
control_pageview = float(sum(df['Pageviews_control']))
exp_pageview = float(sum(df['Pageviews_exp']))
total = control_pageview + exp_pageview
standard_error = np.sqrt((.5*.5)*(1/total))
margin = standard_error * 1.96
print "Confidence Interval = [{:.4f},{:.4f}].".format((.5-margin),(.5+margin))
print "The actual proportion of pageviews for control group = {:.4f}, which is inside the confidence interval, therefore the number of cookies metric passes the sanity check.".format(control_pageview/total)

Confidence Interval = [0.4988,0.5012].
The actual proportion of pageviews for control group = 0.5006, which is inside the confidence interval, therefore the number of cookies metric passes the sanity check.


**Number of clicks**

In [9]:
control_clicks = float(sum(df['Clicks_control']))
exp_clicks = float(sum(df['Clicks_exp']))
total = control_clicks + exp_clicks
standard_error = np.sqrt((.5*.5)*(1.0/total))
margin = standard_error * 1.96
print "Confidence Interval = [{:.4f},{:.4f}].".format((.5-margin),(.5+margin))
print "The actual proportion of clicks for control group = {:.4f}, which is inside the confidence interval, therefore the number of clicks metric passes the sanity check.".format(control_clicks/total)

Confidence Interval = [0.4959,0.5041].
The actual proportion of clicks for control group = 0.5005, which is inside the confidence interval, therefore the number of clicks metric passes the sanity check.


### Check for practical and statistical significance

**Gross Conversion**

In [10]:
df = df.dropna(axis=0)
control_clicks = float(sum(df['Clicks_control']))
exp_clicks = float(sum(df['Clicks_exp']))
control_enroll = float(sum(df['Enrollments_control']))
exp_enroll = float(sum(df['Enrollments_exp']))
control_gross = control_enroll/control_clicks
exp_gross = exp_enroll/exp_clicks
total_gross = (exp_enroll + control_enroll)/(exp_clicks + control_clicks)
se = np.sqrt(total_gross*((1.0 - total_gross)*(1./control_clicks + 1./exp_clicks)))
margin = 1.96 * se
difference = exp_gross - control_gross
print "Gross Conversion confidence interval: [{:.4f},{:.4f}].".format((difference - margin),(difference + margin))


Gross Conversion confidence interval: [-0.0291,-0.0120].


Since the confidence interval for Gross Conversion does not include 0 the results of the test can be considered to be Statistically significant. 

This result is also practically significant because the change is greater than the practical significance threshold (dmin=.01) outlined earlier in the metric choice section. The boundaries of the confidence interval are both lower than -.01, so the change can be determined to be practically significant.

**Net Conversion**

In [11]:
control_clicks = float(sum(df['Clicks_control']))
exp_clicks = float(sum(df['Clicks_exp']))
control_pay = float(sum(df['Payments_control']))
exp_pay = float(sum(df['Payments_exp']))
control_net = control_pay/control_clicks
exp_net = exp_pay/exp_clicks
total_net = (exp_pay + control_pay)/(exp_clicks + control_clicks)
se = np.sqrt(total_net*((1.0 - total_net)*(1./control_clicks + 1./exp_clicks)))
margin = 1.96 * se
difference = exp_net - control_net
print "Net Conversion confidence interval: [{:.4f},{:.4f}].".format((difference - margin),(difference + margin))

Net Conversion confidence interval: [-0.0116,0.0019].


Since the confidence interval for Net Conversion does include 0, this result can not be considered statistically significant.

From a practical significance viewpoint, the launch criteria state that the test should not lead to a decrease in Net Conversions, and the practical significance threshold is .0075, so for the change to be practically significant it would have to lead to an increase in Net Conversions which is greater than or equal to .0075. Not only are the boundaries of the confidence interval less than the practical significance threshold, they include negative numbers, which means that this test may have actually led to a decrease in Net Conversions, so this result is not aligned with the goals of the experiment.

### Run Sign Tests

For each evaluation metric, do a sign test using the day-by-day breakdown. If the sign test does not agree with the confidence interval for the difference, see if you can figure out why.

In [12]:
df['gross_sign'] = df['Clicks_control'] / df['Enrollments_control'] > df['Clicks_exp']/df['Enrollments_exp']
df['net_sign'] = df['Clicks_control'] / df['Payments_control'] > df['Clicks_exp']/df['Payments_exp']
df.head()

Unnamed: 0,Date,Pageviews_control,Clicks_control,Enrollments_control,Payments_control,Pageviews_exp,Clicks_exp,Enrollments_exp,Payments_exp,gross_sign,net_sign
0,"Sat, Oct 11",7723,687,134,70,7716,686,105,34,False,False
1,"Sun, Oct 12",9102,779,147,70,9288,785,116,91,False,True
2,"Mon, Oct 13",10511,909,167,95,10480,884,145,79,False,False
3,"Tue, Oct 14",9871,836,156,105,9867,827,138,92,False,False
4,"Wed, Oct 15",10014,837,163,64,9793,832,140,94,False,True


In [13]:
print "The control group had a higher gross conversion rate than the experiment group in {} days out of {}.".format(\
        (sum(df['gross_sign'])),len(df))
print "The control group had a higher net conversion rate than the experiment group in {} days out of {}.".format(\
        (sum(df['net_sign'])),len(df))

The control group had a higher gross conversion rate than the experiment group in 4 days out of 23.
The control group had a higher net conversion rate than the experiment group in 10 days out of 23.


The following online tool which calculates the p-values, http://graphpad.com/quickcalcs/binomial1.cfm, revealed the following statistics:
* The p-value for gross conversion is .0026 which is statistically significant.
* The p-value for net conversion is .6776 which is not statistically significant.

These results agree with the confidence intervals I calculated before.

### Recommendation

As discussed earlier in the "Launch Criteria" section, it would be necessary to see a statistically significant decrease in gross conversion, without leading to a decrease in net conversions, to recmmend a launch. The results showed that the test led to a statistically and practically significant decrease in gross conversion but did not practically or significantly increase Net Conversions, in fact, it may have even caused a drop net conversions, therefore, this change would probably not lead to increased profits for Udacity, and I would not recommend to launch the change to the website after the test. 

# Follow-Up Experiment: How to Reduce Early Cancellations

An early cancellation is a user who signs up for the free trial, using their credit card, and cancels their subsription during the first 14 days, before their credit card is charged. In an attempt to reduce the frequency of early cancellations, and get more users to follow through on their subscription, I would propose the following A/B test.

The test would be almost identical to the test outlined in this project, except instead of asking people how much time they have to devote to the program, they would be asked how confident they are in the essential skills necessary to complete the projects. For example, if they were signing up for the data analyst nanodegree, the pop-up would say "This program requires an intermediate ability to write code in python. How confident are you with python programming?"

If someone reports a high level of confidence they are encouraged to continue with their enrollment, otherwise, they are encouraged to do some additional studying in the relevant skills. 

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

The  hpyothesis would be that users would have a clearer picture of what technical skills will be required of them to finish the course, and fewer users will cancel their subscription because they are surprised to find that they are unable to complete the projects with their current skill level.

The invarient metrics will be the number of clicks and the number of cookies and the number of clicks because we would want to have an equal number of users seeing each home page option, and an equal number of users who click the start free trial button, from each group.

The evaluation metrics would be gross conversion and net conversion because the goal of the test is to get more students to enroll in the program, and stay enrolled beyond the free trial. A statistically significant change in both metrics would warrant a recommendation to launch the change.