# Situation

At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

# Experiment

In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. [This screenshot](https://www.google.com/url?q=https://drive.google.com/a/knowlabs.com/file/d/0ByAfiG8HpNUMakVrS0s4cGN2TjQ/view?usp%3Dsharing&sa=D&ust=1566807927543000) shows what the experiment looks like.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

In [2]:
# Import some sensible defaults
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Hypothesis

$H_0$: This change won't set clearer expectations for students upfront, and not reduce the number of frustrated students who leave the free trial because they don't have enough time. It won't significantly reduce the number of students to continue past the free trial and eventually complete the course.

$H_1$: This change will set clearer expectations for students upfront, thus reducing the number of frustrated students who leave the free trial because they don't have enough timeâ€”without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis holds true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

## Metric Choice

There are several possible metrics that could be used for the experiment.

### Invariant Metrics

Invariant metrics are chosen to investigate possible issues in the experiment setup and execution, i.e. detect bad measurements due to errors or unintended consequences.

#### Selected:    

| Metric Name               | Formula                                 | $d_{min}$ | Notation         |
|---------------------------|-----------------------------------------|-----------|------------------|
| Number of cookies         | # cookies on course overview page       | 3000      | $cookies_{uniq}$ |
| Number of clicks          | # cookies clicked on button             | 240       | $clicks_{uniq}$  |
| Click-Through-Probability | $\frac{clicks_{uniq}}{cookies_{uniq}}$  | 1%        | $CTP$            |

- **Number of cookies**: That is, number of unique cookies to view the course overview page. (dmin=3000)
    - A good invariant as this is the unit of diversion, hence it is randomized by definition. This metric should not be significantly different from a p=0.5 value (for an equal split)
- **Number of clicks**: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
    - Also a good invariant, even though it is a subset of cookies from above the experiment happens after the button is clicked and therefore the cookies should not be significantly different between groups.
- **Click-through-probability**: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
    - The click-through-probability from course overview page to "Start free trial" click should be unaffected by the experiment, hence not significantly different between groups.

#### Not Selected:    

| Metric Name               | Formula                                  | Practical Significance $d_{min}$ | Notation |
|---------------------------|------------------------------------------|----------------------------------|----------|
| Number of user-ids        | # user-ids that enroll in the free trial | 50                               |$enrolled$|

- **Number of user-ids**: That is, number of users who enroll in the free trial. (dmin=50)
    - This could possibly have been an invariant metric but since it is only recorded after the experimental change we're unable to use it as such. It might be affected by the variation.

### Evaluation Metrics

Evaluation metrics are chosen to investigate the impact of the changes. They are usually tied to business goals (at least indirectly).

| Metric Name      | Formula                           | Practical Significance $d_{min}$ | Notation             |
|------------------|-----------------------------------|----------------------------------|----------------------|
| Gross Conversion | $\frac{enrolled}{cookies_{uniq}}$ | 1%                               | $Conversion_{gross}$ |
| Retention        | $\frac{payment}{enrolled}$        | 1%                               | $Retention$      |
| Net Conversion   | $\frac{payment}{cookies_{uniq}}$  | 0.75%                            | $Conversion_{net}$   |


- **Gross conversion**: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
    - This metric is going to measure whether we're successfully deterring users from enrolling in the free trial. We're expecting the gross conversion to go down ($H_1$).
- **Retention**: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
    - This metric is going to measure the probability that we're successfully retaining users through the free trial. We're expecting retention to increase ($H_1$).
- **Net conversion**: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)
    - This metric is going to measure whether we remain able to funnel users through the free trial that are likely to succeed in completing a course. We're hoping that this metric will stay unchanged ($H_1$).

NOTE: _Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice._

NOTE: $d_{min}$ _is the difference necessary to be practically significant._

**Essential Measures for Launch Recommendation:** It is necessary to see a significant drop by at least 1% in Gross Conversion, without significantly (1% and 0.75% respectively) affecting Retention and Net Conversion. 

## Variability

Variability and variability estimates are essential to making reliable estimates on experiment sizing. Certain types of metrics allow us to make variance estimates analytically (e.g. binomial,  normal, difference between two counts and rates) others are more difficult as they rely on the underlying distribution.

A summary of common estimates for metrics:

| type of metric | distribution | requirement | est. Variance |
|---|---|---|
| probability | binomial (normal) | probability value | $\frac{\hat{p}(1-\hat{p})}{N}$ |
| mean | normal | std. estimate | $\frac{\hat{\sigma}^2}{N}$ |
| median / percentile | varies | - | depends on underlying distribution |
| difference (count) | normal (often) | variance est. of the compared counts | $var(x)+var(y)$ |
| rates | poisson (often) | mean | $\bar{x}$ (mean) |
| ratios | varies | - | depends on underlying distribution of numerator and denominator |

The following metrics are baseline metrics and I assume are some avg. metric measured over a longer timeframe.

#### Baseline Values

In [3]:
baseline = pd.read_csv('data/baseline.csv', header=None)
baseline.columns = ['description','N=40000']

# add notation for easier retrieval
baseline.index = ['cookies_uniq','clicks_uniq','enrolled','ctp','conversion_gross','retention','conversion_net']

# add practical significance levels (d_min)

baseline['dmin'] = [3000,240,50,0.01,0.01,0.01,0.0075]

Next, we should be making ourselves familiar with the metrics we're working with. We'll categorize into types and unit of analysis. 

The type is important for our variance estimate and the unit of analysis will allow us to make a decision on whether this variance estimate is likely to be reliable. That is, if `unit of analysis != unit of diversion`, then our analytical esimate might not be reliable and we should consider an empirical estimate instead. 

_Remember the unit of diversion is 'cookies'._

In [4]:
baseline['type'] = ['count', 'count', 'count', 'probability', 'probability', 'probability', 'probability']
baseline['unit'] = ['cookies', 'cookies', 'user-id', 'cookies','cookies','user-id','cookies']

In [5]:
# put columns in order
baseline = baseline.loc[:,['description','type','unit','dmin','N=40000']]
baseline

Unnamed: 0,description,type,unit,dmin,N=40000
cookies_uniq,Unique cookies to view course overview page pe...,count,cookies,3000.0,40000.0
clicks_uniq,"Unique cookies to click ""Start free trial"" per...",count,cookies,240.0,3200.0
enrolled,Enrollments per day:,count,user-id,50.0,660.0
ctp,"Click-through-probability on ""Start free trial"":",probability,cookies,0.01,0.08
conversion_gross,"Probability of enrolling, given click:",probability,cookies,0.01,0.20625
retention,"Probability of payment, given enroll:",probability,user-id,0.01,0.53
conversion_net,"Probability of payment, given click",probability,cookies,0.0075,0.109313


We can see that the type count excludes cookies_uniq, clicks_uniq and enrolled from being able to make any reliable variance estimate for these metrics. That is okay since they are not part of our evaluation metrics and we don't necessarily need estimates for them.

For the other metrics we're working with probabilities and luckily we have an estimate available 

$$\frac{\hat{p}(1-\hat{p})}{N}$$ 

Lastly, we'll need to check whether our `unit of diversion = unit of analysis`. This applies to all probability metrics except for retention (which uses `user-id` as a unit of analysis). Ideally, we would want to have an empirical estimate for this metric but we'll move on disregarding this disconnect for now.

For this estimate we'll be using $N=5000$

In [6]:
# Reduce counts to 5000 cookies (i.e. divide  each count by 8)
baseline.loc[['cookies_uniq','clicks_uniq','enrolled'],'N=5000'] = baseline['N=40000'] / 8

# Probabilities don't change by changing the counts proportionally
baseline.loc[['ctp','conversion_gross','retention','conversion_net'],'N=5000'] = baseline.loc[['ctp','conversion_gross','retention','conversion_net'],'N=40000']

baseline.loc[:,['N=5000']]

Unnamed: 0,N=5000
cookies_uniq,5000.0
clicks_uniq,400.0
enrolled,82.5
ctp,0.08
conversion_gross,0.20625
retention,0.53
conversion_net,0.109313


These adjusted sample sizes are now going to be used to make our estimates for the Standard Deviations (i.e. the square root of the variance) of the evaluation metrics.

In [7]:
# Calculate standard deviation for CTP with N = 5000
p_ctp = baseline.loc['ctp','N=5000']
N_ctp = baseline.loc['cookies_uniq','N=5000']
baseline.loc['ctp','Standard Deviation Est.'] = np.sqrt(p_ctp * (1-p_ctp) / N_ctp)

# Calculate standard deviation for conversion_gross with N = 5000 (respectively 400)
p_conversion_gross = baseline.loc['conversion_gross','N=5000']
N_conversion_gross = baseline.loc['clicks_uniq','N=5000']
baseline.loc['conversion_gross','Standard Deviation Est.'] = np.sqrt(p_conversion_gross * (1-p_conversion_gross) / N_conversion_gross)

# Calculate standard deviation for retention with N = 5000 (respectively 82.5)
p_retention = baseline.loc['retention','N=5000']
N_retention = baseline.loc['enrolled','N=5000']
baseline.loc['retention','Standard Deviation Est.'] = np.sqrt(p_retention * (1-p_retention) / N_retention)

# Calculate standard deviation for conversion_net with N = 5000 (respectively 400)
p_conversion_net = baseline.loc['conversion_net','N=5000']
N_conversion_net = baseline.loc['clicks_uniq','N=5000']
baseline.loc['conversion_net','Standard Deviation Est.'] = np.sqrt(p_conversion_net * (1-p_conversion_net) / N_conversion_net)

baseline.loc[['ctp','conversion_gross','retention','conversion_net'],['Standard Deviation Est.']]

Unnamed: 0,Standard Deviation Est.
ctp,0.003837
conversion_gross,0.020231
retention,0.054949
conversion_net,0.015602


Understanding the variability of your metrics is essential to understand the sizing for your experiment and fine-tuning what metrics can and should be used for a given experiment (a high variability might not make much sense for experimenting).

### Sizing 

The variability estimates above are not only essential to understanding whether a metric is usable (i.e. reliable) but also can be used for an experiment sizing. 

Let's consider our requirements. $\alpha$ is our level of accepting a Type I Error and $\beta$ our level of acceptance for a Type II Error. These are in an inverse relationship. Another restriction is our sample size $N$. We're therefore facing an optimization problem between these three factors. Given that we usually have some best practice numbers for $\alpha$ and $\beta$ we usually make $N$ the variable to optimize.

To better understand the sizing formula let's first consider what generally $d$ looks like

$$d = |p_{control} - p_{exp}|$$

Simple enough, the difference between two values is the absolute value of one subtracted by the other. Since we're estimating we're now going to consider replacing the values above with educated guesses and we get:

$$d = |p_{baseline} - p_{dmin}|$$

We're using the baseline value as our control p and the baseline value + practical significance level $p_{dmin} = p_{baseline} + d_{min}$ as the smallest change that is valuable to detect.

_Note: Minimizing for d is our practical significance level._

Let's now consider the smallest possible values that we start rejecting each of our requirements for:

$$for\ \alpha_{min},\ \ \ z_{1-\alpha}*\frac{\sqrt{p_{baseline}(1-p_{baseline})}}{\sqrt{N}}$$

$$for\ \beta_{min},\ \ \ -z_{1-\beta}*\frac{\sqrt{p_{dmin}(1-p_{dmin})}}{\sqrt{N}}$$

_Note: We're taking the negative of the result for $\beta_{min}$ as we're interested in the lower tail of the distribution (where we have the largest overlap between the two distributions) in an experiment setup where we assume the evaluation metrics increase_

$\sqrt{p(1-p)}$ is simply the standard deviation (based on the variance estimate of a probability), so we'll replace it with $SD$ for better readability. Now, using the formula with the minimized $\alpha$ and $\beta$ we get,

$$d_{min} = z_{1-\alpha}*\frac{SD_{baseline}}{N} - (-z_{1-\beta}*\frac{SD_{dmin}}{N})$$

Translating this formula to $N$ we get the final workable formula,

$$N=\bigg(\frac{Z_{1-\alpha}SD_1+Z_{1-\beta}SD_2}{d_{min}}\bigg)^2$$

Deriving from the variance estimates table seen at the beginning of this notebook we can derive $SD_1$ and $SD_2$. However, note that we have to use pooled variances for this calculation as we're looking at two samples and this looks as follows

$$SD_1=\sqrt{2*p_{baseline}(1-p_{baseline})}\ \ \ \ \ SD_2=\sqrt{p_{baseline}(1-p_{baseline}) + (p_{dmin})(1-(p_{dmin}))}$$

$SD_1$ computes the estimated standard deviation of the baseline parameter and $SD_2$ the estimated standard deviation of the baseline parameter + the practical significance level.

The following is taken from this [blog post](http://www.alfredo.motta.name/ab-testing-from-scratch/) translation of the sizing script written in R. This does require some prior knowledge of power and significance to follow. 

Basically, we'll need a few things:

| Input | Notation |
|---|---|
| Statistical Significance level (Probability of Type I Error / alpha) | $\alpha$ |
| Practical Significance level (closest True Parameter of interest) | $d_{min}$ |
| Standard Error Estimate | $s$ |
| Beta (Probability of Type II Error) | $\beta$ |

The significance level and beta can simply be defined using experience and best practices (often $\alpha=0.05$ and $\beta=0.2$). The practical signficance level can be chosen using business goals and experience. Lastly, since we usually should work with a baseline in A/B Testing we can often use that Standard Error Estimate as an approximation for the Standard Error Estimate of the True Parameter (the parameter we're trying to detect via the experiment).

Now that we have the inputs we'll want to get the smallest possible sample size (N) that fulfills the $<=\beta$ requirement. In our calculator we'll be doing that using trial and error, we're going to compute $\beta$ for a bunch of Ns and pick the smallest N at which beta crosses the desired value.

_Note: This sample size calculator assumes a two-tailed test and a normal distribution and is not usable for any other underlying distributions or test types. It also makes a simplification for z\* which is usually fine since we're largely just making an estimate but usually using one of the [online calculators](http://www.evanmiller.org/ab-testing/sample-size.html) is advisable._

In [75]:
from scipy.stats import norm
from math import sqrt,factorial

def get_SDs(p_base,d_min):
    sd_base = sqrt(2*p_base*(1-p_base))
    p_dmin = p_base+d_min
    sd_dmin = sqrt(p_base*(1-p_base)+p_dmin*(1-p_dmin))
    return [sd_base,sd_dmin]

# Inputs:
# The desired alpha for a two-tailed test.
# Returns: The z-critical value
# Note: -norm.ppf(alpha / 2) equals norm.ppf(1 - (alpha / 2)) due to the symmetric shape of the normal distribution
def get_z_star(alpha):
    return(norm.ppf(alpha))
    
# Inputs:
#   s: The standard error of the metric with N=1 in each group
#   d_min: The practical significance level
#   Ns: The sample sizes to try
#   alpha: The desired alpha level of the test
#   beta: The desired beta level of the test
# Returns: The smallest N out of the given Ns that will achieve the desired
#          beta. There should be at least N samples in each group of the experiment.
#          If none of the given Ns will work, returns -1. N is the number of
#          samples in each group.
def required_size(SDs, d_min, alpha=0.05, beta=0.2):
    z_1_minus_a = get_z_star(1-alpha/2)
    z_1_minus_b = get_z_star(1-beta)
    
    return ((z_1_minus_a * SDs[0] + z_1_minus_b * SDs[1]) / d_min) ** 2

#### Calculate Size

Since there are three evaluation metrics, we'll need to size the experiment for each of them and choose the largest one that we choose to include. But first the inputs need to be defined.

_Define Inputs_

Statistical Significance levels vary depending on how many metrics you're including in your analysis. The statistical error actually adds up pretty quickly, to counteract this we can lower alpha for each individual metric to maintain a desired $\alpha_{overall}$. Often we're using the Bonferroni Correction to do this:

$$\alpha_{individual} = \frac{\alpha_{overall}}{m} = \frac{0.05}{3} = 0.0167$$

Our new alpha with three evaluation metrics is thus 

$\alpha = 0.0167$

$\beta = 0.2$

The practical significance and standard error estimate depends on the metric.

In [9]:
alpha = 0.0167
beta = 0.2

_Gross Conversion_

In [10]:
p_gross = baseline.loc['conversion_gross','N=5000']
dmin_gross = baseline.loc['conversion_gross','dmin']
print('p_baseline: {}\nd_min: {}'.format(p_gross,dmin_gross))

N_gross = int(round(required_size(get_SDs(p_gross,dmin_gross), dmin_gross, alpha=alpha, beta=beta)))

print('N: {}'.format(N_gross))

p_baseline: 0.20625
d_min: 0.01
N: 34419


We need 34419 cookies per group. Since our baseline conversion from pageview (cookies_uniq) to click is 0.08, we'll divide the number of cookies by that conversion rate. Lastly, we'll need to double the amount to get to the right amount of pageview.

In [11]:
ctp = baseline.loc['ctp','N=5000']
cookies_gross = round(N_gross/ctp * 2)

print("Necessary number of pageviews (cookies_uniq) in total: {}".format(cookies_gross))

Necessary number of pageviews (cookies_uniq) in total: 860475.0


_Retention_

In [12]:
p_retention = baseline.loc['retention','N=5000']
dmin_retention = baseline.loc['retention','dmin']
print('p_baseline: {}\nd_min: {}'.format(p_retention,dmin_retention))

N_retention = int(round(required_size(get_SDs(p_retention,dmin_retention), dmin_retention, alpha=alpha, beta=beta)))

print('N: {}'.format(N_retention))

p_baseline: 0.53
d_min: 0.01
N: 52114


We'll need 52114 user-ids to get a relevant sample. In this case, we're actually two conversion steps apart from pageviews. So to get to the necessary amount of pageviews 
(cookies_uniq) we'll need to expand the term by the enrollment probability (0.20625) as well as by the click probability (0.08).

In [13]:
cookies_retention = round(N_retention/p_gross/ctp * 2)

print("Necessary number of pageviews (cookies_uniq) in total: {}".format(cookies_retention))

Necessary number of pageviews (cookies_uniq) in total: 6316848.0


_Net Conversion_

In [14]:
p_net = baseline.loc['conversion_net','N=5000']
dmin_net = baseline.loc['conversion_net','dmin']
print('p_baseline: {}\nd_min: {}'.format(p_net,dmin_net))

N_net = int(round(required_size(get_SDs(p_net,dmin_net), dmin_net, alpha=alpha, beta=beta)))

print('N: {}'.format(N_net))

p_baseline: 0.1093125
d_min: 0.0075
N: 36505


In [15]:
cookies_net = round(N_net/ctp * 2)

print("Necessary number of pageviews (cookies_uniq) in total: {}".format(cookies_net))

Necessary number of pageviews (cookies_uniq) in total: 912625.0


### Duration and Exposure

We can see in the previous cells that Retention requires a much higher sample size compared to the other two metrics. In this step we'll evaluate the feasibility of acquiring a sample large enough by looking at the duration necessary to acquire such a sample. 

It is important that an experiment is performed over at least the amount of time to capture the effect. This is tricky and depends on the experiment at hand but usually we want to at least have a duration of approximately two weeks (two cycles). We also don't want the experiment to last too long as we might capture a completely new population or other things change in the environment that make our groups not comparable.

One lever to extend or tighten duration is Exposure. We usually only want to show our experiment to a subset of users to minimize the risk and ensure a consistent experience for most of our users. For our experiment we'll set the exposure to 60%.

In [16]:
daily_cookies = baseline.loc['cookies_uniq','N=40000']
exposure = 0.6

duration_gross = cookies_gross/ (daily_cookies * exposure)
duration_retention = cookies_retention/ (daily_cookies * exposure)
duration_net = cookies_net/ (daily_cookies * exposure)

print("""duration (Gross Conversion): {} days
duration (Retention): {} days
duration (Net Conversion): {} days""".format(int(round(duration_gross)),int(round(duration_retention)),int(round(duration_net))))

duration (Gross Conversion): 36 days
duration (Retention): 263 days
duration (Net Conversion): 38 days


Our suspicion has been confirmed. Retention will cause the experiment to be more than 7 times longer than would be necessary for the other two metrics. Thus, we'll drop _Retention_ from our roster of evaluation metrics. This means $\alpha_{overall}$ needs to be recalculated and the sample size as well.

In [52]:
alpha = 0.05 / 2

N_gross = int(round(required_size(get_SDs(p_gross,dmin_gross), dmin_gross, alpha=alpha, beta=beta)))
N_net = int(round(required_size(get_SDs(p_net,dmin_net), dmin_net, alpha=alpha, beta=beta)))

cookies_gross = N_gross / ctp * 2
cookies_net = N_net / ctp * 2

duration_gross = cookies_gross/ (daily_cookies * exposure)
duration_net = cookies_net/ (daily_cookies * exposure)

print("""duration (Gross Conversion): {} days
duration (Net Conversion): {} days""".format(int(round(duration_gross)),int(round(duration_net))))

duration (Gross Conversion): 33 days
duration (Net Conversion): 35 days


## Analysis

The results are in. Due to a business decision the experiment ran for 37 days but that only leaves us with 23 days of valid results (payment tracking comes in with a 14 day delay). 

_Note: We're going to ignore the validity issues of the length of the experiment for the purpose of this analysis and assume $\alpha = 0.025$ is applicable here._

In [53]:
# Load in experiment and control data
exp = pd.read_csv('data/experiment.csv',parse_dates=['Date'])
cont = pd.read_csv('data/control.csv',parse_dates=['Date'])

# Rename columns to above naming convention
header = ['date','cookies_uniq','clicks_uniq','enrolled','payments']
exp.columns = header
cont.columns = header

# Compute metrics
# Note that retention is not necessary to compute as we've dropped that evaluation metric earlier
exp['ctp'] = exp['clicks_uniq'] / exp['cookies_uniq']
exp['conversion_gross'] = exp['enrolled'] / exp['clicks_uniq']
exp['conversion_net'] = exp['payments'] / exp['clicks_uniq']

cont['ctp'] = cont['clicks_uniq'] / cont['cookies_uniq']
cont['conversion_gross'] = cont['enrolled'] / cont['clicks_uniq']
cont['conversion_net'] = cont['payments'] / cont['clicks_uniq']

cont.head()

Unnamed: 0,date,cookies_uniq,clicks_uniq,enrolled,payments,ctp,conversion_gross,conversion_net
0,"Sat, Oct 11",7723,687,134.0,70.0,0.088955,0.195051,0.101892
1,"Sun, Oct 12",9102,779,147.0,70.0,0.085586,0.188703,0.089859
2,"Mon, Oct 13",10511,909,167.0,95.0,0.086481,0.183718,0.10451
3,"Tue, Oct 14",9871,836,156.0,105.0,0.084693,0.186603,0.125598
4,"Wed, Oct 15",10014,837,163.0,64.0,0.083583,0.194743,0.076464


### Sanity Checks

These initial test ensure that our experiment properly executed and reduces the chance that we've had any blunders in the test set up.

These tests are done on the predefined invariant metrics:

- Number of cookies         _| # cookies on course overview page_
- Number of clicks          _| # cookies clicked on button_ 
- Click-Through-Probability

#### Number of cookies

The number of cookies or pageviews were the Unit of Diversion and have therefore been randomized. This means we can look at this unit in a way as a coin flip with p=0.5. The Standard Error should therefore be

$$SE = \sqrt{\frac{p(1-p)}{N}},\ with\ p = 0.5$$


In [54]:
# Total Number of cookies in experimental und control group
exp_cookies_total = float(exp.loc[:22,'cookies_uniq'].sum())
cont_cookies_total = float(cont.loc[:22,'cookies_uniq'].sum())

cookies_total = exp_cookies_total + cont_cookies_total

# exp fraction of total cookies (should not be significantly different from 0.5)
exp_split = exp_cookies_total / cookies_total

# Confidence interval of a binomial distribution of p = 0.5
p_equal = 0.5
se = sqrt(p_equal*(1-p_equal) / cookies_total)
z = -norm.ppf(alpha/2)
m = z*se
ci = (p_equal - m, p_equal + m)
print('The Confidence Interval (CI) is {} and {}.\nThe Experiment Group split: {}\nPass: {}'.format(ci[0],ci[1],exp_split,exp_split > ci[0] and exp_split < ci[1]))

The Confidence Interval (CI) is 0.49827793169 and 0.50172206831.
The Experiment Group split: 0.49905436515
Pass: True


We only need to check the split of one group here as one depends on the other (all assigned to one or the other by definition). In our case the split was successful and is not significantly different from what we'd expect.

#### Number of Clicks

The next check will explore the invariant metric of clicks, using the same methodology.

In [55]:
# Total Number of clicks in experimental und control group
exp_clicks_total = float(exp.loc[:22,'clicks_uniq'].sum())
cont_clicks_total = float(cont.loc[:22,'clicks_uniq'].sum())

clicks_total = exp_clicks_total + cont_clicks_total

# exp fraction of total clicks (should not be significantly different from 0.5)
exp_split = exp_clicks_total / clicks_total

# Confidence interval of a binomial distribution of p = 0.5
se = sqrt(p_equal*(1-p_equal) / clicks_total)
m = z*se
ci = (p_equal - m, p_equal + m)
print('The Confidence Interval (CI) is {} and {}.\nThe Experiment Group split: {}\nPass: {}'.format(ci[0],ci[1],exp_split,exp_split > ci[0] and exp_split < ci[1]))

The Confidence Interval (CI) is 0.493970975893 and 0.506029024107.
The Experiment Group split: 0.499522472723
Pass: True


#### Click-Through Probability

For probability comparisons we'll be going a similar route but use the difference between the probabilities and therefore use a pooled variance instead.

$$p_{pooled} = \frac{X_{cont} + X_{exp}}{N_{cont} + N_{exp}}$$

$$SE_{pooled} = \sqrt{p_{pooled}(1-p_{pooled})\bigg(\frac{1}{N_{cont}}+\frac{1}{N_{exp}}\bigg)}$$

Once we've computed the SE we'll compute the Confidence Interval (CI) around zero and see, if the difference between $p_{exp}$ and $p_{cont}$ is significantly different from 0 (expected difference).

$$d = p_{exp} - p_{cont}$$

In [56]:
p_ctp_exp = exp_clicks_total / exp_cookies_total
p_ctp_cont = cont_clicks_total / cont_cookies_total
d_ctp = p_ctp_exp - p_ctp_cont
p_pooled = clicks_total / cookies_total

p_0 = 0
se_pooled = sqrt(p_pooled*(1-p_pooled)*((1/exp_cookies_total)+(1/cont_cookies_total)))
m = z*se_pooled
ci = (p_0-m,p_0+m)
print('The Confidence Interval (CI) is {} and {}.\np Difference: {}\nPass: {}'.format(ci[0],ci[1],d_ctp,d_ctp > ci[0] and d_ctp < ci[1]))

The Confidence Interval (CI) is -0.00188553301796 and 0.00188553301796.
p Difference: 0.000152761502527
Pass: True


All our invariant metrics passed the sanity checks we can now go on and analyze our results.

### Multiple Metrics Analysis

At this stage we're now going to look into our evaluation metrics:

- Gross Conversion
- Net Conversion

Our analysis for these metrics follows a similar logic we've followed for the Click-Through-Probability sanity check. However, the Confidence Interval is now created around the difference of the probability measurements. The Confidence Interval around the measurement makes sense, as we're trying to show the uncertainty around our measurement, it also allows us an easier way to compute practical significance.

#### Gross Conversion

As a reminder our statistical significance for Gross conversion is the difference between experimental and control group is not signficantly different from 0 and our practical significance is 0.01, both are evaluated at $\alpha = 0.025$.

In [57]:
p_exp_gross = exp.loc[:22,'enrolled'].sum() / exp.loc[:22,'clicks_uniq'].sum()
p_cont_gross = cont.loc[:22,'enrolled'].sum() / cont.loc[:22,'clicks_uniq'].sum()
p_diff_gross = p_exp_gross - p_cont_gross

p_pooled_gross = (exp.loc[:22,'enrolled'].sum() + cont.loc[:22,'enrolled'].sum()) / clicks_total
se_pooled_gross = sqrt(p_pooled_gross * ((1-p_pooled_gross) * (1/exp_clicks_total + 1/cont_clicks_total)))
m = z * se_pooled_gross
ci = (p_diff_gross - m,p_diff_gross + m)
d_min = baseline.loc['conversion_gross','dmin']
print('CI around p_diff_gross: {}\nStatistical Significance: {}\nPractical Significance: {}'.format(ci,p_0 < ci[0] or p_0 > ci[1],-d_min > ci[1]))

CI around p_diff_gross: (-0.030353559713010379, -0.010756189447712752)
Statistical Significance: True
Practical Significance: True


We can see that the Confidence Interval around our measurement at $\alpha = 0.025$ does not include $p_0 = 0$ and also doesn't include $d_{min} = 0.01$. Lastly, we also expected the gross conversion to go decrease, which can also be witnessed. Thus, for $\alpha = 0.025$ we can say that the result has a statistical and practical significance.

#### Net Conversion

Before we go on and make a recommendation, we'll have to assess the second part of our hypothesis. We wanted Net Conversion to remain the same. Hence, we're expecting a non-signficant change in our results. This will need to be evaluated.

In [58]:
p_exp_net = exp.loc[:22,'payments'].sum() / exp.loc[:22,'clicks_uniq'].sum()
p_cont_net = cont.loc[:22,'payments'].sum() / cont.loc[:22,'clicks_uniq'].sum()
p_diff_net = p_exp_net - p_cont_net

p_pooled_net = (exp.loc[:22,'payments'].sum() + cont.loc[:22,'payments'].sum()) / clicks_total
se_pooled_net = sqrt(p_pooled_net * ((1-p_pooled_net) * (1/exp_clicks_total + 1/cont_clicks_total)))
m = z * se_pooled_net
ci = (p_diff_net - m,p_diff_net + m)
d_min = baseline.loc['conversion_net','dmin']
print('CI around p_diff_net: {}\nStatistical Significance: {}\nPractical Significance: {}'.format(ci,p_0 < ci[0] or p_0 > ci[1],-d_min > ci[1]))

CI around p_diff_net: (-0.012570998897390454, 0.0028235535483021193)
Statistical Significance: False
Practical Significance: False


Since the Confidence Interval includes 0 we're not seeing a significant difference in Net Conversions (that means there is also no practical significance) at $\alpha=0.025$. 

#### Sign Test Evaluation

To cross our ts and dot our is we'll use a non-parametric test (i.e. the sign test) to see whether we get the same results. Failing the sign doesn't mean that the results of our parametric test are wrong but a positive result of the sign test will give us additional confidence that our results are solid.

The advantages of the Sign test are:
- no assumptions on the underlying distribution

The critical value can be computed using

$$K = \frac{n-1}{2}-0.98\sqrt{n}$$

To calculate the probability of this binomial distribution we're using the following formula, with n = total number of days where the outcome is non-zero and k = max{# of positive outcomes, # of negative outcomes}.

Note: We're using the maximum of either positive or negative outcomes as our "success" parameter.

$$_nC_k = \frac{n!}{k!(n-k)!}$$

$$P(X) = _nC_k*p^k*(1-p)^{n-k}$$

$P(X)$ has to be computed and summed across all n >= k >= successes. Since we're interested in a two-tailed test, we'll then have to double that final summed probability.

In [90]:
positive_gross = (exp.loc[:22,'conversion_gross'] - cont.loc[:22,'conversion_gross']) < 0
positive_gross = positive_gross.sum()
negative_gross = (exp.loc[:22,'conversion_gross'] - cont.loc[:22,'conversion_gross']) > 0
negative_gross = negative_gross.sum()
successes_gross = max(positive_gross,negative_gross)
total_gross = positive_gross + negative_gross

# Find the critical value K
def get_binom_p(k,n):  
    combinations = float(factorial(n)/(factorial(k) * factorial(n-k)))
    p_binom = combinations * p_equal**k * (1-p_equal)**(n-k)
    return p_binom

one_sided_prob = sum([get_binom_p(i+successes_gross,total_gross) for i in range(total_gross - successes_gross)])
two_sided_prob = one_sided_prob * 2

print('Two-Sided probability: {}\nalpha: {}\nStatistical Significance: {}'.format(two_sided_prob,alpha,two_sided_prob < alpha))

Two-Sided probability: 0.00259923934937
alpha: 0.025
Statistical Significance: True


In [92]:
successes_net = (exp.loc[:22,'conversion_net'] - cont.loc[:22,'conversion_net']) < 0
successes_net = successes_net.sum()
total_net = (exp.loc[:22,'conversion_net'] - cont.loc[:22,'conversion_net']) != 0
total_net = total_net.sum()

one_sided_prob = sum([get_binom_p(i+successes_net,total_net) for i in range(total_net - successes_net)])
two_sided_prob = one_sided_prob * 2

print('Two-Sided probability: {}\nalpha: {}\nStatistical Significance: {}'.format(two_sided_prob,alpha,two_sided_prob < alpha))

Two-Sided probability: 0.677639245987
alpha: 0.025
Statistical Significance: False


The Sign Test confirms our results in the parametric test.

### Recommendation

The experiment set out to investigate whether an additional screener can improve the UX for users by explaining to users that Udacity courses usually require a certain volume of hours to be completed successfully. We therefore expected Gross Conversion to be lowered and Net Conversion to remain the same (i.e. we only want to sway students from enrolling in the paid version that are unlikely to be successful).

Our experiment shows that the screener seems to achieve these results. We can see that the Net Conversion largely remained the same, while the Gross Conversion was decreased (by at least 1%). This implies that this feature should be launch.

To ensure that the treatment has the same result we should roll out the change in stages and closely monitor any issues this could cause when released to the entire population.