# Audacity A/B Testing Project

## Experiment Overview: Free Trial Screener

Udacity is an online course platform specialized in IT sector. At the moment, they want to run an experiment on their website with the goal of improving the course completion rate of their students.

### Context

Let's take a more in-deepth look to how the Udacity environment is setup before the experiment:

* Udacity courses currently have two options on the course overview page: "Start Free Trial", and "Access Course Materials".
* If the student clicks "Start Free Trial" button, they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course.
* If the student clicks "Access Course Materials", they will be able to view the videos and take the quizzes for free, but they will not receive any bonuses, like coaching services or earning the certificate of the course.

### Description of the Experiment

* In the experiment, Udacity tested a change where if the student clicked "Start Free Trial" button, they were asked how much time they had available to devote to the course. 
* If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. 
* If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. 
* At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.

### Hypothesis Testing

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

### Unit of Diversion

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

## Experiment Design
### Metric Choice

The invariant metrics are the ones which shouldn't change across the experiment and the control group. These metrics work to perform a sanity check after running the experiment to test if there was any issue on the process.

The evaluation metrics are those target metrics that you expect to change across groups and are relevant for the business goals. For each metric, it is defined a $D_{min}$ which marks the minimum change which is practically significant to the business. This figure is provided by Udacity for each of the metrics:

#### Invariant Metrics (Sanity Check)

For this case, the following metrics will be chosen as invariant metrics:

* Number of cookies per course page by day. This metric has been chosen because in this case the number of cookies for each group should remain the same during the experiment. As cookies is our unit of diversion for this experiment, the sample size needs to be the same across both groups. It will be represented by $Cookies$ and it has a $D_{min} = 3,000$.
  
* Number of clicks in the 'Start Free Trial' button by day. As this click happens before the Free Trial Screener message appears, this metric shouldn't be impacted by the experiment. It should remain with no changes. It will be represented by $Clicks$ and it has a $D_{min} = 240$.
  
* Click-Through-Probability in the 'Start Free Trial' button by day. This metric is the relationship between the number of clicks on the 'Start Free Trial' button and the number of cookies on page. If these two metrics don't change, the CTP shouldn't change either. It will be represented by $\frac{Clicks}{Cookies}$ and it has a $D_{min} = 0.01$.

#### Evaluation Metrics

The metrics that will be used as evaluation metrics are:

* Gross Conversion. This metric is the relationship between the number of user IDs to complete the checkout and enroll the free trial divided by unique cookie clicks on the button. It will be represented by $\frac{User IDs enrolled}{Clicks}$ and it has a $D_{min} = 0.01$.

* Retention. This metric is the relationship between the number of user IDs to remain enrolled and make at least one payment divided by unique user IDs to complete the checkout. It will be represented by $\frac{User IDs paid}{User IDs enrolled}$ and it has a $D_{min} = 0.01$.
  
* Net Conversion. This metric is the relationship between the number of user IDs to remain enrolled and make at least one payment divided by the number of unique cookie clicks on the button. It will be represented by $\frac{User IDs paid}{Clicks}$ and it has a $D_{min} = 0.0075$.

These three metrics are expected to change because they are measured after the Free Trial Screener message appears. Also, they are relevant for the business because they help to measure the low funnel performance and retention.

### Measuring Standard Deviation

Udacity provides the following rough estimates for these metrics, probably measured with a daily aggregation. This is the baseline for each of the metrics:

* Unique cookies to view course overview page per day: 40,000
* Unique cookies to click "Start free trial" per day: 3,200
* Enrollments per day: 660
* Click-through-probability on "Start free trial": 0.08
* Probability of enrolling, given click (Gross Conversion): 0.20625
* Probability of payment, given enroll (Retention): 0.53
* Probability of payment, given click (Net Conversion): 0.1093125

Now, we will need to calculate the standard deviation for each of the evaluation metrics. This step is very important to test if a metric has a great variability or it is more robust. On one hand, the most variant a metric is, the harder is to get significant results. In the other hand, if the metric is too robust, it is possible it's too insensitive to capture the statistically significant change. 

Udacity assumes a sample size of 5,000 cookies visiting the course overview page per day. As the previous data is based on baseline numbers, we will need to readjust the metrics to a sample size of 5,000 cookies.

In [128]:
# Create a dictionary with the baseline estimates
baseline = {"Total Cookies":40000, "Total Clicks":3200, "Total Enrollments":660, "CTP":0.08, "Gross Conversion":0.20625, "Retention":0.53, "Net Conversion":0.1093125}

# Creating a copy of baseline to add the adjustments
sample_adjusted = baseline

# Defining the samples
n = 40000
n_adjusted = 5000

In [129]:
# Scale the estimates from a sample size of 40,000 to 5,000

sample_adjusted["Total Cookies"] = 5000
sample_adjusted["Total Clicks"] = (n_adjusted * baseline["Total Clicks"] / n)
sample_adjusted["Total Enrollments"] = (n_adjusted * baseline["Total Enrollments"] / n)
sample_adjusted

{'Total Cookies': 5000,
 'Total Clicks': 400.0,
 'Total Enrollments': 82.5,
 'CTP': 0.08,
 'Gross Conversion': 0.20625,
 'Retention': 0.53,
 'Net Conversion': 0.1093125}

As our evaluations metrics are probabilities, we can assume all of them have a binomial distribution (or normal, as we have enough data samples). Now, let's calculate the standard deviation based on this with the following formula:

$$SD = \sqrt{\frac{p'(1-p')}{N}}$$

We can do this assumption for the Gross Conversion and the Net Conversion because the unit of diversion is the same that the unit of analysis (metric placed on the denominator). In this case, the unit of diversion is the cookie, as well as the unit of analysis (number of cookies who clicked). We can expect the analytical estimates to be accurate.

However, the Retention metric doesn't have the same unit of analysis and unit of diversion. In this case, the unit of diversion is the cookie but the unit of analysis is the user ID. For this reason, the analytical estimates might not match the empirical estimates and it will be worth it to calculate it empirically too. 

In [130]:
# Importing libraries
import numpy as np
import pandas as pd
import scipy.stats as stats

In [131]:
# Creating a function to calculate the standard deviation
def sd(p, n):
    dic = {}
    dic["sd"] = round(np.sqrt((p*(1-p))/n), 4)
    return dic['sd']

In [132]:
# Let's create three new dictionaries for each metric
gross_conversion = {}
retention = {}
net_conversion = {}

# Adding the Dmin data to each dictionary
gross_conversion["d_min"] = 0.01
retention["d_min"] = 0.01
net_conversion["d_min"] = 0.0075

# Calculating p and n for each metric
gross_conversion["p"] = sample_adjusted["Gross Conversion"]
gross_conversion["n"] = sample_adjusted["Total Clicks"]

retention["p"] = sample_adjusted["Retention"]
retention["n"] = sample_adjusted["Total Enrollments"]

net_conversion["p"] = sample_adjusted["Net Conversion"]
net_conversion["n"] = sample_adjusted["Total Clicks"]

# Using the function created to get the standard deviation
print(sd(gross_conversion["p"], gross_conversion['n']))
print(sd(retention["p"], retention['n']))
print(sd(net_conversion["p"], net_conversion['n']))

0.0202
0.0549
0.0156


### Sizing
#### Number of Samples vs. Power

First of all, we are going to use the formula Evan Miller used on his online calculator to calculate the sample size:

$$n = \frac{Z_{1-\frac{\alpha}{2}}·sd_{1}+Z_{1-\beta}·sd_{2}}{d^2}$$

$$sd_{1} = \sqrt{2p(1-p)}$$
$$sd_{2} = \sqrt{p(1-p)+(p+d)(1-p-d)}$$

Where:

* $p_{1}$ is the baseline conversion rate.
* $\delta$ is the detectable change.
* $\alpha$ is the significance level.
* $\beta$ is the statistical power or practice significance.
* $Z_{\frac{\alpha}{2}}$ means the z-score from the z table that corresponds to $\frac{\alpha}{2}$
* $Z_{\beta}$ means the z-score from the z table that corresponds to $\beta$

During the whole experiment, $\alpha = 0.05$ and $1-\beta = 0.20$. The z-score for each of them are $Z_{\frac{0.05}{2}} = -1.959963985$ and $Z_{\beta} = -0.841621234$.

In [133]:
# Create a function to calculate the sample size
def sample_size(p, delta):
    if p > 0.5:
        p = 1.0 - p
    
    z_a = 1.959963985
    z_b = 0.841621234

    sd1 = np.sqrt(2 * p * (1.0 - p))
    sd2 = np.sqrt(p * (1.0 - p) + (p + delta) * (1.0 - p - delta))

    return round((z_a * sd1 + z_b * sd2) * (z_a * sd1 + z_b * sd2) / (delta * delta), 0)

##### Gross Conversion

For Gross Conversion, we will need at least 25,835 cookies who click in the Free Trial button per group.

In [134]:
gross_conversion['sample_size'] = sample_size(gross_conversion['p'], gross_conversion['d_min'])
gross_conversion['sample_size']

25835.0

Now, we need to estimate the number of pageviews needed to achieve those 25,835 cookies per group. To do so, we need to calculate the ratio between clicks and pageviews: $400/5000 = 0.08$. Now, let's divide the sample size we got between this result and multiply it by two, as we have two groups in the experiment:

In [135]:
gross_conversion['sample_size'] = (gross_conversion['sample_size']/0.08)*2
gross_conversion['sample_size']

645875.0

We would need in total 645,875 pageviews in total counting both groups.

##### Retention

Regarding retention, we will need at least 39,087 users who enrolled per group. 

In [136]:
retention['sample_size'] = sample_size(retention['p'], retention['d_min'])
retention['sample_size']

39115.0

Now, we divide this result by 0.08 to know how many users need to click on the 'Start Free Trial' button and then how many cookies viewed a course overview page:

In [137]:
retention['sample_size'] = ((retention['sample_size']/0.08)/gross_conversion['p'])*2
retention['sample_size']

4741212.121212121

This means we will need 4,74 million pageviews. However, this number is quite high, as Udacity attracts 40,000 cookies per day. The experiment would need to last 120 days to gather the necessary sample. For these reasons, we drop this metric from the experiment.

##### Net Conversion

For net conversion, we will need at least 27,413 users who click per group. 

In [138]:
net_conversion['sample_size'] = sample_size(net_conversion['p'], net_conversion['d_min'])
net_conversion['sample_size']

27413.0

Now, we need to calculate how many pageviews we will need by dividing between 0.08. This way, we will need 685,325 pageviews.

In [139]:
net_conversion['sample_size'] = ((net_conversion['sample_size']/0.08))*2
net_conversion['sample_size']

685325.0

As this number of pageviews is bigger than the one needed for the gross conversion, this is going to be our sample size.

#### Duration vs. Exposure

Indicate what fraction of traffic you would divert to this experiment and, given this, how many days you would need to run the experiment. (These should be the answers from the "Choosing Duration and Exposure" quiz.)

Give your reasoning for the fraction you chose to divert. How risky do you think this experiment would be for Udacity?

## Experiment Analysis
### Sanity Checks

For each of your invariant metrics, give the 95% confidence interval for the value you expect to observe, the actual observed value, and whether the metric passes your sanity check. (These should be the answers from the "Sanity Checks" quiz.)

For any sanity check that did not pass, explain your best guess as to what went wrong based on the day-by-day data. Do not proceed to the rest of the analysis unless all sanity checks pass.

### Result Analysis
#### Effect Size Tests

For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. Indicate whether each metric is statistically and practically significant. (These should be the answers from the "Effect Size Tests" quiz.)

#### Sign Tests

For each of your evaluation metrics, do a sign test using the day-by-day data, and report the p-value of the sign test and whether the result is statistically significant. (These should be the answers from the "Sign Tests" quiz.)

#### Summary

State whether you used the Bonferroni correction, and explain why or why not. If there are any discrepancies between the effect size hypothesis tests and the sign tests, describe the discrepancy and why you think it arose.

### Recommendation

Make a recommendation and briefly describe your reasoning.

## Follow-Up Experiment

Give a high-level description of the follow up experiment you would run, what your hypothesis would be, what metrics you would want to measure, what your unit of diversion would be, and your reasoning for these choices.
