# A/B Testing on Udacity’s Plan to Reduce Early Course Cancellation

Table of contents

1. Description


2. Experiment Design
 
 2.1 Unit of Diversion
 
 2.2 Invariant Metrics
 
 2.3 Evaluation Metrics
 
 2.4 Measuring Variability
 
 2.5 Experiment sizing
 
 
3. Experient analysis
  
  3.1 Sanity Checks
  
  3.2 Effect Check
  
  3.3 Sign test


4. Conclusion and follow-ups






## 1. Description

Udacity, an online learning platform, is considering add a feature to its website.

Currently, the user funnel goes like this: 

course overview --> start free trial --> enroll with payment info --> pay
             
              --> access course materials

After the course overview, an user can either choose to start free trial or access course materials. 

If choosing free trial, the user will be asked to enter their payment information and will have a 14-day trial period, free of charge. During this period, the user will have access to all resources and services as a paid user, including coaching support and feedback on projects. After the trial period, the user will have to pay for the same resources and services. 

If choosing to access course materials, the user can take the courses for free, but will not have the sources and services as a paid user has. 

Now Udacity intends to add a feature by asking the users how much time per week they can work on the course right after they click the "start free trial" button. If the answer is 5 or more hours, they can move forward to the next step of enrollment; otherwise, a message would appear, advising the users to consider "access course materials", though they can still choose to enroll.

The purpose of the change is to discourage the users that do not have enough time from starting free trial and thus save the resoruces and reduce costs.

## 2. Experiment Design

### 2.1 Unit of Diversion

The unit of diversion is a cookie.

### 2.2 Invariant Metrics

Invariant metrics are expected to be the same between the experiment and control groups. They are designed to make sure that data points are distributed evenly between the experiment and control groups.

In this case, they are

1. Number of cookies: number of unique cookies to view the course overview page. 

2. Number of clicks: number of unique cookies to click the "Start free trial" button (which happens before the free trial screen is trigger).

3. Click-through-probability: number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. 

Those three metrics happen before the time commitment question is asked. To make sure traffic is diverted unbiasedly to the experiment and control groups, values cannot be siginificantly different between the two groups. 

### 2.3 Evaluation Metrics

Evaluation metrics are chosen to find out if the experiment will bring in the expected changes.

The expectation is to reduce the number of frustrated students who leave the free trial because they don't have enough time—without significantly reducing the number of students to continue past the free trial and eventually make the payment. 

The evaluation metrics are

1. Gross conversion - enrollments/clicks. 

This matric is expected to decrease significantly, as the experiment is designed to discourage the students who are not likely to be committed to and complete the course from the free trial.

2. Retention 	- payments/enrollments 

This metric is expected to rise significantly since the number of enrollments is expected to decrease.

3. Net conversion -	payments/clicks

As we hope the experiment will achieve the goal while not significantly reducing the number of users to continue past the free trial, the value of this metrics is expected to stay the same or even better go higher.


### 2.4 Measuring Variability

In [7]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [8]:
#Baseline values provided by Udacity
baseline = pd.read_csv('Final Project Baseline Values - Sheet1.csv', header=None, names=['matrics', 'values'])
baseline.set_index(pd.Index(['cookies','clicks','enrollments','CTR','gross conversion','retention','net conversion']), inplace=True)

In [9]:
baseline

Unnamed: 0,matrics,values
cookies,Unique cookies to view course overview page pe...,40000.0
clicks,"Unique cookies to click ""Start free trial"" per...",3200.0
enrollments,Enrollments per day:,660.0
CTR,"Click-through-probability on ""Start free trial"":",0.08
gross conversion,"Probability of enrolling, given click:",0.20625
retention,"Probability of payment, given enroll:",0.53
net conversion,"Probability of payment, given click",0.109313


Assuming a sample size of 5000 cookies visiting the course overview pages per day and based on the baseline values provided, we are going to calculate standard deviations of the evaluation matrices.

For the porpose of standard deviation, we will scale the counts accordingly, while a sample size of 5000 cookies viewing overview page means a scaler of 5000/40000 = 0.125.

In [10]:
baseline.iloc[0, 1] = 5000
baseline.iloc[1, 1] = 3200*0.125
baseline.iloc[2, 1] = 660 * 0.125

In [11]:
baseline

Unnamed: 0,matrics,values
cookies,Unique cookies to view course overview page pe...,5000.0
clicks,"Unique cookies to click ""Start free trial"" per...",400.0
enrollments,Enrollments per day:,82.5
CTR,"Click-through-probability on ""Start free trial"":",0.08
gross conversion,"Probability of enrolling, given click:",0.20625
retention,"Probability of payment, given enroll:",0.53
net conversion,"Probability of payment, given click",0.109313


#### Standard deviation

For the three evaluation metrics, we can estimate the variance analytically if the unit of diversion of the experiment 
is the same as the unit of analysis, and we can further assume binomial distribution.
For gross conversion and net conversion, the unit of diversion equals the unit of analysis. For retention, it is not the case, and empirical analysis is recommended.

For binomial distribution, the standard deviation = np.sqrt(p(1-p)/n)

#### Gross Conversion

In [6]:
p = 1-baseline.iloc[4,1]
sd = round(np.sqrt(p*(1-p)/400), 4)

In [7]:
sd

0.0202

#### Retention 

In [8]:

p = 1-baseline.iloc[5,1]
sd = round(np.sqrt(p*(1-p)/82.5), 4)

In [9]:
sd

0.0549

#### Net conversion

In [13]:
p = 1-baseline.iloc[6,1]
sd = round(np.sqrt(p*(1-p)/400), 4)
sd

0.0156

In [14]:
metrics = pd.DataFrame(baseline.iloc[4:,1])
metrics['standard deviation'] = [0.0202, 0.0549, 0.0156]
metrics

Unnamed: 0,values,standard deviation
gross conversion,0.20625,0.0202
retention,0.53,0.0549
net conversion,0.109313,0.0156


### 2.5 Experiment sizing 

We are going to estimate the size of the empriment for it to have both statistical power and statistical significance based on type I error rate(α, significance level) and type II error rate (β, power).


<center> <font size="5"> $n = \frac{(Z_{1-\frac{\alpha}{2}}sd_1 + Z_{1-\beta}sd_2)^2}{d^2}$</font>, with: <br><br>
$sd_1 = \sqrt{p(1-p)+p(1-p)}$<br><br>
$sd_2 = \sqrt{p(1-p)+(p+d)(1-(p+d)}$ </center><br>

In [15]:
from scipy.stats import norm
def sample_size(alpha, beta, p, d):
    z_alpha = norm.ppf(1-alpha/2)
    z_beta = norm.ppf(1-beta)
    sd1 = np.sqrt(p*(1-p) + p*(1-p))
    sd2 = np.sqrt(p*(1-p) + (p+d)*(1-(p+d)))
    return round((z_alpha * sd1 + z_beta * sd2)**2/d**2)

#### Gross Conversion 

In [16]:
p = metrics.iloc[0,0]
d = 0.01
alpha = 0.05
beta = 0.2
size_gross_coversion = sample_size(alpha, beta, p, d)
print('Number of clicks per group:', size_gross_coversion)
print('Number of cookies in total:', 2*size_gross_coversion /(0.08))

Number of clicks per group: 25835
Number of cookies in total: 645875.0


With 40,000 pageviews per day, we will need 645875/40000 = 17 days to run the experiment.

#### Retention 

In [17]:
p = metrics.iloc[1,0]
d = 0.01
alpha = 0.05
beta = 0.2
size_gross_coversion = sample_size(alpha, beta, p, d)
print('Number of enrollments per group:', size_gross_coversion)
print('Number of cookies in total:', 2*size_gross_coversion/(0.20625*0.08))

Number of enrollments per group: 39087
Number of cookies in total: 4737818.181818182


With 40,000 pageviews per day, we will need 4737818/40000 = 119 days to run the experiment. It takes too long and is not feasible. Therefore, we would drop this metric. 

#### Net conversion 

In [18]:
p = metrics.iloc[2,0]
d = 0.0075
alpha = 0.05
beta = 0.2
size_gross_coversion = sample_size(alpha, beta, p, d)
print('Number of clicks per group:', size_gross_coversion)
print('Number of cookies in total:', 2*size_gross_coversion/0.08)

Number of clicks per group: 27413
Number of cookies in total: 685325.0


With 40,000 pageviews per day, we will need 685325/40000 = 18 days to run the experiment.

Therefore, we could run the experiment for 18 days given 40,000 pageviews per day. If we take 80% of the pageviews per day, we would need 685325/(40,000 * 0.8) = 22 days

#### Online caculator 

We can also find out the sample size using an online caculator https://www.evanmiller.org/ab-testing/sample-size.html

Gross conversion

Baseline value: 0.206250

Significance level: ${\alpha}$: 0.05

Statistical power: ${\beta}$: 0.2

Simple size: 25835 clicks for both the experiment group and the control group

Number of cookies: 2 * 25835 * 40000 / 3200 = 645,875

Retention

Baseline value: 0.53

Significance level ${\alpha}$: 0.05

Statistical power ${\beta}$: 0.2

Simple size: 39115 enrollments for both the experiment group and the control group

Number of cookies: 2 * 39115 * 40000 / 660 = 4,741,212

Net convsersion

Baseline value: 0.109313

Significance level ${\alpha}$: 0.05

Statistical power ${\beta}$: 0.2

Simple size: 27,413 clicks for both the experiment group and the control group

Number of cookies: 2 * 27413 * 40000 / 3200 = 685,325

## 3. Experiment Analysis 

In [19]:
control = pd.read_csv('control_data.xls')
experiment = pd.read_csv('experiment_data.xls')

In [20]:
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [21]:
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


### 3.1 Sanity Checks

Before finding out the effect of the experiment, we need to make sure that the experiment is not biased between the control group and the experiment group by checking if the invariant metrics are equal between the two groups. 

The three invariant metrics are:

1.Number of cookies

2.Number of clicks

3.Click through probability

We can use either hypothesis testing or confidential interval for Sanity checks

#### Number of cookies

It is a one-proportion problem.

In [19]:
cookies_control = control['Pageviews'].sum()
cookies_experiment = experiment['Pageviews'].sum()
cookies_total = cookies_control + cookies_experiment
proportion = cookies_experiment/cookies_total

Hypothesis testing

$H_{0}$: proportion = 0.5

$H_{1}$: proportion != 0.5

According to the central limit theroem, the distribution approximates normal distribution N ~ (p, p(1-p)/n) where n = cookies_total, and p = proportion.

In [20]:
z_score = (proportion - 0.5)/np.sqrt(0.5*(1-0.5)/cookies_total)
z_score

-1.0628507473598006

In [21]:
p_value = 2 * norm.sf(abs(z_score))

In [22]:
p_value

0.2878496417065868

Since p-value is larger than the significance level, we cannot reject the null hypothesis that the proportion is 0.5. 

Confidence interval

In [23]:
margin_of_error = norm.ppf(1-0.05/2) * np.sqrt(0.5*(1-0.5)/cookies_total)

In [24]:
ci = 0.5 + margin_of_error, 0.5 - margin_of_error
ci

(0.5011795861754058, 0.4988204138245942)

Noticing that the actual proportion is within the confidential interval, we can conclude that the actual proportion is not siginificantly different from 0.5.

From both the hypothesis testing and the confidence interval, we can conclude that the number of cookiess is not significantly different between the experiment group and the control group.

#### Number of clicks

In [25]:
clicks_control = control['Clicks'].sum()
clicks_experiment = experiment['Clicks'].sum()
clicks_total = clicks_control + clicks_experiment
proportion = clicks_experiment/clicks_total

Hypothesis testing

$H_{0}$: proportion = 0.5

$H_{1}$: proportion != 0.5

Similarly, the distribution approximates normal distribution N ~ (p, p(1-p)/n) where p = proportion, n = clicks_total

In [26]:
z_score = (proportion - 0.5)/np.sqrt(0.5*(1-0.5)/clicks_total)
z_score

-0.2225731904481107

In [27]:
p_value = 2 * norm.sf(abs(z_score))
p_value

0.8238677039815487

Since p-value is larger than the significance level, we cannot reject the null hypothesis that the proportion is 0.5.

Confidence Interval

In [28]:
margin_of_error = norm.ppf(1-0.05/2) * np.sqrt(0.5*(1-0.5)/clicks_total)

In [29]:
ci = (0.5 - margin_of_error, 0.5 + margin_of_error)
ci

(0.4958845713471463, 0.5041154286528536)

In [30]:
proportion

0.49953265259333723

Noticing that the actual proportion is within the confidential interval, we can conclude that the actual proportion is not siginificantly different from 0.5.

From both the hypothesis testing and the confidence interval, we can conclude that the number of clicks is not significantly different between the experiment group and the control group.

#### Click Through Probability

It is a two-proportions problem with the same variance between the two groups. 

To make sure the experiment is valid, we expect no difference in the click through probability between the experiment and control groups.

Also, we can use pooled standard deviation given the same variance of the two groups. 

In [34]:
ctp_control = clicks_control/cookies_control
ctp_experiment = clicks_experiment/cookies_experiment

Hypothesis testing

$H_{0}$: ctp_control - ctp_experiment = 0

$H_{1}$: ctp_control - ctp_experiment != 0

In [36]:
ctp_pooled = clicks_total/cookies_total
sd_pooled = np.sqrt(ctp_pooled*(1-ctp_pooled)*(1/cookies_control + 1/cookies_experiment))
z_score = (ctp_control - ctp_experiment)/sd_pooled

In [38]:
p_value = 2*norm.sf(abs(z_score))

In [39]:
p_value

0.9317359524473912

Since p_value is larger than the significance level of 0.5, we cannot reject the null hypothesis that the click through probability is the same between the two groups.

Confidence Interval

In [40]:
margin_of_error = norm.ppf(1-0.05/2) * sd_pooled

In [41]:
ci = 0 - margin_of_error, 0 + margin_of_error

In [42]:
ci

(-0.001295655390242568, 0.001295655390242568)

In [43]:
ctp_control - ctp_experiment

-5.662709158693602e-05

The actual difference of click through probability between the two groups is within the confidence interval given a signifiance level of 0.05.

Both the hypothesis testing and confidence interval show that the click through probability is not significantly different between the two groups.

By far we've confirmed that the experiment is run validly. 

### 3.2 Effect Check

We're now going the check if the change could bring the effect as we expect. That is, the two evaluation metrics are expected to be significantlty higher in the experiment group in both a statistical and practical sense.

#### Gross Conversion 

We expect the gross conversion of the experiment group to be significanly less than that of the control group, further by the minimum change (-0.01) to make sense practically, as the intention is to discourage the students that cannot commit from enrolling.

In [52]:
enroll_control = control['Enrollments'].sum()
enroll_experiment = experiment['Enrollments'].sum()
clicks_control = control[control['Enrollments'].notnull()]['Clicks'].sum()
clicks_experiment = experiment[experiment['Enrollments'].notnull()]['Clicks'].sum()

In [53]:
gc_control = enroll_control/ clicks_control

In [54]:
gc_experiment = enroll_experiment/clicks_experiment

In [55]:
gc_pooled = (enroll_control + enroll_experiment)/(clicks_control + clicks_experiment)

In [56]:
sd_pooled = np.sqrt(gc_pooled*(1-gc_pooled)*(1/clicks_control + 1/clicks_experiment))

In [57]:
margin_of_error = norm.ppf(1-0.05/2) * sd_pooled

In [58]:
ci = gc_experiment - gc_control - margin_of_error, gc_experiment - gc_control + margin_of_error
ci

(-0.02912320088750467, -0.011986548273218463)

The upper bound of the confidence interval is less than 0 and the minimum change of -0.01, which means the change is statistically and practically effectively in lowering the gross conversion rate. 

#### Net Conversion

We expect the experiment group to have a higher net conversion rate than the control group.

In [67]:
pay_control = control['Payments'].sum()
pay_experiment = experiment['Payments'].sum()

In [68]:
nc_control = pay_control/clicks_control
nc_experiment = pay_experiment/clicks_experiment

In [69]:
nc_pooled = (pay_control + pay_experiment)/(clicks_control+clicks_experiment)

In [70]:
sd_pooled = np.sqrt(nc_pooled * (1-nc_pooled) * (1/clicks_control + 1/clicks_experiment))

In [71]:
margin_of_error = norm.ppf(1-0.05/2) * sd_pooled

In [72]:
ci = nc_experiment - nc_control - margin_of_error, nc_experiment - nc_control + margin_of_error 
ci

(-0.011604500677993734, 0.0018570553289053993)

Unfortunately, the lower bound of the confidence interval is less than 0 and the minimum change, and even the upper bound is less than the minimum change of 0.0075.

The change could not bring a positive change to the net conversion rate. 

### 3.3 Sign test

In a sign test, we focus on the proportion of the days when the experiment group has a better performance than the control group in terms of the evaluation metrics, which we will call success rate.

We need to confirm if the success rate is higher than 50%.

In [22]:
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [23]:
data = control.merge(experiment, on='Date', how='inner')
data.dropna(inplace=True, axis=0)

In [24]:
data['gc_control'] = data['Enrollments_x']/data['Clicks_x']
data['gc_experiment'] = data['Enrollments_y']/data['Clicks_y']
data['nc_control'] = data['Payments_x']/data['Clicks_x']
data['nc_experiment'] = data['Payments_y']/data['Clicks_y']

In [25]:
data = data[['Date','gc_control', 'gc_experiment','nc_control','nc_experiment']]

In [26]:
n_sample = len(data)
n_sample

23

Since the number of samples is less than 30, we are not going to use the central limit theroem.

Instead, we will calculate the p_value based on its definition. p_value is the probability of obtainning the results at least as extreme as the one observed.

In [27]:
import math
def get_p_value(n, k, p):
    '''
    n: number of samples
    k: number of successes
    p: null hypithesis of success rate
    return: probability of k or more sucesses out of n trials
    '''
    p_val = 0
    # p_value is the probability of obtainning the results at least as extreme as the one observed.
    if k < n*p:
        for i in range(0, k+1):
            p_val += probability(n, i, p)
    else:
        for i in range(k, n+1):
            p_val += probability(n, i, p)
    return 2*p_val
    

def probability(n, k, p):
    '''
    n: number of samples
    k: number of successes
    p: null hypithesis of success rate
    return: probability of k or more sucesses out of n trials
    '''
    prob = p**k * p**(n-k) * math.factorial(n)/(math.factorial(n-k) * math.factorial(k))
    return prob

#### Gross conversion

$H_{0}$: success rate = 0

$H_{1}$: sucess rate != 0

In [28]:
success = (data['gc_experiment'] < data['gc_control']).sum()
success

19

In [29]:
p_value = get_p_value(n_sample, success, 0.5)

In [30]:
p_value

0.002599477767944336

p_value is smaller than the significance of 0.05, which means the null hypothesis of a 0.5 sucess rate can be rejected. Therefore, the success rate is significantly different from (larger than in this case) 0.5.

That is, the change can bring a decline in the gross conversion rate.

#### Net Conversion

$H_{0}$: success rate = 0

$H_{1}$: success rate != 0

In [31]:
success = (data['nc_experiment'] > data['nc_control']).sum()
success

10

In [32]:
p_value = get_p_value(n_sample, success, 0.5)
p_value

0.6776394844055176

The p_value is much higher than the significance level, which means we cannot reject the null hypothesis that the sucess rate is 0.5. We cannot decide if the change has brought a positive change to the net coversion rate.

The results of the sign test are consistent with those of the effect checks. The change can bring a disrable change to gross conversion but not to net conversion.

### 4. Conclusion and follow-ups 

Adding a feature to discourage the users who cannot meet the time commitment requirement can siginificantly decrease the gross conversion rate in both statistical and pratical senses and thus help save coaching resources and reduce costs. But, we don't know if the resources saved is statistically and practically significant. On the other hand, the effect on the net conversion is not clear, and the effect could be negative. This is a risk that we must consider.

Therefore, the change is not recommeneded. More research and experiments are needed to find a way to save the coaching resouces. 



