# Experiment Design

## Experiment Overview

The experiment was running for the Learning courses site.

At the time of this experiment, courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.


In the experiment, was tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial or access the course materials for free instead.

### Hypothesis

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. 

If this hypothesis held true, the Learning courses site could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

### A/B testing experiment

<b> Goal </b>

This experiment is designed to understand whether the change of page will help to filter out students who don’t have much time for study, but also not reduce the number of students who will make the payment after completing their free trial.

We have control and experiment groups. 
For the control group, the version of the page is not changed. And the experimental group, when clicking on the "start free trial", will get an additional question about the number of hours per week that a person is willing to devote to study.

<b> What are the units in the population you are going to run the test on? (unit of diversion) </b>

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

## Choosing and Characterizing Metrics

A/B Testing requires two types of metrics: Invariant Metrics and Evaluation Metrics.

Invariant metrics (for sanity checking): 
- Metrics that shouldn’t change between your test and control groups

Evaluation Metrics:
- Metrics in which we expect to see change, and which are relevant to the business goals. 

### Invariant Metrics

 - <b>Number of cookies</b>: That is, number of unique cookies to view the course overview page. (dmin=3000)
 - <b>Number of clicks</b>: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
 - <b>Click-through-probability (CTP)</b>: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)

### Evaluation Metrics

- <b>Gross conversion</b>: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
- <b>Retention</b>: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
- <b>Net conversion</b>: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

## Measuring Variability

For measuring variability, I need to estimate the standard deviation of evaluation metrics analytically, which also will be helpful later to determine the sizing of the experiment to analyze confidence intervals and draw conclusions.

### The Baseline Values

Before starting the experiment, we need to know what is baseline values of my evaluation metrics, that is, how the metrics behaved before the experiment.

The following estimates of the baseline values (based on daily traffic) offer:

In [1]:
import pandas as pd
import math as mt

# source link - https://docs.google.com/spreadsheets/d/1MYNUtC47Pg8hdoCjOXaHqF-thheGpUshrFA21BAJnNc/edit#gid=0
baseline_values = pd.DataFrame({'Metric':['Number of cookies', 'Number of clicks', 'Number of enrollments', 
                                   'Click-through-probability(CTP)', 'Gross conversion', 'Retention', 'Net conversion'],
                         'Description':['Unique cookies to view course page per day',
                                  'Unique cookies click "Start free trial" per day',
                                  'Enrollments per day',
                                  'Click-through-probability on "Start free trial"',
                                  'Probability of enrolling, given click',
                                  'Probability of payment, given enroll',
                                  'Probability of payment, given click'],
                         'Value':[40000, 3200, 660, 0.08, 0.20625, 0.53, 0.1093125]})
baseline_values

Unnamed: 0,Metric,Description,Value
0,Number of cookies,Unique cookies to view course page per day,40000.0
1,Number of clicks,"Unique cookies click ""Start free trial"" per day",3200.0
2,Number of enrollments,Enrollments per day,660.0
3,Click-through-probability(CTP),"Click-through-probability on ""Start free trial""",0.08
4,Gross conversion,"Probability of enrolling, given click",0.20625
5,Retention,"Probability of payment, given enroll",0.53
6,Net conversion,"Probability of payment, given click",0.109313


### Sample size assumption

For each evaluation metric, I need to make an analytic estimate of its standard deviation, given a sample size of 5,000 cookies visiting the course overview page (a condition from the project). This sample size was chosen to be smaller than the "population"  (Unique cookies to view course page per day), and large enough to have two groups.

I rescaled the baseline values in view of a sample size of 5,000 cookies. This rescale applies only to quantitative metrics, i.e., the number of cookies, clicks, and enrollments.

In [2]:
ratio = 5000 / 40000
baseline_values['Sample_value'] = 0
baseline_values.loc[0:2, 'Sample_value'] = baseline_values['Value'][0:3] * ratio
baseline_values.loc[3:6, 'Sample_value'] = baseline_values['Value'][3:7]
baseline_values

Unnamed: 0,Metric,Description,Value,Sample_value
0,Number of cookies,Unique cookies to view course page per day,40000.0,5000.0
1,Number of clicks,"Unique cookies click ""Start free trial"" per day",3200.0,400.0
2,Number of enrollments,Enrollments per day,660.0,82.5
3,Click-through-probability(CTP),"Click-through-probability on ""Start free trial""",0.08,0.08
4,Gross conversion,"Probability of enrolling, given click",0.20625,0.20625
5,Retention,"Probability of payment, given enroll",0.53,0.53
6,Net conversion,"Probability of payment, given click",0.109313,0.109313


### Measuring  Standard Deviation of Evaluation Metrics

I can assume metrics are binomial distributed (because these metrics are metrics of probability), so we can use this formula for the standard deviation:
$$ SD = \sqrt{ \frac{\hat{p}*(1-\hat{p})}{n}} $$

$ \hat{p} $ is a probability of the event happening,

$ n $ is a sample size

In [3]:
# function for calculationg of the standard deviation
def st_deviation (p, n):
    return round(mt.sqrt(p * (1 - p)/n), 4)

In [4]:
# Creating of table with standard deviation
standard_deviation = pd.DataFrame()
standard_deviation['Metric'] = baseline_values['Metric'][4:7]
standard_deviation['SD'] = 0
standard_deviation = standard_deviation.reset_index(drop= True)

#### Standard Deviation of Gross Conversion

$ \hat{p} $ is probability of enrolling, given click, or 0.206250, and $ n $ is number of clicks or 400

In [5]:
sd1 = st_deviation (baseline_values['Sample_value'][4], baseline_values['Sample_value'][1])

In [6]:
standard_deviation['SD'].loc[0] = sd1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  standard_deviation['SD'].loc[0] = sd1


#### Standard Deviation of Retention

$ \hat{p} $ is probability of payment, given enroll, or 0.530000, and $ n $ is number of enrollments or 82.5

In [7]:
sd2 = st_deviation (baseline_values['Sample_value'][5], baseline_values['Sample_value'][2])

In [8]:
standard_deviation['SD'].loc[1] = sd2

#### Standard Deviation of Net conversion

$ \hat{p} $ is probability of payment, given enroll, or 0.109313, and $ n $ is number of clicks or 400

In [9]:
sd3 = st_deviation (baseline_values['Sample_value'][6], baseline_values['Sample_value'][1])
standard_deviation['SD'].loc[2] = sd3

#### Analytical Standard Deviation for Evaluation Metrics

In [10]:
standard_deviation

Unnamed: 0,Metric,SD
0,Gross conversion,0.0202
1,Retention,0.0549
2,Net conversion,0.0156


## Sizing

With $ \alpha $ (significance level or probability of Type I Error)=0.05 (5%) and $ \beta $ (power)=0.2, using the analytic estimates of variance, I need to calculate how many pageviews total (across both groups) I need to collect to adequately power the experiment? That is, to determine the minimum size for each group of experiment with enough power for each metric.

### Using sample size calculator

https://www.evanmiller.org/ab-testing/sample-size.html

<b> Gross conversion </b>

Baseline rate: 20.625%

Minimum Detectable Effect: 0.01

Sample size (from calculator): 25835 clicks/group

Total sample size (for 2 groups): 25835*2 = 51670 clicks

Pageviews= 51670 / (clicks / pageviews) = 51670 / 0.08 = 645875


<b> Retention </b>

Baseline rate: 53%

Minimum Detectable Effect: 0.01 

Sample size(from calculator): 39115 enrolls/group

Total sample size: 39115*2 = 78230 enrolls

Pageviews= 78230 / (enrolls / pageviews)= 78230 / (660/40000) = 4741212

<b> Net conversion </b>

Baseline rate: 10.9313%

Minimum Detectable Effect: 0.0075

Sample size: 27413 clicks/group

Total sample size: 27413*2=54826 clicks

Pageviews= 54826 / (clicks / pageviews) = 54826 / 0.08 = 685325

In [11]:
sample_sizing = pd.DataFrame({'Metric':['Gross conversion', 'Retention', 'Net conversion'],
                         'Sample size':[51670, 78230, 54826],
                         'Pageviews':[645875, 4741212, 685325]})
sample_sizing

Unnamed: 0,Metric,Sample size,Pageviews
0,Gross conversion,51670,645875
1,Retention,78230,4741212
2,Net conversion,54826,685325


The number of pageviews number that is sufficient for all metrics is 4741212

## Choosing Duration

We have 40,000 page views per day. If we use 100% of traffic for our experiment, we need this duration:

In [12]:
sample_sizing['Duration'] = sample_sizing['Pageviews'] / 40000
sample_sizing

Unnamed: 0,Metric,Sample size,Pageviews,Duration
0,Gross conversion,51670,645875,16.146875
1,Retention,78230,4741212,118.5303
2,Net conversion,54826,685325,17.133125


That is, we need about 119 days for the Retention metric and a maximum of 18 days for Gross conversion and Net conversion. It's the too long-running experiment and has some potential risks for business, so we have to remove Retention from our evaluation metrics.

The duration for Gross conversion and Net conversion is 17 days and 18 days. That is, we need to run the experiment for at least 18 days and the pageview requirement is reduced to 685325.

We also can use only 50% of traffic, then we need 35 days of the experiment. But in this case, is better to run the experiment for a shorter time.

# Experimental Analysis

Data, which is used to perform the analysis, contains the raw information needed to compute the above metrics, broken down day by day.

In [14]:
# load data of control group
# link source - https://docs.google.com/spreadsheets/d/1Mu5u9GrybDdska-ljPXyBjTpdZIUev_6i7t4LRDfXM8/edit#gid=0
control = pd.read_csv("Control.csv", sep=',')
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [15]:
control.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         37 non-null     object 
 1   Pageviews    37 non-null     int64  
 2   Clicks       37 non-null     int64  
 3   Enrollments  23 non-null     float64
 4   Payments     23 non-null     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 1.6+ KB


In [16]:
# load data of experiment group
# link source - https://docs.google.com/spreadsheets/d/1Mu5u9GrybDdska-ljPXyBjTpdZIUev_6i7t4LRDfXM8/edit#gid=0
experiment = pd.read_csv("Experiment.csv", sep=',')
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


In [17]:
experiment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         37 non-null     object 
 1   Pageviews    37 non-null     int64  
 2   Clicks       37 non-null     int64  
 3   Enrollments  23 non-null     float64
 4   Payments     23 non-null     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 1.6+ KB


## Sanity Checks for invariant metrics

I will conduct a sanity check to verify the invariant metrics are equivalent between the two groups, that is, the experiment was conducted as expected and that other factors did not influence the data we collected.

### Sanity Check for metric Number of cookies

That is, a number of unique cookies to view the course overview page.

I expect that the total number of cookies in the control group and the experiment group each account for 50% of the total number of cookies.

We must first count the number of cookies in the control and experimental groups:

In [18]:
control_views = control.Pageviews.sum()
control_views

345543

In [19]:
experiment_views = experiment.Pageviews.sum()
experiment_views

344660

In [20]:
# Calculate the 95% confidence interval (z-score = 1.96) 
# Compute SD of binomial distribution with 50% chance of placement in one of the groups
p=0.5
SD_views=mt.sqrt((p*(1-p)/(control_views + experiment_views)))
# Calculate margin of error
ME1=1.96*SD_views
# Check that observed fraction is within interval
observed =control_views/(control_views + experiment_views)
print ("The confidence interval is ",round(p-ME1,4),"and",round(p+ME1,4),"\nObserved is ", round(observed,4))

The confidence interval is  0.4988 and 0.5012 
Observed is  0.5006


### Sanity Check for metric Number of clicks

In [21]:
control_clicks = control.Clicks.sum()
control_clicks

28378

In [22]:
experiment_clicks = experiment.Clicks.sum()
experiment_clicks

28325

In [23]:
# Calculate the 95% confidence interval (z-score = 1.96)
# Compute SD of binomial distribution with 50% chance of placement in one of the groups
p=0.5
SD_clicks=mt.sqrt((p*(1-p)/(control_clicks + experiment_clicks)))
# Calculate margin of error
ME2=1.96*SD_clicks
# Check that observed fraction is within interval
observed = control_clicks /(control_clicks + experiment_clicks)
print ("The confidence interval is ",round(p-ME2,4),"and",round(p+ME2,4),"\nObserved is ", round(observed,4))

The confidence interval is  0.4959 and 0.5041 
Observed is  0.5005


### Sanity Check for metric Click-through-probability (CTP)

Click-through-probability of the Free Trial Button.

To perform a sanity check for the CTP, we would expect the difference between the two groups to be zero (CTP_exp - CTP_control = 0). The calculation of the standard error is through the pooled standard error between the experiment and control group. We must first count the number of cookies in the control and experimental groups:

In [24]:
ctp_control = control_clicks/control_views 
ctp_exp = experiment_clicks/experiment_views
d_hat = ctp_exp-ctp_control 
# pooled standard error between experiment and control group.
ctp_pool = (control_clicks + experiment_clicks)/(control_views + experiment_views)
# standard error
SE_ctp=mt.sqrt(ctp_pool*(1-ctp_pool)*(1/control_views + 1/experiment_views))
# Calculate margin of error
ME3=1.96*SE_ctp
print("The confidence interval is ",round(-ME3,4),"and",round(ME3,4),"\nObserved is ", round(d_hat,4))

The confidence interval is  -0.0013 and 0.0013 
Observed is  0.0001


All three invariant metrics (Number of cookies, Number of clicks and CTP) passed the sanity check.

## Result Analysis

### Effective Size Test

In this test for each of the evaluation metrics, I give a 95% confidence interval around the difference between the experiment and control groups. Also, I indicate whether each metric is statistically and practically significant.

A metric is statistically significant if the confidence interval does not include 0 (that is, you can be confident there was a change), and it is practically significant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)

<b> Gross conversion </b>

In [25]:
#use only records of clicks that have both clicks and enrollments
clicks_control = control.Clicks.loc[control.Enrollments.notna()].sum()
clicks_exp = experiment.Clicks.loc[experiment.Enrollments.notna()].sum()

enrollments_control = control.Enrollments.sum()
enrollments_exp =experiment.Enrollments.sum()

In [26]:
# calculate gross conversion
GC_control =  enrollments_control/clicks_control
GC_exp = enrollments_exp/clicks_exp

In [27]:
# calculate pooled probability
p_pool = (enrollments_control + enrollments_exp)/(clicks_control + clicks_exp)

#calculate pooled standard deviation:
SD_pool = mt.sqrt(p_pool*(1-p_pool)*(1/clicks_control + 1/clicks_exp))

# calculate margin of errors for 95% confidence interval (Z-score = 1.96)
m = round ((1.96 * SD_pool), 4)

#calculate the practical significance difference
d_hat1 = round((enrollments_exp/clicks_exp - enrollments_control/clicks_control), 4)

#Results
print("Confidence Interval: [",d_hat1-m,",",d_hat1+m,"]")
print("Observed:",d_hat1)
print ("Statistically significant:", not(d_hat1-m<0<d_hat1-m))
d_min=0.01
print("Practically significant:", not(d_hat1-m <d_min<d_hat1+m or d_hat1-m <-d_min<d_hat1+m))

Confidence Interval: [ -0.0292 , -0.012 ]
Observed: -0.0206
Statistically significant: True
Practically significant: True


There was a statistically and practically significant change as a result of the experiment -> I got a negative change of 2%. This means that in the experimental group (those who were asked how many hours they could learn) the gross conversion rate went down, meaning that fewer people signed up for the free trial after the changes.

<b> Net Conversion </b>

In [28]:
#use records of payments
payments_control = control.Payments.sum()
payments_exp = experiment.Payments.sum()

In [29]:
# calculate Net Conversion
Net_C_control =  payments_control/clicks_control
Net_C_exp = payments_exp/clicks_exp

In [30]:
# calculate pooled probability
p_pool2 = (payments_control + payments_exp)/(clicks_control + clicks_exp)

#calculate pooled standard deviation:
SD_pool2 = mt.sqrt(p_pool2*(1-p_pool2)*(1/clicks_control + 1/clicks_exp))

# calculate margin of errors for 95% confidence interval (Z-score = 1.96)
m2 = round(1.96 * SD_pool2,4)

#calculate the practical significance difference
d_hat2 = round((payments_exp/clicks_exp - payments_control/clicks_control), 4)

#Results
print("Confidence Interval: [",d_hat2-m2,",",d_hat2+m2,"]")
print("Observed:",d_hat2)
print ("Statistically significant:", not(d_hat2-m2< 0 <d_hat2+m2))
d_min = 0.0075
print("Practically significant:", not(d_hat2-m2 <d_min<d_hat2+m2 or d_hat2-m2 <-d_min<d_hat2+m2))

Confidence Interval: [ -0.0116 , 0.0018000000000000004 ]
Observed: -0.0049
Statistically significant: False
Practically significant: False


For metric Net Conversion the confidence interval include both 0 and negative dmin, , that is, the difference between control and experiment groups is insignificant.

### Sign Test

The sign test is also a method to validate the result of the experiment to check the signs of the difference in the metrics between the experiment and control groups with the confidence interval of the difference.

For each evaluation metric, do a sign test using the day-by-day breakdown. I need to calculate the evaluation metric per day and then count how many days the metric was lower in the experimental group, and this will be the number of good days. Then I need to check the proportion of good days from all days.

<b> Gross conversion </b>

In [31]:
days = experiment.Clicks.loc[experiment.Enrollments.notna()].count()
days

23

In [32]:
gc_exp = [i/j for i,j in zip(experiment.Enrollments,experiment.Clicks.loc[experiment.Enrollments.notna()])]
gc_cont=[i/j for i,j in zip(control.Enrollments,control.Clicks.loc[control.Enrollments.notna()])]
gc_diff=sum([i>j for i,j in zip(gc_exp,gc_cont)])

In [33]:
from scipy.stats import binom_test
alpha=0.05 # condition from part Design of Experiment -> Sizing 

# The prob of gross conversion of experiment group > gross conversion of control group is 0.5
p_value1 = binom_test(gc_diff, n=days, p=0.5)
print (gc_diff, 'times')
print("p-value:",round(p_value1,4),", Statistically Significant:",p_value1 < alpha)

4 times
p-value: 0.0026 , Statistically Significant: True


For metric Gross Conversion, rates from the experience group are higher than control groups for 4 times; p-value=0.0026 and significant.

<b> Net Conversion </b>

In [34]:
n_days = experiment.Payments.count()
n_days

23

In [35]:
nc_exp = [i/j for i,j in zip(experiment.Payments,experiment.Clicks.loc[experiment.Payments.notna()])]
nc_cont=[i/j for i,j in zip(control.Payments,control.Clicks.loc[control.Payments.notna()])]
nc_diff=sum([i>j for i,j in zip(nc_exp,nc_cont)])

In [36]:
# The prob of gross conversion of experiment group > gross conversion of control group is 0.5
p_value2 = binom_test(nc_diff, n=days, p=0.5)
print (nc_diff, 'times')
print("p-value:",round(p_value2,4),", Statistically Significant:",p_value2 < alpha)

10 times
p-value: 0.6776 , Statistically Significant: False


For metric Net Conversion, rates from the experience group are higher than the control group 10 times; p-value=0.6776, but is not significant.

## Recommendation

The result of the experiment shows that the Gross Conversion will decrease significantly, that is, the number of students to complete checkout and enroll in the free trial will be reduced. 

But, this experiment also shows no significant change in other metric, Net conversion. So, the page change will help reduce the number of enrollments in the free trial but was unrelated to the increase in the number of students remaining in the 14 days of the trial period required to receive payment.

I would not recommend running this change but make other experiments.