# Udemy AB Testing Project

## Table of Contents

- [Introduction](#introduction)
- [Data Collection](#data-collection)
- [Analysis](#analysis)
- [Conclusion](#conclusion)

## Introduction

Udacity's mission is to power careers through tech education. Working towards this mission, the company aims to provide a stimulating learning experience that is tailored to the individual learner and supported by experienced coaches. To improve its services, Udacity tinkered with changing the user flow on its website and set up an A/B test titled "Free Trial Screener" to test its idea.

### Status quo
At the time of the experiment, Udacity courses have two options on the course overview page: "start free trial", and "access course materials".

If students click "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first.

If students click "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

### Treatment
In the experiment, Udacity tests a change where if students click "start free trial", they are asked how much time they have available to devote to the course.

If students indicate 5 or more hours per week, they are taken through the checkout process as usual.

If they indicate fewer than 5 hours per week, a message appears indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, students have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.


### Reasoning
The hypothesis is that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course.

If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

## Data Collection

In [None]:
# -*- coding: utf-8 -*-
# Install missing packages

import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import binom_test
from statsmodels.stats.proportion import proportions_ztest

pd.options.display.float_format = "{:,.6f}".format
pd.options.display.float_format = "{:.4f}".format


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting numpy
  Downloading numpy-2.2.4-cp313-cp313-win_amd64.whl.metadata (60 kB)
Downloading numpy-2.2.4-cp313-cp313-win_amd64.whl (12.6 MB)
   ---------------------------------------- 0.0/12.6 MB ? eta -:--:--
   ---- ----------------------------------- 1.3/12.6 MB 6.9 MB/s eta 0:00:02
   ------- -------------------------------- 2.4/12.6 MB 6.2 MB/s eta 0:00:02
   ----------- ---------------------------- 3.7/12.6 MB 5.8 MB/s eta 0:00:02
   --------------- ------------------------ 5.0/12.6 MB 5.8 MB/s eta 0:00:02
   ------------------- -------------------- 6.3/12.6 MB 5.8 MB/s eta 0:00:02
   ------------------------ --------------- 7.6/12.6 MB 5.9 MB/s eta 0:00:01
   ---------------------------- ----------- 8.9/12.6 MB 5.9 MB/s eta 0:00:01
   --------------------------------- ------ 10.5/12.6 MB 5.9 MB/s eta 0:00:01
   ------------------------------------- -- 11.8/12.6 MB 6.0 MB/s eta 0:00:01
   ---------------------------------------  12.6/12.6 MB 6.0 MB/s eta 0:00:01
   --------

ModuleNotFoundError: No module named 'pandas'

In [None]:
# Storing baseline data
d = {
    "Metric Name": [
        "Cookies",
        "Clicks",
        "User-ids",
        "Click-through-probability",
        "Gross conversion",
        "Retention",
        "Net conversion",
    ],
    "Estimator": [40000, 3200, 660, 0.08, 0.20625, 0.53, 0.109313],
    "dmin": [3000, 240, -50, 0.01, -0.01, 0.01, 0.0075],
}
md = pd.DataFrame(data=d, index=["C", "CL", "ID", "CTP", "CG", "R", "CN"])
md

Unnamed: 0,Metric Name,Estimator,dmin
C,Cookies,40000.0,3000.0
CL,Clicks,3200.0,240.0
ID,User-ids,660.0,-50.0
CTP,Click-through-probability,0.08,0.01
CG,Gross conversion,0.2062,-0.01
R,Retention,0.53,0.01
CN,Net conversion,0.1093,0.0075


### 3-standard deviation rule

The 3-Standard Deviation Rule, also known as the Empirical Rule (or 68-95-99.7 Rule), is a guideline in statistics that describes how data is distributed in a normal (bell-shaped) distribution. It states that:

68% of the data falls within 1 standard deviation (μ±σ) of the mean.
95% of the data falls within 2 standard deviations (μ±2σ) of the mean.
99.7% of the data falls within 3 standard deviations (μ±3σ) of the mean.

In [None]:
# first the standard deviation for the Gross Conversion and Net Conversion metrics needs to be calculated
# Gross Conversion
CG = md.loc["CG"].copy()
CG["Estimator"] = 3200
CG["dmin"] = -0.01
CG["p"] = 0.206250
# as alternative the probability is calculated as number of clicks divided by number of cookies
CG["n"] = md.loc["C"]["Estimator"]
CG["sd"] = np.sqrt(CG["p"] * (1 - CG["p"]) / CG["n"])
CG

Metric Name    Gross conversion
Estimator                  3200
dmin                    -0.0100
p                        0.2062
n                    40000.0000
sd                       0.0020
Name: CG, dtype: object

In [None]:
CN = md.loc["CN"].copy()
CN["Estimator"] = 660
CN["dmin"] = 0.0075
CN["p"] = 0.109313
# as alternative the propability is calculated as number of clicks divided by number of cookies
CN["n"] = md.loc["C"]["Estimator"]
CN["sd"] = np.sqrt(CN["p"] * (1 - CN["p"]) / CN["n"])
CN

Metric Name    Net conversion
Estimator                 660
dmin                   0.0075
p                      0.1093
n                  40000.0000
sd                     0.0016
Name: CN, dtype: object

In [None]:
# The sample size for the experiment is calculated using the formula:
# n = (Z^2 * p * (1 - p)) / d^2
# where:
# n = sample size
# Z = Z-score
# p = baseline conversion rate
# d = minimum detectable effect

# Gross Conversion
CG["n"] = (1.96**2 * CG["p"] * (1 - CG["p"])) / CG["dmin"] ** 2
CG["n"] = np.ceil(CG["n"])
CG

# Net Conversion
CN["n"] = (1.96**2 * CN["p"] * (1 - CN["p"])) / CN["dmin"] ** 2
CN["n"] = np.ceil(CN["n"])
CN

Metric Name    Net conversion
Estimator                 660
dmin                   0.0075
p                      0.1093
n                   6650.0000
sd                     0.0016
Name: CN, dtype: object

In [None]:
CG

Metric Name    Gross conversion
Estimator                  3200
dmin                    -0.0100
p                        0.2062
n                     6290.0000
sd                       0.0020
Name: CG, dtype: object

### Scaling

Since the sample size from Audacity is 5000 cookies, we need to scale the data provided as related to 40000 cookies.

In [None]:
md

scaling_factor = 5000 / md.loc["C"]["Estimator"]

for i in ["C", "CL", "ID"]:
    md.at[i, "Scaled_Estimator"] = md.loc[i]["Estimator"] * scaling_factor

md

Unnamed: 0,Metric Name,Estimator,dmin,Scaled_Estimator
C,Cookies,40000.0,3000.0,5000.0
CL,Clicks,3200.0,240.0,400.0
ID,User-ids,660.0,-50.0,82.5
CTP,Click-through-probability,0.08,0.01,
CG,Gross conversion,0.2062,-0.01,
R,Retention,0.53,0.01,
CN,Net conversion,0.1093,0.0075,


### Computing Standard Error

The general formula is applied to calculate the standard error over the population.

$$SE = \sqrt{\frac{p*(1-p)}{n}}$$

Since we are not yet considering two different populations sample, there is no point to calculate the Pooled Standard Error.

Standard error will be solely calculated for Gross Convertion, Retention and Net Convertion, as considered as evaluation metrics. Click-trough probability is calculated as the ratio between number of Clicks and Cookies, which are both invariant metrics, hence their ratio will be invariant as well



In [None]:
for i in ["CG", "CN"]:

    md.at[i, "Standard Error"] = np.sqrt(
        md.loc[i]["Estimator"]
        * (1 - md.loc[i]["Estimator"])
        / md.loc["C"]["Scaled_Estimator"]
    )

md.at["R", "Standard Error"] = np.sqrt(
    md.loc["R"]["Estimator"]
    * (1 - md.loc["R"]["Estimator"])
    / md.loc["C"]["Scaled_Estimator"]
)

md

Unnamed: 0,Metric Name,Estimator,dmin,Scaled_Estimator,Standard Error
C,Cookies,40000.0,3000.0,5000.0,
CL,Clicks,3200.0,240.0,400.0,
ID,User-ids,660.0,-50.0,82.5,
CTP,Click-through-probability,0.08,0.01,,
CG,Gross conversion,0.2062,-0.01,,0.0057
R,Retention,0.53,0.01,,0.0071
CN,Net conversion,0.1093,0.0075,,0.0044


### Defining Alpha and Beta

Alpha (α) and Beta (β) are related to statistical errors that can occur when making conclusions about your test results. 

Alpha represents the significance level of the test, that is the probability of rejecting the null hypothesis when it is actually true. This is usually set at 5%.

Beta represents the probability of failing to reject the null hypothesis when the alternative hypothesis is actually true. This is usually set at 20%.

In [None]:
alpha = 0.05

beta = 0.2

### Sample Size Calculation

We can make an approximation due to the big size of the sample, by using the following formula:
$$Sample Size = \frac{2 \cdot (Z_{\alpha/2} + Z_{\beta})^2 \cdot p_1 \cdot (1 - p_1)}{d_{\min}^2}$$


In [None]:
# Calculate the z-score for the alpha value
z_alpha = stats.norm.ppf(1 - alpha / 2)

# Calculate the z-score for the beta value
z_beta = stats.norm.ppf(1 - beta)

In [None]:
# Create a function for the sample size calculation
def sample_size(p, d_min):
    return 2 * (z_alpha + z_beta) ** 2 * p * (1 - p) / d_min**2

In [None]:
# Calculate the sample size by adding the result in the table

md.insert(3, "Sample Size", np.nan)

In [None]:
for i in ["CG", "CN"]:
    md.at[i, "Sample Size"] = round(
        (
            sample_size(md.loc[i]["Estimator"], md.loc[i]["dmin"])
            / md.loc["CTP"]["Estimator"]
        )
        * 2,
        0,
    )
md.at["R", "Sample Size"] = round(
    (
        (
            sample_size(md.loc["R"]["Estimator"], md.loc["R"]["dmin"])
            / md.loc["CTP"]["Estimator"]
        )
        / md.loc["CG"]["Estimator"]
    )
    * 2,
    0,
)
md

Unnamed: 0,Metric Name,Estimator,dmin,Sample Size,Scaled_Estimator,Standard Error
C,Cookies,40000.0,3000.0,,5000.0,
CL,Clicks,3200.0,240.0,,400.0,
ID,User-ids,660.0,-50.0,,82.5,
CTP,Click-through-probability,0.08,0.01,,,
CG,Gross conversion,0.2062,-0.01,642474.0,,0.0057
R,Retention,0.53,0.01,4739772.0,,0.0071
CN,Net conversion,0.1093,0.0075,679285.0,,0.0044


### Experiment duration

It is assumed that 100% of the traffic is diverted on the experiment, hence 50% of the users will be scattered to the control group.

By the initial hypothesis, we are expecting around 40,000 page views per day.

The duration, in number of days will be calculated as follows:
$$Duration (days) = \frac{N_{\text{tot}}}{\text{Daily Visitors} \times \text{Experiment exposure}}$$


In [None]:
md.insert(4, "Experiment Duration (Days)", np.nan)

In [None]:
for i in ["CG", "CN", "R"]:
    md.at[i, "Experiment Duration (Days)"] = (
        md.loc[i, "Sample Size"] / md.loc["C", "Estimator"] * 1
    )
md

Unnamed: 0,Metric Name,Estimator,dmin,Sample Size,Experiment Duration (Days),Scaled_Estimator,Standard Error
C,Cookies,40000.0,3000.0,,,5000.0,
CL,Clicks,3200.0,240.0,,,400.0,
ID,User-ids,660.0,-50.0,,,82.5,
CTP,Click-through-probability,0.08,0.01,,,,
CG,Gross conversion,0.2062,-0.01,642474.0,16.0618,,0.0057
R,Retention,0.53,0.01,4739772.0,118.4943,,0.0071
CN,Net conversion,0.1093,0.0075,679285.0,16.9821,,0.0044


It is noticed that to collect significant results regarding the Gross and Net conversion, we need to run the experiment respectively for around 16 days, that would result in having an overall 32 days testing period, plus the 14 days of free trial.

This may seem appropriate for the experiment, meanwhile the retention metrics would take around 118 days for testing, which is rather problematic from a business perspective, as no other experiment are supposed to be taken at the same time, making this scenario risky.

## Analysis

In [None]:
# Load the dataset for the control and experiment groups
control = pd.read_csv(
    r"C:\Users\pietr\Desktop\DA_Touring\AB Testing\Final Project Results - Control.csv"
)
experiment = pd.read_csv(
    r"C:\Users\pietr\Desktop\DA_Touring\AB Testing\Final Project Results - Experiment.csv"
)

In [None]:
# check if control data were loaded correctly
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [None]:
# Check if experiment data were loaded correctly
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


#### Invariant metrics sanity check

Defining an invariant metric in an A/B test involves selecting a metric that remains unchanged between the treatment and control groups. This ensures the experiment was properly randomized. By running a sanity check, we ensure that there is no bias related to these invariant metrics.

To do this we first compute confidence interval around the binominal 

$$CI = \hat{p} \pm Z \cdot \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$$



In [None]:
# Create a table for visualizing the confidence interval data
sanity_check = pd.DataFrame(
    columns=[
        "Left Interval",
        "Right Interval",
        "Observed Value",
        "Sanity Check Result",
    ],
    index=["Cookies", "Clicks", "Click-through-probability"],
)

# Pipeline conversion rate
p = 0.5
# Alpha value
alpha = 0.05


def standard_dev(p, n):
    return np.sqrt(p * (1 - p) / n)


for i, j in zip(["Cookies", "Clicks"], ["Pageviews", "Clicks"]):
    # Calculate the sample size
    n = control[j].sum() + experiment[j].sum()
    # Set the control population size
    n_control = control[j].sum()
    # Calculate the left boundary of the confidence interval
    sanity_check.at[i, "Left Interval"] = 0.5 - (standard_dev(p, n)) * stats.norm.ppf(
        1 - alpha / 2
    )
    # Calculate the right boundary of the confidence interval
    sanity_check.at[i, "Right Interval"] = 0.5 + (standard_dev(p, n)) * stats.norm.ppf(
        1 - alpha / 2
    )
    # Calculate the observed value
    sanity_check.at[i, "Observed Value"] = round((n_control / n), 4)
    # Check if the observed value is within the confidence interval
    if (
        sanity_check.at[i, "Left Interval"]
        <= sanity_check.at[i, "Observed Value"]
        <= sanity_check.at[i, "Right Interval"]
    ):
        sanity_check.at[i, "Sanity Check Result"] = "Pass"
    else:
        sanity_check.at[i, "Sanity Check Result"] = "Fail"

sanity_check

Unnamed: 0,Left Interval,Right Interval,Observed Value,Sanity Check Result
Cookies,0.4988,0.5012,0.5006,Pass
Clicks,0.4959,0.5041,0.5005,Pass
Click-through-probability,,,,


Alternatively we can run a z-test and evaluate the null and alternative hypothesis

In [None]:
# Perform z-test
# Calculate the number of observations
n = control["Pageviews"].sum() + experiment["Pageviews"].sum()
# Calculate the number of successes
n_control = control["Pageviews"].sum()
# Calculate the sample proportion
p = n_control / n
p

# Perform z-test
z_stat, p_value = proportions_ztest(count=n_control, nobs=n, value=0.5)

print(f"Z-statistic: {z_stat}")
print(f"P-value: {p_value}")
print(f"Sample proportion: {p}")

if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Z-statistic: 1.0628516171419604
P-value: 0.2878492472042282
Sample proportion: 0.5006396668806133
Fail to reject the null hypothesis


Checking the validity of the click-trough probability means to verify if any difference occurs between the control and experiment group.

Assuming that there is no difference between the two probabilities, we can either calculate the confidence interval around the theoretical difference (0) or perform a Z-test and calculate the p-value

$$CI = [0 - Z_{\alpha/2} \cdot SE, 0 + Z_{\alpha/2} \cdot SE]$$

$$SE_\text{pooled} = \sqrt{\frac{S_\text{n,exp}^2}{n_\text{exp,pageviews}} + \frac{S_\text{n,control}^2}{n_\text{control,pageviews}}}$$

with $$S = \sqrt{p*(1-p)}$$ and $$p = CTP = \frac{n_\text{clicks}}{n_\text{pageviews}}$$


In [None]:
# Calculate the probability for control and experiment groups
p_control = control["Clicks"].sum() / control["Pageviews"].sum()
p_experiment = experiment["Clicks"].sum() / experiment["Pageviews"].sum()

# Calculate the standard deviation for the control and experiment groups
sd_control = np.sqrt(p_control * (1 - p_control))
sd_experiment = np.sqrt(p_experiment * (1 - p_experiment))

# Calculate the pooled standard error
se_pooled = np.sqrt(
    sd_control**2 / control["Pageviews"].sum()
    + sd_experiment**2 / experiment["Pageviews"].sum()
)

# Calculate the margin of error
moe = se_pooled * stats.norm.ppf(1 - alpha / 2)

# Calculate the confidence interval
ci_low = 0 - moe
ci_high = 0 + moe

# Print the confidence interval
print(f"Confidence interval: [{ci_low:.4f}, {ci_high:.4f}]")

# Calculate the practical significance error and compared it with the confidence interval
d_hat = p_experiment - p_control

if ci_low <= d_hat <= ci_high:
    print("Sanity check passed")
else:
    print("Sanity check failed")

Confidence interval: [-0.0013, 0.0013]
Sanity check passed


Now the Z-test is performed instead of the study of the confidencte interval

In [None]:
# Define the Clicks and Pageviews populations
n_views = [control["Pageviews"].sum(), experiment["Pageviews"].sum()]
n_clicks = [control["Clicks"].sum(), experiment["Clicks"].sum()]

# Perform the z-test with the null hypothesis difference  being 0
z_stat, p_value = proportions_ztest(
    n_clicks, n_views, value=0.0, alternative="two-sided", prop_var=False
)

print(f"Z-statistic:{z_stat}")
print(f"P-value:{p_value}")

if p_value > alpha:
    print("Sanity check is passed. Failed to reject the null hypothesis")
else:
    print("Sanity check is failed. Reject the null hypothesis")

Z-statistic:-0.08566094109242048
P-value:0.9317359524473912
Sanity check is passed. Failed to reject the null hypothesis


#### Evaluation Metrics Sanity Check

The check consists in assuming that the probability between the experiment and the control are the same. This will induce theoretically to have the null hypothesis difference equal to 0.

$$ H0:CG_\text{treatment}=CG_\text{control}$$
$$ H1:CG_\text{treatment}≠CG_\text{control}$$

$$ H0:CN_\text{treatment}=CN_\text{control}$$
$$ H1:CN_\text{treatment}≠CN_\text{control}$$

The evaluation metric hypotheses using two proportion z-tests (thereby, the same assumptions as outlined above apply). The respective confidence interval is calculated now around the observed difference between the conversion metrics.


In [None]:
# Tab creation
mi = pd.DataFrame(
    columns=[
        "Left Interval",
        "Right Interval",
        "d",
        "d min",
        "Practical relevance",
        "Statistical relevance",
    ],
    index=["CN", "CG"],
)
mi

Unnamed: 0,Left Interval,Right Interval,d,d min,Practical relevance,Statistical relevance
CN,,,,,,
CG,,,,,,


To draw any further conclusion regarding the test outcomes, the practical relevance is considered (or practical significance) to check if the result is statistically significant.

The difference between the control and experiment group will represent the effect size that is compared with the Minimum Detectable Effect: if bigger, than the experiment is practical significant. Such condition is met when the MDE is located outside the confidence interval of the observed difference.

In [None]:
for i, j in zip(["Enrollments", "Payments"], ["CN", "CG"]):
    # Given the definition of Gross and Net conversion, we calculate the probability with the data collected by Audacity
    p_control = control.iloc[:23][i].sum() / control.iloc[:23]["Clicks"].sum()
    p_experiment = experiment.iloc[:23][i].sum() / experiment.iloc[:23]["Clicks"].sum()
    # Calculate the difference d
    mi.at[j, "d"] = p_experiment - p_control
    # Calculate the standard deviation for the control and experiment groups
    sd_control = np.sqrt(p_control * (1 - p_control))
    sd_experiment = np.sqrt(p_experiment * (1 - p_experiment))
    # Calculate the pooled standard error
    se_pooled = np.sqrt(
        sd_control**2 / control.iloc[:23]["Clicks"].sum()
        + sd_experiment**2 / experiment.iloc[:23]["Clicks"].sum()
    )
    # Compute the 95% confidence interval around the difference
    mi.at[j, "Left Interval"] = mi.at[j, "d"] - (
        stats.norm.ppf(1 - alpha / 2) * se_pooled
    )
    mi.at[j, "Right Interval"] = mi.at[j, "d"] + (
        stats.norm.ppf(1 - alpha / 2) * se_pooled
    )
    # Calculate the statistical significance
    if mi.at[j, "Left Interval"] <= 0 <= mi.at[j, "Right Interval"]:
        mi.at[j, "Statistical relevance"] = "Yes"
    else:
        mi.at[j, "Statistical relevance"] = "No"

    mi.at[j, "d min"] = md.loc[j]["dmin"]

    if mi.at[j, "d min"] >= 0:
        if (
            mi.at[j, "d"] > mi.at[j, "d min"]
            and mi.at[j, "Left Interval"] > mi.at[j, "d min"]
        ):
            mi.at[j, "Practical relevance"] = "Yes"
        else:
            mi.at[j, "Practical relevance"] = "No"
    else:
        if (
            mi.at[j, "d"] < mi.at[j, "d min"]
            and mi.at[j, "Right Interval"] < mi.at[j, "d min"]
        ):
            mi.at[j, "Practical relevance"] = "Yes"
        else:
            mi.at[j, "Practical relevance"] = "No"

# Print the table
mi

Unnamed: 0,Left Interval,Right Interval,d,d min,Practical relevance,Statistical relevance
CN,-0.0291,-0.012,-0.0206,0.0075,No,No
CG,-0.0116,0.0019,-0.0049,-0.01,No,Yes


## Conclusion

From the latest check it is noticed how none of the metrics have an actual practical relevance for the experiment, that means that there is no difference between the control and experiment variants. Also, only the Gross Conversion has some statistical relevance that makes it unlikley to be random.

Upon this results, considering also the amount of resources involved to run the experiment, such test is not recommended from a Business perspective.