# AB Testing

1. Problem Statement

Background: Reducing the number of frustrated students who left the free trial because they didn't have enough time without significantly reducing the number of students to continue past the free trial and eventually complete the course. 实验前提是学生要点进去“start free trial”. 如果选5小时以上, 填写card information; 如果选5小时一下, 建议选择免费的"access course materials". 假设: 这样的改变可能会减少因时间不足而在免费试用期间感到沮丧并退出的学生数量，同时不会显著减少继续免费试用并最终完成课程的学生数量。  预先为学生设定更明确的目标, 这样如果一开始选了free trail选项, 最后发现没时间跑路, 学生不会感到沮丧. 也不会显著减少继续免费试用并最终完成课程的学生人数。目的是为了Udacity 可以改善学生的整体体验. 
注意: 如果学生选了“free trail”, 他们的userid不能被注册2次. 选了"access course materials", userid不会被追踪. 

Step1: Metrics Design

For Success Metrics, I will choose "Gross Conversion" and "Rentention". For Guardrail Metrics, I will choose "Net conversion". 

Success Metrics:

Short term engagement: 
    Gross Conversion(那些点击了“开始免费试用”按钮，并完成了注册,包括填写个人信息、提供信用卡信息等步骤。在这个阶段，用户尚未进行任何付款，他们只是设置好了账户并开始了免费试用期 除以 点击"Start free trial" button的用户), because this metric measures how many of the unique cookies that click on the "Start Free Trial" button complete the enroll process. This is shortterm engagement because it reflects the behavioral change users make immediately after the change introduced by the experiment. This look at whether a user has started a free trial, which is behavior over a relatively short time. (选择这个指标的目的是为了了解点击“免费试用”按钮的独特cookie中有多少比例完成了注册流程。选择short term我们可以通过这一步直接看出来new change引入后用户立即做出的行为改变. 因为实验的主要目的是减少那些时间不足而可能不会完成课程的学生的注册数量。如果gross conversion显著下降，这表明有效地筛选出了这部分学生)
    
Long term engagement:
    Retention, because this metric measures the percentage of users who stay on the course after the "free trial" ends. This is an long term engagement, since it focuses on user behavior over a longer period of time(longer than 14 days), whether they are willing to continue paying for the course. (选择这个指标是为了看在免费试用期结束后仍然留在课程中的用户。这是一个长期参与的指标，因为它关注的是用户在14天之后的行为，看看他们是否愿意继续付费使用服务. 我想要提升那些更有可能完成课程的学生的保留率。如果保留率提高，这表明这个change帮助留下了更有可能完成课程的学生. Number of userids to complete checkout这个指标指的是那些点击了“开始免费试用”按钮，并完成了注册流程的用户数量。这个注册流程通常包括填写个人信息、提供信用卡信息等步骤。在这个阶段，用户尚未进行任何付款，他们只是设置好了账户并开始了免费试用期。而Number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment)指的是那些不仅完成了注册流程，而且在免费试用期（通常是14天）结束后决定继续使用服务并进行了至少一次付款的用户数量。这表示用户在试用服务后认为它值得付费继续使用。)

Guardrail Metrics:
    Net Conversion(不仅完成了注册流程，而且在免费试用期（通常是14天）结束后决定继续使用服务并进行了至少一次付款的用户数量。这表示用户在试用服务后认为它值得付费继续使用 除以 点击"Start free trial" button的用户): 虽然实验的目的是通过筛选器减少不太可能完成课程的学生的注册，但我们不希望这导致整体付费用户数量显著下降。净转化率的稳定或提高可以确保实验不会对Udacity的整体收入产生负面影响。 但它在时间上更长，因为它衡量的是用户在免费试用期结束后是否继续留在课程中。这也可以被视为一个长期参与的指标，因为它反映了用户在经历了整个免费试用期后的行为。

Step 2: Sample Size

baseline for Gross Conversion rate is: 0.20625; baseline for Retention rate is: 0.53; baseline for Net Conversion rate is: 0.1093125

In [None]:
import numpy as np
import statsmodels.api as sm
from scipy import stats

# 方法 1：直接计算公式
def calculate_sample_size_direct(p1, delta):
    sigma = np.sqrt(p1 * (1 - p1))
    n = 16 * sigma**2 / delta**2
    return n

# 方法 2：使用 statsmodels 的 samplesize_proportions_2indep_onetail
def calculate_sample_size_statsmodels(diff, prop2, power, alpha):
    return sm.stats.samplesize_proportions_2indep_onetail(diff=diff, prop2=prop2, power=power, alpha=alpha)

# 方法 3：使用效果量（Effect Size）与 tt_ind_solve_power
def calculate_sample_size_effect_size(p1, p2, power, alpha):
    es = sm.stats.proportion_effectsize(p1, p2)
    return sm.stats.tt_ind_solve_power(effect_size=es, power=power, alpha=alpha)



# 基线转化率和最小可检测效应
metrics = {
    "Gross Conversion": {"p1": 0.20625, "dmin": 0.01},
    "Retention": {"p1": 0.53, "dmin": 0.01},
    "Net Conversion": {"p1": 0.109313, "dmin": 0.0075}
}

# 计算每个指标的样本量
for metric, params in metrics.items():
    p1 = params["p1"]
    p2 = p1 + params["dmin"]
    n_direct = calculate_sample_size_direct(p1, params["dmin"])
    n_statsmodels = calculate_sample_size_statsmodels(params["dmin"], p2, 0.8, 0.05)
    n_effect_size = calculate_sample_size_effect_size(p1, p2, 0.8, 0.05)
    print(f"{metric} - Sample size per group (Direct): {n_direct}")
    print(f"{metric} - Sample size per group (Statsmodels): {n_statsmodels}")
    print(f"{metric} - Sample size per group (Effect Size): {n_effect_size}")

Gross Conversion - Sample size per group (Direct): 26193.749999999993
Gross Conversion - Sample size per group (Statsmodels): 27045.815320351427
Gross Conversion - Sample size per group (Effect Size): 26153.77078818886
Retention - Sample size per group (Direct): 39856.0
Retention - Sample size per group (Statsmodels): 38925.34008491193
Retention - Sample size per group (Effect Size): 39051.638929532724
Net Conversion - Sample size per group (Direct): 27694.55446215111
Net Conversion - Sample size per group (Statsmodels): 29588.068707186994
Net Conversion - Sample size per group (Effect Size): 27978.923132804237


和Evan Miller计算器算出来的基本差不多. 计算器算出来的是: Sample size for Gross Conversion: 25835; Sample size for Retention: 39115; Sample size for Net Conversion: 27413

In [2]:
# 已知样本量
sample_size_gross_conversion = 25835
sample_size_retention = 39115
sample_size_net_conversion = 27413

# 基线点击率或转化率
click_through_rate = 0.08  # 查看页面的唯一cookies中点击“开始免费试用”按钮的比例
enrollment_rate = 0.20625  # 点击后实际注册的比例（用于 Gross Conversion 和 Retention）

# 计算 Gross Conversion 和 Net Conversion 所需的总页面浏览量
total_pageviews_gross_conversion = (sample_size_gross_conversion / click_through_rate) * 2
total_pageviews_net_conversion = (sample_size_net_conversion / click_through_rate) * 2

# 计算 Retention 所需的总页面浏览量
# 首先，计算达到所需注册用户数量所需的页面浏览量
pageviews_for_enrollment = sample_size_retention / enrollment_rate
# 然后，计算达到所需点击数量所需的页面浏览量
total_pageviews_retention = pageviews_for_enrollment / click_through_rate * 2  # 乘以2是因为有两组

# 打印结果
print(f"Total pageviews for Gross Conversion: {total_pageviews_gross_conversion}")
print(f"Total pageviews for Retention: {total_pageviews_retention}")
print(f"Total pageviews for Net Conversion: {total_pageviews_net_conversion}")

# 选择最大值作为实验的总页面浏览量
max_total_pageviews = max(total_pageviews_gross_conversion, total_pageviews_retention, total_pageviews_net_conversion)
print(f"Maximum total pageviews required for the experiment: {max_total_pageviews}")

Total pageviews for Gross Conversion: 645875.0
Total pageviews for Retention: 4741212.121212121
Total pageviews for Net Conversion: 685325.0
Maximum total pageviews required for the experiment: 4741212.121212121


Step 3. Length of the Experiment

实验持续时间 = 总页面浏览量 / 每天页面浏览量 

Gross Conversion = 645875 / 40000 = 16.15天 = 17天

Retention = 4741212.121212121 / 40000 = 118.5天 = 119天

Net Conversion = 685325 / 40000 = 17.13天 = 18天

Length of the experiment should be at least 17 days, it should be better if we spend 119 days. 119天不太可能. 大概率会忽略retention

Step 4. Testing Result Analysis

第一步: 做Sanity check,  "Pageviews" 和 "Clicks" 作为不变度指标进行Check.

In [3]:
import pandas as pd
control = pd.read_csv('Final Project Results - Control.csv')
experiment = pd.read_csv('Final Project Results - Experiment.csv')

# 计算总数
total_pageviews_control = control['Pageviews'].sum()
total_pageviews_experiment = experiment['Pageviews'].sum()
total_clicks_control = control['Clicks'].sum()
total_clicks_experiment = experiment['Clicks'].sum()

# 计算比例
pageviews_proportion_control = total_pageviews_control / (total_pageviews_control + total_pageviews_experiment)
pageviews_proportion_experiment = total_pageviews_experiment / (total_pageviews_control + total_pageviews_experiment)
clicks_proportion_control = total_clicks_control / (total_clicks_control + total_clicks_experiment)
clicks_proportion_experiment = total_clicks_experiment / (total_clicks_control + total_clicks_experiment)

# 假设比例应该大致相等，比如 50% - 50%
expected_proportion = 0.5

# 进行卡方检验
chi2_pageviews, p_pageviews = stats.chisquare([total_pageviews_control, total_pageviews_experiment])
chi2_clicks, p_clicks = stats.chisquare([total_clicks_control, total_clicks_experiment])

# 打印结果
print(f"Pageviews - Control Proportion: {pageviews_proportion_control}, Experiment Proportion: {pageviews_proportion_experiment}, Chi2 Test P-value: {p_pageviews}")
print(f"Clicks - Control Proportion: {clicks_proportion_control}, Experiment Proportion: {clicks_proportion_experiment}, Chi2 Test P-value: {p_clicks}")

# 判断是否通过 Sanity Check
if p_pageviews > 0.05 and p_clicks > 0.05:
    print("Sanity Check Passed")
else:
    print("Sanity Check Failed")

Pageviews - Control Proportion: 0.5006396668806133, Experiment Proportion: 0.4993603331193866, Chi2 Test P-value: 0.2878496417065941
Clicks - Control Proportion: 0.5004673474066628, Experiment Proportion: 0.49953265259333723, Chi2 Test P-value: 0.8238677039815409
Sanity Check Passed


第二步: Statistical Testing:

H0: There is no difference for "gross conversion", "Retention", and " Net conversion" in Control Group and Experimental Group 

H1: There is a difference for "gross conversion", "Retention", and " Net conversion" in Control Group and Experimental Group 

In [4]:
control['Gross Conversion'] = control['Enrollments'] / control['Clicks'] #Gross conversion
control['Retention'] = control['Payments'] / control['Enrollments']  #Retention
control['Net Conversion'] = control['Payments'] / control['Clicks']  #Net conversion

experiment['Gross Conversion'] = experiment['Enrollments'] / experiment['Clicks']
experiment['Retention'] = experiment['Payments'] / experiment['Enrollments']
experiment['Net Conversion'] = experiment['Payments'] / experiment['Clicks']

# 进行 t-test 并计算置信区间
metrics = ['Gross Conversion', 'Retention', 'Net Conversion']
for metric in metrics:
    control_data = control[metric].dropna()
    experiment_data = experiment[metric].dropna()
    t_stat, p_value = stats.ttest_ind(experiment_data, control_data, equal_var=False)
    
    # 计算自由度
    s1_sq = control_data.var()
    s2_sq = experiment_data.var()
    n1 = len(control_data)
    n2 = len(experiment_data)
    df = ((s1_sq/n1 + s2_sq/n2)**2) / ((s1_sq/n1)**2/(n1-1) + (s2_sq/n2)**2/(n2-1))

    # 计算置信区间
    pooled_se = np.sqrt(s1_sq/n1 + s2_sq/n2)
    margin_of_error = stats.t.ppf(0.975, df) * pooled_se
    diff_means = experiment_data.mean() - control_data.mean()
    ci_lower = diff_means - margin_of_error
    ci_upper = diff_means + margin_of_error
    
    print(f"{metric} - t-statistic: {t_stat}, df: {df}, p-value: {p_value}, 95% CI: [{ci_lower}, {ci_upper}]")

Gross Conversion - t-statistic: -1.5396752696188791, df: 43.757773534481174, p-value: 0.1308405255155708, 95% CI: [-0.047994944936353395, 0.0064257808778214985]
Retention - t-statistic: 1.0081408912731535, df: 40.50474292438236, p-value: 0.3193716316049348, 95% CI: [-0.03347511018416882, 0.10016012511693817]
Net Conversion - t-statistic: -0.5387777625331603, df: 43.64795890688602, p-value: 0.5927776808531843, 95% CI: [-0.02321835453731117, 0.013424640557692442]


Since all the P-value is greater than 0.05, and the condfidence interval include 0. Therefore, we fail to reject H0. 

Step 5. Make a recommandation

从statistical test中我们可以看出, 控制组和实验组之间在这些指标上没有显著差异。这意味着实验中引入new change没有产生显著影响, 所以我的建议是没有必要launch this experiment. 但是如果这个实验实施成本低，即使效果不大，也可能值得尝试; 很多时候，短期内看不到显著效果，但是长期来看可能会带来积极的影响. 我还是会跟我老板讨论, 毕竟他是老大, 他说的算, 我把利弊都给他说, 他爱做做, 不做拉到. 

如果老板硬要做new change这个实验, 我会这么给他建议: 1. 增加样本量：目前实验的样本量实在是太少了，增加样本量可以提高实验的准确性。

2. 增加实验时间：增加实验时间可以帮助收集更多数据，毕竟有当时做metrics choice的时候有选retention, retention需要一百多天. 目前的实验就十几天. 

如果老板还是想增加使用的人数, 想重新设计一个实验, 问我有什么好的建议, 我会这么说: 换一个更加吸引人的UI interface, 或者是再加一些更有趣的功能. 