**Table of contents**
1. [Checking SRM (Sample Ratio Mismatch)](#checking-srm-sample-ratio-mismatch)
2. [Checking the distribution of Data and quality of data](#checking-the-distribution-of-data-and-quality-of-data)
3. [Choosing Metrics](#choosing-metrics)
4. [Sample Size and Power Analysis](#sample-size-and-power-analysis)
5. [Calculating Lift and considering practical significance](#calculating-lift-and-considering-practical-significance)
6. [Chi Test](#chi-test)
7. [Report](#report)




In [1]:
import pandas as pd 
import numpy as np
import scipy
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns 

In [2]:
!pip install skimpy 

Collecting skimpy
  Downloading skimpy-0.0.18-py3-none-any.whl.metadata (34 kB)
Collecting ipykernel>=6.29.5 (from skimpy)
  Downloading ipykernel-7.0.1-py3-none-any.whl.metadata (4.5 kB)
Collecting numpy>=2.0.2 (from skimpy)
  Downloading numpy-2.3.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Downloading skimpy-0.0.18-py3-none-any.whl (17 kB)
Downloading ipykernel-7.0.1-py3-none-any.whl (118 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.2/118.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.3.4-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: numpy, ipykernel, skimpy
  Attempting uninstall: nu

In [3]:
df = pd.read_csv('/kaggle/input/ab-test-click/ab_test_results_aggregated_views_clicks_5.csv')

In [4]:
df.head()

Unnamed: 0,user_id,group,views,clicks
0,1,control,12.0,1.0
1,2,control,3.0,0.0
2,3,control,12.0,1.0
3,4,control,2.0,1.0
4,5,control,4.0,0.0


# Checking SRM(Sample Ratio Mismatch)

Before any statistical analysis, it's good practice to validate your data and check your data quality. 

when conducting A/B testing you divert your traffic to a certain ratio. consider it 50/50 of data. When operating on this assumption that your data is equally split, but if the actual data due to technical bugs or errors skew and differ, then we get biased and wrong analysis. 

Hence we check SRM(Sample ratio mismatch). If you set a 50/50 split and see 60% of users in one variant and 40% in the other, it’s a sign that something is broken. SRM means the random assignment failed, and your results may be biased from the start.

In [5]:
group_counts = df.groupby('group')['user_id'].nunique()
group_counts

group
control    65000
test       65000
Name: user_id, dtype: int64

### Even though we can see that the ratio of control and test are equal and there is no mismatch, we will go ahead and do a SRM test (basicaly a chi square quality test) just to show how its done

Case scenario: In real world, the data is not so exactly and precicely distributed. there will be variations due to randomization. Hence we might see control having 54800 and treatment having 65100 samples.

In these situations, to decide if the difference is in acceptable range we use a statistical test (chi test). We take actual and expected distribution as two variants and conduct the test. 

The normal threshold is 0.01 for p value. Reject the null hypothsis if p value is less than 0.01


In [6]:
actual_control = group_counts.get('control',0)
actual_treatment = group_counts.get('test',0)

total_users = actual_control + actual_treatment

expected_control = total_users * 0.5
expected_treatment = total_users * 0.5

In [7]:
observed = [actual_control,actual_treatment]
expected = [expected_control, expected_treatment]

chi2, p_value = stats.chisquare(observed,expected)

if p_value < 0.01:  # Conservative threshold
    print(f"SRM detected! p-value: {p_value}")
    print("Do not proceed - investigate randomization issues")
else:
    print("It has correct split ratio between control and treatment")

It has correct split ratio between control and treatment


# Checking the distribution of Data and qaulity of data

Instead of using .describe() and .info() we can use skimpy library which gives detailed summary including null counts and distribution of data and thier types.

In [8]:
from skimpy import skim

In [9]:
skim(df)

# Choosing Metrics

Choosing the right metric is half the work in an A/B test. One important consideration is that when the unit of analysis and the unit of diversion differ, the resulting estimate often tends to be inflated.

Given our dataset, we have two possible metrics to evaluate:

1. Click-Through Rate (CTR)
2. Conversion Rate

If we choose CTR, we would need to apply a t-test. However, CTR is a broad metric and typically shows high variability. A click-through probability might provide a more precise measure, but given our context, it may not be ideal.

To gain deeper insight from the data, we can construct a contingency table by creating a binary column that indicates whether a user clicked or not. For this analysis, we will treat a click as a conversion event. Therefore, we will proceed with the second option and use the conversion rate metric, applying a Chi-square test to assess statistical significance.

We will go through creating CTR just in case 

In [10]:
group_ctr = (
    df.groupby('group', as_index=False)
    .agg({'clicks':'sum', 'views': 'sum'})
)
group_ctr

Unnamed: 0,group,clicks,views
0,control,17787.0,323710.0
1,test,19281.0,327379.0


In [11]:
group_ctr['ctr'] = group_ctr['clicks']/group_ctr['views']
group_ctr

Unnamed: 0,group,clicks,views,ctr
0,control,17787.0,323710.0,0.054947
1,test,19281.0,327379.0,0.058895


In [12]:
df['ctr'] = df['clicks'] / df['views']

**Checking homoscedastic**

Though this step can be skipped since for most of the real world data the variance is not equal. Hence we mostly always use Welch's t-test instead of standard t-test.

In [13]:
from scipy.stats import levene

control_ctr = df[df['group'] == 'control']['ctr']
treatment_ctr = df[df['group'] == 'test']['ctr']

stat, p_value = levene(control_ctr, treatment_ctr)

print(f"Levene's test statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Variances are NOT equal (heteroscedastic)")
    print("Use Welch's t-test instead of standard t-test")
else:
    print("Variances are approximately equal (homoscedastic)")
    print("Standard t-test is appropriate")


Levene's test statistic: 24.1467
P-value: 0.0000
Variances are NOT equal (heteroscedastic)
Use Welch's t-test instead of standard t-test


# Sample Size and Power Analysis

Before we conduct an experiment and before the collection of data we have to calculate the size of data we need to draw a valid conclusion.

This part should be conducted first, but since this project was not done on real time data, we will check if we have enough data to draw a significance for desired confidence.

Considering the company condiseres a 10% lift(improvement) as practical and desirable then we assign the effect size similary and use a stats model library to calculate the required data. 

About the parameters. 
1. Effect size (This is the improvement you desire)
2. Apha ( Probalility of making Type 1 error which false positive. α = 0.05 means you accept a 5% risk of false positives)
3. power ( This is not the practical power but rather probability of correctly detecting a true effect when it exists. Power = 1 - β, where β is the probability of a Type II error (false negative)) 


In [14]:
from statsmodels.stats.power import zt_ind_solve_power

# Parameters
effect_size = 0.10  # 10% relative lift
alpha = 0.05  # Significance level
power = 0.90  # Desired statistical power

# Calculate required sample size
required_n = zt_ind_solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    ratio=1.0,  # Equal group sizes
    alternative='two-sided'
)

print(f"Required sample size per group: {required_n:.0f}")
print(f"Actual sample size: {len(df[df['group']=='control'])}")


Required sample size per group: 2101
Actual sample size: 65000


We see that we more than enough data to do our statistical test.

# Calculating Lift and considering practical significance

Here we calculate the conversion rate and lift 

In [15]:
df['converted'] = (df['clicks'] > 0).astype(int)


We create a contingency table, which is essentialy answersing how many converted and who did not for the two variants.

In [16]:
contingency_table = pd.crosstab(df['group'], df['converted'])
print(contingency_table)

converted      0      1
group                  
control    51006  13994
test       50020  14980


We calculate conversion rate for both control and treatment variant. 

**Conversion rate = people who converted/ total customers who visited**

In [17]:
control_converted = contingency_table.loc['control',1]
treatment_converted = contingency_table.loc['test', 1]

control_total = contingency_table.loc['control'].sum()
treatment_total = contingency_table.loc['test'].sum()

control_rate = control_converted / control_total
treatment_rate = treatment_converted / treatment_total

print(f'control rate {control_rate:.4%}')
print(f'treatment rate {treatment_rate:.4%}')

control rate 21.5292%
treatment rate 23.0462%


We calculate lift first to determine whether a change is meaningful enough to consider shipping. A statistical test only tells us whether the observed difference is likely due to chance or not. However, even if a result is statistically significant, we still need to evaluate its practical significance — whether the magnitude of the effect meets business expectations.

By calculating lift, we can assess the real-world impact of the change and decide if it’s worth implementing.

In [18]:
 # Calculate lift
absolute_lift = treatment_rate - control_rate
relative_lift = absolute_lift / control_rate

print(f'absolute lift {absolute_lift:.4%}')
print(f'relative lift {relative_lift:.4%}')

absolute lift 1.5169%
relative lift 7.0459%


The lift we observe is specific to the current dataset. However, what we really want to know is how the conversion rate might perform in the future — will it be higher or lower?

To answer this, we construct a confidence interval, which provides a range of values within which the true conversion rate is likely to fall. For example, we might say, “We are 95% confident that the future conversion rate will fall between 22.7% and 23.3%.”

This interval captures the uncertainty in our estimate. A narrower confidence interval indicates greater precision, while a wider one suggests more uncertainty. It allows us to make informed decisions with an understanding of how much variation to expect in future outcomes.

In [19]:
from statsmodels.stats.proportion import proportion_confint

# 95% confidence intervals for each group
control_ci_low, control_ci_high = proportion_confint(
    control_converted, 
    control_total, 
    alpha=0.05, 
    method='wilson'
)

treatment_ci_low, treatment_ci_high = proportion_confint(
    treatment_converted, 
    treatment_total, 
    alpha=0.05, 
    method='wilson'
)

print(f"\nControl 95% CI: [{control_ci_low:.4%}, {control_ci_high:.4%}]")
print(f"Treatment 95% CI: [{treatment_ci_low:.4%}, {treatment_ci_high:.4%}]")

# Approximate lift confidence interval
lift_ci_low = treatment_ci_low - control_ci_high
lift_ci_high = treatment_ci_high - control_ci_low

relative_lift_ci_low = lift_ci_low / control_rate
relative_lift_ci_high = lift_ci_high / control_rate

print(f"\nRelative lift 95% CI: [{relative_lift_ci_low:.2%}, {relative_lift_ci_high:.2%}]")



Control 95% CI: [21.2149%, 21.8469%]
Treatment 95% CI: [22.7240%, 23.3715%]

Relative lift 95% CI: [4.07%, 10.02%]


# Chi Test

In [20]:
from scipy.stats import chi2_contingency

chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")

Chi-square statistic: 43.0898
P-value: 0.0000


# Report

After completing all analyses, we compile a concise report that summarizes our findings and supports the decision on whether to deploy the new feature.

For this A/B experiment:

1. The results show **statistical significance**, indicating that the observed difference between control and variant is unlikely due to random chance.

2. The business has set a **practical significance threshold** of a 10% uplift as the minimum acceptable improvement to justify implementation. We use this benchmark as our decision criterion.

3. The **observed lift** from the test is **7%**, which, although statistically significant, falls short of the expected business threshold. Therefore, we decide not to deploy the feature at this stage.

In [21]:
minimum_worthwhile_lift = 0.10 

# Print results
print("=" * 60)
print("A/B TEST EVALUATION REPORT")
print("=" * 60)

print("\n1. STATISTICAL SIGNIFICANCE")
print(f"   Chi-square statistic: {chi2:.4f}")
print(f"   P-value: {p_value:.6f}")
print(f"   Degrees of freedom: {dof}")

if p_value < 0.05:
    print("   ✅ Result: STATISTICALLY SIGNIFICANT (p < 0.05)")
else:
    print("   ❌ Result: NOT statistically significant (p ≥ 0.05)")

print("\n2. CONVERSION RATES")
print(f"   Control: {control_rate:.4%} (95% CI: [{control_ci_low:.4%}, {control_ci_high:.4%}])")
print(f"   Treatment: {treatment_rate:.4%} (95% CI: [{treatment_ci_low:.4%}, {treatment_ci_high:.4%}])")

print("\n3. LIFT ANALYSIS")
print(f"   Absolute lift: {absolute_lift:+.4%} ({absolute_lift*100:.2f} percentage points)")
print(f"   Relative lift: {relative_lift:+.2%}")

print("\n4. PRACTICAL SIGNIFICANCE")
print(f"   Business threshold: {minimum_worthwhile_lift:.0%}")

if relative_lift >= minimum_worthwhile_lift:
    print(f"   ✅ Result: PRACTICALLY SIGNIFICANT")
    print(f"      Lift ({relative_lift:.2%}) exceeds threshold ({minimum_worthwhile_lift:.0%})")
else:
    print(f"   ⚠️ Result: NOT PRACTICALLY SIGNIFICANT")
    print(f"      Lift ({relative_lift:.2%}) below threshold ({minimum_worthwhile_lift:.0%})")

print("\n5. RECOMMENDATION")
if p_value < 0.05 and relative_lift >= minimum_worthwhile_lift:
    print("   🚀 SHIP IT: Both statistically and practically significant")
elif p_value < 0.05 and relative_lift < minimum_worthwhile_lift:
    print("   🛑 DON'T SHIP: Statistically significant but lift too small")
else:
    print("   🛑 DON'T SHIP: Not statistically significant")

print("=" * 60)

result = {
    'p_value': p_value,
    'control_rate': control_rate,
    'treatment_rate': treatment_rate,
    'absolute_lift': absolute_lift,
    'relative_lift': relative_lift,
    'is_statistically_significant': p_value < 0.05,
    'is_practically_significant': relative_lift >= minimum_worthwhile_lift
}


A/B TEST EVALUATION REPORT

1. STATISTICAL SIGNIFICANCE
   Chi-square statistic: 43.0898
   P-value: 0.000000
   Degrees of freedom: 1
   ✅ Result: STATISTICALLY SIGNIFICANT (p < 0.05)

2. CONVERSION RATES
   Control: 21.5292% (95% CI: [21.2149%, 21.8469%])
   Treatment: 23.0462% (95% CI: [22.7240%, 23.3715%])

3. LIFT ANALYSIS
   Absolute lift: +1.5169% (1.52 percentage points)
   Relative lift: +7.05%

4. PRACTICAL SIGNIFICANCE
   Business threshold: 10%
   ⚠️ Result: NOT PRACTICALLY SIGNIFICANT
      Lift (7.05%) below threshold (10%)

5. RECOMMENDATION
   🛑 DON'T SHIP: Statistically significant but lift too small
