<a href="https://colab.research.google.com/github/AgnesElza/subscription-retention-analytics/blob/main/05_ab_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook continues from 04_causal_insights.ipynb

In [2]:
# Mounting Google Drive in Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import numpy as np

In [4]:
# Path to your data
data_path = "/content/drive/MyDrive/data_science_projects/kkbox_project/data"
df = pd.read_csv(f"{data_path}/kkbox_merged_clean.csv")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 992931 entries, 0 to 992930
Data columns (total 31 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   msno                      992931 non-null  object 
 1   is_churn                  992931 non-null  int64  
 2   city                      992931 non-null  float64
 3   gender                    992931 non-null  object 
 4   registered_via            992931 non-null  float64
 5   registration_init_time    992931 non-null  object 
 6   reg_year                  992931 non-null  int64  
 7   reg_month                 992931 non-null  int64  
 8   n_txns                    992931 non-null  int64  
 9   cancel_rate               992931 non-null  float64
 10  auto_renew_rate           992931 non-null  float64
 11  avg_plan_days             992931 non-null  float64
 12  std_plan_days             992931 non-null  float64
 13  share_30d                 992931 non-null  f

## A/B Testing for Retention — KKBox Case Study



This notebook demonstrates a **full A/B testing workflow** using the KKBox churn dataset.  
While the dataset itself does not contain a real retention intervention (e.g., coupons or offers),  
we simulate an experiment by randomly assigning users to **Treatment** and **Control** groups.  

The goal is to show how data scientists design and analyze experiments in practice, including:  

- **Randomization** (deterministic hashing by user_id)  
- **Sample Ratio Mismatch (SRM) check** to validate group sizes  
- **Covariate balance checks** on key pre-period features  
- **Outcome analysis** (renewal rate difference + confidence interval)  
- **Variance reduction with CUPED** (Controlled-experiment Using Pre-Experiment Data)  
- **Decision framework** (ship / don’t ship / inconclusive)  

### Why this matters
Experimentation is one of the most critical skills for product data scientists.  
It bridges statistical rigor with business impact — telling us **not just what will happen**,  
but whether a product change **causes** improvement in user behavior.  

### Note
Since no real coupon was applied in this dataset, the effect size here is near zero.  
The focus of this notebook is on demonstrating the **methodology and workflow**  
that can be directly applied to real experiments (e.g., retention offers, pricing tests, UX changes).


## Add Randomization

We need to split users (msno) into Treatment (T) and Control (C) groups.
Use deterministic hashing (so the split is consistent)

In [6]:
import hashlib

def assign_variant(user_id, exp_id="kkbox_coupon_v1", p_treat=0.5):
    h = hashlib.md5(f"{exp_id}:{user_id}".encode()).hexdigest()
    bucket = int(h[:8], 16) / 16**8
    return "T" if bucket < p_treat else "C"

df["variant"] = df["msno"].astype(str).apply(assign_variant)
df["variant"].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
variant,Unnamed: 1_level_1
T,0.500343
C,0.499657


Each user (msno) has been consistently and fairly assigned to T or C.

We now have the foundation of a valid A/B test setup.

## SRM (Sample Ratio Mismatch) Test

Now let’s formally check that the group proportions match expectation (companies do this before analyzing any outcome).

In [7]:
from scipy.stats import chi2_contingency

n_t = (df.variant=="T").sum()
n_c = (df.variant=="C").sum()
total = n_t + n_c
expected = [0.5*total, 0.5*total]

chi2, p, *_ = chi2_contingency([[n_t, n_c], expected])
print(f"SRM test p-value = {p:.3g}")

SRM test p-value = 0.63


No sample ratio mismatch → the groups are split as expected (50/50)

## Check covariate balance

### Step 1: Pick covariates to test

These are good predictors of churn (pre-period features):

- n_txns – number of transactions

- auto_renew_rate – tendency to auto-renew

- avg_plan_days – average subscription length

- share_30d – music sharing behavior

- avg_discount_rate – history of discounts

- avg_unique_songs_per_day – engagement

### Step 2: Run statistical tests

Numeric → Welch’s t-test

Categorical (like city) → Chi-square test

In [10]:
from scipy.stats import ttest_ind, chi2_contingency

# Define covariates
numeric_covs = [
    "n_txns", "auto_renew_rate", "avg_plan_days",
    "share_30d", "avg_discount_rate", "avg_unique_songs_per_day",
    "last_gap_days", "cancel_rate", "total_listens", "active_days",
    "total_amount_paid", "avg_list_price", "avg_amount_paid", "completion_ratio"
]

categorical_covs = ["city", "gender", "reg_year", "reg_month", "registered_via"]

# Numeric covariate balance checks (Welch’s t-test)
for cov in numeric_covs:
    if cov in df.columns:
        t, p = ttest_ind(
            df.loc[df.variant=="T", cov],
            df.loc[df.variant=="C", cov],
            equal_var=False, nan_policy="omit"
        )
        print(f"{cov:25s} | p={p:.3f}")

# Categorical covariate balance checks (Chi-square test)
for cov in categorical_covs:
    if cov in df.columns:
        tab = pd.crosstab(df[cov], df["variant"])
        if tab.shape[0] > 1:  # skip degenerate
            chi2, p, *_ = chi2_contingency(tab.values)
            print(f"{cov + ' (categorical)':25s} | p={p:.3f}")


n_txns                    | p=0.231
auto_renew_rate           | p=0.386
avg_plan_days             | p=0.660
share_30d                 | p=0.273
avg_discount_rate         | p=0.778
avg_unique_songs_per_day  | p=0.839
last_gap_days             | p=0.648
cancel_rate               | p=0.085
total_listens             | p=0.762
active_days               | p=0.355
total_amount_paid         | p=0.315
avg_list_price            | p=0.823
avg_amount_paid           | p=0.694
completion_ratio          | p=0.987
city (categorical)        | p=0.663
gender (categorical)      | p=0.407
reg_year (categorical)    | p=0.361
reg_month (categorical)   | p=0.464
registered_via (categorical) | p=0.269


All p-values > 0.05 → balanced groups.

## Outcome Analysis: Difference in Means

We now compare the primary outcome (renewal rate) between Treatment and Control.  
This gives us the estimated lift and its 95% confidence interval.

In [11]:
import numpy as np
from scipy.stats import norm

# Define outcome (renewed = 1 - churn)
df["renewed"] = 1 - df["is_churn"]

# Split treatment and control
y_t = df.loc[df.variant=="T", "renewed"]
y_c = df.loc[df.variant=="C", "renewed"]

# Renewal rates
p_t, p_c = y_t.mean(), y_c.mean()
diff = p_t - p_c

# Standard error of difference in proportions
se = np.sqrt(p_t*(1-p_t)/len(y_t) + p_c*(1-p_c)/len(y_c))

# 95% confidence interval
z = norm.ppf(0.975)   # z-score for 95%
ci = (diff - z*se, diff + z*se)

print("=== Renewal Rate A/B Test (Raw) ===")
print(f"Treatment rate (T): {p_t:.2%}   [n={len(y_t):,}]")
print(f"Control rate (C):   {p_c:.2%}   [n={len(y_c):,}]")
print(f"Difference (Lift):  {diff:.2%}")
print(f"95% CI: ({ci[0]:.2%}, {ci[1]:.2%})")

# Optional decision rule
if ci[0] > 0:
    print("Decision: 🚀 Ship (significant positive lift)")
elif ci[1] < 0:
    print("Decision: ❌ Do not ship (significant negative impact)")
else:
    print("Decision: 🤔 Inconclusive (CI crosses 0)")


=== Renewal Rate A/B Test (Raw) ===
Treatment rate (T): 93.61%   [n=496,806]
Control rate (C):   93.60%   [n=496,125]
Difference (Lift):  0.01%
95% CI: (-0.08%, 0.11%)
Decision: 🤔 Inconclusive (CI crosses 0)


## CUPED(Controlled-experiment Using Pre-Experiment Data) variance reduction

Decided to still use CUPED for variance reduction (to show technique)

In A/B tests, randomization makes groups comparable in expectation, but there’s always noise in outcomes (e.g., some users churn more just because they were lighter listeners before).

That noise = larger variance = wider confidence intervals = needing more users or longer experiments.

CUPED reduces variance by using pre-period information to “de-noise” the outcome.

In [12]:
import numpy as np
from scipy.stats import norm

# Outcome and covariate
y = df["renewed"].astype(float)
x = df["last_gap_days"].astype(float)

# --- 1) Compute theta
theta = np.cov(y, x, ddof=1)[0,1] / np.var(x, ddof=1)
print(f"Theta (correction factor) = {theta:.4f}")

# --- 2) Adjusted outcome
df["renewed_cuped"] = y - theta * (x - x.mean())

# --- 3) Variance reduction
var_before = y.var(ddof=1)
var_after = df["renewed_cuped"].var(ddof=1)
print(f"Variance before: {var_before:.6f}")
print(f"Variance after : {var_after:.6f}")
print(f"Variance reduction: {(1 - var_after/var_before)*100:.1f}%")

# --- 4) A/B test with CUPED-adjusted outcome
y_t = df.loc[df.variant=="T", "renewed_cuped"]
y_c = df.loc[df.variant=="C", "renewed_cuped"]

diff = y_t.mean() - y_c.mean()
se = np.sqrt(y_t.var(ddof=1)/len(y_t) + y_c.var(ddof=1)/len(y_c))
z = norm.ppf(0.975)
ci = (diff - z*se, diff + z*se)

print("\n=== Renewal Rate A/B Test (CUPED) ===")
print(f"Adjusted mean (T): {y_t.mean():.4f}")
print(f"Adjusted mean (C): {y_c.mean():.4f}")
print(f"Difference (Lift): {diff:.4f}")
print(f"95% CI: ({ci[0]:.4f}, {ci[1]:.4f})")


Theta (correction factor) = -0.0017
Variance before: 0.059837
Variance after : 0.055161
Variance reduction: 7.8%

=== Renewal Rate A/B Test (CUPED) ===
Adjusted mean (T): 0.9361
Adjusted mean (C): 0.9360
Difference (Lift): 0.0001
95% CI: (-0.0009, 0.0010)


CUPED worked (smaller variance), but result remains inconclusive.

This dataset does not contain an actual coupon intervention, so the treatment/control split is simulated. As expected, the estimated effect is ~0. However, the project demonstrates the full A/B testing workflow (randomization, SRM, balance checks, CUPED variance reduction) that I would apply to a real retention experiment.

In real-world scenarios where offers are given, the same pipeline would capture true differences in churn outcomes. For example, companies often observe 1–3pp lift in renewal when targeted discounts are offered.