# A/B Test Analysis: Search Ranking Algorithm Rollout


## Business Goal
Decide whether to launch the new ranking algorithm to all traffic in an online booking platform.

## Launch Gates
1. Sample size is adequate for the chosen MDE and alpha.
2. Data quality and randomization checks pass.
3. Primary metric improves with statistical and practical significance.
4. Guardrail metric is non-inferior (no meaningful degradation).


#### `sessions_data.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `session_id` | `string` | Unique session identifier (unique for each row) |
| `user_id` | `string` | Unique user identifier (non logged-in users have missing user_id values; each user can have multiple sessions) |
| `session_start_timestamp` | `string` | When a session started |
| `booking_timestamp` | `string` | When a booking was made (missing if no booking was made during a session) |
| `time_to_booking` | `float` | time from start of the session to booking, in minutes (missing if no booking was made during a session) |

<br>

#### `users_data.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `user_id` | `string` | Unique user identifier (only logged-in users in this table) |
| `experiment_group` | `string` | control / variant split for the experiment (expected to be equal 50/50) |

<br>

## Metric Definitions

### Primary metric: user conversion rate
- Unit of analysis: user
- Population: randomized users with at least one session
- Definition: a user is `converted = 1` if they have **at least one** session with booking evidence.

### Guardrail metric: median time to first booking
- Unit of analysis: user
- Population: converted users only
- Definition: for each converted user, take the **earliest booking event** (`booking_timestamp`) across their sessions, then use that session's `time_to_booking`.

## Design Choices

- Statistical threshold: alpha = 0.05 (95% confidence)
- Power target: 80%
- Primary Minimum Detectable Effect (MDE): +15% relative lift
- Guardrail non-inferiority tolerance: +5% on median time to first booking

# 1. Load Data and Build Session-Level Conversion

Session conversion is `1` when `booking_timestamp` exists.

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from scipy.stats import chisquare, norm
from pingouin import ttest, mwu
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize, proportions_ztest


In [2]:
sessions_path = Path("sessions_data.csv")
users_path = Path("users_data.csv")

sessions = pd.read_csv(
    sessions_path,
    parse_dates=["session_start_timestamp", "booking_timestamp"],
    keep_default_na=True,
)
users = pd.read_csv(users_path, keep_default_na=True)
sessions["conversion"] = sessions["booking_timestamp"].notna().astype(int)
sessions


Unnamed: 0,session_id,user_id,session_start_timestamp,booking_timestamp,time_to_booking,conversion
0,CP0lbAGnb5UNi3Ut,TcCIMrtQ75wHGXVj,2025-01-26 20:02:39.177358627,NaT,,0
1,UQAjrPYair63L1p8,TcCIMrtQ75wHGXVj,2025-01-20 16:12:51.536912203,NaT,,0
2,9zQrAPxV5oi2SzSa,TcCIMrtQ75wHGXVj,2025-01-28 03:46:40.839362144,NaT,,0
3,kkrz1M5vxrQ8wXRZ,GUGVzto9KGqeX3dc,2025-01-25 02:48:50.953303099,NaT,,0
4,AKDXZWWFYKViHC27,,2025-01-28 00:30:49.979124308,NaT,,0
...,...,...,...,...,...,...
16976,r97QgNbPezCuHDIe,ieZHD9fUWFNBGvDY,2025-01-03 11:03:07.077358723,NaT,,0
16977,h6Bi84uYTCkRPbbK,ieZHD9fUWFNBGvDY,2025-01-09 23:20:27.544024706,NaT,,0
16978,dgHigXEDcRv212yK,0Dt3nGFV3sdFziMU,2025-01-13 01:50:17.851139069,NaT,,0
16979,4qH5gmksGUoSSfjw,0Dt3nGFV3sdFziMU,2025-01-10 05:39:53.701869011,NaT,,0


## 1.1 Build Triggered Intention-To-Treat Population (User-Randomized)

The analysis population includes randomized users who triggered at least one session.


In [3]:
sessions_enrolled = sessions.merge(
    users,
    on="user_id",
    how="inner"
)

randomized_users = users["user_id"].nunique()
triggered_users = sessions_enrolled["user_id"].nunique()
trigger_rate = triggered_users / randomized_users

window_start = sessions_enrolled["session_start_timestamp"].min()
window_end = sessions_enrolled["session_start_timestamp"].max()

print(f"Randomized users: {randomized_users:,}")
print(f"Total analysis sessions: {len(sessions_enrolled):,}")
print(f"Triggered users: {triggered_users:,}")
print(f"Trigger rate: {trigger_rate:.2%}")
print(f"Analysis window: {window_start} to {window_end}")

sessions_enrolled


Randomized users: 10,000
Total analysis sessions: 15,283
Triggered users: 9,454
Trigger rate: 94.54%
Analysis window: 2025-01-01 00:00:41.086753607 to 2025-01-30 23:56:58.465101004


Unnamed: 0,session_id,user_id,session_start_timestamp,booking_timestamp,time_to_booking,conversion,experiment_group
0,CP0lbAGnb5UNi3Ut,TcCIMrtQ75wHGXVj,2025-01-26 20:02:39.177358627,NaT,,0,variant
1,UQAjrPYair63L1p8,TcCIMrtQ75wHGXVj,2025-01-20 16:12:51.536912203,NaT,,0,variant
2,9zQrAPxV5oi2SzSa,TcCIMrtQ75wHGXVj,2025-01-28 03:46:40.839362144,NaT,,0,variant
3,kkrz1M5vxrQ8wXRZ,GUGVzto9KGqeX3dc,2025-01-25 02:48:50.953303099,NaT,,0,variant
4,ABZZFrwItZAPdYGP,v2EBIHmOdQfalI6k,2025-01-11 11:41:36.912253618,NaT,,0,variant
...,...,...,...,...,...,...,...
15278,r97QgNbPezCuHDIe,ieZHD9fUWFNBGvDY,2025-01-03 11:03:07.077358723,NaT,,0,control
15279,h6Bi84uYTCkRPbbK,ieZHD9fUWFNBGvDY,2025-01-09 23:20:27.544024706,NaT,,0,control
15280,dgHigXEDcRv212yK,0Dt3nGFV3sdFziMU,2025-01-13 01:50:17.851139069,NaT,,0,control
15281,4qH5gmksGUoSSfjw,0Dt3nGFV3sdFziMU,2025-01-10 05:39:53.701869011,NaT,,0,control


## 1.2 Build User-Level Table

This aligns the analysis unit with user-level randomization.


In [4]:
# Aggregate to user level (one row per user)
user_level = (
    sessions_enrolled.groupby(["user_id", "experiment_group"], as_index=False)
    .agg(converted=("conversion", "max"), sessions=("session_id", "count"))
)

user_level

Unnamed: 0,user_id,experiment_group,converted,sessions
0,003Xd4ieQOrdlkSI,control,0,1
1,00PopsslpYfFwj5Q,control,1,2
2,00WTrvAywA3W13ys,variant,0,1
3,00tKaNLDPYrdRfhy,variant,0,2
4,0185eIbzcihrvyAn,control,1,1
...,...,...,...,...
9449,zySQtvWt9RToy2Np,control,0,3
9450,zz20c5wFcIvmg4Xl,variant,0,1
9451,zz4ssd1NzN77IJ27,variant,0,3
9452,zzDM6bUkRI1lRKES,variant,0,1


In [5]:
# For converted users, find their time_to_booking for the earliest booking event
booked_sessions = sessions_enrolled[sessions_enrolled["conversion"] == 1].copy()

# Pick the earliest booking event per user
booked_sessions = booked_sessions.sort_values(["user_id", "booking_timestamp"])
user_first_booking = (
    booked_sessions.groupby(["user_id", "experiment_group"], as_index=False)
    .first()[["user_id", "experiment_group", "time_to_booking"]]
)

# Merge booking time back to user level; users who didn't convert will have NaN for time_to_booking
user_level = user_level.merge(user_first_booking, on=["user_id", "experiment_group"], how="left")
user_level


Unnamed: 0,user_id,experiment_group,converted,sessions,time_to_booking
0,003Xd4ieQOrdlkSI,control,0,1,
1,00PopsslpYfFwj5Q,control,1,2,17.201313
2,00WTrvAywA3W13ys,variant,0,1,
3,00tKaNLDPYrdRfhy,variant,0,2,
4,0185eIbzcihrvyAn,control,1,1,11.456044
...,...,...,...,...,...
9449,zySQtvWt9RToy2Np,control,0,3,
9450,zz20c5wFcIvmg4Xl,variant,0,1,
9451,zz4ssd1NzN77IJ27,variant,0,3,
9452,zzDM6bUkRI1lRKES,variant,0,1,


# 2. Statistical Validation and Inference

This section evaluates launch readiness through power analysis, data quality sanity checks, and metric inference.


In [6]:
confidence_level = 0.95
alpha = 1 - confidence_level
power_target = 0.80

# +15% relative lift
mde_primary_rel = 0.15

# allow at most +5% worsening in median time_to_booking
noninf_guardrail_abs = 0.05

n_bootstraps = 20000

print(f"Alpha: {alpha:.2f} (confidence level: {confidence_level:.0%})")
print(f"Power target: {power_target:.0%}, so the chance of a false negative is at most {1 - power_target:.0%}")
print(f"Primary MDE (relative lift): {mde_primary_rel:.1%}")
print(f"Guardrail non-inferiority tolerance (absolute): +{noninf_guardrail_abs:.1%}")

Alpha: 0.05 (confidence level: 95%)
Power target: 80%, so the chance of a false negative is at most 20%
Primary MDE (relative lift): 15.0%
Guardrail non-inferiority tolerance (absolute): +5.0%


## 2.1 Power Analysis (Alpha 0.05, Power 0.8, MDE 15%)

In [7]:
# Calculate required sample size per group for primary MDE
p_control = user_level.loc[user_level.experiment_group.eq("control"), "converted"].mean()
p_variant = p_control * (1 + mde_primary_rel)

standardized_effect = abs(proportion_effectsize(p_control, p_variant))  # Absolute Cohen's h

n_per_group = NormalIndPower().solve_power(
    effect_size=standardized_effect,
    alpha=alpha,
    power=power_target,
    ratio=1.0,  # equal group sizes
    alternative="two-sided"
)

print(f"Baseline conversion rate (p_control) = {p_control:.2%}")
print(f"Variant conversion rate at MDE (p_control * (1 + MDE)) = {p_variant:.2%}")
print(f"Standardized effect size (Absolute Cohen's h) at MDE = {standardized_effect:.4f}")
print(f"Required sample size per group for 80% power = {int(np.ceil(n_per_group))}")
print(f"\nTotal required sample size = {2 * int(np.ceil(n_per_group))}")


Baseline conversion rate (p_control) = 23.99%
Variant conversion rate at MDE (p_control * (1 + MDE)) = 27.59%
Standardized effect size (Absolute Cohen's h) at MDE = 0.0823
Required sample size per group for 80% power = 2318

Total required sample size = 4636


In [8]:
# Calculate observed sample sizes
obs_n_control = len(user_level[user_level["experiment_group"] == "control"])
obs_n_variant = len(user_level[user_level["experiment_group"] == "variant"])
print(f"Observed users in control group: {obs_n_control:,}")
print(f"Observed users in variant group: {obs_n_variant:,}")

if obs_n_control > n_per_group and obs_n_variant > n_per_group:
    print("\nThe observed sample size in both groups > the required sample size for the target MDE. The experiment is sufficiently powered to detect the target MDE. ")
else:
    print("\nThe observed sample size in at least one group is less than the required sample size for the target MDE. The experiment may not be sufficiently powered to detect the target MDE.")

Observed users in control group: 4,706
Observed users in variant group: 4,748

The observed sample size in both groups > the required sample size for the target MDE. The experiment is sufficiently powered to detect the target MDE. 


## 2.2 Experiment Sanity Checks


### 2.2.1 Sample Ratio Mismatch (SRM) Check

Purpose: verify assignment counts match the expected 50/50 split.

Test method: chi-square goodness-of-fit test (two-sided) on assignment counts.

Why this method: assignment is a categorical 50/50 split by design, and this test checks whether observed counts deviate more than expected from random variation.

Hypothesis:
- H0: assignment_control = assignment_variant
- H1: assignment_control != assignment_variant


In [9]:
# Check Sample Ratio Mismatch (SRM) - are the assignment counts roughly equal between groups?
assign_counts = users["experiment_group"].value_counts().reindex(["control", "variant"])
obs = assign_counts.values
srm_pval = chisquare(f_obs=obs)[1]

print(f"Assignment users by group: {assign_counts.to_dict()}")
print(f"\nSRM chi-square goodness-of-fit p-value: {srm_pval:.4f}")
if srm_pval > alpha:
    print(f"p-value > {alpha:.2f}: Failed to reject the null hypothesis that the assignment counts are equal.\n\nThere is no strong evidence of sample ratio mismatch. The allocated traffic matches the experiment design.")
elif srm_pval < alpha:
    print(f"p-value < {alpha:.2f}: Rejected the null hypothesis that the assignment counts are equal.\n\nThere is strong evidence of sample ratio mismatch. This could be a sign of a bug in traffic allocation.")
else:
    print(f"p-value == {alpha:.2f}: The p-value is exactly equal to the alpha threshold. This is a borderline case for sample ratio mismatch.")

Assignment users by group: {'control': 4991, 'variant': 5009}

SRM chi-square goodness-of-fit p-value: 0.8572
p-value > 0.05: Failed to reject the null hypothesis that the assignment counts are equal.

There is no strong evidence of sample ratio mismatch. The allocated traffic matches the experiment design.


### 2.2.2 Trigger Rate Balance Check

Purpose: ensure both groups have comparable trigger rates into the analysis population.

Test method: two-proportion z-test (two-sided) on trigger rates.

Why this method: trigger is a binary outcome (triggered vs. not triggered) measured on independent users, so a proportion test directly evaluates between-group balance.

Hypothesis:
- H0: trigger_rate_control = trigger_rate_variant
- H1: trigger_rate_control != trigger_rate_variant


In [10]:
# Check trigger rate difference between groups - is the trigger rate significantly different between control and variant?

# triggered users
triggered_counts = (
    sessions_enrolled[["user_id", "experiment_group"]]
    .drop_duplicates()["experiment_group"]
    .value_counts()
    .reindex(["control", "variant"])
)

# two-proportion z-test: successes = triggered, trials = assigned
count = np.array([triggered_counts["control"], triggered_counts["variant"]])
nobs  = np.array([assign_counts["control"],  assign_counts["variant"]])
z, pval = proportions_ztest(count, nobs, alternative="two-sided")


print("Assigned users:", assign_counts.to_dict())
print("Triggered users:", triggered_counts.to_dict())
print(f"Trigger rates: control = {triggered_counts['control'] / assign_counts['control']:.2%}, variant = {triggered_counts['variant'] / assign_counts['variant']:.2%}")
print(f"\nTrigger Rates Proportions Z-test p-value: {pval:.4f}")
if pval > alpha:
    print(f"p-value > {alpha:.2f}: Failed to reject the null hypothesis that the trigger rates are equal.\n\nThere is no strong evidence that one group triggers way more/less.")
elif pval < alpha:
    print(f"p-value < {alpha:.2f}: Rejected the null hypothesis that the trigger rates are equal.\n\nThere is a statistically significant difference in trigger rates between groups.")
else:
    print(f"p-value == {alpha:.2f}: The p-value is exactly equal to the alpha threshold. This is a borderline case for trigger rate difference between groups.")


Assigned users: {'control': 4991, 'variant': 5009}
Triggered users: {'control': 4706, 'variant': 4748}
Trigger rates: control = 94.29%, variant = 94.79%

Trigger Rates Proportions Z-test p-value: 0.2715
p-value > 0.05: Failed to reject the null hypothesis that the trigger rates are equal.

There is no strong evidence that one group triggers way more/less.


### 2.2.3 Sessions per User Balance Check

Purpose: confirm user engagement is balanced across experiment groups.

Test method: Welch's two-sample t-test (two-sided) on mean sessions per user.

Why this method: it compares means for two independent groups while allowing unequal variances and slightly different sample sizes.

Hypothesis:
- H0: mean_sessions_per_user_control = mean_sessions_per_user_variant
- H1: mean_sessions_per_user_control != mean_sessions_per_user_variant


In [11]:
# Check sessions per user difference between groups - is there a significant difference in the number of sessions per user between control and variant?
sessions_per_user = (
    sessions_enrolled.groupby(["user_id", "experiment_group"])["session_id"]
    .count()
    .reset_index(name="n_sessions")
)

control_sessions_per_user = sessions_per_user.loc[sessions_per_user["experiment_group"] == "control", "n_sessions"]
variant_sessions_per_user = sessions_per_user.loc[sessions_per_user["experiment_group"] == "variant", "n_sessions"]
print(f"Mean sessions per user in control group: {control_sessions_per_user.mean():.2f}")
print(f"Mean sessions per user in variant group: {variant_sessions_per_user.mean():.2f}")

ttest_res = ttest(control_sessions_per_user, variant_sessions_per_user, correction=True) # Welch's t-test with unequal variance correction
spu_pval = ttest_res['p-val'].values[0]
print(f"\nSessions Per User Welch's t test p-value: {spu_pval:.4f}")
if spu_pval < alpha:
    print(f"p-value < {alpha:.2f}: Reject the null hypothesis that the mean sessions per user from control and variant groups are the same. \n\nThis indicates a statistically significant difference in user engagement between groups.")
elif spu_pval > alpha:
    print(f"p-value > {alpha:.2f}: Fail to reject the null hypothesis that the mean sessions per user from control and variant groups are the same. \n\nThis indicates there is no statistically significant difference in user engagement between groups.")
else:
    print(f"p-value == {alpha:.2f}: The p-value is exactly equal to the alpha threshold. This is a borderline case for sessions per user difference between groups.")

Mean sessions per user in control group: 1.62
Mean sessions per user in variant group: 1.61

Sessions Per User Welch's t test p-value: 0.5321
p-value > 0.05: Fail to reject the null hypothesis that the mean sessions per user from control and variant groups are the same. 

This indicates there is no statistically significant difference in user engagement between groups.


## 2.3 Primary Metric Inference

Test method:
- Two-proportion z-test (two-sided) on user conversion rates.
- 95% normal-approximation CI for absolute lift and 95% bootstrap CI for relative lift.

Why this method:
- Conversion is binary at the user level, so the proportion test is the standard inferential test for between-group difference.
- CI-based lift estimates quantify uncertainty and support both statistical significance and MDE-based practical significance checks.

Hypothesis:
- H0: conversion_variant = conversion_control
- H1: conversion_variant != conversion_control

Primary gate passes only if:
1. Two-sided p-value < alpha
2. Lower CI bound of relative lift >= MDE


In [12]:
# Calculate conversion counts and rates by group
conv_counts = (
    user_level.groupby("experiment_group")["converted"]
    .agg(converted="sum", total_users="count")
    .reindex(["control", "variant"])
)
conv_rates = conv_counts["converted"] / conv_counts["total_users"]
conv_rate_control = conv_rates["control"]
conv_rate_variant = conv_rates["variant"]
print(f"Conversion rates by group: control = {conv_rate_control:.2%}, variant = {conv_rate_variant:.2%}")
conv_counts

Conversion rates by group: control = 23.99%, variant = 26.77%


Unnamed: 0_level_0,converted,total_users
experiment_group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1129,4706
variant,1271,4748


In [13]:
# Check conversion rate difference between groups - is the conversion rate significantly different between control and variant?
conv_count = conv_counts["converted"].values
conv_nobs = conv_counts["total_users"].values
conv_z, conv_pval = proportions_ztest(conv_count, conv_nobs, alternative="two-sided")
print(f"Conversion Rates Proportions Z-test p-value: {conv_pval:.4f}")
if conv_pval > alpha:
    print(f"p-value > {alpha:.2f}: Failed to reject the null hypothesis that the conversion rates are equal.\n\nThere is no strong evidence that one group converts better/worse.")
elif conv_pval < alpha:
    print(f"p-value < {alpha:.2f}: Rejected the null hypothesis that the conversion rates are equal.\n\nThere is a statistically significant difference in conversion rates between groups.")
else:
    print(f"p-value == {alpha:.2f}: The p-value is exactly equal to the alpha threshold. This is a borderline case for conversion rate difference between groups.")

Conversion Rates Proportions Z-test p-value: 0.0019
p-value < 0.05: Rejected the null hypothesis that the conversion rates are equal.

There is a statistically significant difference in conversion rates between groups.


In [14]:
# Check the (1 - alpha) confidence interval for the absolute difference in conversion rates (variant - control)
conv_diff = conv_rate_variant - conv_rate_control

# Standard error for the difference in proportions (Binomial distribution)
conv_se = np.sqrt(
    conv_rate_control * (1 - conv_rate_control) / conv_counts.loc["control", "total_users"]
    + conv_rate_variant * (1 - conv_rate_variant) / conv_counts.loc["variant", "total_users"]
)

z = norm.ppf(1 - alpha / 2)  # two-sided
ci_lower_abs = conv_diff - z * conv_se
ci_upper_abs = conv_diff + z * conv_se

print(f"Difference in conversion rates (point estimate): {conv_diff:.2%}")
print(f"{(1-alpha):.0%} CI for conversion rate difference: [{ci_lower_abs:.2%}, {ci_upper_abs:.2%}]")


Difference in conversion rates (point estimate): 2.78%
95% CI for conversion rate difference: [1.03%, 4.53%]


In [15]:
# Bootstrap the relative difference in conversion rates to get a confidence interval for the relative lift
conv_diff_rel = conv_diff / conv_rate_control
print(f"Relative lift in conversion rate (point estimate): {conv_diff_rel:.2%}")

control_converted = user_level.loc[user_level["experiment_group"]=="control", "converted"]
variant_converted = user_level.loc[user_level["experiment_group"]=="variant", "converted"]

rng = np.random.default_rng(42)

n_c, n_v = len(control_converted), len(variant_converted)
bootstraps = np.empty(n_bootstraps)

for b in range(n_bootstraps):
    ctrl_b = rng.choice(control_converted, size=n_c, replace=True)
    var_b  = rng.choice(variant_converted, size=n_v, replace=True)

    p_c = ctrl_b.mean()
    p_v = var_b.mean()

    # relative lift
    bootstraps[b] = (p_v - p_c) / p_c

ci_lower_rel, ci_upper_rel = np.percentile(bootstraps, [100*alpha/2, 100*(1-alpha/2)])
print(f"95% bootstrap confidence interval for the relative lift: [{ci_lower_rel:.2%}, {ci_upper_rel:.2%}]")

# Compare the MDE threshold to the bootstrap confidence interval of relative lift
print(f"Primary MDE (relative lift): {mde_primary_rel:.1%}")
if ci_lower_rel > mde_primary_rel:
    print(f"\nThe entire 95% confidence interval is above the primary MDE threshold. \nThis suggests that the observed conversion rate difference is not only statistically significant but also practically significant, exceeding the minimum effect size we aimed to detect.")
elif ci_upper_rel < mde_primary_rel:
    print(f"\nThe entire 95% confidence interval is below the primary MDE threshold. \nThis suggests that the observed conversion rate difference is statistically significant but may not be practically significant, as it does not exceed the minimum effect size we aimed to detect.")
else:
    print(f"\nThe 95% confidence interval overlaps with the primary MDE threshold. \nThis suggests that while there may be a statistically significant difference in conversion rates, we cannot be confident that the true effect size exceeds the minimum effect size we aimed to detect.")

Relative lift in conversion rate (point estimate): 11.58%
95% bootstrap confidence interval for the relative lift: [4.13%, 19.66%]
Primary MDE (relative lift): 15.0%

The 95% confidence interval overlaps with the primary MDE threshold. 
This suggests that while there may be a statistically significant difference in conversion rates, we cannot be confident that the true effect size exceeds the minimum effect size we aimed to detect.


In [16]:
# Print summary of test results
print("----Summary of Tests on the Primary Metric----")
print("\nThe new search algorithm's impact on conversion rate:")
print(f"Absolute lift: {conv_diff:.2%}, with 95% CI [{ci_lower_abs:.2%}, {ci_upper_abs:.2%}]")
print(f"Relative lift: {conv_diff_rel:.2%}, with 95% CI [{ci_lower_rel:.2%}, {ci_upper_rel:.2%}]")
print(f"\nConclusion: \nWe are 95% confident that the treatment lifts conversion rate by at least {ci_lower_abs:.2%} and at most {ci_upper_abs:.2%}, but it may not exceed the minimum effect size that is practically meaningful for business.")


----Summary of Tests on the Primary Metric----

The new search algorithm's impact on conversion rate:
Absolute lift: 2.78%, with 95% CI [1.03%, 4.53%]
Relative lift: 11.58%, with 95% CI [4.13%, 19.66%]

Conclusion: 
We are 95% confident that the treatment lifts conversion rate by at least 1.03% and at most 4.53%, but it may not exceed the minimum effect size that is practically meaningful for business.


## 2.4 Guardrail Non-Inferiority Inference

Test method:
- Mann-Whitney U test (two-sided) for distributional difference in time to first booking.
- 95% bootstrap CI for median difference (variant - control), compared against the non-inferiority threshold.

Why this method:
- Time to booking is typically skewed, so a non-parametric rank-based test is appropriate for distribution checks.
- The guardrail is defined on a median difference threshold; bootstrap CI gives a direct uncertainty interval for that decision rule.

Hypothesis:
- H0: median_time_to_booking_variant - median_time_to_booking_control >= guardrail_threshold
- H1: median_time_to_booking_variant - median_time_to_booking_control < guardrail_threshold


In [17]:
# Calculate median time to booking of converted users by group
control_time_to_booking = user_level.loc[(user_level["experiment_group"] == "control") & (user_level["converted"] == 1), "time_to_booking"].dropna()
variant_time_to_booking = user_level.loc[(user_level["experiment_group"] == "variant") & (user_level["converted"] == 1), "time_to_booking"].dropna()

print(f"Median time to first booking (converted users) - control: {control_time_to_booking.median():.2f} minutes")
print(f"Median time to first booking (converted users) - variant: {variant_time_to_booking.median():.2f} minutes")

Median time to first booking (converted users) - control: 14.27 minutes
Median time to first booking (converted users) - variant: 14.09 minutes


In [18]:
# Check distribution shift in time to booking
mwu_res = mwu(control_time_to_booking, variant_time_to_booking, alternative="two-sided")
mwu_pval = mwu_res["p-val"].values[0]

print(f"Median Time to Booking Mann-Whitney U test p-value: {mwu_pval:.4f}")
if mwu_pval < alpha:
    print(f"p-value < {alpha:.2f}: Reject the null hypothesis that the median time to booking from control and variant groups are from the same distribution. \n\nThis indicates a statistically significant difference in user experience between groups.")
elif mwu_pval > alpha:
    print(f"p-value > {alpha:.2f}: Fail to reject the null hypothesis that the median time to booking from control and variant groups are from the same distribution. \n\nThis indicates there is no statistically significant difference in user experience between groups.")
else:
    print(f"p-value == {alpha:.2f}: The p-value is exactly equal to the alpha threshold. This is a borderline case for median time to booking difference between groups.")


Median Time to Booking Mann-Whitney U test p-value: 0.4129
p-value > 0.05: Fail to reject the null hypothesis that the median time to booking from control and variant groups are from the same distribution. 

This indicates there is no statistically significant difference in user experience between groups.


In [19]:
# Bootstrap the median time to booking difference to get a confidence interval
time_diff = variant_time_to_booking.median() - control_time_to_booking.median()
bootstraps_time_diff = np.empty(n_bootstraps)
for b in range(n_bootstraps):
    ctrl_b = rng.choice(control_time_to_booking, size=len(control_time_to_booking), replace=True)
    var_b  = rng.choice(variant_time_to_booking, size=len(variant_time_to_booking), replace=True)
    bootstraps_time_diff[b] = np.median(var_b) - np.median(ctrl_b)

ci_lower_time_diff, ci_upper_time_diff = np.percentile(bootstraps_time_diff, [100*alpha/2, 100*(1-alpha/2)])
print(f"Difference in median time to booking (point estimate): {time_diff:.2f} minutes")
print(f"Two-sided 95% bootstrap confidence interval for difference in median time to booking: [{ci_lower_time_diff:.2f}, {ci_upper_time_diff:.2f}] minutes")

# Compare the non-inferiority guardrail threshold to the bootstrap confidence interval of time difference
guardrail_time_diff = control_time_to_booking.median() * noninf_guardrail_abs
print(f"\nNon-inferiority guardrail threshold ({noninf_guardrail_abs:.1%} absolute increase in median time to booking): {guardrail_time_diff:.2f} minutes")
if ci_upper_time_diff < guardrail_time_diff:
    print(f"The upper bound of the 95% CI {ci_upper_time_diff:.2f} minutes is below the non-inferiority guardrail threshold. \nThis suggests that we can be confident that the new search algorithm does not worsen the median time to booking by more than the acceptable limit.")
elif ci_lower_time_diff > guardrail_time_diff:  
    print(f"The lower bound of the 95% CI {ci_lower_time_diff:.2f} minutes is above the non-inferiority guardrail threshold. \nThis suggests that the new search algorithm may worsen the median time to booking by more than the acceptable limit, raising potential concerns about user experience.")
else:
    print(f"The 95% CI [{ci_lower_time_diff:.2f}, {ci_upper_time_diff:.2f}] minutes overlaps with the non-inferiority guardrail threshold. \nThis suggests that we cannot be confident that the new search algorithm meets the non-inferiority criterion for median time to booking.")


Difference in median time to booking (point estimate): -0.18 minutes
Two-sided 95% bootstrap confidence interval for difference in median time to booking: [-0.64, 0.31] minutes

Non-inferiority guardrail threshold (5.0% absolute increase in median time to booking): 0.71 minutes
The upper bound of the 95% CI 0.31 minutes is below the non-inferiority guardrail threshold. 
This suggests that we can be confident that the new search algorithm does not worsen the median time to booking by more than the acceptable limit.


# 3. Final Launch Decision

Decision combines power adequacy, data quality, primary success, and guardrail safety.

## Gate Outcomes
1. Sample size adequacy: **pass** (observed users per group exceed required users per group).
2. Data quality and randomization: **pass** (no SRM; no material trigger-rate or sessions-per-user imbalance).
3. Primary metric (conversion): **mixed**. Statistical significance passes (p = 0.0019), but practical-significance gate does not pass because the 95% CI for relative lift [4.13%, 19.66%] overlaps the 15% MDE threshold.
4. Guardrail metric (median time to booking): **pass** non-inferiority (95% CI upper bound for time delta is below guardrail threshold).

## Decision
Do not launch to 100% traffic yet under the current launch criteria. Continue the experiment (or run a staged ramp) to tighten uncertainty and confirm the effect clears the practical-significance threshold.
