![ab_testing_image](ab_testing_image.jpg)

As a Data Scientist at a leading online travel agency, you’ve been tasked with evaluating the impact of a new search ranking algorithm designed to improve conversion rates. The Product team is considering a full rollout, but only if the experiment shows a clear positive effect on the conversion rate and does not lead to a longer time to book.

They have shared A/B test datasets with session-level booking data (`"sessions_data.csv"`) and user-level control/variant split (`"users_data.csv"`). Your job is to analyze and interpret the results to determine whether the new ranking system delivers a statistically significant improvement and provide a clear, data-driven recommendation.

## `sessions_data.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `session_id` | `string` | Unique session identifier (unique for each row) |
| `user_id` | `string` | Unique user identifier (non logged-in users have missing user_id values; each user can have multiple sessions) |
| `session_start_timestamp` | `string` | When a session started |
| `booking_timestamp` | `string` | When a booking was made (missing if no booking was made during a session) |
| `time_to_booking` | `float` | time from start of the session to booking, in minutes (missing if no booking was made during a session) |
| `conversion` | `integer` | _New column to create:_ did session end up with a booking (0 if booking_timestamp or time_to_booking is Null, otherwise 1) |

<br>

## `users_data.csv`

| column | data type | description | 
|--------|-----------|-------------|
| `user_id` | `string` | Unique user identifier (only logged-in users in this table) |
| `experiment_group` | `string` | control / variant split for the experiment (expected to be equal 50/50) |

<br>

The full on criteria are the following:
- Primary metric (conversion) effect must be statistically significant and show positive effect (increase).
- Guardrail (time_to_booking) effect must either be statistically insignificant or show positive effect (decrease)

In [3]:
# Import library
import pandas as pd
from scipy.stats import chisquare
from pingouin import ttest
from statsmodels.stats.proportion import proportions_ztest

In [4]:
confidence_level = 0.90  # Set the pre-defined confidence level (90%)
alpha = 1 - confidence_level  # Significance level for hypothesis tests

In [5]:
# load the data
sessions = pd.read_csv('sessions_data.csv')
users = pd.read_csv('users_data.csv')
# merge the data
sessions_x_users = sessions.merge(users, how='inner', on='user_id')
sessions_x_users.head()

Unnamed: 0,session_id,user_id,session_start_timestamp,booking_timestamp,time_to_booking,experiment_group
0,CP0lbAGnb5UNi3Ut,TcCIMrtQ75wHGXVj,2025-01-26 20:02:39.177358627,,,variant
1,UQAjrPYair63L1p8,TcCIMrtQ75wHGXVj,2025-01-20 16:12:51.536912203,,,variant
2,9zQrAPxV5oi2SzSa,TcCIMrtQ75wHGXVj,2025-01-28 03:46:40.839362144,,,variant
3,kkrz1M5vxrQ8wXRZ,GUGVzto9KGqeX3dc,2025-01-25 02:48:50.953303099,,,variant
4,ABZZFrwItZAPdYGP,v2EBIHmOdQfalI6k,2025-01-11 11:41:36.912253618,,,variant


In [6]:
sessions_x_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15283 entries, 0 to 15282
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   session_id               15283 non-null  object 
 1   user_id                  15283 non-null  object 
 2   session_start_timestamp  15283 non-null  object 
 3   booking_timestamp        2607 non-null   object 
 4   time_to_booking          2607 non-null   float64
 5   experiment_group         15283 non-null  object 
dtypes: float64(1), object(5)
memory usage: 716.5+ KB


In [7]:
# Compute primary metrics
sessions_x_users['conversion'] = sessions_x_users.apply(
    lambda row: 0 if pd.isnull(row['booking_timestamp']) or pd.isnull(row['time_to_booking']) else 1, 
    axis=1
)
sessions_x_users.head(5)


Unnamed: 0,session_id,user_id,session_start_timestamp,booking_timestamp,time_to_booking,experiment_group,conversion
0,CP0lbAGnb5UNi3Ut,TcCIMrtQ75wHGXVj,2025-01-26 20:02:39.177358627,,,variant,0
1,UQAjrPYair63L1p8,TcCIMrtQ75wHGXVj,2025-01-20 16:12:51.536912203,,,variant,0
2,9zQrAPxV5oi2SzSa,TcCIMrtQ75wHGXVj,2025-01-28 03:46:40.839362144,,,variant,0
3,kkrz1M5vxrQ8wXRZ,GUGVzto9KGqeX3dc,2025-01-25 02:48:50.953303099,,,variant,0
4,ABZZFrwItZAPdYGP,v2EBIHmOdQfalI6k,2025-01-11 11:41:36.912253618,,,variant,0


In [8]:
# Check if the number of users in each experiment group is balanced
groups_count = sessions_x_users['experiment_group'].value_counts()
groups_count

experiment_group
variant    7653
control    7630
Name: count, dtype: int64

We perform a sanity check for Sample Ratio Mismatch (SRM) in A/B testing to ensure that randomization worked correctly—before analyzing test results.Sample Ratio Mismatch (SRM) happens when the observed split between test groups (e.g., A/B) deviates significantly from the expected ratio (e.g., 50/50)

Why It Matters ?\
🔍 Detects randomization issues \
🛠️ Catches setup bugs (e.g., traffic filters, rollout errors) \
⚠️ Prevents biased results due to uneven group composition\

we run a chi-square test for proportions:

Null hypothesis: Observed group sizes = expected group sizes

If p-value is very small (e.g. < 0.01), you reject the null and declare an SRM.

In [9]:
# sanity check - Sample Ratio Mismatch (SRM)
n = sessions_x_users.shape[0] # Total sample size
srm_chi2_stat, srm_chi2_pval = chisquare(f_obs = groups_count, f_exp = (n/2, n/2))
srm_chi2_pval = round(srm_chi2_pval, 4)
print(f'\nSRM\np-value: {srm_chi2_pval}') 

# If p < alpha, there's likely a sampling issue issue
if srm_chi2_pval < 0.01:
    print ("Simple Ration Mismatch (SRM) maybe be present")
else:
    print ("Simple Ration Mismatch (SRM) likely not be present")


SRM
p-value: 0.8524
Simple Ration Mismatch (SRM) likely not be present


We will comute Z-test for proportions. Why use Z-test in this case? Why estimate the effect size? Explain both

In [10]:
# Count the number of conversions and total observations in each group
grouped_data = sessions_x_users.groupby('experiment_group')['conversion'].agg(['sum', 'count'])

# Perform the Z-test for proportions
z_stat, pval_primary = proportions_ztest(count=grouped_data['sum'], nobs=grouped_data['count'], alternative='two-sided')

# Save the p-value rounded to 4 decimals
pval_primary = round(pval_primary, 4)

pval_primary

np.float64(0.0002)

In [11]:
avg_metric_per_group = sessions_x_users.groupby('experiment_group')['conversion'].mean()
effect_size = avg_metric_per_group['variant'] / avg_metric_per_group['control'] - 1

effect_size_primary = round(effect_size, 4)
print(f'\nPrimary metric\np-value: {pval_primary: .4f} | effect size: {effect_size_primary: .4f}') 


Primary metric
p-value:  0.0002 | effect size:  0.1422


In [12]:
# EFFECT ANALYSIS - GUARDRAIL METRIC
# T-test on time to booking for control vs variant
stats_guardrail = ttest(
    sessions_x_users.loc[(sessions_x_users['experiment_group'] == 'control'), 'time_to_booking'],
    sessions_x_users.loc[(sessions_x_users['experiment_group'] == 'variant'), 'time_to_booking'],
    alternative='two-sided',
)
pval_guardrail, tstat_guardrail = stats_guardrail['p-val'].values[0], stats_guardrail['T'].values[0]
pval_guardrail = round(pval_guardrail, 4)
pval_guardrail


np.float64(0.5365)

In [13]:
# Estimate effect size for the guardrail metric

avg_metric_per_group = sessions_x_users.groupby('experiment_group')['time_to_booking'].mean()
effect_size_guardrail = avg_metric_per_group['variant'] / avg_metric_per_group['control'] - 1
effect_size_guardrail = round(effect_size_guardrail, 4)
print(f'\nGuardrail\np-value: {pval_guardrail} | effect size: {effect_size_guardrail}')



Guardrail
p-value: 0.5365 | effect size: -0.0079


In [14]:
# DECISION
# Primary metric must be statistically significant and show positive effect (increase)
criteria_full_on_primary = (pval_primary < alpha) & (effect_size_primary > 0)

# Guardrail must either be statistically insignificant or whow positive effect (decrease)
criteria_full_on_guardrail = (pval_guardrail > alpha) | (effect_size_guardrail <= 0)

# Final launch decision based on both metrics
if criteria_full_on_primary and criteria_full_on_guardrail:
    decision_full_on = 'Yes'
    print('\nThe experiment results are significantly positive and the guardrail metric was not harmed, we are going full on!')
else:
    decision_full_on = 'No'
    print('\nThe experiment results are inconclusive or the guardrail metric was harmed, we are pulling back!')


The experiment results are significantly positive and the guardrail metric was not harmed, we are going full on!
