### 🎯 __Project Objectives: A/B Test — “recommender_system_test”__

The main objective of this second phase of the data analysis project is to evaluate the impact of implementing a new recommendation system on the behavior of users of an international online store, through the analysis of a controlled A/B test.

Specifically, the objectives are:

1. Measure and compare the performance of the new checkout funnel (group B) against the current funnel (group A), using key conversion metrics within the user journey:

    - Product page views (product_page)
    - Add to cart events (product_cart)
    - Completed purchases (purchase)

2. Validate whether the new recommendation system improves sales funnel conversion by at least 10% at each stage, within 14 days of user registration, as established in the business hypothesis.

3. Verify the consistency and quality of the collected data to ensure the analysis is statistically reliable and representative of the selected audience (15% of new users in the EU region).

4. Apply appropriate statistical methods (such as hypothesis testing and significance analysis) to determine whether the observed differences between groups A and B are attributable to the new recommendation system and not due to chance.

5. Communicate actionable findings that allow the product team to decide whether the new checkout funnel should be implemented widely or requires additional adjustments before full deployment.

#### __Technical Description__

- Test Name: __recommender_system_test__
- Groups: A (control), B (new checkout funnel)
- Launch Date: 2020-12-07
- Date new users stopped: 2020-12-21
- End Date: 2021-01-01
- Audience: __15% of new__ users from the _EU region_
- Test Purpose: To test changes related to the introduction of an improved recommendation system
- Expected Outcome: __Within 14 days of enrollment__, users will show improved conversion rates for product page views (the product_page event), adding items to the shopping cart (product_cart), and purchases (purchase). At each stage of the `product_page → product_cart → purchase` __funnel__, there will be _at least a 10% increase_.
- Expected number of test participants: 6,000


####  __Data Description__

`ab_project_marketing_events_us.csv` (the marketing events calendar for 2020)

- `name`: the name of the marketing event
- `regions`: regions where the advertising campaign will be run
- `start_dt`: the campaign start date
- `finish_dt`: the campaign end date

`final_ab_new_users_upd_us.csv` (all users who registered in the online store from December 7 to 21, 2020)

- `user_id`: user identification
- `first_date`: Date of enrollment
- `region`: location
- `device`: Device used for enrollment

`final_ab_events_upd_us.csv` (all new user events from December 7, 2020, to January 1, 2021)

- `user_id`: user identification
- `event_dt`: Date and time of the event
- `event_name`: Name of the event type
- `details`: Additional information about the event (e.g., the total order in USD for purchase events)

`final_ab_participants_upd_us.csv` (a table with data on test participants)

- `user_id`: user identification
- `ab_test`: Name of the test
- `group`: The test group the user belonged to

### 💻 __1. Notebook Libraries and Customization__

In [103]:
import datetime as dt
from datetime import datetime
from IPython.display import display, HTML
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import re
from scipy.stats import norm
from scipy import stats as st
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
import unicodedata

### 💻 __2. Functions__

In [2]:
# Function to normalize string formatting in object-type columns
def normalize_string_format(df, include=None, exclude=None):
    """
    Standardizes text formatting for object-type (string) columns in a DataFrame.

    Operations performed:
    - Converts text to lowercase
    - Strips leading/trailing whitespace
    - Replaces punctuation with spaces
    - Collapses spaces into underscores
    - Removes redundant underscores
    - Adds unicode normalization to remove accents and special characters.

    Parameters:
    df (DataFrame): The input DataFrame.
    include (list, optional): Specific columns to apply formatting to. If None, applies to all except those in 'exclude'.
    exclude (list, optional): Columns to skip.

    Returns:
    DataFrame: Updated DataFrame with normalized string formats.
    """

    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    def clean_text(text):
        if isinstance(text, str):
            text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
            text = (text
                    # .lower()
                    .strip())
            text = re.sub(r'[^\w\s.]', ' ', text)    # Replaces all characters other than letters, numbers, or spaces with a space
            text = re.sub(r'\s+', '_', text)        # Replace one or more consecutive spaces with an underscore "_"
            text = re.sub(r'__+', '_', text)        # Replaces multiple consecutive underscores with a single "_"
            text = re.sub(r'_(?=\s|$)', '', text)   # Removes underscores that are just before a space or at the end of the text
            text = re.sub(r'__+', '_', text)        # Replaces double underscores with single underscores
        return text

    for column in available_columns:
        if df[column].dtype in ['object', 'string']:
            df[column] = df[column].apply(clean_text)

    return df

# Function to identify non-standard missing values in object-type columns
def check_existing_missing_values(df):
    """
    Checks object-type columns in a DataFrame for non-standard missing values.

    Parameters:
    df (DataFrame): The dataset to inspect.

    Output:
    Displays the number of non-standard missing entries per column and the matched values.
    """

    # Common non-standard representations of missing values
    missing_values = ['', ' ', 'N/A', 'none', 'None','null', 'NULL', 'NaN', 'nan', 'NAN', 'nat', 'NaT']

    display(HTML(f"<h4>Scanning for Non-Standard Missing Values</h4>"))

    for column in df.columns:

        matches = df[df[column].isin(missing_values)][column].unique()

        if df[column].isin(missing_values).any() and matches.size > 0:
            count = df[column].isin(missing_values).sum()
            display(
                HTML(f"> Missing values in column <i>'{column}'</i>: <b>{count}</b>"))
            display(
                HTML(f"&emsp;Matched non-standard values: {list(matches)}"))
        else:
            display(
                HTML(f"> Missing values in column <i>'{column}'</i>: None"))

    print()

    return None

# Function to standardize non-standard missing values to pd.NA
def replace_missing_values(df, include=None, exclude=None):
    """
    Replaces common non-standard missing value entries in object-type columns with pd.NA.

    Parameters:
    df (DataFrame): The input dataset.
    include (list, optional): List of columns to include. If None, all columns except those in 'exclude' are considered.
    exclude (list, optional): List of columns to exclude from replacement.

    Returns:
    DataFrame: Updated DataFrame with non-standard missing values replaced by pd.NA.
    """

    missing_values = ['', ' ', 'N/A', 'none', 'None', 'null', 'NULL', 'NaN', 'nan', 'NAN', 'nat', 'NaT']

    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    for column in available_columns:
        if df[column].dtype in ['object', 'string'] and df[column].isin(missing_values).any():
            df.loc[:, column] = df[column].replace(missing_values, pd.NA)

    return df

# function for displaying the percentage of mising values in a Dataset
def missing_values_rate(df, include=None, exclude=None):
    
    """
    Displays the percentage of missing values for specified columns in a DataFrame.

    Parameters:
    ----------
    df : pandas.DataFrame
        The DataFrame to analyze.

    include : list, optional
        List of column names to include in the analysis. If None, all columns not in `exclude` are considered.

    exclude : list, optional
        List of column names to exclude from the analysis. Default is an empty list.

    Returns:
    -------
    None
        Displays HTML output in a Jupyter Notebook environment.
    """
    
    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    for column in available_columns:
        total_values = len(df[column])
        if total_values == 0:
            percentage = 0
        else:
            missing_values = df[column].isna().sum()
            percentage = (missing_values / total_values) * 100

        display(HTML(f"> Percentage of missing values for column <i>'{column}'</i>: <b>{percentage:.2f}</b> %<br>"))
        display(HTML(f">    Total values: {df[column].shape[0]}<br>   > Missing values: {df[column].isna().sum()}<br><br>"))

# Funnel detect sequence
def detect_irregular_funnel_sequences(df_events, user_col='user_id', event_col='event_name', date_col='event_dt', time_col='event_tm'):
    """
    Detects users whose sequence of events by date does not follow the logical order of the funnel,
    allowing incomplete but ordered sequences.

    Returns:
    - df_users_irregular: DataFrame with columns:
    - user_id
    - event_date: Date of the event group
    - event_sequence: Ordered list of tuples (timestamp, event)
    """
    
    expected_order = ['login', 'product_page', 'product_cart', 'purchase']
    df_events = df_events.copy()

    # Combine date and time into a complete datetime 
    df_events['event_timestamp'] = df_events.apply(lambda row: datetime.combine(row[date_col], row[time_col]), axis=1) 

    # Sort by user, date and time 
    df_sorted = df_events.sort_values(by=[user_col, date_col, 'event_timestamp']) 

    irregular_records = [] 

    # Group by user and date 
    for (user_id, event_date), group in df_sorted.groupby([user_col, date_col]):
        funnel_events = group[group[event_col].isin(expected_order)] 

        # Generate list of tuples (timestamp, event) 
        event_sequence = list(zip(funnel_events['event_timestamp'].tolist(), funnel_events[event_col].tolist())) 

        # Check if the sequence is ordered according to the funnel 
        last_index = -1 
        for _, e in event_sequence:
            current_index = expected_order.index(e) 
            if current_index < last_index:
                irregular_records.append({'user_id': user_id, 'event_date': event_date, 'event_sequence': event_sequence}) 
                break 
            last_index = current_index 

    df_users_irregular = pd.DataFrame(irregular_records) 
    return df_users_irregular

# Plot pie or donut chart
# plot_pie_chart(df_users_testAB, 'region', 'user_id', color_map=color_map, unique_values=True)
def plot_pie_chart_plotlypx(df, category_col, value_col, color_map=None, labels=True, title=None, unique_values=False):
    """
    Generates a pie chart using Plotly Express.

    Parameters:

    - df: Input DataFrame
    - category_col: Column with categories (pie labels)
    - value_col: Column with values (proportions)
    - color_map: dictionary (color map by category) or list (sequence of colors)
    - labels: bool, if True shows labels as percentages
    - title: chart title text (None uses generic title)
    """
    
    if unique_values:
        df_plot = (
            df[[category_col, value_col]]
            .drop_duplicates()
            .groupby(category_col, observed=True)[value_col]
            .nunique()
            .reset_index(name='count')
        )
    else:
        df_plot = (
            df[[category_col, value_col]]
            .groupby(category_col, observed=True)[value_col]
            .count()
            .reset_index(name='count')
        )   
    
    # Create chart based on color_map type
    if isinstance(color_map, dict):
        fig = px.pie(
            df_plot,
            names=category_col,
            values='count',
            color=category_col,
            color_discrete_map=color_map,
            hole=0.0 # for full pie (adjustable if you want a donut-like pie)
        )
    elif isinstance(color_map, list):
        fig = px.pie(
            df_plot,
            names=category_col,
            values='count',
            color=category_col,
            color_discrete_sequence=color_map,
            hole=0.0 # for full pie (adjustable if you want a donut-like pie)
        )
    else:
        fig = px.pie(
            df_plot,
            names=category_col,
            values='count',
            color=category_col,
            hole=0.0 # for full pie (adjustable if you want a donut-like pie)
        )

    fig.update_traces(textinfo='percent+label' if labels else 'none')
    fig.update_layout(title_text=title if title else "Distribution by Category", title_x=0.5)
    fig.show()

# Function to plot a horizontal bar chart using Plotly Express
def plot_horizontal_bar_plotpx(df, x: str, y: str, hue: str | None = None,         # categorical column for grouping (like hue in seaborn)
                               title: str = '', xlabel: str = '', ylabel: str = '', sort: bool = True, height: int = 500, width: int = 1200, color: str = 'grey'):
    """
    Plots a horizontal bar chart with Plotly Express.

    Parameters:
    - df (DataFrame): Input DataFrame.
    - x (str): Column name for x-axis values (numeric).
    - y (str): Column name for y-axis categories.
    - hue (str, optional): Column for coloring/grouping (categorical).
    - title (str): Chart title.
    - xlabel (str): X-axis label.
    - ylabel (str): Y-axis label.
    - sort (bool): If True, sorts bars by x descending.
    - height (int): Figure height.
    - width (int): Figure width.
    """

    data = df.copy()
    if sort:
        data = data.sort_values(by=x, ascending=True)
        
    if hue:
        fig = px.bar(
            data,
            x=x,
            y=y,
            orientation='h',
            color=hue,             # uses categoric columns
            title=title,
            height=height,
            width=width
        )
    else:
        fig = px.bar(
            data,
            x=x,
            y=y,
            title=title,
            height=height,
            width=width,
            color_discrete_sequence=[color]
        )

    fig.update_layout(
        xaxis_title=xlabel,
        yaxis_title=ylabel,
        template='plotly_white'
    )

    fig.show()


### 🔁 __3. Data Loading__

In [3]:
df_mkt_events = pd.read_csv('../data/raw/ab_project_marketing_events_us.csv', sep=',', header='infer', keep_default_na=False)
df_user_events = pd.read_csv('../data/raw/final_ab_events_upd_us.csv', sep=',', header='infer', keep_default_na=False)
df_users_registration = pd.read_csv('../data/raw/final_ab_new_users_upd_us.csv', sep=',', header='infer', keep_default_na=False)
df_user_test = pd.read_csv('../data/raw/final_ab_participants_upd_us.csv', sep=',', header='infer', keep_default_na=False)

### 🧹 __4. Data Cleanup__

##### **4.1** Data Overview

In [4]:
df_mkt_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       14 non-null     object
 1   regions    14 non-null     object
 2   start_dt   14 non-null     object
 3   finish_dt  14 non-null     object
dtypes: object(4)
memory usage: 576.0+ bytes


In [5]:
df_user_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423761 entries, 0 to 423760
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user_id     423761 non-null  object
 1   event_dt    423761 non-null  object
 2   event_name  423761 non-null  object
 3   details     423761 non-null  object
dtypes: object(4)
memory usage: 12.9+ MB


In [6]:
df_users_registration.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58703 entries, 0 to 58702
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     58703 non-null  object
 1   first_date  58703 non-null  object
 2   region      58703 non-null  object
 3   device      58703 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


In [7]:
df_user_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14525 entries, 0 to 14524
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  14525 non-null  object
 1   group    14525 non-null  object
 2   ab_test  14525 non-null  object
dtypes: object(3)
memory usage: 340.6+ KB


##### **4.2** Standardization of column heading formats (Lower case, snake case)

In [8]:
df_mkt_events = normalize_string_format(df_mkt_events, exclude=['start_dt', 'finish_dt'])
df_mkt_events

Unnamed: 0,name,regions,start_dt,finish_dt
0,Christmas_New_Year_Promo,EU_N.America,2020-12-25,2021-01-03
1,St._Valentine_s_Day_Giveaway,EU_CIS_APAC_N.America,2020-02-14,2020-02-16
2,St._Patric_s_Day_Promo,EU_N.America,2020-03-17,2020-03-19
3,Easter_Promo,EU_CIS_APAC_N.America,2020-04-12,2020-04-19
4,4th_of_July_Promo,N.America,2020-07-04,2020-07-11
5,Black_Friday_Ads_Campaign,EU_CIS_APAC_N.America,2020-11-26,2020-12-01
6,Chinese_New_Year_Promo,APAC,2020-01-25,2020-02-07
7,Labor_day_May_1st_Ads_Campaign,EU_CIS_APAC,2020-05-01,2020-05-03
8,International_Women_s_Day_Promo,EU_CIS_APAC,2020-03-08,2020-03-10
9,Victory_Day_CIS_May_9th_Event,CIS,2020-05-09,2020-05-11


In [9]:
df_users_registration = normalize_string_format(df_users_registration, include=['region', 'device'])
df_users_registration

Unnamed: 0,user_id,first_date,region,device
0,D72A72121175D8BE,2020-12-07,EU,PC
1,F1C668619DFE6E65,2020-12-07,N.America,Android
2,2E1BF1D4C37EA01F,2020-12-07,EU,PC
3,50734A22C0C63768,2020-12-07,EU,iPhone
4,E1BDDCE0DAFA2679,2020-12-07,N.America,iPhone
...,...,...,...,...
58698,1DB53B933257165D,2020-12-20,EU,Android
58699,538643EB4527ED03,2020-12-20,EU,Mac
58700,7ADEE837D5D8CBBD,2020-12-20,EU,PC
58701,1C7D23927835213F,2020-12-20,EU,iPhone


##### **4.3** Explicit Duplicate Removal

In [10]:
display(HTML(f"> Explicit duplicates in <i>df_mkt_events</i>: <b>{df_mkt_events.duplicated().sum()}</b>"))
print()
display(HTML(f"> Explicit duplicates in <i>df_users_registration</i>: <b>{df_users_registration.duplicated().sum()}</b>"))
print()
display(HTML(f"> Explicit duplicates in <i>df_user_events</i>: <b>{df_user_events.duplicated().sum()}</b>"))
print()
display(HTML(f"> Explicit duplicates in <i>df_user_test</i>: <b>{df_user_test.duplicated().sum()}</b>"))










In [11]:
print(df_user_test.head())

            user_id group                  ab_test
0  D1ABA3E2887B6A73     A  recommender_system_test
1  A7A3664BD6242119     A  recommender_system_test
2  DABC14FDDFADD29E     A  recommender_system_test
3  04988C5DF189632E     A  recommender_system_test
4  4FF2998A348C484F     A  recommender_system_test


In [12]:
print(df_user_test.tail())

                user_id group            ab_test
14520  1D302F8688B91781     B  interface_eu_test
14521  3DE51B726983B657     A  interface_eu_test
14522  F501F79D332BE86C     A  interface_eu_test
14523  63FBE257B05F2245     A  interface_eu_test
14524  79F9ABFB029CF724     B  interface_eu_test


##### **4.4** Missing Value Analysis

In [13]:
check_existing_missing_values(df_mkt_events)




In [14]:
check_existing_missing_values(df_users_registration)




In [15]:
check_existing_missing_values(df_user_events)




In [16]:
check_existing_missing_values(df_user_test)




In [17]:
# Handling missing values
df_user_events = replace_missing_values(df_user_events, include=['details'])
df_user_events

Unnamed: 0,user_id,event_dt,event_name,details
0,E1BDDCE0DAFA2679,2020-12-07 20:22:03,purchase,99.99
1,7B6452F081F49504,2020-12-07 09:22:53,purchase,9.99
2,9CD9F34546DF254C,2020-12-07 12:59:29,purchase,4.99
3,96F27A054B191457,2020-12-07 04:02:40,purchase,4.99
4,1FD7660FDF94CA1F,2020-12-07 10:15:09,purchase,4.99
...,...,...,...,...
423756,245E85F65C358E08,2020-12-30 19:35:55,login,
423757,9385A108F5A0A7A7,2020-12-30 10:54:15,login,
423758,DB650B7559AC6EAC,2020-12-30 10:59:09,login,
423759,F80C9BDDEA02E53C,2020-12-30 09:53:39,login,


In [18]:
missing_values_rate(df_user_events, include=['details'])

`LSPL`
__Note:__

Missing values will not be imputated with any values, due missing values belong to other events than purchase and makes no sense to impute them with arbitrary values, because this would introduce noise or false data into the analysis.

- Do not impute: Correct, NaNs indicate that the information is not applicable.
- Safe quantitative operations: Correct, Pandas ignores NaNs in calculations.
- This keeps your analysis clean and consistent for each stage of the funnel.

##### **4.5** Casting Datatypes

In [19]:
# Cast to datetime
df_mkt_events['start_dt'] = pd.to_datetime(df_mkt_events['start_dt'], errors='coerce').dt.date
df_mkt_events['finish_dt'] = pd.to_datetime(df_mkt_events['finish_dt'], errors='coerce').dt.date
df_mkt_events

Unnamed: 0,name,regions,start_dt,finish_dt
0,Christmas_New_Year_Promo,EU_N.America,2020-12-25,2021-01-03
1,St._Valentine_s_Day_Giveaway,EU_CIS_APAC_N.America,2020-02-14,2020-02-16
2,St._Patric_s_Day_Promo,EU_N.America,2020-03-17,2020-03-19
3,Easter_Promo,EU_CIS_APAC_N.America,2020-04-12,2020-04-19
4,4th_of_July_Promo,N.America,2020-07-04,2020-07-11
5,Black_Friday_Ads_Campaign,EU_CIS_APAC_N.America,2020-11-26,2020-12-01
6,Chinese_New_Year_Promo,APAC,2020-01-25,2020-02-07
7,Labor_day_May_1st_Ads_Campaign,EU_CIS_APAC,2020-05-01,2020-05-03
8,International_Women_s_Day_Promo,EU_CIS_APAC,2020-03-08,2020-03-10
9,Victory_Day_CIS_May_9th_Event,CIS,2020-05-09,2020-05-11


In [20]:
df_users_registration['first_date'] = pd.to_datetime(df_users_registration['first_date'], errors='coerce').dt.date
df_users_registration

Unnamed: 0,user_id,first_date,region,device
0,D72A72121175D8BE,2020-12-07,EU,PC
1,F1C668619DFE6E65,2020-12-07,N.America,Android
2,2E1BF1D4C37EA01F,2020-12-07,EU,PC
3,50734A22C0C63768,2020-12-07,EU,iPhone
4,E1BDDCE0DAFA2679,2020-12-07,N.America,iPhone
...,...,...,...,...
58698,1DB53B933257165D,2020-12-20,EU,Android
58699,538643EB4527ED03,2020-12-20,EU,Mac
58700,7ADEE837D5D8CBBD,2020-12-20,EU,PC
58701,1C7D23927835213F,2020-12-20,EU,iPhone


In [21]:
df_user_events['event_dt'] = pd.to_datetime(df_user_events['event_dt'], errors='coerce')
df_user_events

Unnamed: 0,user_id,event_dt,event_name,details
0,E1BDDCE0DAFA2679,2020-12-07 20:22:03,purchase,99.99
1,7B6452F081F49504,2020-12-07 09:22:53,purchase,9.99
2,9CD9F34546DF254C,2020-12-07 12:59:29,purchase,4.99
3,96F27A054B191457,2020-12-07 04:02:40,purchase,4.99
4,1FD7660FDF94CA1F,2020-12-07 10:15:09,purchase,4.99
...,...,...,...,...
423756,245E85F65C358E08,2020-12-30 19:35:55,login,
423757,9385A108F5A0A7A7,2020-12-30 10:54:15,login,
423758,DB650B7559AC6EAC,2020-12-30 10:59:09,login,
423759,F80C9BDDEA02E53C,2020-12-30 09:53:39,login,


In [22]:
# Cast to category
df_mkt_events['regions'] = df_mkt_events['regions'].astype('category')
df_mkt_events

Unnamed: 0,name,regions,start_dt,finish_dt
0,Christmas_New_Year_Promo,EU_N.America,2020-12-25,2021-01-03
1,St._Valentine_s_Day_Giveaway,EU_CIS_APAC_N.America,2020-02-14,2020-02-16
2,St._Patric_s_Day_Promo,EU_N.America,2020-03-17,2020-03-19
3,Easter_Promo,EU_CIS_APAC_N.America,2020-04-12,2020-04-19
4,4th_of_July_Promo,N.America,2020-07-04,2020-07-11
5,Black_Friday_Ads_Campaign,EU_CIS_APAC_N.America,2020-11-26,2020-12-01
6,Chinese_New_Year_Promo,APAC,2020-01-25,2020-02-07
7,Labor_day_May_1st_Ads_Campaign,EU_CIS_APAC,2020-05-01,2020-05-03
8,International_Women_s_Day_Promo,EU_CIS_APAC,2020-03-08,2020-03-10
9,Victory_Day_CIS_May_9th_Event,CIS,2020-05-09,2020-05-11


In [23]:
df_users_registration['region'] = df_users_registration['region'].astype('category')
df_users_registration['device'] = df_users_registration['device'].astype('category')
df_users_registration

Unnamed: 0,user_id,first_date,region,device
0,D72A72121175D8BE,2020-12-07,EU,PC
1,F1C668619DFE6E65,2020-12-07,N.America,Android
2,2E1BF1D4C37EA01F,2020-12-07,EU,PC
3,50734A22C0C63768,2020-12-07,EU,iPhone
4,E1BDDCE0DAFA2679,2020-12-07,N.America,iPhone
...,...,...,...,...
58698,1DB53B933257165D,2020-12-20,EU,Android
58699,538643EB4527ED03,2020-12-20,EU,Mac
58700,7ADEE837D5D8CBBD,2020-12-20,EU,PC
58701,1C7D23927835213F,2020-12-20,EU,iPhone


In [24]:
df_user_events['event_name'] = df_user_events['event_name'].astype('category')
df_user_events

Unnamed: 0,user_id,event_dt,event_name,details
0,E1BDDCE0DAFA2679,2020-12-07 20:22:03,purchase,99.99
1,7B6452F081F49504,2020-12-07 09:22:53,purchase,9.99
2,9CD9F34546DF254C,2020-12-07 12:59:29,purchase,4.99
3,96F27A054B191457,2020-12-07 04:02:40,purchase,4.99
4,1FD7660FDF94CA1F,2020-12-07 10:15:09,purchase,4.99
...,...,...,...,...
423756,245E85F65C358E08,2020-12-30 19:35:55,login,
423757,9385A108F5A0A7A7,2020-12-30 10:54:15,login,
423758,DB650B7559AC6EAC,2020-12-30 10:59:09,login,
423759,F80C9BDDEA02E53C,2020-12-30 09:53:39,login,


In [25]:
df_user_test['group'] = df_user_test['group'].astype('category')
df_user_test['ab_test'] = df_user_test['ab_test'].astype('category')
df_user_test

Unnamed: 0,user_id,group,ab_test
0,D1ABA3E2887B6A73,A,recommender_system_test
1,A7A3664BD6242119,A,recommender_system_test
2,DABC14FDDFADD29E,A,recommender_system_test
3,04988C5DF189632E,A,recommender_system_test
4,4FF2998A348C484F,A,recommender_system_test
...,...,...,...
14520,1D302F8688B91781,B,interface_eu_test
14521,3DE51B726983B657,A,interface_eu_test
14522,F501F79D332BE86C,A,interface_eu_test
14523,63FBE257B05F2245,A,interface_eu_test


In [26]:
# Check dtypes
df_mkt_events.dtypes

name           object
regions      category
start_dt       object
finish_dt      object
dtype: object

In [27]:
df_users_registration.dtypes

user_id         object
first_date      object
region        category
device        category
dtype: object

In [28]:
df_user_events.dtypes

user_id               object
event_dt      datetime64[ns]
event_name          category
details               object
dtype: object

In [29]:
df_user_test.dtypes

user_id      object
group      category
ab_test    category
dtype: object

### ⚙️ **5. Feature Engineering**

In [30]:
df_users = df_user_events.merge(df_users_registration, on='user_id', how='left')
df_users

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device
0,E1BDDCE0DAFA2679,2020-12-07 20:22:03,purchase,99.99,2020-12-07,N.America,iPhone
1,7B6452F081F49504,2020-12-07 09:22:53,purchase,9.99,2020-12-07,EU,iPhone
2,9CD9F34546DF254C,2020-12-07 12:59:29,purchase,4.99,2020-12-07,N.America,iPhone
3,96F27A054B191457,2020-12-07 04:02:40,purchase,4.99,2020-12-07,EU,iPhone
4,1FD7660FDF94CA1F,2020-12-07 10:15:09,purchase,4.99,2020-12-07,EU,Android
...,...,...,...,...,...,...,...
423756,245E85F65C358E08,2020-12-30 19:35:55,login,,2020-12-07,EU,Android
423757,9385A108F5A0A7A7,2020-12-30 10:54:15,login,,2020-12-07,EU,PC
423758,DB650B7559AC6EAC,2020-12-30 10:59:09,login,,2020-12-07,EU,Android
423759,F80C9BDDEA02E53C,2020-12-30 09:53:39,login,,2020-12-07,EU,iPhone


In [31]:
df_users['event_tm'] = df_users['event_dt'].dt.time
df_users['event_dt'] = df_users['event_dt'].dt.date
df_users

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm
0,E1BDDCE0DAFA2679,2020-12-07,purchase,99.99,2020-12-07,N.America,iPhone,20:22:03
1,7B6452F081F49504,2020-12-07,purchase,9.99,2020-12-07,EU,iPhone,09:22:53
2,9CD9F34546DF254C,2020-12-07,purchase,4.99,2020-12-07,N.America,iPhone,12:59:29
3,96F27A054B191457,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,04:02:40
4,1FD7660FDF94CA1F,2020-12-07,purchase,4.99,2020-12-07,EU,Android,10:15:09
...,...,...,...,...,...,...,...,...
423756,245E85F65C358E08,2020-12-30,login,,2020-12-07,EU,Android,19:35:55
423757,9385A108F5A0A7A7,2020-12-30,login,,2020-12-07,EU,PC,10:54:15
423758,DB650B7559AC6EAC,2020-12-30,login,,2020-12-07,EU,Android,10:59:09
423759,F80C9BDDEA02E53C,2020-12-30,login,,2020-12-07,EU,iPhone,09:53:39


In [32]:
def add_mkt_event(df_users, df_mkt_events): 
    """ 
    Adds the name of the marketing event from df_mkt_events to df_users if the event_dt is within start_dt and finish_dt and the region matches. 
    Parameters: 
    - df_users: DataFrame with columns ['user_id', 'event_dt', 'region', ...] 
    - df_mkt_events: DataFrame with columns ['name', 'regions', 'start_dt', 'finish_dt'] 
    
    Returns: 
    - df_users with new column 'mkt_event_name' 
    """ 
    
    # Function to apply per row of df_users 
    def find_event(row): 
        # Filter events where the user's date is within the range 
        mask_date = ((df_mkt_events['start_dt'] <= row['event_dt']) & (df_mkt_events['finish_dt'] >= row['event_dt'])) 
        
        # Filter by region using sets, row by row 
        def region_match(event_regions_str): 
            event_regions_set = set(event_regions_str.replace('_', ' ').split()) 
            user_regions_set = set(row['region'].split()) 
            return bool(event_regions_set & user_regions_set) # Non-empty intersection 
        
        mask_region = df_mkt_events['regions'].apply(region_match) 
        
        # Events that meet both conditions 
        matched_events = df_mkt_events[mask_date & mask_region] 
        
        if not matched_events.empty: 
            return matched_events.iloc[0]['name'] 
        else: return pd.NA 
    
    df_users['mkt_event_name'] = df_users.apply(find_event, axis=1) 
    return df_users

In [33]:
df_users = add_mkt_event(df_users, df_mkt_events)
df_users

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name
0,E1BDDCE0DAFA2679,2020-12-07,purchase,99.99,2020-12-07,N.America,iPhone,20:22:03,
1,7B6452F081F49504,2020-12-07,purchase,9.99,2020-12-07,EU,iPhone,09:22:53,
2,9CD9F34546DF254C,2020-12-07,purchase,4.99,2020-12-07,N.America,iPhone,12:59:29,
3,96F27A054B191457,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,04:02:40,
4,1FD7660FDF94CA1F,2020-12-07,purchase,4.99,2020-12-07,EU,Android,10:15:09,
...,...,...,...,...,...,...,...,...,...
423756,245E85F65C358E08,2020-12-30,login,,2020-12-07,EU,Android,19:35:55,Christmas_New_Year_Promo
423757,9385A108F5A0A7A7,2020-12-30,login,,2020-12-07,EU,PC,10:54:15,Christmas_New_Year_Promo
423758,DB650B7559AC6EAC,2020-12-30,login,,2020-12-07,EU,Android,10:59:09,Christmas_New_Year_Promo
423759,F80C9BDDEA02E53C,2020-12-30,login,,2020-12-07,EU,iPhone,09:53:39,Christmas_New_Year_Promo


In [34]:
df_users = df_users.merge(df_user_test, on='user_id', how='left')
df_users

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
0,E1BDDCE0DAFA2679,2020-12-07,purchase,99.99,2020-12-07,N.America,iPhone,20:22:03,,,
1,7B6452F081F49504,2020-12-07,purchase,9.99,2020-12-07,EU,iPhone,09:22:53,,,
2,9CD9F34546DF254C,2020-12-07,purchase,4.99,2020-12-07,N.America,iPhone,12:59:29,,,
3,96F27A054B191457,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,04:02:40,,B,interface_eu_test
4,1FD7660FDF94CA1F,2020-12-07,purchase,4.99,2020-12-07,EU,Android,10:15:09,,,
...,...,...,...,...,...,...,...,...,...,...,...
429471,245E85F65C358E08,2020-12-30,login,,2020-12-07,EU,Android,19:35:55,Christmas_New_Year_Promo,,
429472,9385A108F5A0A7A7,2020-12-30,login,,2020-12-07,EU,PC,10:54:15,Christmas_New_Year_Promo,,
429473,DB650B7559AC6EAC,2020-12-30,login,,2020-12-07,EU,Android,10:59:09,Christmas_New_Year_Promo,,
429474,F80C9BDDEA02E53C,2020-12-30,login,,2020-12-07,EU,iPhone,09:53:39,Christmas_New_Year_Promo,A,interface_eu_test


### 📝 **6: Pre-test Funnel Analisis**

#### **6.1** Funnel data consistency

In [35]:
# Funnel stages should be: login, product_page, product_cart, purchase
df_users_irregular = detect_irregular_funnel_sequences(df_users)
print(f"Users registries with irregular sequence: {len(df_users_irregular)}")
df_users_irregular

Users registries with irregular sequence: 112834


Unnamed: 0,user_id,event_date,event_sequence
0,000199F1887AE5E6,2020-12-14,"[(2020-12-14 09:56:09, purchase), (2020-12-14 ..."
1,000199F1887AE5E6,2020-12-15,"[(2020-12-15 07:22:56, purchase), (2020-12-15 ..."
2,000199F1887AE5E6,2020-12-20,"[(2020-12-20 06:36:35, purchase), (2020-12-20 ..."
3,000199F1887AE5E6,2020-12-21,"[(2020-12-21 02:11:23, purchase), (2020-12-21 ..."
4,0002499E372175C7,2020-12-22,"[(2020-12-22 03:49:52, purchase), (2020-12-22 ..."
...,...,...,...
112829,FFF91B6C5431F375,2020-12-14,"[(2020-12-14 22:12:04, product_cart), (2020-12..."
112830,FFF91B6C5431F375,2020-12-17,"[(2020-12-17 08:27:17, product_cart), (2020-12..."
112831,FFFFE36C0F6E92DF,2020-12-22,"[(2020-12-22 11:38:57, product_cart), (2020-12..."
112832,FFFFE36C0F6E92DF,2020-12-23,"[(2020-12-23 05:09:13, product_page), (2020-12..."


In [36]:
# Filter irregular users' registries
df_users = df_users.merge(df_users_irregular[['user_id', 'event_date']], left_on=['user_id', 'event_dt'], right_on=['user_id', 'event_date'], how='left', indicator=True)
df_users = df_users.loc[(df_users['_merge'] == 'left_only'), :].drop(columns=['_merge', 'event_date'])
df_users

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
2,9CD9F34546DF254C,2020-12-07,purchase,4.99,2020-12-07,N.America,iPhone,12:59:29,,,
19,9C31B4124B3AE217,2020-12-07,purchase,4.99,2020-12-07,EU,PC,21:38:36,,,
21,FCD216B91578B8DC,2020-12-07,purchase,99.99,2020-12-07,N.America,PC,18:24:41,,,
28,AA77BDA8996FE1B8,2020-12-07,purchase,9.99,2020-12-07,N.America,Android,08:24:41,,,
36,2765321AC15BA00A,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,08:57:39,,,
...,...,...,...,...,...,...,...,...,...,...,...
429464,2761935C2DD2794F,2020-12-30,login,,2020-12-07,EU,iPhone,16:51:29,Christmas_New_Year_Promo,,
429465,E0E5446B78A6520B,2020-12-30,login,,2020-12-07,CIS,Android,23:36:30,CIS_New_Year_Gift_Lottery,,
429468,33E7BCF696B41C7B,2020-12-30,login,,2020-12-07,EU,PC,12:21:24,Christmas_New_Year_Promo,,
429472,9385A108F5A0A7A7,2020-12-30,login,,2020-12-07,EU,PC,10:54:15,Christmas_New_Year_Promo,,


#### **6.2** Funnel

In [37]:
# Get conversions rate
df_funnel = df_users.groupby('event_name', observed=True).agg(events=('event_name', 'count'), users=('user_id', 'nunique')).sort_values(by='users', ascending=False).reset_index()
df_funnel['conversion_rate'] = ((df_funnel['users'] / df_funnel['events']) * 100).round(3)
df_funnel['total_conversion_rate'] = ((df_funnel['users'] / df_funnel.loc[0, 'users']) * 100).round(3)
df_funnel['stage_conversion_rate'] = ((df_funnel['users'] / df_funnel['users'].shift(1)) * 100).round(3)
df_funnel['drop_rate'] = (100 - df_funnel['stage_conversion_rate']).round(3)
df_funnel

Unnamed: 0,event_name,events,users,conversion_rate,total_conversion_rate,stage_conversion_rate,drop_rate
0,login,70757,30763,43.477,100.0,,
1,product_page,33627,17650,52.488,57.374,57.374,42.626
2,product_cart,10907,6373,58.43,20.716,36.108,63.892
3,purchase,171,163,95.322,0.53,2.558,97.442


In [38]:
plot_horizontal_bar_plotpx(df_funnel, x='users', y='event_name', title='Pre-Test Conversion Funnel', xlabel='Users', ylabel='Events', sort=True)

In [39]:
plot_horizontal_bar_plotpx(df_funnel, x='stage_conversion_rate', y='event_name', title='Pre-Test Stage Conversion Rate Funnel', xlabel='Rate', ylabel='Events', sort=True)

In [40]:
plot_horizontal_bar_plotpx(df_funnel.sort_values(by='drop_rate', ascending=False), x='drop_rate', y='event_name', title='Pre-Test Drop Rate Funnel', xlabel='Users', ylabel='Events', sort=False)

#### 🧪 **7: A/B Testing**

__AB Test Technical Description__

- Test Name: __recommender_system_test__
- Groups: A (control), B (new checkout funnel)
- Launch Date: 2020-12-07
- Date new users stopped: 2020-12-21
- End Date: 2021-01-01
- Audience: __15% of new__ users from the _EU region_
- Test Purpose: To test changes related to the introduction of an improved recommendation system
- Expected Outcome: __Within 14 days of enrollment__, users will show improved conversion rates for product page views (the product_page event), adding items to the shopping cart (product_cart), and purchases (purchase). At each stage of the `product_page → product_cart → purchase` __funnel__, there will be _at least a 10% increase_.
- Expected number of test participants: 6,000

#### **7.1** Dataset test requirements fitting 

In [41]:
# Get data first date lifetime
display(HTML(f"> Earliest first date in df_users: <b>{df_users['first_date'].min()}</b>"))
display(HTML(f"> Latest first date in df_users: <b>{df_users['first_date'].max()}</b>"))

In [42]:
# Get data that fits onñy for A/B test lifetime (Launch - End date)
df_users_testAB = df_users.loc[(df_users['first_date'] >= dt.date(2020, 12, 7)) & (df_users['first_date'] <= dt.date(2020, 12, 21)), :]
df_users_testAB

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
2,9CD9F34546DF254C,2020-12-07,purchase,4.99,2020-12-07,N.America,iPhone,12:59:29,,,
19,9C31B4124B3AE217,2020-12-07,purchase,4.99,2020-12-07,EU,PC,21:38:36,,,
21,FCD216B91578B8DC,2020-12-07,purchase,99.99,2020-12-07,N.America,PC,18:24:41,,,
28,AA77BDA8996FE1B8,2020-12-07,purchase,9.99,2020-12-07,N.America,Android,08:24:41,,,
36,2765321AC15BA00A,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,08:57:39,,,
...,...,...,...,...,...,...,...,...,...,...,...
429464,2761935C2DD2794F,2020-12-30,login,,2020-12-07,EU,iPhone,16:51:29,Christmas_New_Year_Promo,,
429465,E0E5446B78A6520B,2020-12-30,login,,2020-12-07,CIS,Android,23:36:30,CIS_New_Year_Gift_Lottery,,
429468,33E7BCF696B41C7B,2020-12-30,login,,2020-12-07,EU,PC,12:21:24,Christmas_New_Year_Promo,,
429472,9385A108F5A0A7A7,2020-12-30,login,,2020-12-07,EU,PC,10:54:15,Christmas_New_Year_Promo,,


In [43]:
# Get audience Info within Test AB

html_output =""

for region in df_users_testAB['region'].unique():
    if region == df_users_testAB['region'].unique()[-1]:
        html_output += f"{region}"
    else:
        html_output += f"{region}, "
    
display(HTML(f"> Regions within Test AB: <b>{html_output}</b>"))

In [44]:
# Double check whether the amount of audience is correct after filtering test AB data

html_output =""
audience_AB_Total = 0
audience_AB_NAmerica = 0
audience_AB_EU = 0
audience_AB_APAC = 0
audience_AB_CIS = 0

for region in df_users_testAB['region'].unique():
    if region == 'N.America':
        audience_AB_NAmerica = df_users_testAB.loc[(df_users_testAB['region'] == 'N.America'), 'user_id'].unique().shape[0]
    elif region == 'EU':
        audience_AB_EU = df_users_testAB.loc[(df_users_testAB['region'] == 'EU'), 'user_id'].unique().shape[0]
    elif region == 'APAC':
        audience_AB_APAC = df_users_testAB.loc[(df_users_testAB['region'] == 'APAC'), 'user_id'].unique().shape[0]
    else:
        audience_AB_CIS = df_users_testAB.loc[(df_users_testAB['region'] == 'CIS'), 'user_id'].unique().shape[0]

audience_AB_Total = audience_AB_APAC + audience_AB_CIS + audience_AB_EU + audience_AB_NAmerica

display(HTML(f"> Total audience in test AB (All regions): <b>{audience_AB_Total}</b>"))

for region in df_users_testAB['region'].unique():
    if region == 'N.America':
        display(HTML(f"> N.america audience rate in test AB: <b>{round((audience_AB_NAmerica / audience_AB_Total) * 100, 3)} %</b>"))
    elif region == 'EU':
        display(HTML(f"> EU audience rate in test AB: <b>{round((audience_AB_EU / audience_AB_Total) * 100, 3)} %</b>"))
    elif region == 'APAC':
        display(HTML(f"> APAC audience rate in test AB: <b>{round((audience_AB_APAC / audience_AB_Total) * 100, 3)} %</b>"))
    else:
        display(HTML(f"> CISC audience rate in test AB: <b>{round((audience_AB_CIS / audience_AB_Total) * 100, 3)} %</b>"))

In [45]:
gray_scale = ['#2b2b2b', '#4d4d4d', '#707070', '#999999', '#bfbfbf']
plot_pie_chart_plotlypx(df_users_testAB, 'region', 'user_id', color_map=gray_scale, labels=True, title="Audience proportions", unique_values=True)

#### **7.2** A/B groups segmentation

In [46]:
# Get data by groups
df_users_group_a = df_users_testAB.loc[(df_users['group'] == 'A'), :]
df_users_group_a

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
41,F2BE35774F63059B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,05:23:18,,A,interface_eu_test
271,F80C9BDDEA02E53C,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,10:05:54,,A,interface_eu_test
523,F0715ADC532E14B4,2020-12-07,purchase,4.99,2020-12-07,EU,Android,01:57:58,,A,interface_eu_test
777,8FF91E21E27A330D,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,01:57:30,,A,interface_eu_test
971,0BA0B65F1B8C19F9,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,04:42:52,,A,interface_eu_test
...,...,...,...,...,...,...,...,...,...,...,...
429393,E5589EAE02ACD150,2020-12-29,login,,2020-12-20,EU,Mac,22:17:08,Christmas_New_Year_Promo,A,recommender_system_test
429396,D21F0D4FDCD82DB2,2020-12-29,login,,2020-12-20,EU,iPhone,02:17:00,Christmas_New_Year_Promo,A,recommender_system_test
429407,C15DC7DF26A3300D,2020-12-29,login,,2020-12-20,EU,Android,02:57:31,Christmas_New_Year_Promo,A,interface_eu_test
429411,BCEC881B3C573B2B,2020-12-29,login,,2020-12-20,EU,PC,08:17:16,Christmas_New_Year_Promo,A,interface_eu_test


In [47]:
df_users_group_b = df_users_testAB.loc[(df_users['group'] == 'B'), :]
df_users_group_b

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
88,392C03684E704CB3,2020-12-07,purchase,4.99,2020-12-07,EU,PC,09:45:50,,B,interface_eu_test
128,406CD606E407DF53,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,05:12:23,,B,interface_eu_test
134,A4145A2EA1E17654,2020-12-07,purchase,499.99,2020-12-07,EU,PC,16:15:16,,B,interface_eu_test
158,C7CB2F1BA42F102B,2020-12-07,purchase,99.99,2020-12-07,EU,iPhone,03:05:05,,B,recommender_system_test
682,3242BDDFA690A22B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,18:10:54,,B,recommender_system_test
...,...,...,...,...,...,...,...,...,...,...,...
429369,98DEF9BB002BA9E3,2020-12-29,login,,2020-12-20,EU,iPhone,13:16:10,Christmas_New_Year_Promo,B,interface_eu_test
429378,8B02FD26DBC4FDE3,2020-12-29,login,,2020-12-20,EU,PC,10:29:02,Christmas_New_Year_Promo,B,interface_eu_test
429432,8CE3B6FD918462B4,2020-12-29,login,,2020-12-20,EU,PC,17:55:44,Christmas_New_Year_Promo,B,interface_eu_test
429438,2C29721DDDA76B2A,2020-12-29,login,,2020-12-20,EU,iPhone,05:58:20,Christmas_New_Year_Promo,B,interface_eu_test


#### **7.3** A/A Test
An AA test consists of randomly dividing users into two groups (A1 and A2), without applying any differentiation between them. Both groups receive the same experience, interface, recommendations, etc.   

AA Test Objective:   
- Randomly divide users into two groups: AA1 and AA2
- Do not apply any differences between them
- Verify that the funnel metrics (login → product_page → product_cart → purchase) are statistically equivalent
- Detect bias, noise, or assignment errors before introducing changes

**7.3.1** AA group segmentation

In [48]:
# Assign AA1 or AA2 group
df_users_aa_base = df_users_group_a.copy()
df_users_aa_base

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
41,F2BE35774F63059B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,05:23:18,,A,interface_eu_test
271,F80C9BDDEA02E53C,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,10:05:54,,A,interface_eu_test
523,F0715ADC532E14B4,2020-12-07,purchase,4.99,2020-12-07,EU,Android,01:57:58,,A,interface_eu_test
777,8FF91E21E27A330D,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,01:57:30,,A,interface_eu_test
971,0BA0B65F1B8C19F9,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,04:42:52,,A,interface_eu_test
...,...,...,...,...,...,...,...,...,...,...,...
429393,E5589EAE02ACD150,2020-12-29,login,,2020-12-20,EU,Mac,22:17:08,Christmas_New_Year_Promo,A,recommender_system_test
429396,D21F0D4FDCD82DB2,2020-12-29,login,,2020-12-20,EU,iPhone,02:17:00,Christmas_New_Year_Promo,A,recommender_system_test
429407,C15DC7DF26A3300D,2020-12-29,login,,2020-12-20,EU,Android,02:57:31,Christmas_New_Year_Promo,A,interface_eu_test
429411,BCEC881B3C573B2B,2020-12-29,login,,2020-12-20,EU,PC,08:17:16,Christmas_New_Year_Promo,A,interface_eu_test


In [49]:
# Get unique users and shuffle them
unique_users = df_users_aa_base[['user_id']].drop_duplicates()
unique_users = unique_users.sample(frac=1.0, random_state=42).reset_index(drop=True)
unique_users

Unnamed: 0,user_id
0,0BC3EEE32A4E6B32
1,AEE08F288177755C
2,B263BC5957738D68
3,B043E52A918CDD6E
4,58202E3192987E3F
...,...
3885,B667BF6049381019
3886,B8E0351B91F1DDC2
3887,06D2B163CB560FAC
3888,EA59CBC69BD2351F


In [50]:
# Assign AA1 or AA2 alternately
unique_users['aa_group'] = np.where(unique_users.index % 2 == 0, 'AA1', 'AA2')
unique_users

Unnamed: 0,user_id,aa_group
0,0BC3EEE32A4E6B32,AA1
1,AEE08F288177755C,AA2
2,B263BC5957738D68,AA1
3,B043E52A918CDD6E,AA2
4,58202E3192987E3F,AA1
...,...,...
3885,B667BF6049381019,AA2
3886,B8E0351B91F1DDC2,AA1
3887,06D2B163CB560FAC,AA2
3888,EA59CBC69BD2351F,AA1


In [51]:
# Add the aa_group data for each user
df_users_aa_base = df_users_aa_base.merge(unique_users, on='user_id', how='left')
df_users_aa_base

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test,aa_group
0,F2BE35774F63059B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,05:23:18,,A,interface_eu_test,AA1
1,F80C9BDDEA02E53C,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,10:05:54,,A,interface_eu_test,AA2
2,F0715ADC532E14B4,2020-12-07,purchase,4.99,2020-12-07,EU,Android,01:57:58,,A,interface_eu_test,AA1
3,8FF91E21E27A330D,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,01:57:30,,A,interface_eu_test,AA1
4,0BA0B65F1B8C19F9,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,04:42:52,,A,interface_eu_test,AA1
...,...,...,...,...,...,...,...,...,...,...,...,...
14820,E5589EAE02ACD150,2020-12-29,login,,2020-12-20,EU,Mac,22:17:08,Christmas_New_Year_Promo,A,recommender_system_test,AA1
14821,D21F0D4FDCD82DB2,2020-12-29,login,,2020-12-20,EU,iPhone,02:17:00,Christmas_New_Year_Promo,A,recommender_system_test,AA1
14822,C15DC7DF26A3300D,2020-12-29,login,,2020-12-20,EU,Android,02:57:31,Christmas_New_Year_Promo,A,interface_eu_test,AA2
14823,BCEC881B3C573B2B,2020-12-29,login,,2020-12-20,EU,PC,08:17:16,Christmas_New_Year_Promo,A,interface_eu_test,AA1


In [52]:
df_users_aa_a1 = df_users_aa_base.loc[(df_users_aa_base['aa_group'] =='AA1'), :]
df_users_aa_a1

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test,aa_group
0,F2BE35774F63059B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,05:23:18,,A,interface_eu_test,AA1
2,F0715ADC532E14B4,2020-12-07,purchase,4.99,2020-12-07,EU,Android,01:57:58,,A,interface_eu_test,AA1
3,8FF91E21E27A330D,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,01:57:30,,A,interface_eu_test,AA1
4,0BA0B65F1B8C19F9,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,04:42:52,,A,interface_eu_test,AA1
5,FF1AF3B4FB596E23,2020-12-07,purchase,9.99,2020-12-07,EU,iPhone,05:19:43,,A,interface_eu_test,AA1
...,...,...,...,...,...,...,...,...,...,...,...,...
14819,317765BEF86DA9AE,2020-12-29,login,,2020-12-20,EU,Android,04:49:26,Christmas_New_Year_Promo,A,interface_eu_test,AA1
14820,E5589EAE02ACD150,2020-12-29,login,,2020-12-20,EU,Mac,22:17:08,Christmas_New_Year_Promo,A,recommender_system_test,AA1
14821,D21F0D4FDCD82DB2,2020-12-29,login,,2020-12-20,EU,iPhone,02:17:00,Christmas_New_Year_Promo,A,recommender_system_test,AA1
14823,BCEC881B3C573B2B,2020-12-29,login,,2020-12-20,EU,PC,08:17:16,Christmas_New_Year_Promo,A,interface_eu_test,AA1


In [53]:
df_users_aa_a2 = df_users_aa_base.loc[(df_users_aa_base['aa_group'] =='AA2'), :]
df_users_aa_a2

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test,aa_group
1,F80C9BDDEA02E53C,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,10:05:54,,A,interface_eu_test,AA2
9,79FE3925D374219B,2020-12-08,purchase,4.99,2020-12-07,EU,Android,04:23:40,,A,interface_eu_test,AA2
10,AEA1CE413D40523C,2020-12-08,purchase,9.99,2020-12-07,EU,Android,16:48:06,,A,interface_eu_test,AA2
11,A5054CA4405CE915,2020-12-08,purchase,4.99,2020-12-07,EU,PC,21:07:14,,A,interface_eu_test,AA2
12,EB55FA9F26451B26,2020-12-08,purchase,4.99,2020-12-07,EU,PC,02:55:57,,A,interface_eu_test,AA2
...,...,...,...,...,...,...,...,...,...,...,...,...
14806,31B39445D687BB84,2020-12-29,login,,2020-12-20,APAC,PC,16:06:45,,A,recommender_system_test,AA2
14811,1D509D446FEFD7D2,2020-12-29,login,,2020-12-20,EU,PC,14:16:45,Christmas_New_Year_Promo,A,interface_eu_test,AA2
14817,36EDA624DB7B7F90,2020-12-29,login,,2020-12-20,EU,PC,20:54:25,Christmas_New_Year_Promo,A,recommender_system_test,AA2
14818,36EDA624DB7B7F90,2020-12-29,login,,2020-12-20,EU,PC,20:54:25,Christmas_New_Year_Promo,A,interface_eu_test,AA2


In [54]:
# Check if there are the same users in both samples
display(HTML(f"> Users in group <i>AA1</i>: <b>{df_users_aa_a1['user_id'].nunique()}</b>"))
display(HTML(f"> Users in group <i>AA2</i>: <b>{df_users_aa_a2['user_id'].nunique()}</b>"))

duplicates = set(df_users_aa_a1['user_id'].unique()) & set(df_users_aa_a2['user_id'].unique())

display(HTML(f"> Duplicate users between AA1 and AA2: <b>{len(duplicates)}</b>)"))
display(HTML(f"> Duplicate rate: <b>{round((len(duplicates) * 100 / (df_users_aa_a1.shape[0] + df_users_aa_a2.shape[0])), 3)} %</b>"))


In [55]:
# Check group balance
total = df_users_aa_a1['user_id'].nunique() + df_users_aa_a2['user_id'].nunique()
difference = df_users_aa_a1['user_id'].nunique() - df_users_aa_a2['user_id'].nunique()
rate = (difference / total) * 100

display(HTML(f"> Group balance verification: <i>AA1 group</i> (<b>{df_users_aa_a1['user_id'].nunique()}</b>), <i>AA2 group</i> (<b>{df_users_aa_a2['user_id'].nunique()}</b>)"))
display(HTML(f"> Group balance difference rate: <i>Difference</i> <b>{difference}</b>, Rate <b>{round(rate, 3)} %</b>"))

if rate < 5:
    display(HTML(f"> Groups are balanced."))
else:
    display(HTML(f"> Groups are unbalanced."))

**7.3.2** AA groups conversion comparison

In [56]:
df_funnel_AA1 = df_users_aa_a1
df_funnel_AA1 = df_funnel_AA1.groupby('event_name', observed=True).agg(events=('event_name', 'count'), users=('user_id', 'nunique')).sort_values(by='users', ascending=False).reset_index()
df_funnel_AA1['conversion_rate'] = ((df_funnel_AA1['users'] / df_funnel_AA1['events']) * 100).round(3)
df_funnel_AA1['total_conversion_rate'] = ((df_funnel_AA1['users'] / df_funnel_AA1.loc[0, 'users']) * 100).round(3)
df_funnel_AA1['stage_conversion_rate'] = ((df_funnel_AA1['users'] / df_funnel_AA1['users'].shift(1)) * 100).round(3)
df_funnel_AA1['drop_rate'] = (100 - df_funnel_AA1['stage_conversion_rate']).round(3)
df_funnel_AA1

Unnamed: 0,event_name,events,users,conversion_rate,total_conversion_rate,stage_conversion_rate,drop_rate
0,login,4639,1938,41.776,100.0,,
1,product_page,2152,1098,51.022,56.656,56.656,43.344
2,product_cart,697,374,53.659,19.298,34.062,65.938
3,purchase,8,8,100.0,0.413,2.139,97.861


In [57]:
plot_horizontal_bar_plotpx(df_funnel_AA1, x='users', y='event_name', title='AA-Test (AA1) Conversion Funnel', xlabel='Users', ylabel='Events', sort=True, color='lightgrey')

In [58]:
plot_horizontal_bar_plotpx(df_funnel_AA1, x='stage_conversion_rate', y='event_name', title='AA-Test (AA1) Stage Conversion Rate Funnel', xlabel='Rate', ylabel='Events', sort=True, color='lightgrey')

In [59]:
plot_horizontal_bar_plotpx(df_funnel_AA1.sort_values(by='drop_rate', ascending=False), x='drop_rate', y='event_name', title='AA-Test (AA1) Drop Rate Funnel', xlabel='Users', ylabel='Events', sort=False, color='lightgrey')

In [60]:
df_funnel_AA2 = df_users_aa_a2
df_funnel_AA2 = df_funnel_AA2.groupby('event_name', observed=True).agg(events=('event_name', 'count'), users=('user_id', 'nunique')).sort_values(by='users', ascending=False).reset_index()
df_funnel_AA2['conversion_rate'] = ((df_funnel_AA2['users'] / df_funnel_AA2['events']) * 100).round(3)
df_funnel_AA2['total_conversion_rate'] = ((df_funnel_AA2['users'] / df_funnel_AA2.loc[0, 'users']) * 100).round(3)
df_funnel_AA2['stage_conversion_rate'] = ((df_funnel_AA2['users'] / df_funnel_AA2['users'].shift(1)) * 100).round(3)
df_funnel_AA2['drop_rate'] = (100 - df_funnel_AA2['stage_conversion_rate']).round(3)
df_funnel_AA2

Unnamed: 0,event_name,events,users,conversion_rate,total_conversion_rate,stage_conversion_rate,drop_rate
0,login,4561,1939,42.513,100.0,,
1,product_page,2080,1103,53.029,56.885,56.885,43.115
2,product_cart,680,398,58.529,20.526,36.083,63.917
3,purchase,8,7,87.5,0.361,1.759,98.241


In [61]:
plot_horizontal_bar_plotpx(df_funnel_AA2, x='users', y='event_name', title='AA-Test (AA2) Conversion Funnel', xlabel='Users', ylabel='Events', sort=True, color='darkgrey')

In [62]:
plot_horizontal_bar_plotpx(df_funnel_AA2, x='stage_conversion_rate', y='event_name', title='AA-Test (AA2) Stage Conversion Rate Funnel', xlabel='Rate', ylabel='Events', sort=True, color='darkgrey')

In [63]:
plot_horizontal_bar_plotpx(df_funnel_AA2.sort_values(by='drop_rate', ascending=False), x='drop_rate', y='event_name', title='AA-Test (AA2) Drop Rate Funnel', xlabel='Users', ylabel='Events', sort=False, color='darkgrey')

**7.3.3** Z-Test for proportions comparison between AA1 and AA2 groups

In [64]:
# Z test

# 1. Hypotheses H₀, H₁
# H₀: The proportion of events are the same between group AA1 and group AA2.
# H₁: The proportion of events are different between group AA1 and group AA2.

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

# 3. Calculate critical and test values, define acceptance and rejection zones

results = []

# Total number of users in each group (make sure it is not empty)
try:
    total_users_1 = int(df_funnel_AA1.loc[df_funnel_AA1['event_name'] == 'login', 'users'].values[0])
    total_users_2 = int(df_funnel_AA2.loc[df_funnel_AA2['event_name'] == 'login', 'users'].values[0])
except IndexError:
    raise ValueError("The 'login' event was not found in one of the groups.")

for event in df_funnel_AA1['event_name']:
    try:
        count1 = int(df_funnel_AA1.loc[df_funnel_AA1['event_name'] == event, 'users'].values[0])
        count2 = int(df_funnel_AA2.loc[df_funnel_AA2['event_name'] == event, 'users'].values[0])

        # Validation: avoid division by zero
        if total_users_1 == 0 or total_users_2 == 0:
            stat, pval = np.nan, np.nan
        else:
            stat, pval = proportions_ztest(count=[count1, count2], nobs=[total_users_1, total_users_2])

        results.append({
            'event': event,
            'AA1_users': count1,
            'AA2_users': count2,
            'z_score': round(stat, 4) if not np.isnan(stat) else None,
            'p_value': round(pval, 4) if not np.isnan(pval) else None
        })

    except IndexError:
        print(f"Event '{event}' not found in both groups. Ignored.")
    
df_ztest = pd.DataFrame(results)
df_ztest


invalid value encountered in scalar divide



Unnamed: 0,event,AA1_users,AA2_users,z_score,p_value
0,login,1938,1939,,
1,product_page,1098,1103,-0.1437,0.8857
2,product_cart,374,398,-0.9572,0.3385
3,purchase,8,7,0.2597,0.7951


In [65]:
for index, row in df_ztest.iterrows():
    if row['event'] == 'login':
        continue
    elif row['p_value'] < 0.05:
        display(HTML(f"> H₀ is rejected, not rejecting H₁, because there is enough statistical evidence that the proportion of <i>{row['event']}</i> events differs significantly between group AA1 and group AA2."))
    else:
        display(HTML(f"> H₀ is not rejected, rejecting H₁, because there is not enough statistical evidence that the proportion of <i>{row['event']}</i> events differs between group AA1 and group AA2."))


#### **7.4** A/B Test

- Expected Outcome: __Within 14 days of enrollment__, users will show improved conversion rates for product page views (the product_page event), adding items to the shopping cart (product_cart), and purchases (purchase). At each stage of the `product_page → product_cart → purchase` __funnel__, there will be _at least a 10% increase_.
- Expected number of test participants: 6,000

**7.4.1** AB group segmentation

In [66]:
# Group A and B Overview
df_users_group_a

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
41,F2BE35774F63059B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,05:23:18,,A,interface_eu_test
271,F80C9BDDEA02E53C,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,10:05:54,,A,interface_eu_test
523,F0715ADC532E14B4,2020-12-07,purchase,4.99,2020-12-07,EU,Android,01:57:58,,A,interface_eu_test
777,8FF91E21E27A330D,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,01:57:30,,A,interface_eu_test
971,0BA0B65F1B8C19F9,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,04:42:52,,A,interface_eu_test
...,...,...,...,...,...,...,...,...,...,...,...
429393,E5589EAE02ACD150,2020-12-29,login,,2020-12-20,EU,Mac,22:17:08,Christmas_New_Year_Promo,A,recommender_system_test
429396,D21F0D4FDCD82DB2,2020-12-29,login,,2020-12-20,EU,iPhone,02:17:00,Christmas_New_Year_Promo,A,recommender_system_test
429407,C15DC7DF26A3300D,2020-12-29,login,,2020-12-20,EU,Android,02:57:31,Christmas_New_Year_Promo,A,interface_eu_test
429411,BCEC881B3C573B2B,2020-12-29,login,,2020-12-20,EU,PC,08:17:16,Christmas_New_Year_Promo,A,interface_eu_test


In [67]:
df_users_group_b

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
88,392C03684E704CB3,2020-12-07,purchase,4.99,2020-12-07,EU,PC,09:45:50,,B,interface_eu_test
128,406CD606E407DF53,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,05:12:23,,B,interface_eu_test
134,A4145A2EA1E17654,2020-12-07,purchase,499.99,2020-12-07,EU,PC,16:15:16,,B,interface_eu_test
158,C7CB2F1BA42F102B,2020-12-07,purchase,99.99,2020-12-07,EU,iPhone,03:05:05,,B,recommender_system_test
682,3242BDDFA690A22B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,18:10:54,,B,recommender_system_test
...,...,...,...,...,...,...,...,...,...,...,...
429369,98DEF9BB002BA9E3,2020-12-29,login,,2020-12-20,EU,iPhone,13:16:10,Christmas_New_Year_Promo,B,interface_eu_test
429378,8B02FD26DBC4FDE3,2020-12-29,login,,2020-12-20,EU,PC,10:29:02,Christmas_New_Year_Promo,B,interface_eu_test
429432,8CE3B6FD918462B4,2020-12-29,login,,2020-12-20,EU,PC,17:55:44,Christmas_New_Year_Promo,B,interface_eu_test
429438,2C29721DDDA76B2A,2020-12-29,login,,2020-12-20,EU,iPhone,05:58:20,Christmas_New_Year_Promo,B,interface_eu_test


In [68]:
# Check if there are the same users in both samples
display(HTML(f"> Users in group <i>A</i>: <b>{df_users_group_a['user_id'].nunique()}</b>"))
display(HTML(f"> Users in group <i>B</i>: <b>{df_users_group_b['user_id'].nunique()}</b>"))

duplicates = set(df_users_group_a['user_id'].unique()) & set(df_users_group_b['user_id'].unique())

display(HTML(f"> Duplicate users between A and B: <b>{len(duplicates)}</b>)"))
display(HTML(f"> Duplicate rate: <b>{round((len(duplicates) * 100 / (df_users_group_a.shape[0] + df_users_group_b.shape[0])), 3)} %</b>"))

`LSPL`   
__Note__: Duplicated Users within both groups were found.

Option 1: Eliminate duplicate users   
✔️ Advantages:   
- Preserves independence between groups
- Avoids bias in comparative metrics

❌ Disadvantages:   
- Reduces sample size
- Can eliminate valuable users if the number of duplicates is high

Option 2: Reassign Duplicate Users   
✔️ Advantages:   
- You retain the full sample size
- Useful if duplicates are few and you can assign reproducibly

❌ Disadvantages:   
- Risk of bias if the reassignment is not random
- Can contaminate metrics if the user has already interacted with both groups

✅ Recommendation   
- If duplicates are less than 1–2%, remove them.
- If they are more than 5%, consider remapping with reproducible logic (e.g., user_id hashes).

In [69]:
df_users_group_a = df_users_group_a.loc[~(df_users_group_a['user_id'].isin(duplicates)), :]
df_users_group_a

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
41,F2BE35774F63059B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,05:23:18,,A,interface_eu_test
271,F80C9BDDEA02E53C,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,10:05:54,,A,interface_eu_test
523,F0715ADC532E14B4,2020-12-07,purchase,4.99,2020-12-07,EU,Android,01:57:58,,A,interface_eu_test
971,0BA0B65F1B8C19F9,2020-12-07,purchase,4.99,2020-12-07,EU,Mac,04:42:52,,A,interface_eu_test
992,FF1AF3B4FB596E23,2020-12-07,purchase,9.99,2020-12-07,EU,iPhone,05:19:43,,A,interface_eu_test
...,...,...,...,...,...,...,...,...,...,...,...
429385,317765BEF86DA9AE,2020-12-29,login,,2020-12-20,EU,Android,04:49:26,Christmas_New_Year_Promo,A,interface_eu_test
429393,E5589EAE02ACD150,2020-12-29,login,,2020-12-20,EU,Mac,22:17:08,Christmas_New_Year_Promo,A,recommender_system_test
429396,D21F0D4FDCD82DB2,2020-12-29,login,,2020-12-20,EU,iPhone,02:17:00,Christmas_New_Year_Promo,A,recommender_system_test
429407,C15DC7DF26A3300D,2020-12-29,login,,2020-12-20,EU,Android,02:57:31,Christmas_New_Year_Promo,A,interface_eu_test


In [70]:
df_users_group_b = df_users_group_b.loc[~(df_users_group_b['user_id'].isin(duplicates)), :]
df_users_group_b

Unnamed: 0,user_id,event_dt,event_name,details,first_date,region,device,event_tm,mkt_event_name,group,ab_test
88,392C03684E704CB3,2020-12-07,purchase,4.99,2020-12-07,EU,PC,09:45:50,,B,interface_eu_test
128,406CD606E407DF53,2020-12-07,purchase,4.99,2020-12-07,EU,iPhone,05:12:23,,B,interface_eu_test
134,A4145A2EA1E17654,2020-12-07,purchase,499.99,2020-12-07,EU,PC,16:15:16,,B,interface_eu_test
158,C7CB2F1BA42F102B,2020-12-07,purchase,99.99,2020-12-07,EU,iPhone,03:05:05,,B,recommender_system_test
682,3242BDDFA690A22B,2020-12-07,purchase,4.99,2020-12-07,EU,Android,18:10:54,,B,recommender_system_test
...,...,...,...,...,...,...,...,...,...,...,...
429354,78CF36A980245CE5,2020-12-29,login,,2020-12-20,EU,iPhone,08:20:28,Christmas_New_Year_Promo,B,interface_eu_test
429369,98DEF9BB002BA9E3,2020-12-29,login,,2020-12-20,EU,iPhone,13:16:10,Christmas_New_Year_Promo,B,interface_eu_test
429378,8B02FD26DBC4FDE3,2020-12-29,login,,2020-12-20,EU,PC,10:29:02,Christmas_New_Year_Promo,B,interface_eu_test
429432,8CE3B6FD918462B4,2020-12-29,login,,2020-12-20,EU,PC,17:55:44,Christmas_New_Year_Promo,B,interface_eu_test


In [71]:
# Check group balance
total = df_users_group_a['user_id'].nunique() + df_users_group_b['user_id'].nunique()
difference = df_users_group_a['user_id'].nunique() - df_users_group_b['user_id'].nunique()
rate = (difference / total) * 100

display(HTML(f"> Group balance verification: <i>A group</i> (<b>{df_users_group_a['user_id'].nunique()}</b>), <i>B group</i> (<b>{df_users_group_b['user_id'].nunique()}</b>)"))
display(HTML(f"> Group balance difference rate: <i>Difference</i> <b>{difference}</b>, Rate <b>{round(rate, 3)} %</b>"))

if rate < 5:
    display(HTML(f"> Groups are balanced."))
else:
    display(HTML(f"> Groups are unbalanced."))

`LSPL`   
__Note__: Unbalanced groups were found after duplicates deletion in both groups

Strategies for managing imbalance

1. Subsampling the largest group: Reduce group A to match the size of B   
✔️ Preserva independencia ✔️ Balancea tamaños ❌ Pierdes datos del grupo A    

2. Statistical Weighting: Instead of trimming, you can apply weights inversely proportional to the size of each group in the analysis   
✔️ Preserves all data ✔️ Useful for models or regressions ❌ More complex to implement in simple z-tests   

3. Controlled Reassignment: If the imbalance is due to deleted duplicates, you can rebuild the assignment with reproducible logic   
✔️ Perfect balance ✔️ Preserves events ❌ Requires justifying reassignment

✅ Recommendation

| Imbalance | Suggested Action |
|-----------|------------------|
| <5% | Acceptable, continue |
| 5–10% | Consider weighting or subsampling |
| >10% | Recommend reallocating or subsampling |

In [72]:
# As groups AB were already assigned, reallocating would be dangerous due to data would be deviated, Therefore subsampling will be done
# by reducing group A trying to trying to match group B size
# Helping with: 
# - Group independence: you don't introduce bias due to reassignment.
# - Statistical validity: you can apply tests like the z-test or chi-square test without worrying about cross-contamination.
# - Reproducibility: you can document and repeat sampling with random_state.

# Subsample users from A to match the size of B
# df_users_group_a = df_users_group_a.sample(n=len(df_users_group_b), random_state=42)
# df_users_group_a

# Step 1: Calculate target size for A
n_b = len(df_users_group_b['user_id'].drop_duplicates())
target_n_a = max(n_b, 3090) # Make sure it doesn't go below 3090

#Step 2: Subsample unique users from A
users_a_sampled = df_users_group_a['user_id'].drop_duplicates().sample(n=target_n_a, random_state=42)

# Step 3: Filter events from those users
df_users_group_a = df_users_group_a[df_users_group_a['user_id'].isin(users_a_sampled)]


In [73]:
# Check group balance
total = df_users_group_a['user_id'].nunique() + df_users_group_b['user_id'].nunique()
difference = df_users_group_a['user_id'].nunique() - df_users_group_b['user_id'].nunique()
rate = (difference / total) * 100

display(HTML(f"> Group balance verification: <i>A group</i> (<b>{df_users_group_a['user_id'].nunique()}</b>), <i>B group</i> (<b>{df_users_group_b['user_id'].nunique()}</b>)"))
display(HTML(f"> Group balance difference rate: <i>Difference</i> <b>{difference}</b>, Rate <b>{round(rate, 3)} %</b>"))

if rate < 5:
    display(HTML(f"> Groups are balanced."))
else:
    display(HTML(f"> Groups are unbalanced."))

**7.4.2** AB groups conversion comparison

In [74]:
df_funnel_A = df_users_group_a
df_funnel_A = df_funnel_A.groupby('event_name', observed=True).agg(events=('event_name', 'count'), users=('user_id', 'nunique')).sort_values(by='users', ascending=False).reset_index()
df_funnel_A['conversion_rate'] = ((df_funnel_A['users'] / df_funnel_A['events']) * 100).round(3)
df_funnel_A['total_conversion_rate'] = ((df_funnel_A['users'] / df_funnel_A.loc[0, 'users']) * 100).round(3)
df_funnel_A['stage_conversion_rate'] = ((df_funnel_A['users'] / df_funnel_A['users'].shift(1)) * 100).round(3)
df_funnel_A['drop_rate'] = (100 - df_funnel_A['stage_conversion_rate']).round(3)
df_funnel_A

Unnamed: 0,event_name,events,users,conversion_rate,total_conversion_rate,stage_conversion_rate,drop_rate
0,login,7256,3080,42.448,100.0,,
1,product_page,3421,1779,52.002,57.76,57.76,42.24
2,product_cart,1137,634,55.761,20.584,35.638,64.362
3,purchase,13,12,92.308,0.39,1.893,98.107


In [75]:
plot_horizontal_bar_plotpx(df_funnel_A, x='users', y='event_name', title='AB-Test (A) Conversion Funnel', xlabel='Users', ylabel='Events', sort=True)

In [76]:
plot_horizontal_bar_plotpx(df_funnel_A, x='stage_conversion_rate', y='event_name', title='AB-Test (A) Stage Conversion Rate Funnel', xlabel='Rate', ylabel='Events', sort=True)

In [77]:
plot_horizontal_bar_plotpx(df_funnel_A.sort_values(by='drop_rate', ascending=False), x='drop_rate', y='event_name', title='AB-Test (A) Drop Rate Funnel', xlabel='Users', ylabel='Events', sort=False)

In [78]:
df_funnel_B = df_users_group_b
df_funnel_B = df_funnel_B.groupby('event_name', observed=True).agg(events=('event_name', 'count'), users=('user_id', 'nunique')).sort_values(by='users', ascending=False).reset_index()
df_funnel_B['conversion_rate'] = ((df_funnel_B['users'] / df_funnel_B['events']) * 100).round(3)
df_funnel_B['total_conversion_rate'] = ((df_funnel_B['users'] / df_funnel_B.loc[0, 'users']) * 100).round(3)
df_funnel_B['stage_conversion_rate'] = ((df_funnel_B['users'] / df_funnel_B['users'].shift(1)) * 100).round(3)
df_funnel_B['drop_rate'] = (100 - df_funnel_B['stage_conversion_rate']).round(3)
df_funnel_B

Unnamed: 0,event_name,events,users,conversion_rate,total_conversion_rate,stage_conversion_rate,drop_rate
0,login,6690,2826,42.242,100.0,,
1,product_page,2971,1551,52.205,54.883,54.883,45.117
2,product_cart,1055,608,57.63,21.515,39.201,60.799
3,purchase,21,19,90.476,0.672,3.125,96.875


In [79]:
plot_horizontal_bar_plotpx(df_funnel_B, x='users', y='event_name', title='AB-Test (B) Conversion Funnel', xlabel='Users', ylabel='Events', sort=True, color='black')

In [80]:
plot_horizontal_bar_plotpx(df_funnel_B, x='stage_conversion_rate', y='event_name', title='AB-Test (B) Stage Conversion Rate Funnel', xlabel='Rate', ylabel='Events', sort=True, color='black')

In [81]:
plot_horizontal_bar_plotpx(df_funnel_B.sort_values(by='drop_rate', ascending=False), x='drop_rate', y='event_name', title='AB-Test (B) Drop Rate Funnel', xlabel='Users', ylabel='Events', sort=False, color='black')

**7.4.3** Z-Test for proportions comparison between A and B groups

In [82]:
# Z test

# 1. Hypotheses H₀, H₁
# H₀: The proportion of events between group A and group B is that the ones from groupB are smaller than groupA.
# H₁: The proportion of events between group A and group B changed, making group B increased a 10% at least.

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05
z_critical = norm.ppf(1 - alpha) # Threshold for a unilateral right test

# 3. Calculate critical and test values, define acceptance and rejection zones

results = []

# Total number of users in each group (make sure it is not empty)
try:
    total_users_A = int(df_funnel_A.loc[df_funnel_A['event_name'] == 'login', 'users'].values[0])
    total_users_B = int(df_funnel_B.loc[df_funnel_B['event_name'] == 'login', 'users'].values[0])
except IndexError:
    raise ValueError("The 'login' event was not found in one of the groups.")

for event in df_funnel_A['event_name']:
    try:
        countA = int(df_funnel_A.loc[df_funnel_A['event_name'] == event, 'users'].values[0])
        countB = int(df_funnel_B.loc[df_funnel_B['event_name'] == event, 'users'].values[0])

        if total_users_A == 0 or total_users_B == 0:
            stat, pval, decision = np.nan, np.nan, 'Invalid'
        else:
            propA = countA / total_users_A
            propB = countB / total_users_B
            
            diff_observed = propB - propA
            expected_min_increase = 0.10 * propA

            stat, pval = proportions_ztest(count=[countA, countB], nobs=[total_users_A, total_users_B], alternative='larger')
            
            if stat >= z_critical and diff_observed >= expected_min_increase:
                decision = 'Reject H₀ (B increased ≥10%)'
            else:
                decision = 'Fail to reject H₀'

        results.append({
            'event': event,
            'A_users': countA,
            'B_users': countB,
            'prop_A': round(propA, 4),
            'prop_B': round(propB, 4),
            'diff_observed': round(diff_observed, 4),
            'expected_min_increase': round(expected_min_increase, 4),
            'z_score': round(stat, 4) if not np.isnan(stat) else None,
            'p_value': round(pval, 4) if not np.isnan(pval) else None,
            'decision': decision
        })

    except IndexError:
        print(f"Event '{event}' not found in both groups. Ignored.")
    
df_ztest = pd.DataFrame(results)
df_ztest


invalid value encountered in scalar divide



Unnamed: 0,event,A_users,B_users,prop_A,prop_B,diff_observed,expected_min_increase,z_score,p_value,decision
0,login,3080,2826,1.0,1.0,0.0,0.1,,,Fail to reject H₀
1,product_page,1779,1551,0.5776,0.5488,-0.0288,0.0578,2.2268,0.013,Fail to reject H₀
2,product_cart,634,608,0.2058,0.2151,0.0093,0.0206,-0.8762,0.8095,Fail to reject H₀
3,purchase,12,19,0.0039,0.0067,0.0028,0.0004,-1.502,0.9335,Fail to reject H₀


In [84]:
for index, row in df_ztest.iterrows():
    if row['event'] == 'login':
        continue
    elif stat >= z_critical and diff_observed >= expected_min_increase:
        display(HTML(f"> H₀ is rejected, not rejecting H₁, because there is enough statistical evidence that the proportion of <i>{row['event']}</i> between group A and group B changed, making group B increased a 10% at least."))
    else:
        display(HTML(f"> H₀ is not rejected, rejecting H₁, because there is not enough statistical evidence that the proportion of <i>{row['event']}</i> between group A and group B changed, making group B increased a 10% at least."))


**7.4.4** Check whether the test had good statistical power

In [101]:
# Initialize the power calculation function
power_analysis = NormalIndPower()

# Significance Level
alpha = 0.05

# DEfine Results
power_results = []

# Total users per group
total_users_A = int(df_funnel_A.loc[df_funnel_A['event_name'] == 'login', 'users'].values[0])
total_users_B = int(df_funnel_B.loc[df_funnel_B['event_name'] == 'login', 'users'].values[0])

# We go through the events (funnel phases)
for event in df_funnel_A['event_name']:
    try:
        countA = int(df_funnel_A.loc[df_funnel_A['event_name'] == event, 'users'].values[0])
        countB = int(df_funnel_B.loc[df_funnel_B['event_name'] == event, 'users'].values[0])

        if total_users_A == 0 or total_users_B == 0:
            power = np.nan
        else:
            # We calculate the observed proportions
            propA = countA / total_users_A
            propB = countB / total_users_B
            
            # Calculating the effect size (Cohen's h)
            effect_size = proportion_effectsize(propA, propB)

            # Calculation of statistical power
            power = power_analysis.power(
                effect_size=effect_size,
                nobs1=total_users_A,
                ratio=total_users_B / total_users_A,
                alpha=alpha,
                alternative='larger'
            )

        power_results.append({
            'event': event,
            'A_users': countA,
            'B_users': countB,
            'prop_A': round(propA, 4),
            'prop_B': round(propB, 4),
            'power': round(power, 4) if not np.isnan(power) else None
        })

    except IndexError:
        print(f"Event '{event}' not found in both groups. Ignored.")

# Create a DataFrame with the power results
df_power = pd.DataFrame(power_results)
df_power

Unnamed: 0,event,A_users,B_users,prop_A,prop_B,power
0,login,3080,2826,1.0,1.0,0.05
1,product_page,1779,1551,0.5776,0.5488,0.7197
2,product_cart,634,608,0.2058,0.2151,0.0059
3,purchase,12,19,0.0039,0.0067,0.0008


In [102]:
for index, row in df_power.iterrows():
    if power >= 0.8:
        display(HTML(f"> The test for <i>{row['event']}</i> had <b>good</b> sensitivity"))
    else:
        display(HTML(f"> The test for <i>{row['event']}</i> had <b>risk</b> of false negatives (Type II Error)"))

**7.4.5** Sensitivity Analysis

In [108]:
# Common parameters
alpha = 0.05 # Significance level
power = 0.8 # Desired power (80%)
nobs_A = df_users_group_a['user_id'].nunique() # Sample size for group A
nobs_B = df_users_group_b['user_id'].nunique()# Sample size for group B
ratio = nobs_B / nobs_A # Relationship between sample sizes

# Suppose we have the funnel phases
funnel_phases = df_funnel_A['event_name'].unique() # Example of phases

# Function to get the proportion of users in a specific funnel phase
def get_proportion(group_df, event_name, total_event_name='login'):
    try:
        users_in_phase = group_df.loc[group_df['event_name'] == event_name, 'users'].values[0]
        total_users = group_df.loc[group_df['event_name'] == total_event_name, 'users'].values[0]
        return users_in_phase / total_users
    except IndexError:
        return np.nan # If the event is not found

# Sensitivity analysis by phase
results = []

for event in funnel_phases:
    # Get proportions by phase for B (and A if needed for comparison)
    prop_B_base = get_proportion(df_funnel_B, event)

    if np.isnan(prop_B_base):
        continue

    # Expected increase (10%)
    expected_increase_B = 0.10 * prop_B_base
    prop_B_expected = prop_B_base + expected_increase_B # New expected proportion

    # Calculate the minimum detectable effect size by phase
    detectable_effect = zt_ind_solve_power( 
        effect_size=None, # This is the effect size we are looking for 
        nobs1=nobs_B, 
        alpha=alpha, 
        power=power, 
        ratio=ratio, 
        alternative='larger' 
        ) 

    # Minimum increase in percentage 
    min_percentage_increase_B = (detectable_effect / prop_B_base) * 100 

    # Store results per phase 
    results.append({ 
        'phase': event, 
        'prop_B_base': round(prop_B_base, 4), 
        'prop_B_expected': round(prop_B_expected, 4), 
        'min_effect_size': round(detectable_effect, 4), 
        'min_percentage_increase': round(min_percentage_increase_B, 1) 
    })

# Show results
df_results_by_phase = pd.DataFrame(results)
display(df_results_by_phase)

Unnamed: 0,phase,prop_B_base,prop_B_expected,min_effect_size,min_percentage_increase
0,login,1.0,1.1,0.0673,6.7
1,product_page,0.5488,0.6037,0.0673,12.3
2,product_cart,0.2151,0.2367,0.0673,31.3
3,purchase,0.0067,0.0074,0.0673,1001.5


### 🧠 __8. Conclussions__

### Conclusion

The A/B test results indicate that there are **no statistically significant differences** between groups A and B across all funnel stages. Although a slight variation was observed in the *product_page* phase, the statistical power across most stages remains below the recommended threshold (0.8), limiting the reliability of any conclusions drawn from the current data.

A key issue identified is the **imbalance in sample sizes between groups A and B**. Increasing the sample size **specifically on the B side** is essential, since expanding only group A would further skew the ratio and invalidate the statistical assumptions required for a valid z-test. Additionally, the **reassignment of users to A or B before removing duplicate users** introduces bias. Because the initial group allocation was tied to the recommendation system experiment (with group B being the treatment group), random reallocation is not acceptable—users must remain in their original assignment to preserve experimental integrity.

### Identified Problems

1. **Low statistical power** in later funnel stages (*product_cart* and *purchase*) due to very small sample sizes.
2. **Group imbalance**, which affects the accuracy and comparability of results.
3. **Pre-cleaning reassignment of users**, which may have altered the original experimental design and compromised the randomization logic.
4. **High minimum detectable effects (MDE)** — up to 1000% in the purchase phase — making it nearly impossible to detect realistic improvements.

### Strategies to Increase Statistical Power

* **Increase group B sample size** proportionally to group A until both have comparable user counts.
* **Extend the experiment duration** to capture more user interactions through the entire funnel.
* **Focus on early funnel stages** (e.g., *product_page*), where traffic is higher and the test already shows moderate power (~0.72).
* **Reduce variance** by ensuring clean user segmentation and consistent tracking of unique users.
* **Use alternative inference methods**, such as Bayesian A/B testing, to handle low-conversion scenarios more effectively.

### Recommendations

1. Re-run the experiment **after correcting group assignment and user deduplication**, maintaining the original allocation from the recommendation system.
2. Set clear target power levels (**≥0.8**) and estimate required sample sizes per stage before restarting.
3. Consider using **sensitivity analysis** results to define realistic MDE targets (e.g., 5–10%) instead of large, unachievable thresholds.
4. In future experiments, **establish user assignment and filtering protocols before data collection** to prevent sampling bias.

In summary, the current test provides limited statistical confidence due to low power and sampling inconsistencies. Strengthening group balance, preserving original assignments, and expanding group B’s sample size are necessary steps to achieve valid and actionable insights in subsequent test iterations.
