### 🎯 __Project Objectives: A/B Test — “recommender_system_test”__

The main objective of this second phase of the data analysis project is to evaluate the impact of implementing a new recommendation system on the behavior of users of an international online store, through the analysis of a controlled A/B test.

Specifically, the objectives are:

1. Measure and compare the performance of the new checkout funnel (group B) against the current funnel (group A), using key conversion metrics within the user journey:

    - Product page views (product_page)
    - Add to cart events (product_cart)
    - Completed purchases (purchase)

2. Validate whether the new recommendation system improves sales funnel conversion by at least 10% at each stage, within 14 days of user registration, as established in the business hypothesis.

3. Verify the consistency and quality of the collected data to ensure the analysis is statistically reliable and representative of the selected audience (15% of new users in the EU region).

4. Apply appropriate statistical methods (such as hypothesis testing and significance analysis) to determine whether the observed differences between groups A and B are attributable to the new recommendation system and not due to chance.

5. Communicate actionable findings that allow the product team to decide whether the new checkout funnel should be implemented widely or requires additional adjustments before full deployment.

#### __Technical Description__

- Test Name: __recommender_system_test__
- Groups: A (control), B (new checkout funnel)
- Launch Date: 2020-12-07
- Date new users stopped: 2020-12-21
- End Date: 2021-01-01
- Audience: __15% of new__ users from the _EU region_
- Test Purpose: To test changes related to the introduction of an improved recommendation system
- Expected Outcome: __Within 14 days of enrollment__, users will show improved conversion rates for product page views (the product_page event), adding items to the shopping cart (product_cart), and purchases (purchase). At each stage of the `product_page → product_cart → purchase` __funnel__, there will be _at least a 10% increase_.
- Expected number of test participants: 6,000


####  __Data Description__

`ab_project_marketing_events_us.csv` (the marketing events calendar for 2020)

- `name`: the name of the marketing event
- `regions`: regions where the advertising campaign will be run
- `start_dt`: the campaign start date
- `finish_dt`: the campaign end date

`final_ab_new_users_upd_us.csv` (all users who registered in the online store from December 7 to 21, 2020)

- `user_id`: user identification
- `first_date`: Date of enrollment
- `region`: location
- `device`: Device used for enrollment

`final_ab_events_upd_us.csv` (all new user events from December 7, 2020, to January 1, 2021)

- `user_id`: user identification
- `event_dt`: Date and time of the event
- `event_name`: Name of the event type
- `details`: Additional information about the event (e.g., the total order in USD for purchase events)

`final_ab_participants_upd_us.csv` (a table with data on test participants)

- `user_id`: user identification
- `ab_test`: Name of the test
- `group`: The test group the user belonged to

### 💻 __1. Notebook Libraries and Customization__

In [1]:
from IPython.display import display, HTML
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import re
from scipy import stats as st
from statsmodels.stats.proportion import proportions_ztest
import unicodedata

### 💻 __2. Functions__

In [2]:
# Function to normalize string formatting in object-type columns
def normalize_string_format(df, include=None, exclude=None):
    """
    Standardizes text formatting for object-type (string) columns in a DataFrame.

    Operations performed:
    - Converts text to lowercase
    - Strips leading/trailing whitespace
    - Replaces punctuation with spaces
    - Collapses spaces into underscores
    - Removes redundant underscores
    - Adds unicode normalization to remove accents and special characters.

    Parameters:
    df (DataFrame): The input DataFrame.
    include (list, optional): Specific columns to apply formatting to. If None, applies to all except those in 'exclude'.
    exclude (list, optional): Columns to skip.

    Returns:
    DataFrame: Updated DataFrame with normalized string formats.
    """

    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    def clean_text(text):
        if isinstance(text, str):
            text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
            text = (text
                    # .lower()
                    .strip())
            text = re.sub(r'[^\w\s.]', ' ', text)    # Replaces all characters other than letters, numbers, or spaces with a space
            text = re.sub(r'\s+', '_', text)        # Replace one or more consecutive spaces with an underscore "_"
            text = re.sub(r'__+', '_', text)        # Replaces multiple consecutive underscores with a single "_"
            text = re.sub(r'_(?=\s|$)', '', text)   # Removes underscores that are just before a space or at the end of the text
            text = re.sub(r'__+', '_', text)        # Replaces double underscores with single underscores
        return text

    for column in available_columns:
        if df[column].dtype in ['object', 'string']:
            df[column] = df[column].apply(clean_text)

    return df

# Function to identify non-standard missing values in object-type columns
def check_existing_missing_values(df):
    """
    Checks object-type columns in a DataFrame for non-standard missing values.

    Parameters:
    df (DataFrame): The dataset to inspect.

    Output:
    Displays the number of non-standard missing entries per column and the matched values.
    """

    # Common non-standard representations of missing values
    missing_values = ['', ' ', 'N/A', 'none', 'None','null', 'NULL', 'NaN', 'nan', 'NAN', 'nat', 'NaT']

    display(HTML(f"<h4>Scanning for Non-Standard Missing Values</h4>"))

    for column in df.columns:

        matches = df[df[column].isin(missing_values)][column].unique()

        if df[column].isin(missing_values).any() and matches.size > 0:
            count = df[column].isin(missing_values).sum()
            display(
                HTML(f"> Missing values in column <i>'{column}'</i>: <b>{count}</b>"))
            display(
                HTML(f"&emsp;Matched non-standard values: {list(matches)}"))
        else:
            display(
                HTML(f"> Missing values in column <i>'{column}'</i>: None"))

    print()

    return None

# Function to standardize non-standard missing values to pd.NA
def replace_missing_values(df, include=None, exclude=None):
    """
    Replaces common non-standard missing value entries in object-type columns with pd.NA.

    Parameters:
    df (DataFrame): The input dataset.
    include (list, optional): List of columns to include. If None, all columns except those in 'exclude' are considered.
    exclude (list, optional): List of columns to exclude from replacement.

    Returns:
    DataFrame: Updated DataFrame with non-standard missing values replaced by pd.NA.
    """

    missing_values = ['', ' ', 'N/A', 'none', 'None', 'null', 'NULL', 'NaN', 'nan', 'NAN', 'nat', 'NaT']

    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    for column in available_columns:
        if df[column].dtype in ['object', 'string'] and df[column].isin(missing_values).any():
            df.loc[:, column] = df[column].replace(missing_values, pd.NA)

    return df

# function for displaying the percentage of mising values in a Dataset
def missing_values_rate(df, include=None, exclude=None):
    
    """
    Displays the percentage of missing values for specified columns in a DataFrame.

    Parameters:
    ----------
    df : pandas.DataFrame
        The DataFrame to analyze.

    include : list, optional
        List of column names to include in the analysis. If None, all columns not in `exclude` are considered.

    exclude : list, optional
        List of column names to exclude from the analysis. Default is an empty list.

    Returns:
    -------
    None
        Displays HTML output in a Jupyter Notebook environment.
    """
    
    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    for column in available_columns:
        total_values = len(df[column])
        if total_values == 0:
            percentage = 0
        else:
            missing_values = df[column].isna().sum()
            percentage = (missing_values / total_values) * 100

        display(HTML(f"> Percentage of missing values for column <i>'{column}'</i>: <b>{percentage:.2f}</b> %<br>"))
        display(HTML(f">    Total values: {df[column].shape[0]}<br>   > Missing values: {df[column].isna().sum()}<br><br>"))

### 🔁 __3. Data Loading__

In [3]:
df_mkt_events = pd.read_csv('../data/raw/ab_project_marketing_events_us.csv', sep=',', header='infer', keep_default_na=False)
df_user_events = pd.read_csv('../data/raw/final_ab_events_upd_us.csv', sep=',', header='infer', keep_default_na=False)
df_users_registration = pd.read_csv('../data/raw/final_ab_new_users_upd_us.csv', sep=',', header='infer', keep_default_na=False)
df_user_test = pd.read_csv('../data/raw/final_ab_participants_upd_us.csv', sep=',', header='infer', keep_default_na=False)

### 🧹 __4. Data Cleanup__

##### **4.1** Data Overview

In [4]:
df_mkt_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       14 non-null     object
 1   regions    14 non-null     object
 2   start_dt   14 non-null     object
 3   finish_dt  14 non-null     object
dtypes: object(4)
memory usage: 576.0+ bytes


In [5]:
df_user_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423761 entries, 0 to 423760
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user_id     423761 non-null  object
 1   event_dt    423761 non-null  object
 2   event_name  423761 non-null  object
 3   details     423761 non-null  object
dtypes: object(4)
memory usage: 12.9+ MB


In [6]:
df_users_registration.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58703 entries, 0 to 58702
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     58703 non-null  object
 1   first_date  58703 non-null  object
 2   region      58703 non-null  object
 3   device      58703 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


In [7]:
df_user_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14525 entries, 0 to 14524
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  14525 non-null  object
 1   group    14525 non-null  object
 2   ab_test  14525 non-null  object
dtypes: object(3)
memory usage: 340.6+ KB


##### **4.2** Standardization of column heading formats (Lower case, snake case)

In [8]:
df_mkt_events = normalize_string_format(df_mkt_events, exclude=['start_dt', 'finish_dt'])
df_mkt_events

Unnamed: 0,name,regions,start_dt,finish_dt
0,Christmas_New_Year_Promo,EU_N.America,2020-12-25,2021-01-03
1,St._Valentine_s_Day_Giveaway,EU_CIS_APAC_N.America,2020-02-14,2020-02-16
2,St._Patric_s_Day_Promo,EU_N.America,2020-03-17,2020-03-19
3,Easter_Promo,EU_CIS_APAC_N.America,2020-04-12,2020-04-19
4,4th_of_July_Promo,N.America,2020-07-04,2020-07-11
5,Black_Friday_Ads_Campaign,EU_CIS_APAC_N.America,2020-11-26,2020-12-01
6,Chinese_New_Year_Promo,APAC,2020-01-25,2020-02-07
7,Labor_day_May_1st_Ads_Campaign,EU_CIS_APAC,2020-05-01,2020-05-03
8,International_Women_s_Day_Promo,EU_CIS_APAC,2020-03-08,2020-03-10
9,Victory_Day_CIS_May_9th_Event,CIS,2020-05-09,2020-05-11


In [9]:
df_users_registration = normalize_string_format(df_users_registration, include=['region', 'device'])
df_users_registration

Unnamed: 0,user_id,first_date,region,device
0,D72A72121175D8BE,2020-12-07,EU,PC
1,F1C668619DFE6E65,2020-12-07,N.America,Android
2,2E1BF1D4C37EA01F,2020-12-07,EU,PC
3,50734A22C0C63768,2020-12-07,EU,iPhone
4,E1BDDCE0DAFA2679,2020-12-07,N.America,iPhone
...,...,...,...,...
58698,1DB53B933257165D,2020-12-20,EU,Android
58699,538643EB4527ED03,2020-12-20,EU,Mac
58700,7ADEE837D5D8CBBD,2020-12-20,EU,PC
58701,1C7D23927835213F,2020-12-20,EU,iPhone


##### **4.3** Explicit Duplicate Removal

In [11]:
display(HTML(f"> Explicit duplicates in <i>df_mkt_events</i>: <b>{df_mkt_events.duplicated().sum()}</b>"))
print()
display(HTML(f"> Explicit duplicates in <i>df_users_registration</i>: <b>{df_users_registration.duplicated().sum()}</b>"))
print()
display(HTML(f"> Explicit duplicates in <i>df_user_events</i>: <b>{df_user_events.duplicated().sum()}</b>"))
print()
display(HTML(f"> Explicit duplicates in <i>df_user_test</i>: <b>{df_user_test.duplicated().sum()}</b>"))










##### **4.4** Missing Value Analysis

In [12]:
check_existing_missing_values(df_mkt_events)




In [13]:
check_existing_missing_values(df_users_registration)




In [14]:
check_existing_missing_values(df_mkt_events)




In [15]:
check_existing_missing_values(df_user_test)




##### **4.5** Casting Datatypes

In [19]:
# Cast to datetime
df_mkt_events['start_dt'] = pd.to_datetime(df_mkt_events['start_dt'], errors='coerce').dt.date
df_mkt_events['finish_dt'] = pd.to_datetime(df_mkt_events['start_dt'], errors='coerce').dt.date
df_mkt_events

Unnamed: 0,name,regions,start_dt,finish_dt
0,Christmas_New_Year_Promo,EU_N.America,2020-12-25,2020-12-25
1,St._Valentine_s_Day_Giveaway,EU_CIS_APAC_N.America,2020-02-14,2020-02-14
2,St._Patric_s_Day_Promo,EU_N.America,2020-03-17,2020-03-17
3,Easter_Promo,EU_CIS_APAC_N.America,2020-04-12,2020-04-12
4,4th_of_July_Promo,N.America,2020-07-04,2020-07-04
5,Black_Friday_Ads_Campaign,EU_CIS_APAC_N.America,2020-11-26,2020-11-26
6,Chinese_New_Year_Promo,APAC,2020-01-25,2020-01-25
7,Labor_day_May_1st_Ads_Campaign,EU_CIS_APAC,2020-05-01,2020-05-01
8,International_Women_s_Day_Promo,EU_CIS_APAC,2020-03-08,2020-03-08
9,Victory_Day_CIS_May_9th_Event,CIS,2020-05-09,2020-05-09


In [27]:
df_users_registration['first_date'] = pd.to_datetime(df_users_registration['first_date'], errors='coerce').dt.date
df_users_registration

Unnamed: 0,user_id,first_date,region,device
0,D72A72121175D8BE,2020-12-07,EU,PC
1,F1C668619DFE6E65,2020-12-07,N.America,Android
2,2E1BF1D4C37EA01F,2020-12-07,EU,PC
3,50734A22C0C63768,2020-12-07,EU,iPhone
4,E1BDDCE0DAFA2679,2020-12-07,N.America,iPhone
...,...,...,...,...
58698,1DB53B933257165D,2020-12-20,EU,Android
58699,538643EB4527ED03,2020-12-20,EU,Mac
58700,7ADEE837D5D8CBBD,2020-12-20,EU,PC
58701,1C7D23927835213F,2020-12-20,EU,iPhone


In [23]:
# Cast to category
df_mkt_events['regions'] = df_mkt_events['regions'].astype('category')
df_mkt_events

Unnamed: 0,name,regions,start_dt,finish_dt
0,Christmas_New_Year_Promo,EU_N.America,2020-12-25,2020-12-25
1,St._Valentine_s_Day_Giveaway,EU_CIS_APAC_N.America,2020-02-14,2020-02-14
2,St._Patric_s_Day_Promo,EU_N.America,2020-03-17,2020-03-17
3,Easter_Promo,EU_CIS_APAC_N.America,2020-04-12,2020-04-12
4,4th_of_July_Promo,N.America,2020-07-04,2020-07-04
5,Black_Friday_Ads_Campaign,EU_CIS_APAC_N.America,2020-11-26,2020-11-26
6,Chinese_New_Year_Promo,APAC,2020-01-25,2020-01-25
7,Labor_day_May_1st_Ads_Campaign,EU_CIS_APAC,2020-05-01,2020-05-01
8,International_Women_s_Day_Promo,EU_CIS_APAC,2020-03-08,2020-03-08
9,Victory_Day_CIS_May_9th_Event,CIS,2020-05-09,2020-05-09


In [28]:
df_users_registration['region'] = df_users_registration['region'].astype('category')
df_users_registration['device'] = df_users_registration['device'].astype('category')
df_users_registration

Unnamed: 0,user_id,first_date,region,device
0,D72A72121175D8BE,2020-12-07,EU,PC
1,F1C668619DFE6E65,2020-12-07,N.America,Android
2,2E1BF1D4C37EA01F,2020-12-07,EU,PC
3,50734A22C0C63768,2020-12-07,EU,iPhone
4,E1BDDCE0DAFA2679,2020-12-07,N.America,iPhone
...,...,...,...,...
58698,1DB53B933257165D,2020-12-20,EU,Android
58699,538643EB4527ED03,2020-12-20,EU,Mac
58700,7ADEE837D5D8CBBD,2020-12-20,EU,PC
58701,1C7D23927835213F,2020-12-20,EU,iPhone


In [None]:
# Check dtypes
df_mkt_events.dtypes

name           object
regions      category
start_dt       object
finish_dt      object
dtype: object

In [29]:
df_users_registration.dtypes

user_id         object
first_date      object
region        category
device        category
dtype: object