# Corporacion Favorita - New Superb Forecasting Model - 

## Data Preperation Pipeline

Made by 4B Consultancy (Janne Heuvelmans, Georgi Duev, Alexander Engelage, Sebastiaan de Bruin) - 2024

In this data pipeline, the data used for forecasting item unit_sales will be processed and finalized before being imported in the machine learning model.   
The following steps are made within this notebook:  

>-0. Import Packages 
     
>-1. Load and optimize raw data  
    -1.1. Functions - Creation of downcast and normalize functions for initial data load  
    -1.2. Functions - Import raw data from local path  
    -1.3. Importing raw data  
       
>-2. Cleaning data (functions)  
    -2.1. Return list containing stores with less then 1670 operational days with sales  
    -2.2. Return list containing stores with cluster=10 in stores df  
    -2.3. Function to exclude stores with less then 1670 sales days and related to cluster 10 
 
>-3. Excluding data based on exploratory data analyses (functions)  
    -3.1. Function (partly optional) - Excluding stores based on sales units and on cluster type 10    
    -3.2. Function - Exclude holiday event related to the "Terromoto" volcano event  

>-4. Enriching datasets for further analysis (functions)  
    -4.1. Function - Determining holidays per store     
    -4.2. Function - Determining a count per type of holiday per store  
    -4.3. Function - Constructing a cartesian sales dataset for each store based on the maximum sales daterange  

>-5. Constructing final dataset

The structure of this notebook was inspired by:
https://hamilton.dagworks.io/en/latest/how-tos/use-in-jupyter-notebook/ 

## 0. Import packages

In [1]:
# Importing the libraries
import pandas as pd
import numpy as np
import polars as pl
import os
import sys
import altair as alt
import vegafusion as vf
import sklearn
import time
from datetime import date, datetime, timedelta
from sklearn.pipeline import Pipeline, make_pipeline

## 1. Load and optimize raw data

### 1.1. Functions - Creation of downcast and normalize functions for initial data load
Update formatting of features to optimize memory and standardize column names.  
Furthermore, get basic information on loaded data and print back to user.  

1.1.1. Optimize memory by:  
- a) Remove spaces from column names.    
- b) Downcasting objects, integers and floats.  
- c) Standardize date columns to datetime format.

In [2]:
# Data memory optimization function 1 - Removing spaces from the column names
def standardize_column_names(s):
    """Removes spaces from the column names."""

    return s.replace(" ", "")


# Data memory optimization function 2 - Changing datatypes to smaller ones (downcasting)
def optimize_memory(df):
    """Optimize memory usage of a DataFrame by converting object columns to categorical
    and downcasting numeric columns to smaller types."""

    # Change: Objects to Categorical.
    object_cols = df.select_dtypes(include="object").columns
    if not object_cols.empty:
        print("Change: Objects to Categorical")
        df[object_cols] = df[object_cols].astype("category")

    # Change: Convert integers to smallest signed or unsigned integer and floats to smallest.
    for col in df.select_dtypes(include=["int"]).columns:
        if (df[col] >= 0).all():  # Check if all values are non-negative
            df[col] = pd.to_numeric(
                df[col], downcast="unsigned"
            )  # Downcast to unsigned
        else:
            df[col] = pd.to_numeric(df[col], downcast="integer")  # Downcast to signed

    # Downcast float columns
    for col in df.select_dtypes(include=["float"]).columns:
        df[col] = pd.to_numeric(df[col], downcast="float")

    return df


# Data memory optimization function 4 - Transform date-related columns to datetime format.
def transform_date_to_datetime(df, i):
    """Transform date-related columns to datetime format."""
    if i != 0:
        if "date" in df.columns:
            print("Change: Transformed 'date' column to Datetime Dtype")
            df["date"] = pd.to_datetime(df["date"]).dt.tz_localize(None).dt.floor("D")

    return df

1.1.2. Return basic information on each dataframe:  
- a) Information on the number of observation and features.  
- b) Information on the optimized size of the dataframe. 

In [3]:
# Getting the basic information of the dataframe (number of observations and features, optimized size)
def df_basic_info(df, dataframe_name):
    print(
        f"The '{dataframe_name}' dataframe contains: {df.shape[0]:,}".replace(",", ".")
        + f" observations and {df.shape[1]} features."
    )
    print(
        f"After optimizing by downcasting and normalizing it has optimized size of    {round(sys.getsizeof(df)/1024/1024/1024, 2)} GB."
    )

### 1.2. Functions - Import raw data from local PATH
Create import data function and apply downcast, normalize functions and give basic information function within the importing function.

In [4]:
def f_get_data(i=0):

    # Define path.
    c_path = "C:/Users/sebas/OneDrive/Documenten/GitHub/Supermarketcasegroupproject/Group4B/data/raw/"

    # c_path = "C:/Users/alexander/Documents/0. Data Science and AI for Experts/EAISI_4B_Supermarket/data/raw/"

    # c_path = 'https://www.dropbox.com/scl/fo/4f5xcrzfqlyv3qjzm0kgc/AAJkdVC_Wa8NjoTBMwG4gx4?rlkey=gyi9pc4rcmghkzk2wgqyb7y4o&dl=0' Checking if possible to use c_path of dropbox

    # Identify file.
    v_file = (
        "history-per-year",  # 0
        "holidays_events",  # 1
        "items",  # 2
        "stores",  # 3
    )

    print(f"\nReading file {i}\n")

    # Load data.
    df = (
        pd.read_parquet(c_path + v_file[i] + ".parquet")
        .rename(columns=standardize_column_names)
        .pipe(optimize_memory)
        .pipe(transform_date_to_datetime, i)
    )

    # Return data.
    return df

### 1.3. Importing raw data
Importing parquet files with importing function (downcasting, normalizing and giving basic information)

In [5]:
# Sales History per year
df_sales = f_get_data(0)

# Holidays
df_holidays = f_get_data(1)

# Items
df_items = f_get_data(2)

# Stores
df_stores = f_get_data(3)


Reading file 0


Reading file 1

Change: Objects to Categorical
Change: Transformed 'date' column to Datetime Dtype

Reading file 2

Change: Objects to Categorical

Reading file 3

Change: Objects to Categorical


## 2. Cleaning data (functions)

### 2.1. Prepare and clean df_sales
Drop of columns "id", "year", "month", "day" and create a date column based on the columns "year" , "month" and "day".

In [6]:
# Prepare df_sales by cleaning up df for merging with holidays by dropping unneeded columns
def sales_cleaned(df_sales):
    df_sales["item_nbr"].astype(int)
    df_sales["date"] = pd.to_datetime(df_sales[["year", "month", "day"]])
    df_sales = df_sales.drop(columns=["id", "year", "month", "day"])

    return df_sales

### 2.2. Prepare, clean and rename df_items
Renaming columns: "family" to "item_family" and  "class" to "item_class"

In [7]:
# Prepare df_items by cleaning up df by renaming columns for clearity in final df
def items_cleaned_renamed(df_items):
    df_items["item_nbr"] = df_items["item_nbr"].astype(int)
    df_items = df_items.rename(columns={"family": "item_family", "class": "item_class"})

    return df_items

### 2.3. Prepare, clean and rename df_stores
Drop of columns "state"  
Rename of columns "city" to "store_city", "cluster" to "store_cluster" and "type" to "store_type"

In [8]:
# Prepare df_stores by cleaning up df by dropping unneeded columns and rename columns for clearity in final df
def stores_cleaned_renamed(df_stores):

    df_stores = df_stores.drop(columns=["state", "city"])

    df_stores = df_stores.rename(
        columns={
            "cluster": "store_cluster",
            "type": "store_type",
        }
    )

    return df_stores

## 3. Excluding data based on exploratory data analyses (functions)
Excluding sales data based on store sales availability  
Excluding holiday events related to the "Terromoto" volcano event

In [9]:
def sales_funnel_construction(df_sales):

    start_date_dataset = df_sales["date"].min()
    scoping_end_date = df_sales["date"].max()

    df_sales_funnel = df_sales.copy()

    # Add the item family to all the rows, make sure that items have been cleaned before running this function!
    df_sales_funnel = df_sales_funnel.merge(df_items, how= 'inner', on ='item_nbr')

    # Calculate the total sum of unit sales and the unit percentage per row
    sales_funnel_total_unit_sales = df_sales_funnel['unit_sales'].sum()
    df_sales_funnel['unit_sales_percentage'] = (df_sales_funnel['unit_sales'] / sales_funnel_total_unit_sales) * 100

    # Normalize percentages
    df_sales_funnel['unit_sales_percentage'] = (df_sales_funnel['unit_sales_percentage'] / 
                                                df_sales_funnel['unit_sales_percentage'].sum()) * 100

    return start_date_dataset, scoping_end_date, df_sales_funnel, sales_funnel_total_unit_sales

### 3.1. Function (partly optional) - Excluding stores based on sales units and on cluster type 10

3.1.1. Function (optional) - Return list containing stores with less then 1670 operational days with sales  
default parameter: store_exclusion_cutoff_number = 1670 days. Based on Exploratory data analysis, 17 stores do not have 1670 days of date present in the sales dataset and either are new stores are were closed for a significant number of days during the timeframe within the sales dataset. It might be functional to make the model only for stores that had sales for all dates (and not new) as that might influence model behavior. This function gives the flexibility as so the user can choose him/herself the cutoff point.

In [10]:
def stores_exclude_sales_days(df_sales, param=None):
    """
    Filters out stores that do not meet sales conditions required for forecasting based on historical and 
    in-scope sales data.

    This function processes a sales dataset to identify and exclude stores that either:
    1. Do not have any sales prior to the scoping start date (i.e., missing history).
    2. Have insufficient sales within the last two years leading up to the scoping end date.

    It applies a default or user-defined threshold to determine which stores are missing significant sales 
    within the in-scope period.

    Parameters:
    -----------
    df_sales : pandas.DataFrame
        The input DataFrame containing sales data. It must have at least the following columns:
        - 'store_nbr': The store identifier.
        - 'date': The date of the sales record.
        - 'unit_sales': The number of units sold on that date.
    
    param : int, optional
        A threshold value for determining whether a store is missing too many sales dates within the scoping period.
        If no value is provided, the function will use the maximum number of valid sales dates within the scoping 
        period as the threshold.

    Returns:
    --------
    list_scoping_filtered : list
        A list of `store_nbr` identifiers for stores that either have no sales history before the scoping start date 
        or are missing too many sales records within the last two years (based on the threshold).
    
    Example:
    --------
    filtered_stores = stores_exclude_sales_days(df_sales, param=50)
    """

    start_date_dataset = df_sales["date"].min()
    scoping_start_date = df_sales["date"].max() - pd.DateOffset(years=2)
    scoping_end_date = df_sales["date"].max()

    df_scoping = df_sales.copy()

    df_scoping = df_scoping.groupby(['store_nbr','date']).agg(
            {'unit_sales':'sum'}).reset_index()

    # STEP 1 - Make a filter based on the sales prior to the scoping start date! We assume that stores that don't have any sales prior to the scoping start date are not 
    # candidates for the selection for forecasting.

    # Filter the dataframe within the date range # Initialize a column with 0s
    df_scoping['label_history'] = 0

    # Make a mask that determines if there are any sales on each date and store combination prior to the scoping_start_date. Having no sales before the scoping start date 
    # Gives us an idea that there are no sales on the first date for scoping, making the store not a good candidate for further analysis as we need 2 years.
    mask_historyscoping = (df_scoping['date'] >= start_date_dataset) & (df_scoping['date'] <= scoping_start_date) & (df_scoping['unit_sales'].notna())

    maskwithinset = (df_scoping['date'] >= scoping_start_date) & (df_scoping['date'] <= scoping_end_date) & (df_scoping['unit_sales'].notna())

    # Assign label 1 where conditions are met
    df_scoping.loc[mask_historyscoping, 'label_history'] = 1
    df_scoping.loc[maskwithinset, 'label_withinsetcheck'] = 1

    # Sum the labels for the period before scoping
    df_scoping = df_scoping.groupby(['store_nbr']).agg(
        unit_sales_count_history=('label_history', 'sum'),
        unit_sales_count_check= ('label_withinsetcheck', 'sum')
    ).reset_index()

    # Set default_setrangemax based on the data if the param is not provided
    if param is None:
        default_setrangemax = df_scoping['unit_sales_count_check'].max()
    else:
        default_setrangemax = param

    # Make a flag for stores that don't have any sales before the start date of scoping
    df_scoping['flag_nohistory'] = df_scoping['unit_sales_count_history'].apply(lambda x: 1 if x == 0 else 0)
    df_scoping['flag_withinsetcheck'] = df_scoping['unit_sales_count_check'].apply(lambda x: 1 if x < default_setrangemax else 0)

    # Select only the stores that do not have any sales prior to the scoping start date
    # Apply filter with OR condition: either flag_scopingtwoyears == 1 or flag_withinsetcheck == 1

    df_scoping_nohistory_dropout = df_scoping[(df_scoping['flag_nohistory'] == 1)]['store_nbr'].tolist()
    df_datasetcheck_dropout = df_scoping[(df_scoping['flag_withinsetcheck'] == 1)]['store_nbr'].tolist()

    list_scoping_filtered = df_scoping[
        (df_scoping['flag_nohistory'] == 1) | (df_scoping['flag_withinsetcheck'] == 1)
    ]['store_nbr'].tolist()

    # STEP 2 - Yet, some sales do have sales prior the scoping start date but do miss a significant amount of sales within our two year time window. We need to exclude

    print(f'The first date in the sales dataset found is: {start_date_dataset}')
    print(f'The last date in the sales dataset found is:  {scoping_end_date}')
    print(f'Subtracting two years for train/test/split we now search for stores that did not have any sales before:  {scoping_start_date}')
    print(f'We assume these stores do not meet the condition in that they do not have sales for the full last two years, they will atleast at the beginning of this period miss data and thus are not good candidates')
    print(f'Stores {df_scoping_nohistory_dropout} were excluded from our dataset based on no sales prior to the used range within the dataset for forecasting')
    print(f"The maximum number of datapoints found within the 2 years set used for forecasting is {df_scoping['unit_sales_count_check'].max()}, the parameter in the function is set to {param}, if filled, this will be used as the cut-off value for our 2 year dataset")
    print(f'Stores {df_datasetcheck_dropout} were excluded from our dataset based on missing to much dates with sales within the dataset needed for forecasting')

    return list_scoping_filtered

3.1.2. Function - Return list containing stores with cluster=10 in stores df  
From our exploratory data analysis we found that cluster 10 had data issues as it was the only cluster that could was assigned to multiple storetypes. Therefore and because these stores are not part of the top 10 in terms of unit sales, we excluded all stores assigned to cluster 10.

In [11]:
def stores_exclude_cluster(df_stores, cluster_number=10):

    # Get the list of store numbers that belong to cluster 10

    list_stores_cluster_10 = df_stores[df_stores["cluster"] == cluster_number][
        "store_nbr"
    ].tolist()

    print(f'Stores {list_stores_cluster_10} were excluded from our dataset based on beloning to cluster 10')
    return list_stores_cluster_10

3.1.3. Function - Exclude stores with less then X sales days and stores related to cluster 10  

In [12]:
def df_sales_cleaned_stores(df_sales, df_stores, df_sales_funnel, store_exclusion_cutoff_number=None):

    # Excluded less then 1670 salesdays
    list_scoping_filtered = stores_exclude_sales_days(
        df_sales, store_exclusion_cutoff_number
    )

    # Funneling information step 2
    df_sales_funnel['Exclude_stores_days'] = df_sales_funnel['store_nbr'].isin(list_scoping_filtered).astype(int)

    df_sales = df_sales.drop(
        df_sales[df_sales["store_nbr"].isin(list_scoping_filtered)].index
    )

    # Cluster 10
    list_stores_cluster_10 = stores_exclude_cluster(df_stores, cluster_number=10)

    # Funneling information step 3
    df_sales_funnel['Exclude_stores_cluster'] = df_sales_funnel['store_nbr'].isin(list_stores_cluster_10).astype(int)

    df_sales = df_sales.drop(
        df_sales[df_sales["store_nbr"].isin(list_stores_cluster_10)].index
    )

    return df_sales,df_sales_funnel

### 3.2. Function - Excluding items based on item family

In [13]:
def items_exclude_family(df_items, families=None):
    """
    Exclude items based on a list of family categories, using a default list of families 
    unless additional families are provided by the user.
    
    Parameters:
    ----------
    df_items : pandas.DataFrame
        A DataFrame containing items. It should have at least two columns:
        - 'item_nbr': The unique identifier for the item.
        - 'family': The family or category the item belongs to.
    
    families : list of str, optional
        A list of additional families to exclude. If provided, these families will be 
        added to the default list. If not provided, only the default list will be used.
    
    Returns:
    -------
    list
        A list of item numbers (item_nbr) from the DataFrame that belong to the families 
        specified either by the default list or by the combined default and user-provided families.
    
    Notes:
    -----
    - The default families are:
      ['BEVERAGES', 'PRODUCE', 'CELEBRATION', 'HOME AND KITCHEN I', 'HOME AND KITCHEN II',
       'HOME CARE', 'LADIESWARE', 'PETS SUPPLIES', 'PLAYERS AND ELECTRONICS', 'SCHOOL AND OFFICE SUPPLIES']
    
    - If the `families` parameter is provided, the user-specified families will be appended 
      to the default families list before filtering items.
    
    Examples:
    --------
    1. Using only the default families:
    
        excluded_items = items_exclude_family(df_items)
    
    2. Adding new families to the exclusion list:
    
        excluded_items = items_exclude_family(df_items, families=['SNACKS', 'CLOTHING'])
    """

    # Default families list
    default_families = ['BEVERAGES', 
                        'PRODUCE', 
                        'CELEBRATION', 
                        'HOME AND KITCHEN I', 
                        'HOME AND KITCHEN II',
                        'HOME CARE', 'LADIESWARE',
                        'PETS SUPPLIES', 
                        'PLAYERS AND ELECTRONICS',
                        'SCHOOL AND OFFICE SUPPLIES'
                        ]
    
    # If the user provides additional families, append them to the default list
    if families is not None:
        default_families.extend(families)
    
    # Get the list of item numbers that belong to the combined list of families
    list_items_in_families = df_items[df_items["item_family"].isin(default_families)]["item_nbr"].tolist()

    return list_items_in_families

In [14]:
def df_sales_cleaned_items(df_sales, df_items, df_sales_funnel, families = None):
 
    list_items_in_families = items_exclude_family(df_items, families)

    # Funneling information step 3
    df_sales_funnel['Exclude_items_family'] = df_sales_funnel['item_nbr'].isin(list_items_in_families).astype(int)   

    # Drop observations from df_sales where item_nbr is in list_items_in_families
    df_sales = df_sales[~df_sales["item_nbr"].isin(list_items_in_families)]

    return df_sales, df_sales_funnel

In [15]:
def sales_funnel_examineexclusions(df_sales_funnel, start_date_dataset, scoping_end_date, sales_funnel_total_unit_sales):  

    # PART 1 CALCULATE THE STEPS FOR THE FULL TIMERANGE

    # Step 1: Create dataframes per exclusion step
    df_sales_funnel_exclusion_stores_days = df_sales_funnel[df_sales_funnel['Exclude_stores_days'] == 1]
    df_sales_funnel_exclusion_stores_cluster = df_sales_funnel[df_sales_funnel['Exclude_stores_cluster'] == 1]
    df_sales_funnel_exclusion_items_family = df_sales_funnel[df_sales_funnel['Exclude_items_family'] == 1]

    #  Step 2: Exclusion of stores_days and of stores_cluster
    df_sales_funnel_exclusion_step2 = df_sales_funnel[
        (df_sales_funnel['Exclude_stores_days'] == 1)|
        (df_sales_funnel['Exclude_stores_cluster'] == 1)
    ]

    # Step 3: Exclusion of all 3 steps, we calculate this via one notebook (not adding up) as there stores that might be in cluster 10 and are also excluded in step 1
    df_sales_funnel_exclusion_step3 = df_sales_funnel[
        (df_sales_funnel['Exclude_stores_days'] == 1)|
        (df_sales_funnel['Exclude_stores_cluster'] == 1)|
        df_sales_funnel['Exclude_items_family'] == 1
    ]

    # Step 4: Make scalars for each individual exclusion step as well as cumulative steps and return the unit_sales
    df_sales_funnel_exclusion_stores_days_total_unit_sales = df_sales_funnel_exclusion_stores_days['unit_sales'].sum()
    df_sales_funnel_exclusion_stores_cluster_total_unit_sales = df_sales_funnel_exclusion_stores_cluster['unit_sales'].sum()
    df_sales_funnel_exclusion_items_family_total_unit_sales = df_sales_funnel_exclusion_items_family['unit_sales'].sum()
    df_sales_funnel_exclusion_step2_total_unit_sales = df_sales_funnel_exclusion_step2['unit_sales'].sum()
    df_sales_funnel_exclusion_step3_total_unit_sales = df_sales_funnel_exclusion_step3['unit_sales'].sum()

    # Step 5: Make scalars for each individual exclusion step as well as cumulative steps and return the unit_sales_percentage
    df_sales_funnel_total_exclusion_stores_days_percentage = df_sales_funnel_exclusion_stores_days['unit_sales_percentage'].sum()
    df_sales_funnel_total_exclusion_stores_cluster_percentage = df_sales_funnel_exclusion_stores_cluster['unit_sales_percentage'].sum()
    df_sales_funnel_total_exclusion_items_family_percentage = df_sales_funnel_exclusion_items_family['unit_sales_percentage'].sum()
    df_sales_funnel_total_exclusion_step2_percentage = df_sales_funnel_exclusion_step2['unit_sales_percentage'].sum()
    df_sales_funnel_total_exclusion_step3_percentage = df_sales_funnel_exclusion_step3['unit_sales_percentage'].sum()

    # PART 2 CALCULATE THE STEPS FOR THE LAST FULL MONTH IN THE DATASET

    # Step 1: Find the last date in the dataset
    last_date = df_sales_funnel['date'].max()

    # Step 2: Calculate the start and end of the last full month
    last_full_month_end = last_date.replace(day=1) - pd.Timedelta(days=1)  # Last day of the previous month
    last_full_month_start = last_full_month_end.replace(day=1)  # First day of the previous month

    # Step 3: Filter the dataset for the last full month and adjust percentages
    df_sales_funnel_last_full_month = df_sales_funnel[
        (df_sales_funnel['date'] >= last_full_month_start) &
        (df_sales_funnel['date'] <= last_full_month_end)
    ]

    df_sales_funnel_last_full_month_total_unit_sales = df_sales_funnel_last_full_month['unit_sales'].sum()
    df_sales_funnel_last_full_month['unit_sales_percentage'] = (df_sales_funnel_last_full_month['unit_sales'] / df_sales_funnel_last_full_month_total_unit_sales) * 100

    # Step 4: Create dataframes per exclusion step
    df_sales_funnel_last_full_month_exclusion_stores_days = df_sales_funnel_last_full_month[df_sales_funnel_last_full_month['Exclude_stores_days'] == 1]
    df_sales_funnel_last_full_month_exclusion_stores_cluster = df_sales_funnel_last_full_month[df_sales_funnel_last_full_month['Exclude_stores_cluster'] == 1]
    df_sales_funnel_last_full_month_exclusion_items_family = df_sales_funnel_last_full_month[df_sales_funnel_last_full_month['Exclude_items_family'] == 1]

    # Step 5:  Exclusion of stores_days and of stores_cluster
    df_sales_funnel_last_full_month_exclusion_step2 = df_sales_funnel_last_full_month[
        (df_sales_funnel_last_full_month['Exclude_stores_days'] == 1)|
        (df_sales_funnel_last_full_month['Exclude_stores_cluster'] == 1)
    ]

    # Step 6: Exclusion of all 3 steps, we calculate this via one notebook (not adding up) as there stores that might be in cluster 10 and are also excluded in step 1
    df_sales_funnel_last_full_month_exclusion_step3 = df_sales_funnel_last_full_month[
        (df_sales_funnel_last_full_month['Exclude_stores_days'] == 1)|
        (df_sales_funnel_last_full_month['Exclude_stores_cluster'] == 1)|
        df_sales_funnel_last_full_month['Exclude_items_family'] == 1
    ]

    # Step 4: Make scalars for each individual exclusion step as well as cumulative steps and return the unit_sales
    df_sales_funnel_last_full_month_exclusion_stores_days_total_unit_sales = df_sales_funnel_last_full_month_exclusion_stores_days['unit_sales'].sum()
    df_sales_funnel_last_full_month_exclusion_stores_cluster_total_unit_sales = df_sales_funnel_last_full_month_exclusion_stores_cluster['unit_sales'].sum()
    df_sales_funnel_last_full_month_exclusion_items_family_total_unit_sales = df_sales_funnel_last_full_month_exclusion_items_family['unit_sales'].sum()
    df_sales_funnel_last_full_month_exclusion_step2_total_unit_sales = df_sales_funnel_last_full_month_exclusion_step2['unit_sales'].sum()
    df_sales_funnel_last_full_month_exclusion_step3_total_unit_sales = df_sales_funnel_last_full_month_exclusion_step3['unit_sales'].sum()

    # Step 5: Make scalars for each individual exclusion step as well as cumulative steps and return the unit_sales_percentage
    df_sales_funnel_last_full_month_exclusion_stores_days_percentage = df_sales_funnel_last_full_month_exclusion_stores_days['unit_sales_percentage'].sum()
    df_sales_funnel_last_full_month_exclusion_stores_cluster_percentage = df_sales_funnel_last_full_month_exclusion_stores_cluster['unit_sales_percentage'].sum()
    df_sales_funnel_last_full_month_exclusion_items_family_percentage = df_sales_funnel_last_full_month_exclusion_items_family['unit_sales_percentage'].sum()
    df_sales_funnel_last_full_month_exclusion_step2_percentage = df_sales_funnel_last_full_month_exclusion_step2['unit_sales_percentage'].sum()
    df_sales_funnel_last_full_month_exclusion_step3_percentage = df_sales_funnel_last_full_month_exclusion_step3['unit_sales_percentage'].sum()


    df_sales_funnel_step1effect = df_sales_funnel_exclusion_stores_days.groupby('store_nbr')[['unit_sales','unit_sales_percentage']].sum()
    df_sales_funnel_step1effect[['unit_sales','unit_sales_percentage']] = df_sales_funnel_step1effect[['unit_sales','unit_sales_percentage']].round(2)
    df_sales_funnel_step1effect = df_sales_funnel_step1effect.reset_index()

    df_sales_funnel_step2effect = df_sales_funnel_exclusion_stores_cluster.groupby('store_nbr')[['unit_sales','unit_sales_percentage']].sum()
    df_sales_funnel_step2effect[['unit_sales','unit_sales_percentage']] = df_sales_funnel_step2effect[['unit_sales','unit_sales_percentage']].round(2)
    df_sales_funnel_step2effect  = df_sales_funnel_step2effect.reset_index()

    df_sales_funnel_step3effect = df_sales_funnel_exclusion_items_family.groupby('family', observed= True)[['unit_sales', 'unit_sales_percentage']].sum()
    df_sales_funnel_step3effect[['unit_sales','unit_sales_percentage']] = df_sales_funnel_step3effect[['unit_sales','unit_sales_percentage']].round(2)
    df_sales_funnel_step3effect = df_sales_funnel_step3effect.reset_index()

    df_sales_funnel_last_full_month_step1effect = df_sales_funnel_last_full_month_exclusion_stores_days.groupby('store_nbr')[['unit_sales','unit_sales_percentage']].sum()
    df_sales_funnel_last_full_month_step1effect[['unit_sales','unit_sales_percentage']] = df_sales_funnel_last_full_month_step1effect[['unit_sales','unit_sales_percentage']].round(2)
    df_sales_funnel_last_full_month_step1effect = df_sales_funnel_last_full_month_step1effect.reset_index()

    df_sales_funnel_last_full_month_step2effect = df_sales_funnel_last_full_month_exclusion_stores_cluster.groupby('store_nbr')[['unit_sales','unit_sales_percentage']].sum()
    df_sales_funnel_last_full_month_step2effect[['unit_sales','unit_sales_percentage']] = df_sales_funnel_last_full_month_step2effect[['unit_sales','unit_sales_percentage']].round(2)
    df_sales_funnel_last_full_month_step2effect  = df_sales_funnel_last_full_month_step2effect.reset_index()

    df_sales_funnel_last_full_month_step3effect = df_sales_funnel_last_full_month_exclusion_items_family.groupby('family', observed= True)[['unit_sales', 'unit_sales_percentage']].sum()
    df_sales_funnel_last_full_month_step3effect[['unit_sales','unit_sales_percentage']] = df_sales_funnel_last_full_month_step3effect[['unit_sales','unit_sales_percentage']].round(2)
    df_sales_funnel_last_full_month_step3effect = df_sales_funnel_last_full_month_step3effect.reset_index()



    # PRINT OUT FOR PART 1 FULL TIMERANGE

    print("-" * 50)
    print('We will now check for the impact of the exclusion functions on our initial dataset, we will do this both the whole timerange as well as the last full month')
    print('Check 1-Effect on full timerange')
    print()
    print(f"Total timerange  - The unit sales from {start_date_dataset} to {scoping_end_date} is {int(sales_funnel_total_unit_sales):,}. This is 100% without excluding data".replace(',','.'))
    print(f"                   The total number of rows in the df_sales dataset without any manipulation is {df_sales_funnel.shape[0]:,}".replace(',','.'))
    print()
    print(f'Exclusion step 1 - Excluding stores that do not have sales at the start of the last two years selected for the train, test and split')
    print(f'                   The effect of exclusion 1 is a reduction in rows of {df_sales_funnel_exclusion_stores_days.shape[0]:,} leaving us with {df_sales_funnel.shape[0] - df_sales_funnel_exclusion_stores_days.shape[0]:,} rows'.replace(',','.'))
    print(f'                   The effect of exclusion 1 is a reduction in unit sales of {int(df_sales_funnel_exclusion_stores_days_total_unit_sales):,} leaving us with {int(sales_funnel_total_unit_sales - df_sales_funnel_exclusion_stores_days_total_unit_sales):,} '.replace(',','.'))
    print(f'                   The effect of exclusion 1 is a reduction of unit sales of {df_sales_funnel_total_exclusion_stores_days_percentage:.2f}% leaving us with {100-df_sales_funnel_total_exclusion_stores_days_percentage:.2f}%.')
    print()
    print(f'Exclusion step 2 - Excluding stores that belong to cluster 10')
    print(f'                   The effect of exclusion 2 is a reduction in rows of {df_sales_funnel_exclusion_step2.shape[0] - df_sales_funnel_exclusion_stores_days.shape[0]:,}. The individual reduction without overlapping effects from step 2 would be {df_sales_funnel_exclusion_stores_cluster.shape[0]:,},  leaving us with {df_sales_funnel.shape[0] - df_sales_funnel_exclusion_step2.shape[0]:,} rows'.replace(',','.'))
    print(f'                   The effect of exclusion 2 is a reduction in unit sales of {int(df_sales_funnel_exclusion_step2_total_unit_sales - df_sales_funnel_exclusion_stores_days_total_unit_sales):,}. The individual reduction without overlapping effects from step 1 would be {int(df_sales_funnel_exclusion_stores_cluster_total_unit_sales):,} leaving us with {int(sales_funnel_total_unit_sales - df_sales_funnel_exclusion_step2_total_unit_sales):,} '.replace(',','.'))
    print(f'                   The effect of exclusion 2 is a reduction of unit sales of {df_sales_funnel_total_exclusion_step2_percentage - df_sales_funnel_total_exclusion_stores_days_percentage:.2f}%, the individual reduction without overlapping effects from step 1 would be {df_sales_funnel_total_exclusion_stores_cluster_percentage:.2f}%, leaving us with {100-df_sales_funnel_total_exclusion_step2_percentage:.2f}%.')
    print()
    print(f'Exclusion step 3 - Excluding items that are including in the families selected for exclusion')
    print(f'                   The effect of exclusion 3 is a reduction in rows of {df_sales_funnel_exclusion_step3.shape[0] - df_sales_funnel_exclusion_step2.shape[0]:,}. The individual reduction without overlapping effects from step 3 would be {df_sales_funnel_exclusion_items_family.shape[0]:,} leaving us with {df_sales_funnel.shape[0] - df_sales_funnel_exclusion_step3.shape[0]:,} rows'.replace(',','.'))
    print(f'                   The effect of exclusion 3 is a reduction in unit sales of {int(df_sales_funnel_exclusion_step3_total_unit_sales - df_sales_funnel_exclusion_step2_total_unit_sales):,}. The individual reduction without overlapping effects from step 1 would be {int(df_sales_funnel_exclusion_items_family_total_unit_sales):,} leaving us with {int(sales_funnel_total_unit_sales - df_sales_funnel_exclusion_step3_total_unit_sales):,} '.replace(',','.'))
    print(f'                   The effect of exclusion 3 is a reduction of unit sales of {df_sales_funnel_total_exclusion_step3_percentage - df_sales_funnel_total_exclusion_step2_percentage:.2f}%, the individual reduction without overlapping effects from step 1 would be {df_sales_funnel_total_exclusion_items_family_percentage:.2f}%, leaving us with {100-df_sales_funnel_total_exclusion_step3_percentage:.2f}%.')
    print()
    print('Printing out the individual effects within each exclusion step')
    print(df_sales_funnel_step1effect)
    print(df_sales_funnel_step2effect)
    print(df_sales_funnel_step3effect)
    print("-" * 50)
    print('Check 2-Effect on last full month in the dataset')
    print(f"                   The last full month in the dataset is {last_full_month_start.strftime('%B %Y')}")
    print()
    print(f"Total timerange  - The unit sales from {last_full_month_start} to {last_full_month_end} is {int(df_sales_funnel_last_full_month_total_unit_sales):,}. This is 100% without excluding data".replace(',','.'))
    print(f"                   The total number of rows in the df_sales dataset without any manipulation is {df_sales_funnel_last_full_month.shape[0]:,}".replace(',','.'))
    print()
    print(f'Exclusion step 1 - Excluding stores that do not have sales at the start of the last two years selected for the train, test and split')
    print(f'                   The effect of exclusion 1 is a reduction in rows of {df_sales_funnel_last_full_month_exclusion_stores_days.shape[0]:,} leaving us with {df_sales_funnel_last_full_month.shape[0] - df_sales_funnel_last_full_month_exclusion_stores_days.shape[0]:,} rows'.replace(',','.'))
    print(f'                   The effect of exclusion 1 is a reduction in unit sales of {int(df_sales_funnel_last_full_month_exclusion_stores_days_total_unit_sales):,} leaving us with {int(df_sales_funnel_last_full_month_total_unit_sales - df_sales_funnel_last_full_month_exclusion_stores_days_total_unit_sales):,} '.replace(',','.'))
    print(f'                   The effect of exclusion 1 is a reduction of unit sales of {df_sales_funnel_last_full_month_exclusion_stores_days_percentage:.2f}% leaving us with {100-df_sales_funnel_last_full_month_exclusion_stores_days_percentage:.2f}%.')
    print()
    print(f'Exclusion step 2 - Excluding stores that belong to cluster 10')
    print(f'                   The effect of exclusion 2 is a reduction in rows of {df_sales_funnel_last_full_month_exclusion_step2.shape[0] - df_sales_funnel_last_full_month_exclusion_stores_days.shape[0]:,}. The individual reduction without overlapping effects from step 2 would be {df_sales_funnel_last_full_month_exclusion_stores_cluster.shape[0]:,},  leaving us with {df_sales_funnel_last_full_month.shape[0] - df_sales_funnel_last_full_month_exclusion_step2.shape[0]:,} rows'.replace(',','.'))
    print(f'                   The effect of exclusion 2 is a reduction in unit sales of {int(df_sales_funnel_last_full_month_exclusion_step2_total_unit_sales - df_sales_funnel_last_full_month_exclusion_stores_days_total_unit_sales):,}. The individual reduction without overlapping effects from step 1 would be {int(df_sales_funnel_last_full_month_exclusion_stores_cluster_total_unit_sales):,} leaving us with {int(df_sales_funnel_last_full_month_total_unit_sales - df_sales_funnel_last_full_month_exclusion_step2_total_unit_sales):,} '.replace(',','.'))
    print(f'                   The effect of exclusion 2 is a reduction of unit sales of {df_sales_funnel_last_full_month_exclusion_step2_percentage - df_sales_funnel_last_full_month_exclusion_stores_days_percentage:.2f}%, the individual reduction without overlapping effects from step 1 would be {df_sales_funnel_last_full_month_exclusion_stores_cluster_percentage:.2f}%, leaving us with {100-df_sales_funnel_last_full_month_exclusion_step2_percentage:.2f}%.')
    print()
    print(f'Exclusion step 3 - Excluding items that are including in the families selected for exclusion')
    print(f'                   The effect of exclusion 3 is a reduction in rows of {df_sales_funnel_last_full_month_exclusion_step3.shape[0] - df_sales_funnel_last_full_month_exclusion_step2.shape[0]:,}. The individual reduction without overlapping effects from step 3 would be {df_sales_funnel_last_full_month_exclusion_items_family.shape[0]:,} leaving us with {df_sales_funnel_last_full_month.shape[0] - df_sales_funnel_last_full_month_exclusion_step3.shape[0]:,} rows'.replace(',','.'))
    print(f'                   The effect of exclusion 3 is a reduction in unit sales of {int(df_sales_funnel_last_full_month_exclusion_step3_total_unit_sales - df_sales_funnel_last_full_month_exclusion_step2_total_unit_sales):,}. The individual reduction without overlapping effects from step 1 would be {int(df_sales_funnel_last_full_month_exclusion_items_family_total_unit_sales):,} leaving us with {int(df_sales_funnel_last_full_month_total_unit_sales - df_sales_funnel_last_full_month_exclusion_step3_total_unit_sales):,} '.replace(',','.'))
    print(f'                   The effect of exclusion 3 is a reduction of unit sales of {df_sales_funnel_last_full_month_exclusion_step3_percentage - df_sales_funnel_last_full_month_exclusion_step2_percentage:.2f}%, the individual reduction without overlapping effects from step 1 would be {df_sales_funnel_last_full_month_exclusion_items_family_percentage:.2f}%, leaving us with {100-df_sales_funnel_last_full_month_exclusion_step3_percentage:.2f}%.')
    print()
    print('Printing out the individual effects within each exclusion step')
    print(df_sales_funnel_last_full_month_step1effect)
    print(df_sales_funnel_last_full_month_step2effect)
    print(df_sales_funnel_last_full_month_step3effect)
    print("-" * 50)

### 3.3. Function - Exclude holiday event related to the "Terromoto" volcano event

3.3.1. Function - Create dataframe based on df_holidays with only events containing "Terremoto Manabi"

In [16]:
def holiday_filter_vulcano_event(df_holidays, event_substring="Terremoto Manabi"):

    # Filter the DataFrame where 'description' contains the event_substring
    df_vulcano_event_filtered = df_holidays[
        df_holidays["description"].str.contains(event_substring)
    ]

    return df_vulcano_event_filtered

3.3.2. Function - Exclude the "Terremoto Manabi" from the df_holidays dataframe

In [17]:
def df_holidays_cleaned(df_holidays):

    # Exclude holiday_filter_vulcano_event function to return filtered df
    df_vulcano_event_filtered = holiday_filter_vulcano_event(df_holidays)

    # Filter the specific holiday events from the holiday DataFrame
    df_holidays = df_holidays.loc[
        ~df_holidays.index.isin(df_vulcano_event_filtered.index)
    ]

    return df_holidays

## 4. Enriching datasets for further analysis (functions)

### 4.1. Function - Determining holidays per store
The holidays dataset contains information on local, regional and national holidays. For each of these types, there is a different key/identifier that corresponds with the stores data found in df_stores (the raw data). To overcome this issue, three separate dataframes are made for each type of holiday where the data is merged (joined) with the stores dataframe. Thereafter, these dataframes are combined as to construct one big dataframe containing all the holidays per store.

4.1.1. Function - Make cleaned versions of the holidays and stores dataframe

In [18]:
# Prepare df_holiday and df_stores by cleaning up df for merging with holidays by dropping unneeded columns
def clean_holidays_stores_prep(df_holidays, df_stores):

    df_holidays_cleaned = df_holidays.drop(
        columns=[
            "description",
            "transferred",
        ]
    )

    df_stores_cleaned = df_stores.drop(columns=["cluster", "type"])

    return df_holidays_cleaned, df_stores_cleaned

4.1.2. Function - Create a dataframe with all the local holidays per store

In [19]:
def holidays_prep_local(df_holidays, df_stores):

    df_holidays_cleaned, df_stores_cleaned = clean_holidays_stores_prep(
        df_holidays, df_stores
    )

    # select locale 'Local' from holiday df and merge with city stores df
    df_holidays_local = df_holidays_cleaned[df_holidays_cleaned["locale"] == "Local"]

    df_holidays_prep_local = df_holidays_local.merge(
        df_stores_cleaned, left_on="locale_name", right_on="city", how="left"
    )

    return df_holidays_prep_local

4.1.3. Function - Create a dataframe with all the regional holidays per store

In [20]:
def holidays_prep_regional(df_holidays, df_stores):

    df_holidays_cleaned, df_stores_cleaned = clean_holidays_stores_prep(
        df_holidays, df_stores
    )

    # select locale 'Regional' from holiday df and merge with state stores df
    df_holidays_regional = df_holidays_cleaned[
        df_holidays_cleaned["locale"] == "Regional"
    ]

    df_holidays_prep_regional = df_holidays_regional.merge(
        df_stores_cleaned, left_on="locale_name", right_on="state", how="left"
    )

    return df_holidays_prep_regional

4.1.4. Function - Create a dataframe with all the national holidays per store

In [21]:
def holidays_prep_national(df_holidays, df_stores):

    df_holidays_cleaned, df_stores_cleaned = clean_holidays_stores_prep(
        df_holidays, df_stores
    )

    # Select locale 'Regional' from holiday df and merge with national stores df
    df_holidays_national = df_holidays_cleaned[
        df_holidays_cleaned["locale"] == "National"
    ]

    # Create extra column for merge on "Ecuador"
    df_stores_cleaned["national_merge"] = "Ecuador"

    df_holidays_prep_national = df_holidays_national.merge(
        df_stores_cleaned, left_on="locale_name", right_on="national_merge", how="left"
    )

    # Drop newly created column national_merge, not needed further
    df_holidays_prep_national = df_holidays_prep_national.drop(
        columns=["national_merge"]
    )

    return df_holidays_prep_national

4.1.5. Function - Create a dataframe that merges all the separate dataframe for each type of holiday and store combination

In [22]:
def holidays_prep_merged(df_holidays, df_stores):

    # Load prep functions from local, Regional and National df's
    df_holidays_prep_local = holidays_prep_local(df_holidays, df_stores)

    df_holidays_prep_regional = holidays_prep_regional(df_holidays, df_stores)

    df_holidays_prep_national = holidays_prep_national(df_holidays, df_stores)

    # Combine local, regional and national dataframes into 1 merged dataframe
    df_holidays_merged = pd.concat(
        [df_holidays_prep_local, df_holidays_prep_regional, df_holidays_prep_national]
    )

    # Clean df_holidays_merged by dropping "locale_name", "city", "state"
    df_holidays_merged = df_holidays_merged.drop(
        columns=["locale_name", "city", "state"]
    )

    # Rename 'type' of holiday to 'holiday_type'
    df_holidays_merged = df_holidays_merged.rename(
        columns={"type": "holiday_type", "locale": "holiday_locale"}
    )

    return df_holidays_merged

### 4.2. Function - Determining a count per type of holiday per store
The dataframe resulting from the function described in 4.1. gives duplicate values because there sometimes are multiple holidays on one date. Duplicate values per date would result in multiple sales rows for each date, making it not workable. Therfore, we transform the holiday and stores combination to contain 3 columns (for each type of holiday, namely, local, regional and national) that count the amount of holidays found for a specific date. Thereby we create a unique list of date and store combinations for all the holidays within the dataset.

4.2.1. Function - Creating unique combination of store and date with three count columns for each type of holiday

In [23]:
def holidays_prep_merged_grouped(df_holidays, df_stores):

    # Merge the holiday dataframes and clean the merged dataframe
    df_holidays_merged = holidays_prep_merged(df_holidays, df_stores)

    # Group by date and store_nbr and count the number of holidays per date per store
    df_holidays_merged_grouped = df_holidays_merged.pivot_table(
        index=["date", "store_nbr"],
        columns="holiday_locale",
        values="holiday_type",
        aggfunc="count",
        observed=True,
    ).reset_index()

    # Remove the name of the columns
    df_holidays_merged_grouped.columns.name = None

    # Fill NaN values with 0
    df_holidays_merged_grouped = df_holidays_merged_grouped.fillna(0)

    # Convert the count columns to Int8-dtype (note the capital 'I'). This dtype can handle null values, needed to prevent float64 from the merge in Step 6
    # Rename the columns to holiday_local_count,  holiday_regional_count, holiday_national_count
    df_holidays_merged_grouped = df_holidays_merged_grouped.astype(
        {"Local": "Int8", "Regional": "Int8", "National": "Int8"}
    ).rename(
        columns={
            "Local": "holiday_local_count",
            "Regional": "holiday_regional_count",
            "National": "holiday_national_count",
        }
    )

    # Let's do an inner join with the original data to get the original date and store_nbr combinations back. Therefore we need to make another dataframe.
    df_holidays_merged_grouped_inner = holidays_prep_merged(df_holidays, df_stores)
    df_holidays_merged_grouped_inner = (
        df_holidays_merged_grouped_inner.groupby(["date", "store_nbr"])
        .size()
        .reset_index()
        .drop(columns=0)
    )

    df_holidays_merged_grouped = df_holidays_merged_grouped.merge(
        df_holidays_merged_grouped_inner, on=["date", "store_nbr"], how="inner"
    )

    print(
        f"In the orignal unioned holiday dataframe, df_holidays_merged we found (including duplicates) {df_holidays_merged.shape[0]} rows"
    )
    print(
        f"In our new adjusted dataframe we have {df_holidays_merged_grouped.shape[0]} rows"
    )
    print(
        f"Thus, we have removed {df_holidays_merged.shape[0] - df_holidays_merged_grouped.shape[0]} rows"
    )

    # Might want to filter out the holiday dates that will never be in de salesdate range. However, they will be left out anyway when joining with the sales data.
    return df_holidays_merged_grouped

4.2.2. Function - Filling in NA values for each count column whenever no holiday could be found for a specific holiday date and store combination

In [24]:
#  Fill newly created NaN columns, due to holiday join, with 'no' on thates where there are now holidays
def holidays_fill_zero_normal(df):
    """
    Fills the NaN values with 0 for all columns "holiday_local_count", "holiday_regional_count", "holiday_national_count", in the combined dataframe.
    It will only fill the columns that are in the original dataframe and not in the holiday dataframe.
    """

    columns_to_fill = [
        "holiday_local_count",
        "holiday_regional_count",
        "holiday_national_count",
    ]

    df[columns_to_fill] = df[columns_to_fill].fillna(0).astype("int8")

    return df

### 4.3. Function - Constructing a cartesian sales dataset for each store based on the maximum sales daterange
The df_sales dataset contains unit sales data for each store but not all stores have data for each date. To overcome this and make sure each date is present for each store we construct a new dataframe based on the minimum- and maximum date found within the sales dataframe. The result is thus a sales dataframe with each date, store and item combination for the whole timerange.

In [25]:
def filling_dates_cartesian(df):

    # Print first and last date of df
    print(f'First date in df: {df["date"].min()}')
    print(f'Last date in df:  {df["date"].max()}')

    # Calculate memory size and shape size of start df
    df_mem_start = sys.getsizeof(df)
    df_shape_start = df.shape[0] / 1e6
    print(
        f"Start size of df_sales:     {round(df_mem_start/1024/1024/1024, 2)} GB and start observations:     {round(df_shape_start, 1)} million."
    )

    # Create a complete date range for the entire dataset, it's a datetimeindex object
    all_dates = pd.date_range(start=df["date"].min(), end=df["date"].max(), freq="D")

    # Create a multi-index from all possible combinations of 'item_nbr' and 'date'
    all_combinations = pd.MultiIndex.from_product(
        [df["store_nbr"].unique(), df["item_nbr"].unique(), all_dates],
        names=["store_nbr", "item_nbr", "date"],
    )

    print(
        f"The multi-index (all_combinations of store, date and item) for the minimum and maximum dates found result in {round(all_combinations.shape[0]/1e6,1)} million rows, this is the amount of rows we expect in the final dataframe."
    )

    # -----------------------------------------------------------------------------------------------------
    # Check for duplicates in the combination of 'store_nbr', 'item_nbr', and 'date'
    # This method is based on boolean indexing, when there's a true value for the duplicated method, it will return those rows to the duplicate_rows variable
    duplicate_rows = df[
        df.duplicated(subset=["store_nbr", "item_nbr", "date"], keep=False)
    ]
    if not duplicate_rows.empty:
        print(
            "Warning: Duplicate entries found in the combination of 'store_nbr', 'item_nbr', and 'date'."
        )
        print(f"Total dublicate rows {duplicate_rows.shape[0]}")
        print("-" * 71)

    # -----------------------------------------------------------------------------------------------------

    # Reindex the original DataFrame to include all combinations of 'store_nbr', 'item_nbr', and 'date'
    df_reindexed = df.set_index(["store_nbr", "item_nbr", "date"]).reindex(
        all_combinations
    )

    # Reset the index to turn the multi-index back into regular columns
    df_sales_cartesian = df_reindexed.reset_index()

    # Calculate memory size and shape size of final end df
    df_mem_end = sys.getsizeof(df_sales_cartesian)
    df_mem_change_perc = ((df_mem_end - df_mem_start) / df_mem_start) * 100
    df_mem_change = df_mem_end - df_mem_start

    df_shape_end = df_sales_cartesian.shape[0] / 1e6
    df_shape_change_perc = ((df_shape_end - df_shape_start) / df_shape_start) * 100
    df_shape_change = df_shape_end - df_shape_start

    print(
        f"Final size of the dataframe is:     {round(df_mem_end/1024/1024/1024, 2)} GB and end observations:       {round(df_shape_end, 1)} million."
    )
    print(
        f"Change in size of the dataframe is: {round(df_mem_change_perc, 2)} % and observations:           {round(df_shape_change_perc, 2)}     %."
    )
    print(
        f"Increased size of the dataframe is: {round(df_mem_change/1024/1024/1024, 2)} GB and increased observations: {round(df_shape_change, 1)} million."
    )

    return df_sales_cartesian

## 5. Constructing final dataset
In this step all the datasets will be merged together.

In [26]:
# Merge datasets
def merge_datasets(df_sales, df_items, df_stores, df_holidays):

    # Basic information of loaded data
    print(
        "Step 1 - Importing, downcasting and normalizing data and optimizing memory, the following data has been imported."
    )
    df_basic_info(df_sales, "df_sales")
    print("""""")
    df_basic_info(df_items, "df_items")
    print("""""")
    df_basic_info(df_stores, "df_stores")
    print("""""")
    df_basic_info(df_holidays, "df_holidays")
    print("-" * 100)

    # Sales prep
    print(
        "Step 2 - Cleaning sales data and making a cartesian product of the sales data and the minimum and maximum dates found in the data."
    )
    df_sales = sales_cleaned(df_sales)
    df_items = items_cleaned_renamed(df_items)
    start_date_dataset, scoping_end_date, df_sales_funnel,sales_funnel_total_unit_sales = sales_funnel_construction(df_sales)
    df_sales, df_sales_funnel = df_sales_cleaned_stores(df_sales, df_stores, df_sales_funnel)
    df_sales, df_sales_funnel = df_sales_cleaned_items(df_sales, df_items, df_sales_funnel)
    sales_funnel_examineexclusions(df_sales_funnel, start_date_dataset, scoping_end_date,sales_funnel_total_unit_sales) 
    df_sales_cartesian = filling_dates_cartesian(df_sales)

    print("-" * 100)

    # Holidays prep
    print(
        "Step 3 - Cleaning holiday data and counting the number of holidays per date per store for each type of holiday (national, regional, local)."
    )
    df_holidays = df_holidays_cleaned(df_holidays)
    df_holidays_merged_grouped = holidays_prep_merged_grouped(df_holidays, df_stores)
    print("-" * 100)

    # Stores prep
    print(
        "Step 4 - Cleaning stores data (read: dropping unnecessary columns and renaming columns for clarity)."
    )
    df_stores = stores_cleaned_renamed(df_stores)
    print("-" * 100)

    # Items prep
    print(
        "Step 5 - Cleaning items data  (read: dropping unnecessary columns and renaming columns for clarity)."
    )
    # df_items = items_cleaned_renamed(df_items)
    print("-" * 100)

    # Holidays merge on sales
    print(
        "Step 6 - Adding holiday data to our cartesian product of sales data (with store, item and date combinations) and cleaning up null values for count of three holiday columns."
    )

    df_merged = df_sales_cartesian.merge(
        df_holidays_merged_grouped, on=["date", "store_nbr"], how="left"
    )

    df_merged = holidays_fill_zero_normal(df_merged)
    print("-" * 100)

    # Stores merged with sales+holidays
    print(
        "Step 7 - Adding holiday data to our cartesian product of sales data (with store, item and date combinations) and cleaning up null values for count of holiday columns."
    )
    df_merged = df_merged.merge(df_stores, on="store_nbr", how="left")

    print("-" * 100)

    # Change the dtype for item_nbr from uint32 to int32, during testing we found that the merge was not working properly with uint32
    # df_merged["item_nbr"] = df_merged["item_nbr"].astype(int)
    # df_items["item_nbr"] = df_items["item_nbr"].astype(int)

    # Items merged with sales+holidays+stores
    print(
        "Step 8 - Adding items data to our cartesian product of sales data (with store, item and date combinations) and cleaning up null values for count of holiday columns. Remember, in our last step we added a lot of store information as well"
    )
    df_final = df_merged.merge(df_items, on="item_nbr", how="left")
    print("-" * 100)

    # Print some referential integrity checks to make sure we have the same amount of rows
    print(
        f"The amount of rows in the sales dataframe was {df_sales.shape[0] /1_000_000:.2f} million."
    )
    print(
        f"After making a cartesian product with date, store and item we had a total of {df_sales_cartesian.shape[0]/1_000_000:.2f} million rows."
    )
    print(
        f"After mergin with the holidays, stores, and items we have {df_final.shape[0]/1_000_000:.2f} million rows"
    )
    print(
        f"The difference between the incoming and outgoing data from this function is {df_sales.shape[0] - df_final.shape[0]} rows"
    )
    print(
        f'If we compare the outgoing dataframe called "df_final" with the cartesian product of sales data and dates we see that the difference is {df_sales_cartesian.shape[0] - df_final.shape[0]} rows'
    )
    print(
        f"If the difference is 0, we have a perfect match and we can continue with the next steps."
    )

    # f"Final size of the dataframe is:     {round(df_mem_end/1024/1024/1024, 2)} GB and end observations:       {round(df_shape_end, 1)} million."
    return df_final

In [27]:
# df_sales = df_sales[(df_sales["store_nbr"] == 1)]

# df_sales.info()

In [28]:
df_final = merge_datasets(df_sales, df_items, df_stores, df_holidays)  # --> 2.44 GB

Step 1 - Importing, downcasting and normalizing data and optimizing memory, the following data has been imported.
The 'df_sales' dataframe contains: 125.497.040 observations and 8 features.
After optimizing by downcasting and normalizing it has optimized size of    2.1 GB.

The 'df_items' dataframe contains: 4.100 observations and 4 features.
After optimizing by downcasting and normalizing it has optimized size of    0.0 GB.

The 'df_stores' dataframe contains: 54 observations and 5 features.
After optimizing by downcasting and normalizing it has optimized size of    0.0 GB.

The 'df_holidays' dataframe contains: 350 observations and 6 features.
After optimizing by downcasting and normalizing it has optimized size of    0.0 GB.
----------------------------------------------------------------------------------------------------
Step 2 - Cleaning sales data and making a cartesian product of the sales data and the minimum and maximum dates found in the data.
The first date in the sales da

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sales_funnel_last_full_month['unit_sales_percentage'] = (df_sales_funnel_last_full_month['unit_sales'] / df_sales_funnel_last_full_month_total_unit_sales) * 100


--------------------------------------------------
We will now check for the impact of the exclusion functions on our initial dataset, we will do this both the whole timerange as well as the last full month
Check 1-Effect on full timerange

Total timerange  - The unit sales from 2013-01-01 00:00:00 to 2017-08-15 00:00:00 is 1.073.612.032. This is 100% without excluding data
                   The total number of rows in the df_sales dataset without any manipulation is 125.497.040

Exclusion step 1 - Excluding stores that do not have sales at the start of the last two years selected for the train, test and split
                   The effect of exclusion 1 is a reduction in rows of 11.033.163 leaving us with 114.463.877 rows
                   The effect of exclusion 1 is a reduction in unit sales of 74.822.848 leaving us with 998.789.184 
                   The effect of exclusion 1 is a reduction of unit sales of 6.97% leaving us with 93.03%.

Exclusion step 2 - Excluding stores that 

In [29]:
df_final.shape[0]

200729382

In [25]:
# Sebastiaan code -
# What about the "onpromotion" column, seems that it has a lot of NaN values. Are these quality issues or is just that there's no promotion.
# This issue didn't arrive after merging, it was there from the beginning (in the df_sales dataframe).
# You would expect that if there's no promotion going on the value to be "False"

# df_sales1 = sales_cleaned(df_sales)

# df_sales1_unique = df_sales1["onpromotion"].unique()

# 4. Data Manipulation

X.X. Count nulls per column

In [30]:
null_counts = df_final.isnull().sum()

type(null_counts)

pandas.core.series.Series

In [31]:
df_final.info()
# Count nulls per column
null_counts = df_final.isnull().sum()

# Print results
for column, count in null_counts.items():
    print(f"Column '{column}' has {count} null values.")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200729382 entries, 0 to 200729381
Data columns (total 13 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   store_nbr               uint8         
 1   item_nbr                uint32        
 2   date                    datetime64[ns]
 3   unit_sales              float32       
 4   onpromotion             boolean       
 5   holiday_local_count     int8          
 6   holiday_national_count  int8          
 7   holiday_regional_count  int8          
 8   store_type              category      
 9   store_cluster           uint8         
 10  item_family             category      
 11  item_class              uint16        
 12  perishable              uint8         
dtypes: boolean(1), category(2), datetime64[ns](1), float32(1), int8(3), uint16(1), uint32(1), uint8(3)
memory usage: 5.2 GB
Column 'store_nbr' has 0 null values.
Column 'item_nbr' has 0 null values.
Column 'date' has 0 nul

4.2: Detect negative values

•	Action: Delete unit_sales if values are lower than zero --> N/A

To-do: do we want do make negative --> 0 or delete values --> Inpute later?

In [32]:
def negative_sales_cleaned(df):

    # Check the number of negative values before replacement
    before_replacement = (df["unit_sales"] < 0).sum()
    print(f"Number of negative values before replacement: {before_replacement}")

    # Create a boolean mask for the negative sales rows to create a 'boolean flag-list' containing all negative rows, used to filter full df_sales df
    negative_sales_mask = df["unit_sales"] < 0

    # Use the mask to update the flagged 'unit_sales' column in the original DataFrame
    df.loc[negative_sales_mask, "unit_sales"] = df.loc[
        negative_sales_mask, "unit_sales"
    ].where(df.loc[negative_sales_mask, "unit_sales"] >= 0, np.nan)

    # Check the number of negative values after replacement
    after_replacement = (df["unit_sales"] < 0).sum()
    print(f"Number of negative values after replacement: {after_replacement}")

    return df

4.3 Define new, old and closed stores

•	Condition: sales for all items a given store and date are NA

•	Action: Impute with 0

-------------------------------

Label Variable for atributing numbers to store status:

-     OPEN = 0
-     NEW = 2
-     CLOSED = 4
-     OLD = 6
-     NEVER_OPENED = 8

----------

To-do: Write in polars??

To-do: Can the ML model run with NaN values? Or need the new / old stores also need to inputed with 0


In [33]:
def merge_store_status(df):

    # Label Variable for atributing numbers to store status, to save memory in df
    OPEN = 0
    NEW = 2
    CLOSED = 4
    OLD = 6
    NEVER_OPENED = 8

    # Group by store and date, then sum sales
    df_grouped = (
        df.groupby(["store_nbr", "date"]).agg({"unit_sales": "sum"}).reset_index()
    ).reset_index()

    # Sort by store and date
    df_grouped = df_grouped.sort_values(["store_nbr", "date"])

    # Create a new column for store status, label al stores as 'open' by default and make dtype in8
    df_grouped["store_status"] = np.int8(OPEN)

    # Find the first and last day with sales for each store
    first_sale_date = (
        df_grouped[df_grouped["unit_sales"] > 0].groupby("store_nbr")["date"].min()
    )

    last_sale_date = (
        df_grouped[df_grouped["unit_sales"] > 0].groupby("store_nbr")["date"].max()
    )

    # Loop trhough stores by lapeling them as 'NEW', 'CLOSED', 'OLD' or 'NEVER_OPENED' based on first sale date and last sale date
    for store in df_grouped["store_nbr"].unique():
        store_data = df_grouped[df_grouped["store_nbr"] == store]

        if store in first_sale_date.index:
            first_date = first_sale_date[store]
            last_date = last_sale_date[store]

            # Mark as 'NEW' before first sale date
            df_grouped.loc[
                (df_grouped["store_nbr"] == store) & (df_grouped["date"] < first_date),
                "store_status",
            ] = NEW

            # Mark as 'closed' after first sale date if sales are 0
            df_grouped.loc[
                (df_grouped["store_nbr"] == store)
                & (df_grouped["date"] > first_date)
                & (df_grouped["unit_sales"] == 0),
                "store_status",
            ] = CLOSED

            # Mark as 'OLD' after last sale date
            df_grouped.loc[
                (df_grouped["store_nbr"] == store) & (df_grouped["date"] > last_date),
                "store_status",
            ] = OLD

        else:
            # If a store never had any sales, mark all dates as 'NEVER_OPENED' --> no records?
            df_grouped.loc[df_grouped["store_nbr"] == store, "store_status"] = (
                NEVER_OPENED
            )

    # Merging store_status on df_sales
    df = df.merge(
        df_grouped[["store_nbr", "date", "store_status"]],
        left_on=["store_nbr", "date"],
        right_on=["store_nbr", "date"],
        how="left",
    )

    # Get list of NEW stores at 01-01-2013 and OPEN stores at 02-01-2013
    mask_new = (df["store_status"] == NEW) & (df["date"] == "2013-01-01")
    mask_open = (df["store_status"] == OPEN) & (df["date"] == "2013-01-02")

    # Get list of thores that meet both the coditions of NEW AT 01-01-2013 and OPEN at 02-01-2013
    stores_new = set(df[mask_new]["store_nbr"].unique())
    stores_open = set(df[mask_open]["store_nbr"].unique())
    stores_status_change = stores_new.intersection(stores_open)

    # Change status of stores that are NEW on 01-01-2013 but OPEN on 02-01-2013 to CLOSED on 01-01-2013
    df.loc[
        (df["store_nbr"].isin(stores_status_change)) & (df["date"] == "2013-01-01"),
        ["store_status"],
    ] = [CLOSED]

    # Using a mask to flag al 'CLOSED' or (|) 'NEW' stores and impute 'closed' and 'new' stores with 0
    mask = (df["store_status"] == CLOSED) | (df["store_status"] == NEW)
    df.loc[mask, "unit_sales"] = 0

    print("-" * 72)
    print(
        f"Size of df:     {round(sys.getsizeof(df)/1024/1024/1024, 2)} GB and end observations:       {round(df.shape[0] / 1e6, 1)} million."
    )
    print("- " * 36)
    print("df_grouped store_status value counts:")
    print(df_grouped["store_status"].value_counts())

    print("-" * 72)

    return df

4.4 New product --> !Polars function!

•	Before the very first sale of an item, all observations are kept as NA

•	After the very first sale of an item, we go to step 3: 

 -----------------------------------

Label Variable for atributing numbers to store status, to save memory in df
-     EXISTING = 1
-     NEW = 3
-     OLD = 7
-     NEVER_SOLD = 9

TO-DO: Add polars to requirements.txt

In [30]:
# <PATH>.\venv_case_project\Scripts\activate
# pip install polars

In [34]:
import polars as pl  # later to import packages step at 0


def merge_item_status_polars(df_pandas):

    # Record the start time of the function

    start_time = time.time()

    # Label variables

    EXISTING = 1
    NEW = 3
    OLD = 7
    NEVER_SOLD = 9

    # Convert the Pandas df to Polars df
    df = pl.from_pandas(df_pandas)

    # Sort by store, item, and date
    df = df.sort(["store_nbr", "item_nbr", "date"])

    print(f"Elapsed time: {time.time() - start_time:.2f} seconds | LINE | df sorted |")

    # Create a new column for item status, initialise to EXISTING
    df = df.with_columns(pl.lit(EXISTING).cast(pl.Int8).alias("item_status"))

    print(
        f"Elapsed time: {time.time() - start_time:.2f} seconds | LINE | item_status added |"
    )

    # Filter for rows with unit_sales > 0 and calculate first/last sale dates
    first_sale_date = (
        df.filter(pl.col("unit_sales") > 0)
        .group_by(["store_nbr", "item_nbr"])
        .agg([pl.col("date").min().alias("first_sale_date")])
    )

    last_sale_date = (
        df.filter(pl.col("unit_sales") > 0)
        .group_by(["store_nbr", "item_nbr"])
        .agg([pl.col("date").max().alias("last_sale_date")])
    )
    print(
        f"Elapsed time: {time.time() - start_time:.2f} seconds | LINE | first and last sale dates |"
    )

    # Join first and last sale dates to the original dataframe
    df = df.join(first_sale_date, on=["store_nbr", "item_nbr"], how="left")

    df = df.join(last_sale_date, on=["store_nbr", "item_nbr"], how="left")

    print(
        f"Elapsed time: {time.time() - start_time:.2f} seconds | LINE | joined sale dates |"
    )

    # Update the item_status column based on first and last sale dates
    df = df.with_columns(
        pl.when(pl.col("date") < pl.col("first_sale_date"))
        .then(pl.lit(NEW))
        .when(pl.col("date") > pl.col("last_sale_date"))
        .then(pl.lit(OLD))
        .otherwise(pl.col("item_status"))
        .alias("item_status")
    )

    # Handle NEVER_SOLD case where first_sale_date is null
    df = df.with_columns(
        pl.when(pl.col("first_sale_date").is_null())
        .then(pl.lit(NEVER_SOLD))
        .otherwise(pl.col("item_status"))
        .alias("item_status")
    )

    print(
        f"Elapsed time: {time.time() - start_time:.2f} seconds | LINE | updated item status |"
    )

    # Drop columns first_sale_date" and "last_sale_date" as these are not longer needed
    df = df.drop(["first_sale_date", "last_sale_date"])

    # Convert Polars df back to Pandas df
    df = df.to_pandas()

    print("-" * 72)
    print(f"Total execution time: {(time.time() - start_time) / 60:.2f} minutes")
    print("- " * 36)
    print("df_grouped item_status value counts:")
    print(df["item_status"].value_counts())

    print("-" * 72)

    return df

To-do: Add print function to keep track of type of inputations

To-do: .interpolate() --> ???

4.5 Promotional Data 

•   All missing values are interpreted a day with no promotion

•   Action: Inpute onpromotion N/A with False

In [35]:
# Fill missing N/A values in onpromotion column with False
def sales_fill_onpromotion(df):

    df["onpromotion"] = df["onpromotion"].fillna(False).astype(bool)

    return df

# 5 Feature construction

5.X Extracting datetime features

In [36]:
def datetime_features(df):
    # Ensure the date column is sorted
    df = df.sort_values("date")

    # Add column with ISO year
    df["year"] = df["date"].dt.isocalendar().year.astype("int16")

    # Add column with weekday (1-7, where 1 is Monday)
    df["weekday"] = df["date"].dt.dayofweek.add(1).astype("int8")

    # Add column with ISO week number (1-53)
    df["week_nbr"] = df["date"].dt.isocalendar().week.astype("int8")

    # Calculate the date of the Monday of the first week
    first_date = df["date"].iloc[0]
    days_to_last_monday = (first_date.weekday() - 0 + 7) % 7
    monday_first_week = first_date - pd.Timedelta(days=days_to_last_monday)

    # Calculate cumulative week numbers starting from the first Monday
    df["week_number_cum"] = (
        ((df["date"] - monday_first_week).dt.days // 7) + 1
    ).astype("int16")

    return df

# 6. Data Manipulation and Feature construction --> Final-Function

In [37]:
def manipulate_final_dataset(df):

    df = negative_sales_cleaned(df)

    df = merge_store_status(df)

    df = merge_item_status_polars(df)

    df = sales_fill_onpromotion(df)

    df = datetime_features(df)
    
    return df

In [38]:
df_final = manipulate_final_dataset(df_final)

Number of negative values before replacement: 4438
Number of negative values after replacement: 0
------------------------------------------------------------------------
Size of df:     5.42 GB and end observations:       200.7 million.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
df_grouped store_status value counts:
store_status
0    68088
2     2217
4      549
Name: count, dtype: int64
------------------------------------------------------------------------
Elapsed time: 8.73 seconds | LINE | df sorted |
Elapsed time: 8.80 seconds | LINE | item_status added |
Elapsed time: 20.34 seconds | LINE | first and last sale dates |
Elapsed time: 37.87 seconds | LINE | joined sale dates |
Elapsed time: 39.48 seconds | LINE | updated item status |
------------------------------------------------------------------------
Total execution time: 0.81 minutes
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
df_grouped item_status value counts:
ite

  df["onpromotion"] = df["onpromotion"].fillna(False).astype(bool)


In [39]:
df_final.info()
# Count nulls per column
null_counts = df_final.isnull().sum()

# Print results
for column, count in null_counts.items():
    print(f"Column '{column}' has {count} null values.")

<class 'pandas.core.frame.DataFrame'>
Index: 200729382 entries, 0 to 200729381
Data columns (total 19 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   store_nbr               uint8         
 1   item_nbr                uint32        
 2   date                    datetime64[ns]
 3   unit_sales              float32       
 4   onpromotion             bool          
 5   holiday_local_count     int8          
 6   holiday_national_count  int8          
 7   holiday_regional_count  int8          
 8   store_type              category      
 9   store_cluster           uint8         
 10  item_family             category      
 11  item_class              uint16        
 12  perishable              uint8         
 13  store_status            int8          
 14  item_status             int8          
 15  year                    int16         
 16  weekday                 int8          
 17  week_nbr                int8          
 18  wee

In [41]:
df_finalcountholidaysum = df_final['holiday_regional_count'].sum()


76491

# Write to Parquet fil and saves it in output_path

In [42]:
def save_dataframe_to_parquet(df, output_path, file_prefix="Prepped_data"):
    try:
        # Ensure the directory exists
        os.makedirs(output_path, exist_ok=True)

        # Generate today's date for the filename
        today = date.today().strftime("%Y%m%d")

        # Create the full filename with path
        filename = f"{file_prefix}_{today}.parquet"
        full_path = os.path.join(output_path, filename)

        # Save the DataFrame to a Parquet file
        df.to_parquet(full_path)

        print(f"DataFrame successfully saved to {full_path}")

        return full_path

    except Exception as e:
        print(f"Error saving DataFrame to Parquet file: {e}")

        return None

In [43]:
# output_path = "C:/Users/alexander/Documents/0. Data Science and AI for Experts/TEST"

output_path = "C:/Users/sebas/OneDrive/Documenten/GitHub/Supermarketcasegroupproject/Group4B/data/interim"

saved_path = save_dataframe_to_parquet(df_final, output_path)

DataFrame successfully saved to C:/Users/sebas/OneDrive/Documenten/GitHub/Supermarketcasegroupproject/Group4B/data/interim\Prepped_data_20241016.parquet


X # Function to print memory usage of DataFrames

In [41]:
# Function to print memory usage of DataFrames
def print_memory_usage(dataframes):
    for name, df in dataframes.items():
        mem_usage = df.memory_usage(deep=True)
        total_mem = mem_usage.sum()

        print(f"DataFrame: {name}")
        print(mem_usage)
        print(f"Total Memory Usage: {total_mem} bytes\n")


# Check for DataFrames
dataframes = {
    name: obj for name, obj in globals().items() if isinstance(obj, pd.DataFrame)
}
print_memory_usage(dataframes)

DataFrame: _
Index                       80
store_nbr                   10
item_nbr                    40
date                        80
unit_sales                  40
onpromotion                 10
holiday_local_count         10
holiday_national_count      10
holiday_regional_count      10
store_type                 472
store_cluster               10
item_family               3318
item_class                  20
perishable                  10
store_status                10
item_status                 10
year                        20
weekday                     10
week_nbr                    10
week_number_cum             20
dtype: int64
Total Memory Usage: 4200 bytes

DataFrame: df_sales
Index                 128
id              501988160
store_nbr       125497040
item_nbr        501988160
unit_sales      501988160
onpromotion     250994080
day             125497040
year            125497200
month           125497324
date           1003976320
dtype: int64
Total Memory Usage: 326292361