# Competition overview

**Competition Summary**

- This is the Rohlik Sales Forecasting Challenge, a time-series forecasting competition hosted on Kaggle. The primary objective is to predict the sales volume for various inventory items across 11 different Rohlik Group warehouses for a period of 14 days.

- Accurate forecasts are vital for the e-grocery company's operations, as they directly impact supply chain efficiency, inventory management, and overall sustainability by minimizing waste.

- The model's performance will be evaluated using the Weighted Mean Absolute Error (WMAE). The specific weights for each inventory item are provided in a separate file. The competition runs from November 15, 2024, to February 15, 2025, and offers cash prizes for the top three competitors.

- The dataset includes historical sales and order data, product metadata, and a calendar with holiday information. Some features available in the training set (e.g., sales and availability) are intentionally removed from the test set, as they would not be known at the time of a real-world prediction.

### [Link to competition](https://www.kaggle.com/competitions/rohlik-sales-forecasting-challenge-v2/overview)

# Data Dictionary

This data dictionary describes the files and columns provided for the competition.

**sales_train.csv and sales_test.csv**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column</th>
    <th class="tg-7zrl">Description</th>
    <th class="tg-7zrl">Data Type</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">unique_id</td>
    <td class="tg-7zrl">A unique identifier for a specific inventory item in a specific warehouse.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">date</td>
    <td class="tg-7zrl">The date of the sales record.</td>
    <td class="tg-7zrl">Date</td>
  </tr>
  <tr>
    <td class="tg-7zrl">warehouse</td>
    <td class="tg-7zrl">The name of the warehouse where the item is stored.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">total_orders</td>
    <td class="tg-7zrl">The historical number of orders for the selected warehouse.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">sales</td>
    <td class="tg-7zrl">The target variable: sales volume (pcs or kg).</td>
    <td class="tg-7zrl">Float</td>
  </tr>
  <tr>
    <td class="tg-7zrl">sell_price_main</td>
    <td class="tg-7zrl">The selling price of the item.</td>
    <td class="tg-7zrl">Float</td>
  </tr>
  <tr>
    <td class="tg-7zrl">availability</td>
    <td class="tg-7zrl">The proportion of the day the item was available. A value of 1 means it was available all day.</td>
    <td class="tg-7zrl">Float</td>
  </tr>
  <tr>
    <td class="tg-7zrl">type_0_discount, type_1_discount, etc.</td>
    <td class="tg-7zrl">The percentage discount offered for various promotion types. Negative values indicate no discount.</td>
    <td class="tg-7zrl">Float</td>
  </tr>
</tbody></table>

**inventory.csv**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column</th>
    <th class="tg-7zrl">Description</th>
    <th class="tg-7zrl">Data Type</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">unique_id</td>
    <td class="tg-7zrl">A unique identifier for a specific inventory item.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">product_unique_id</td>
    <td class="tg-7zrl">A unique identifier for a product, shared across all warehouses.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">name</td>
    <td class="tg-7zrl">The name of the product.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">L1_category_name, L2_category_name, etc.</td>
    <td class="tg-7zrl">Hierarchical category names for the product. L4 is the most granular.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">warehouse</td>
    <td class="tg-7zrl">The name of the warehouse where the inventory item is located.</td>
    <td class="tg-7zrl">String</td>
  </tr>
</tbody></table>

**calendar.csv**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column</th>
    <th class="tg-7zrl">Description</th>
    <th class="tg-7zrl">Data Type</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">warehouse</td>
    <td class="tg-7zrl">The name of the warehouse.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">date</td>
    <td class="tg-7zrl">The date of the calendar event.</td>
    <td class="tg-7zrl">Date</td>
  </tr>
  <tr>
    <td class="tg-7zrl">holiday_name</td>
    <td class="tg-7zrl">The name of the public holiday, if applicable.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">holiday</td>
    <td class="tg-7zrl">A binary flag (0 or 1) indicating if the date is a holiday.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">shops_closed</td>
    <td class="tg-7zrl">A flag indicating a public holiday where most shops are closed.</td>
    <td class="tg-7zrl">Boolean</td>
  </tr>
  <tr>
    <td class="tg-7zrl">winter_school_holidays</td>
    <td class="tg-7zrl">A flag for winter school holidays.</td>
    <td class="tg-7zrl">Boolean</td>
  </tr>
  <tr>
    <td class="tg-7zrl">school_holidays</td>
    <td class="tg-7zrl">A flag for general school holidays.</td>
    <td class="tg-7zrl">Boolean</td>
  </tr>
</tbody></table>

**test_weights.csv**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column</th>
    <th class="tg-7zrl">Description</th>
    <th class="tg-7zrl">Data Type</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">unique_id</td>
    <td class="tg-7zrl">A unique identifier for the inventory item.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">weight</td>
    <td class="tg-7zrl">The weight used for calculating the Weighted Mean Absolute Error (WMAE) metric for this item.</td>
    <td class="tg-7zrl">Float</td>
  </tr>
</tbody>
</table>

### [Link to dataset](https://www.kaggle.com/competitions/rohlik-sales-forecasting-challenge-v2/data)

# Pipeline Overview

The pipeline for the Rohlik Sales Forecasting Model is a comprehensive, end-to-end machine learning workflow that predicts product sales using a hybrid deep learning architecture. It starts with data loading and merging from various sources, including historical sales, inventory, and calendar data, which is then enriched with custom holiday and temporal features. After a chronological train-test split, the data undergoes exploratory data analysis (EDA), where missing values are imputed, and features are prepared and encoded. Feature engineering further refines the dataset by dropping irrelevant columns and applying a preprocessing pipeline for scaling and encoding. The processed data is then prepared into time-series datasets for a WaveNet + Transformer hybrid model, which combines dilated causal convolutions with multi-head self-attention to capture both local and global temporal patterns. The model is trained using a custom weighted loss function to handle imbalanced data, and its performance is evaluated on unscaled metrics like RMSE and WMAE.

**Key Stages and Components**

**1. Data Preparation and Enrichment**

The pipeline begins by loading and merging several datasets (`sales_train.csv`, `inventory.csv`, `sales_test.csv`, and `test_weights.csv`). The `calendar.csv` is also loaded and enriched with specific custom holidays for different locations (e.g., Prague, Munich). This process creates new features like `days_to_next_holiday` and `post_closure_days`, which are crucial for capturing the impact of events on sales. The data is then filtered to include only relevant time periods and locations before all datasets are joined together into a single master dataframe.

**2. Preprocessing and Feature Engineering**

A chronological train-test split ensures the model is evaluated on future data, mimicking a real-world scenario. Initial EDA checks for missing values, which are imputed with the column mean. Key features, including product IDs, warehouse locations, and categories, are encoded using `OrdinalEncoder`. The `date` column is transformed into cyclical features (e.g., `sin_month`, `cos_month`) to represent seasonality effectively. Based on Mutual Information (MI) analysis and domain knowledge, low-relevance features (like most discount types and unique IDs) are dropped to reduce noise and model complexity. The final preprocessing pipeline scales numerical features with `StandardScaler` and saves these transformers for future use.

**3. Modeling and Training**

The core of the pipeline is a hybrid deep learning model that combines WaveNet and Transformer architectures.

- **WaveNet blocks** use dilated causal convolutions to capture local, short-term dependencies in the time series without a large number of layers.
- **Transformer encoders** use **multi-head and self-attention** to model long-range dependencies, allowing the model to weigh the importance of different time steps in the sequence.

The model is trained using a custom Weighted MAE loss function, which assigns higher penalties to errors on more critical predictions based on the `test_weights.csv`. An Adam optimizer with a cosine decay learning rate schedule is used, and training incorporates Early Stopping to prevent overfitting.

**4. Evaluation**

After training, the model's performance is evaluated on the test dataset. The pipeline computes both **scaled metrics** (like WMAE on the model's direct output) and **unscaled metrics** (like MAE and RMSE) for better interpretability. The unscaled metrics are calculated after inverse-scaling the predictions back to their original sales values, providing a clear measure of how well the model predicts actual sales numbers. This comprehensive evaluation ensures the model is not only accurate but also robust for weighted samples, a key requirement of the challenge.

# Model submission notebook

### [Model submission notebook](https://www.kaggle.com/code/misterfour/rohlik-sales-forecasting-challenge-submission)
### [Reference! (add holidays calendar of each country into dataset)](https://www.kaggle.com/competitions/rohlik-sales-forecasting-challenge-v2/overview)

# Import libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/rohlik-sales-forecasting-challenge-v2/calendar.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/test_weights.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/inventory.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_train.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_test.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/solution.csv


In [2]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow.keras.layers import ( # Consolidated Keras layers
    Input, Conv1D, Multiply, Add, Dense, Dropout, LayerNormalization,
    MultiHeadAttention, GlobalAveragePooling1D, Activation, BatchNormalization
)
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import Metric
from tensorflow.keras import mixed_precision # For mixed precision policy, if used

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error # For evaluation metrics
from datetime import datetime
import joblib # For saving/loading models or other objects
from sklearn.preprocessing import OrdinalEncoder
from joblib import Parallel, delayed
from sklearn.feature_selection import mutual_info_regression

# Disable GPU usage and force CPU only (if desired)
# This block should typically come after TensorFlow import but before model definition/training
tf.config.set_visible_devices([], 'GPU')  # Prevent TensorFlow from using GPU
physical_devices = tf.config.list_physical_devices('CPU')
assert len(physical_devices) > 0, "No CPU devices found"
tf.config.set_logical_device_configuration(
    physical_devices[0],
    [tf.config.LogicalDeviceConfiguration()]
)

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

2025-08-24 09:00:35.657793: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756026035.882814      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756026035.952800      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Load Dataset

In [4]:
train = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_train.csv', parse_dates=['date'])
inventory = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/inventory.csv')
submission = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_test.csv', parse_dates=['date'])
weights = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/test_weights.csv')

# Merge Dataset and Calendar Enrichment

This step handles the crucial data enrichment and merging steps of the sales forecasting pipeline, focusing on creating and integrating new temporal features. It specifically addresses the need for detailed holiday and closure information, which can significantly influence sales patterns.

**Custom Holiday and Calendar Enrichment**

The initial section of the code defines lists of custom holiday dates, such as Easter Day and Mother Day, for specific warehouses in the Czech Republic, Brno, Munich, and Frankfurt. The `fill_loss_holidays` function then takes these lists and populates the holiday and `holiday_name` columns in a base calendar DataFrame, ensuring these important dates are captured for their respective locations.

The `enrich_calendar` function is a key part of this process. It generates several new, highly predictive temporal features by analyzing the dates within the calendar. These include:

- `date_days_to_next_holiday`: The number of days remaining until the next observed holiday.
- `date_days_to_shops_closed`: The number of days until the next shop closure.
- `date_day_after_closed_day`: A binary flag that is 1 if the current day is the day after a shop was closed.
- `date_second_closed_day`: A binary flag that indicates if a day is part of a multi-day shop closure (e.g., the second day of a long weekend).

After enriching the calendar, the code renames the new columns to a consistent format (e.g., `date_holiday_name, date_days_to_next_holiday`) for better clarity and organization.

**Dataset Stacking and Merging**

The stack_datasets function is a core utility that combines all the disparate data sources into a single, cohesive DataFrame. It performs a series of merges:

- It first merges the main sales data with the newly enriched calendar data on the date and warehouse columns.
- Next, it joins the result with the inventory data on unique_id and warehouse to incorporate product-specific details.
- Finally, it merges the weights data on unique_id, which is critical for the model's weighted loss function.

This function ensures that all relevant features—from product prices and inventory to temporal holidays and closure flags—are available in a single table for the subsequent modeling steps. 

In [5]:
# Additional holiday days

czech_holiday = [ 
    (['03/31/2024', '04/09/2023', '04/17/2022', '04/04/2021', '04/12/2020'], 'Easter Day'),
    (['05/12/2024', '05/10/2020', '05/09/2021', '05/08/2022', '05/14/2023'], "Mother Day"),
]
brno_holiday = [
    (['03/31/2024', '04/09/2023', '04/17/2022', '04/04/2021', '04/12/2020'], 'Easter Day'),
    (['05/12/2024', '05/10/2020', '05/09/2021', '05/08/2022', '05/14/2023'], "Mother Day"),
]
munich_holidays = [
    (['03/30/2024', '04/08/2023', '04/16/2022', '04/03/2021'], 'Holy Saturday'),
    (['05/12/2024', '05/14/2023', '05/08/2022', '05/09/2021'], 'Mother Day'),
]
frankfurt_holidays = [
    (['03/30/2024', '04/08/2023', '04/16/2022', '04/03/2021'], 'Holy Saturday'),
    (['05/12/2024', '05/14/2023', '05/08/2022', '05/09/2021'], 'Mother Day'),
]

# Functions

def fill_loss_holidays(df_fill, warehouses, holidays):
    df = df_fill.copy()
    for item in holidays:
        dates, holiday_name = item
        generated_dates = [datetime.strptime(date, '%m/%d/%Y').strftime('%Y-%m-%d') for date in dates]
        for generated_date in generated_dates:
            df.loc[(df['warehouse'].isin(warehouses)) & (df['date'] == generated_date), 'holiday'] = 1
            df.loc[(df['warehouse'].isin(warehouses)) & (df['date'] == generated_date), 'holiday_name'] = holiday_name
    return df

def enrich_calendar(df):
    df = df.sort_values('date').reset_index(drop=True)

    # Number of days until next holiday
    df['next_holiday_date'] = df.loc[df['holiday'] == 1, 'date'].shift(-1)
    # Fill NaT values by using the next valid observation to fill the gap
    df['next_holiday_date'] = df['next_holiday_date'].bfill() 
    df['date_days_to_next_holiday'] = (df['next_holiday_date'] - df['date']).dt.days
    df.drop(columns=['next_holiday_date'], inplace=True)

    # Number of days until shops are closed
    df['next_shops_closed_date'] = df.loc[df['shops_closed'] == 1, 'date'].shift(-1)
    df['next_shops_closed_date'] = df['next_shops_closed_date'].bfill()
    df['date_days_to_shops_closed'] = (df['next_shops_closed_date'] - df['date']).dt.days
    df.drop(columns=['next_shops_closed_date'], inplace=True)

    # Was the shop closed yesterday?
    df['date_day_after_closed_day'] = ((df['shops_closed'] == 0) & (df['shops_closed'].shift(1) == 1)).astype(int)

    # Are shops closed today and were they also closed yesterday (e.g., December 26 in Germany)?
    df['date_second_closed_day'] = ((df['shops_closed'] == 1) & (df['shops_closed'].shift(1) == 1)).astype(int)

    # Was the shop closed the last two days?
    df['date_day_after_two_closed_days'] = ((df['shops_closed'] == 0) & (df['date_second_closed_day'].shift(1) == 1)).astype(int)

    return df

#calendar = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/calendar.csv', parse_dates=['date'])
calendar = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/calendar.csv', parse_dates=['date'])
calendar = fill_loss_holidays(df_fill=calendar, warehouses=['Prague_1', 'Prague_2', 'Prague_3'], holidays=czech_holiday)
calendar = fill_loss_holidays(df_fill=calendar, warehouses=['Brno_1'], holidays=brno_holiday)
calendar = fill_loss_holidays(df_fill=calendar, warehouses=['Munich_1'], holidays=munich_holidays)
calendar = fill_loss_holidays(df_fill=calendar, warehouses=['Frankfurt_1'], holidays=frankfurt_holidays)

calendar_enriched = pd.DataFrame()

for location in ['Frankfurt_1', 'Prague_2', 'Brno_1', 'Munich_1', 'Prague_3', 'Prague_1', 'Budapest_1']:
    calendar_enriched = pd.concat([
        calendar_enriched,enrich_calendar(calendar.query('date >= "2020-08-01 00:00:00" and warehouse ==@location'))])
calendar_enriched.loc[:,'year'] = calendar_enriched['date'].dt.year
calendar_enriched.sort_values('date')[['date','holiday_name','shops_closed','warehouse','date_days_to_next_holiday']].head(5)

calendar_enriched = calendar_enriched.rename(columns={
    'holiday_name':'date_holiday_name',
    'year':'date_year',
    'holiday':'date_holiday_flag',
    'holiday':'date_holiday_flag',
    'shops_closed':'date_shops_closed_flag',
    'winter_school_holidays':'date_winter_school_holidays_flag',
    'school_holidays':'date_school_holidays_flag',
})

def stack_datasets(df, calendar_extended, inventory, weights):
    """
    Stacks the given DataFrame with additional data from calendar, inventory, and weights.

    Args:
        df: The main DataFrame to be stacked.
        calendar_extended: DataFrame containing calendar-related information.
        inventory: DataFrame containing inventory information.
        weights: DataFrame containing weight information for unique IDs.

    Returns:
        pandas.DataFrame: The stacked DataFrame.
    """
    # Merge with calendar_extended on date and warehouse
    df = df.merge(calendar_extended, on=['date', 'warehouse'], how='left')
    
    # Merge with inventory on unique_id and warehouse
    df = df.merge(inventory, on=['unique_id', 'warehouse'], how='left')
    
    # Perform a VLOOKUP-style merge with weights on unique_id
    df = df.merge(weights, on='unique_id', how='left')
    
    # Ensure 'date' is in datetime format
    df['date'] = pd.to_datetime(df['date'])
    
    return df

df_train = stack_datasets(train, calendar_enriched, inventory, weights)
df_train

# Fill 'date_holiday_name' with 'Working Day' where it's NaN
df_train['date_holiday_name'] = df_train['date_holiday_name'].fillna('Working Day')
df_train

Unnamed: 0,unique_id,date,warehouse,total_orders,sales,sell_price_main,availability,type_0_discount,type_1_discount,type_2_discount,...,date_second_closed_day,date_day_after_two_closed_days,date_year,product_unique_id,name,L1_category_name_en,L2_category_name_en,L3_category_name_en,L4_category_name_en,weight
0,4845,2024-03-10,Budapest_1,6436.0,16.34,646.26,1.00,0.00000,0.0,0.0,...,0,0,2024,2375,Croissant_35,Bakery,Bakery_L2_18,Bakery_L3_83,Bakery_L4_1,1.925596
1,4845,2021-05-25,Budapest_1,4663.0,12.63,455.96,1.00,0.00000,0.0,0.0,...,0,1,2021,2375,Croissant_35,Bakery,Bakery_L2_18,Bakery_L3_83,Bakery_L4_1,1.925596
2,4845,2021-12-20,Budapest_1,6507.0,34.55,455.96,1.00,0.00000,0.0,0.0,...,0,0,2021,2375,Croissant_35,Bakery,Bakery_L2_18,Bakery_L3_83,Bakery_L4_1,1.925596
3,4845,2023-04-29,Budapest_1,5463.0,34.52,646.26,0.96,0.20024,0.0,0.0,...,0,0,2023,2375,Croissant_35,Bakery,Bakery_L2_18,Bakery_L3_83,Bakery_L4_1,1.925596
4,4845,2022-04-01,Budapest_1,5997.0,35.92,486.41,1.00,0.00000,0.0,0.0,...,0,0,2022,2375,Croissant_35,Bakery,Bakery_L2_18,Bakery_L3_83,Bakery_L4_1,1.925596
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4007414,4941,2023-06-21,Prague_1,9988.0,26.56,34.06,1.00,0.00000,0.0,0.0,...,0,0,2023,2422,Kohlrabi_9,Fruit and vegetable,Fruit and vegetable_L2_3,Fruit and vegetable_L3_114,Fruit and vegetable_L4_1,2.262646
4007415,4941,2023-06-24,Prague_1,8518.0,27.42,34.06,1.00,0.00000,0.0,0.0,...,0,0,2023,2422,Kohlrabi_9,Fruit and vegetable,Fruit and vegetable_L2_3,Fruit and vegetable_L3_114,Fruit and vegetable_L4_1,2.262646
4007416,4941,2023-06-23,Prague_1,10424.0,33.39,34.06,1.00,0.00000,0.0,0.0,...,0,0,2023,2422,Kohlrabi_9,Fruit and vegetable,Fruit and vegetable_L2_3,Fruit and vegetable_L3_114,Fruit and vegetable_L4_1,2.262646
4007417,4941,2023-06-22,Prague_1,10342.0,22.88,34.06,1.00,0.00000,0.0,0.0,...,0,0,2023,2422,Kohlrabi_9,Fruit and vegetable,Fruit and vegetable_L2_3,Fruit and vegetable_L3_114,Fruit and vegetable_L4_1,2.262646


# Split Dataset

This code performs a chronological train-test split on a time-series dataset, a critical step in time-series forecasting. The core concept is to prevent data leakage by ensuring that the model is trained only on past data and evaluated exclusively on future data. This mimics a real-world scenario where a model trained on historical information is used to make predictions for an unseen future. The code splits a DataFrame (`df_split`) into two parts: a training set and a testing set.

Split Index Calculation: It first calculates a split index, which is 80% of the total number of rows.

     train_split_index = int(len(df_split) * 0.80)

Chronological Slicing: It then uses this index to slice the DataFrame. The training set (`df_train`) contains the first 80% of the data, representing the earliest time periods. The testing set (`df_test`) contains the remaining 20%, representing the most recent time periods. The use of .iloc ensures that the slicing is based on row position, maintaining the chronological order.

     df_train = df_split.iloc[:train_split_index].copy()
     df_test = df_split.iloc[train_split_index:].copy()

**Theories and Concepts**

- **Preventing Data Leakage:** A random split could include future data points in the training set. For example, a random split might use sales data from December 2024 to train a model that is then tested on sales from October 2024. This data leakage would give the model an unfair advantage, as it has "seen the future," leading to overly optimistic performance metrics that won't hold up in a real forecasting situation.
- **Mimicking Real-World Forecasting:** A chronological split ensures the model's performance on the test set is a reliable proxy for its performance on future, unseen data. This method provides an honest evaluation of the model's ability to generalize to new time periods, which is the ultimate goal of any forecasting model.

In [6]:
# Calculate the split index for 80% training data
# The remaining 20% will be for the test set
df_split = df_train.copy()
train_split_index = int(len(df_split) * 0.80)
# Split the DataFrame chronologically
# IMPORTANT: Perform slicing on the original df_train for both new dataframes
df_train = df_split.iloc[:train_split_index].copy()
df_test = df_split.iloc[train_split_index:].copy() 

print("New df_train shape (80%):", df_train.shape)
print("New df_test shape (20%):", df_test.shape)

New df_train shape (80%): (3205935, 32)
New df_test shape (20%): (801484, 32)


# EDA

This step performs Exploratory Data Analysis (EDA) to prepare the sales forecasting dataset for modeling. The process involves handling missing values, encoding categorical and temporal features, and using a statistical method to assess feature relevance.

**Missing Value Imputation**

The first step is a missing value check using `df.isnull().sum()`. The output shows that the `total_orders` and sales columns in both the training and test sets have a small number of NaN (Not a Number) values.

**Theory: Data Integrity and Imputation**

Missing data can cause errors or lead to biased results in machine learning models. The code addresses this using imputation, a technique to fill in missing values. The chosen method is to replace NaNs with the column mean. This is a simple and common strategy, particularly when the number of missing values is small, as it preserves the overall mean of the feature and prevents data loss. The output confirms that all NaNs in the `total_orders` and `sales` columns of the training set are successfully filled.

In [7]:
df_train.isnull().sum()

unique_id                            0
date                                 0
warehouse                            0
total_orders                        34
sales                               34
sell_price_main                      0
availability                         0
type_0_discount                      0
type_1_discount                      0
type_2_discount                      0
type_3_discount                      0
type_4_discount                      0
type_5_discount                      0
type_6_discount                      0
date_holiday_name                    0
date_holiday_flag                    0
date_shops_closed_flag               0
date_winter_school_holidays_flag     0
date_school_holidays_flag            0
date_days_to_next_holiday            0
date_days_to_shops_closed            0
date_day_after_closed_day            0
date_second_closed_day               0
date_day_after_two_closed_days       0
date_year                            0
product_unique_id        

In [8]:
df_test.isnull().sum()

unique_id                            0
date                                 0
warehouse                            0
total_orders                        18
sales                               18
sell_price_main                      0
availability                         0
type_0_discount                      0
type_1_discount                      0
type_2_discount                      0
type_3_discount                      0
type_4_discount                      0
type_5_discount                      0
type_6_discount                      0
date_holiday_name                    0
date_holiday_flag                    0
date_shops_closed_flag               0
date_winter_school_holidays_flag     0
date_school_holidays_flag            0
date_days_to_next_holiday            0
date_days_to_shops_closed            0
date_day_after_closed_day            0
date_second_closed_day               0
date_day_after_two_closed_days       0
date_year                            0
product_unique_id        

In [9]:
# Identify numerical columns (excluding non-numeric types like 'object' or 'datetime')
# Only numeric columns can have a mean calculated for filling.
numeric_cols = df_train.select_dtypes(include=np.number).columns

# Fill NaN values in numeric columns with the mean of their respective columns
for col in numeric_cols:
    if df_train[col].isnull().any(): # Check if there are any NaNs in the column
        col_mean = df_train[col].mean()
        df_train[col] = df_train[col].fillna(col_mean)
        print(f"\nFilled NaNs in '{col}' with mean: {col_mean:.2f}")

print("\ndf_train shape after filling NaNs:", df_train.shape)
print("\nNaN count per column after filling:")
print(df_train.isnull().sum())


Filled NaNs in 'total_orders' with mean: 6093.52

Filled NaNs in 'sales' with mean: 112.54

df_train shape after filling NaNs: (3205935, 32)

NaN count per column after filling:
unique_id                           0
date                                0
warehouse                           0
total_orders                        0
sales                               0
sell_price_main                     0
availability                        0
type_0_discount                     0
type_1_discount                     0
type_2_discount                     0
type_3_discount                     0
type_4_discount                     0
type_5_discount                     0
type_6_discount                     0
date_holiday_name                   0
date_holiday_flag                   0
date_shops_closed_flag              0
date_winter_school_holidays_flag    0
date_school_holidays_flag           0
date_days_to_next_holiday           0
date_days_to_shops_closed           0
date_day_after_closed_d

## Feature and Target Separation

In [10]:
df_tr_corr = df_train.copy()

In [11]:
featurex = df_tr_corr.drop(['sales'], axis=1)
featurey = df_tr_corr[['sales']]
print("featurex", featurex.shape)
print("featurey", featurey.shape)
print('-------------------------------------------------------------------------')

featurex (3205935, 31)
featurey (3205935, 1)
-------------------------------------------------------------------------


## Feature Encoding for Mutual Information (MI) Analysis

**Categorical Encoding:** 

The OrdinalEncoder is used to convert categorical string columns (`warehouse, date_holiday_name, and product categories`) into integers. Ordinal encoding assigns a unique integer to each unique category. While this method implies an order, which may not exist, it's a simple way to make the data model-compatible.

In [12]:
# Get a list of column names with string data type
string_columns = df_tr_corr.select_dtypes(include=['object']).columns.tolist() 

print(string_columns)  # Output: ['name', 'city', 'country']

['warehouse', 'date_holiday_name', 'name', 'L1_category_name_en', 'L2_category_name_en', 'L3_category_name_en', 'L4_category_name_en']


In [13]:
encoder = OrdinalEncoder()
featurex[['warehouse', 'date_holiday_name', 'name', 'L1_category_name_en', 'L2_category_name_en', 'L3_category_name_en', 'L4_category_name_en']] = encoder.fit_transform(featurex[['warehouse', 'date_holiday_name', 'name', 'L1_category_name_en', 'L2_category_name_en', 'L3_category_name_en', 'L4_category_name_en']])
featurex

Unnamed: 0,unique_id,date,warehouse,total_orders,sell_price_main,availability,type_0_discount,type_1_discount,type_2_discount,type_3_discount,...,date_second_closed_day,date_day_after_two_closed_days,date_year,product_unique_id,name,L1_category_name_en,L2_category_name_en,L3_category_name_en,L4_category_name_en,weight
0,4845,2024-03-10,1.0,6436.0,646.26,1.00,0.00000,0.0,0.0,0.0,...,0,0,2024,2375,665.0,0.0,2.0,52.0,0.0,1.925596
1,4845,2021-05-25,1.0,4663.0,455.96,1.00,0.00000,0.0,0.0,0.0,...,0,1,2021,2375,665.0,0.0,2.0,52.0,0.0,1.925596
2,4845,2021-12-20,1.0,6507.0,455.96,1.00,0.00000,0.0,0.0,0.0,...,0,0,2021,2375,665.0,0.0,2.0,52.0,0.0,1.925596
3,4845,2023-04-29,1.0,5463.0,646.26,0.96,0.20024,0.0,0.0,0.0,...,0,0,2023,2375,665.0,0.0,2.0,52.0,0.0,1.925596
4,4845,2022-04-01,1.0,5997.0,486.41,1.00,0.00000,0.0,0.0,0.0,...,0,0,2022,2375,665.0,0.0,2.0,52.0,0.0,1.925596
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3205930,833,2020-08-06,4.0,5380.0,23.81,1.00,0.00000,0.0,0.0,0.0,...,0,0,2020,435,1005.0,1.0,25.0,92.0,10.0,2.438045
3205931,833,2023-03-13,4.0,10032.0,36.10,0.21,0.00000,0.0,0.0,0.0,...,0,0,2023,435,1005.0,1.0,25.0,92.0,10.0,2.438045
3205932,833,2022-08-26,4.0,8686.0,36.10,0.34,0.00000,0.0,0.0,0.0,...,0,0,2022,435,1005.0,1.0,25.0,92.0,10.0,2.438045
3205933,833,2022-10-11,4.0,9043.0,25.26,1.00,0.00000,0.0,0.0,0.0,...,0,0,2022,435,1005.0,1.0,25.0,92.0,10.0,2.438045


**Cyclical Feature Encoding:** 

The `encode_datetime` function transforms the date column. It extracts basic features like year, month, and day. More importantly, it uses sine and cosine transformations to create `sin_month, cos_month, sin_day, and cos_day` features. This is based on the theory that time-based data (`months or days of the week`) is cyclical, with no beginning or end. Representing them as a single number (e.g., month 12 is followed by month 1) can mislead a model. Sine and cosine transformations map these values to a continuous circle, correctly capturing the cyclical nature and allowing the model to understand the relationship between months or days at the start and end of a cycle.

In [14]:
def encode_datetime(df, datetime_column):
  """
  Encodes datetime features in a pandas DataFrame.

  Args:
    df: The pandas DataFrame containing the datetime column.
    datetime_column: The name of the datetime column in the DataFrame.

  Returns:
    pandas.DataFrame: The DataFrame with encoded datetime features.  
  """

  df[datetime_column] = pd.to_datetime(df[datetime_column]) 

  # Extract features
  df['year'] = df[datetime_column].dt.year
  df['month'] = df[datetime_column].dt.month
  df['day'] = df[datetime_column].dt.day

  # Create cyclical features (optional)
  df['sin_month'] = np.sin(2 * np.pi * df['month'] / 12)
  df['cos_month'] = np.cos(2 * np.pi * df['month'] / 12)
  df['sin_day'] = np.sin(2 * np.pi * df['day'] / 31)
  df['cos_day'] = np.cos(2 * np.pi * df['day'] / 31)

  # Drop the original datetime column (optional)
  df.drop(datetime_column, axis=1, inplace=True)

  return df

# Example usage:
# Assuming 'df' is your DataFrame and 'date_time' is the name of your datetime column
featurex = encode_datetime(featurex, 'date')
featurex

Unnamed: 0,unique_id,warehouse,total_orders,sell_price_main,availability,type_0_discount,type_1_discount,type_2_discount,type_3_discount,type_4_discount,...,L3_category_name_en,L4_category_name_en,weight,year,month,day,sin_month,cos_month,sin_day,cos_day
0,4845,1.0,6436.0,646.26,1.00,0.00000,0.0,0.0,0.0,0.15312,...,52.0,0.0,1.925596,2024,3,10,1.000000e+00,6.123234e-17,0.897805,-0.440394
1,4845,1.0,4663.0,455.96,1.00,0.00000,0.0,0.0,0.0,0.15025,...,52.0,0.0,1.925596,2021,5,25,5.000000e-01,-8.660254e-01,-0.937752,0.347305
2,4845,1.0,6507.0,455.96,1.00,0.00000,0.0,0.0,0.0,0.15025,...,52.0,0.0,1.925596,2021,12,20,-2.449294e-16,1.000000e+00,-0.790776,-0.612106
3,4845,1.0,5463.0,646.26,0.96,0.20024,0.0,0.0,0.0,0.15312,...,52.0,0.0,1.925596,2023,4,29,8.660254e-01,-5.000000e-01,-0.394356,0.918958
4,4845,1.0,5997.0,486.41,1.00,0.00000,0.0,0.0,0.0,0.15649,...,52.0,0.0,1.925596,2022,4,1,8.660254e-01,-5.000000e-01,0.201299,0.979530
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3205930,833,4.0,5380.0,23.81,1.00,0.00000,0.0,0.0,0.0,0.00000,...,92.0,10.0,2.438045,2020,8,6,-8.660254e-01,-5.000000e-01,0.937752,0.347305
3205931,833,4.0,10032.0,36.10,0.21,0.00000,0.0,0.0,0.0,0.00000,...,92.0,10.0,2.438045,2023,3,13,1.000000e+00,6.123234e-17,0.485302,-0.874347
3205932,833,4.0,8686.0,36.10,0.34,0.00000,0.0,0.0,0.0,0.00000,...,92.0,10.0,2.438045,2022,8,26,-8.660254e-01,-5.000000e-01,-0.848644,0.528964
3205933,833,4.0,9043.0,25.26,1.00,0.00000,0.0,0.0,0.0,0.00000,...,92.0,10.0,2.438045,2022,10,11,-8.660254e-01,5.000000e-01,0.790776,-0.612106


## Mutual Information (MI) Analysis

The step is to perform a correlation analysis to understand the relationship between features and the target variable (`sales`). The code uses `mutual_info_regression`, which calculates Mutual Information (MI) scores.

**Theory: Mutual Information vs. Correlation**

Traditional correlation metrics like Pearson's correlation coefficient measure only linear relationships between two variables. Mutual Information (MI) is a more powerful and general concept from information theory. It measures the reduction in uncertainty about one variable given the value of another. In simpler terms, it quantifies the dependency between two variables, regardless of whether the relationship is linear or non-linear.

- **High MI Score:** A high score indicates that knowing the value of a feature significantly reduces the uncertainty about the value of the target.
- **Low MI Score:** A low score indicates that the feature has little to no predictive power on its own.

The output for Aggregated Mutual Information Scores provides a ranked list of features. It confirms `that total_orders, unique_id, and name` are highly relevant, as they have the highest MI scores. Conversely, features like different discount types and certain holiday flags have very low scores, indicating a weak relationship with sales. This analysis is crucial for feature selection, allowing the data scientist to focus on the most informative variables and potentially discard irrelevant ones to improve model performance and reduce training time.

In [15]:
# Flatten featurey to a one-dimensional format
#y_flattened = featurey.values.ravel()  # Converts to 1D array

# For demonstration, let's print the flattened featurey
#print("Flattened featurey:", y_flattened)

# Calculate mutual information scores for each feature with respect to each target
X, y = featurex, featurey
def calculate_mi_for_target(target_index):
    return mutual_info_regression(X, y.iloc[:, target_index], random_state=42)

# Step 3: Run parallelized MI calculations on the original y (multi-dimensional)
mi_scores = Parallel(n_jobs=-1)(delayed(calculate_mi_for_target)(i) for i in range(y.shape[1]))

# Step 4: Convert the list of scores to a DataFrame
mi_scores_df = pd.DataFrame(mi_scores, columns=featurex.columns, index=featurey.columns)
print("\nMutual Information Scores for each feature with respect to each target:")
mi_scores_df

# Step 5: Aggregate scores across all targets (e.g., by averaging)
aggregated_mi_scores = mi_scores_df.mean(axis=0).sort_values(ascending=False)
print("\nAggregated Mutual Information Scores:")
aggregated_mi_scores


Mutual Information Scores for each feature with respect to each target:

Aggregated Mutual Information Scores:


total_orders                        1.237632
unique_id                           0.870261
name                                0.738922
product_unique_id                   0.721629
weight                              0.568456
sell_price_main                     0.300177
date_days_to_shops_closed           0.280129
L3_category_name_en                 0.220470
date_days_to_next_holiday           0.212774
L2_category_name_en                 0.097007
L4_category_name_en                 0.077303
day                                 0.069964
sin_day                             0.069611
warehouse                           0.066437
cos_day                             0.035973
month                               0.033994
type_0_discount                     0.030659
date_holiday_name                   0.030613
L1_category_name_en                 0.025312
type_6_discount                     0.019307
type_4_discount                     0.018949
availability                        0.018790
sin_month 

In [16]:
mi_scores_df.T.describe()

Unnamed: 0,sales
count,37.0
mean,0.157895
std,0.291355
min,4.1e-05
25%,0.005094
50%,0.025312
75%,0.097007
max,1.237632


In [17]:
aggregated_mi_scores.describe()

count    37.000000
mean      0.157895
std       0.291355
min       0.000041
25%       0.005094
50%       0.025312
75%       0.097007
max       1.237632
dtype: float64

In [18]:
mi_scores_df

Unnamed: 0,unique_id,warehouse,total_orders,sell_price_main,availability,type_0_discount,type_1_discount,type_2_discount,type_3_discount,type_4_discount,...,L3_category_name_en,L4_category_name_en,weight,year,month,day,sin_month,cos_month,sin_day,cos_day
sales,0.870261,0.066437,1.237632,0.300177,0.01879,0.030659,4.1e-05,0.005094,0.000112,0.018949,...,0.22047,0.077303,0.568456,0.017409,0.033994,0.069964,0.018741,0.017237,0.069611,0.035973


# Feature engineering

The step focuses on preparing a time-series dataset for a machine learning model, specifically addressing missing values, feature selection, and data scaling.

## Missing Value Imputation

**Missing Value Imputation**

The code first checks for and then handles missing values (NaNs) in the total_orders and sales columns of both the `df_train` and `df_test` dataframes. It iterates through all numerical columns and fills any NaN values with the mean of that column. This is a simple and standard method of imputation.

**Theory: Data Integrity and Imputation**

Machine learning models require complete data. Missing values can cause models to fail or produce biased results. Imputation is the process of filling in these gaps. Using the mean is a common technique that maintains the central tendency of the data. The output confirms that this step successfully fills all missing values, making the data ready for further processing.

### Check missing value in dataset

In [19]:
A = df_train.isnull().sum()
B = df_test.isnull().sum()
print("NaN value in train dataset")
print(A)
print("-"*100)
print("NaN value in test dataset")
print(B)

NaN value in train dataset
unique_id                           0
date                                0
warehouse                           0
total_orders                        0
sales                               0
sell_price_main                     0
availability                        0
type_0_discount                     0
type_1_discount                     0
type_2_discount                     0
type_3_discount                     0
type_4_discount                     0
type_5_discount                     0
type_6_discount                     0
date_holiday_name                   0
date_holiday_flag                   0
date_shops_closed_flag              0
date_winter_school_holidays_flag    0
date_school_holidays_flag           0
date_days_to_next_holiday           0
date_days_to_shops_closed           0
date_day_after_closed_day           0
date_second_closed_day              0
date_day_after_two_closed_days      0
date_year                           0
product_unique_id      

### Fill missing value

In [20]:
# Fill NaN values in df_train with the mean of each numerical column in df_train
for col in df_train.select_dtypes(include=np.number).columns:
    if df_train[col].isnull().any(): # Check if there are any NaN values in the column
        mean_val_train = df_train[col].mean()
        df_train[col] = df_train[col].fillna(mean_val_train) # Removed inplace=True and assigned back
        print(f"Filled NaN in df_train['{col}'] with mean: {mean_val_train:.2f}")

# Fix: Assign the result of fillna back to the column
# Fill NaN values in df_test with the mean of each numerical column in df_test
for col in df_test.select_dtypes(include=np.number).columns:
    if df_test[col].isnull().any(): # Check if there are any NaN values in the column
        mean_val_test = df_test[col].mean()
        df_test[col] = df_test[col].fillna(mean_val_test) # Removed inplace=True and assigned back
        print(f"Filled NaN in df_test['{col}'] with mean: {mean_val_test:.2f}")

print("\n--- After NaN Imputation ---")
print("df_train info:")
df_train.info()
print("\ndf_test info:")
df_test.info()

Filled NaN in df_test['total_orders'] with mean: 5604.52
Filled NaN in df_test['sales'] with mean: 91.75

--- After NaN Imputation ---
df_train info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3205935 entries, 0 to 3205934
Data columns (total 32 columns):
 #   Column                            Dtype         
---  ------                            -----         
 0   unique_id                         int64         
 1   date                              datetime64[ns]
 2   warehouse                         object        
 3   total_orders                      float64       
 4   sales                             float64       
 5   sell_price_main                   float64       
 6   availability                      float64       
 7   type_0_discount                   float64       
 8   type_1_discount                   float64       
 9   type_2_discount                   float64       
 10  type_3_discount                   float64       
 11  type_4_discount               

## Sorting the Data by Date

The `sort_dataframe_by_date` function sorts both the training and testing dataframes chronologically based on their date column.

**Theory: Time-Series Data Order**

In time-series analysis, the order of data is paramount. Sorting ensures that all subsequent operations, such as creating sequences or time-based features, are performed on a correctly ordered dataset. This is a vital step to avoid data leakage and to accurately model temporal dependencies.

In [21]:
def sort_dataframe_by_date(df, date_column):
  """
  Sorts a pandas DataFrame by a specified date column in ascending order.

  Args:
    df: The pandas DataFrame to be sorted.
    date_column: The name of the date column in the DataFrame.

  Returns:
    pandas.DataFrame: The sorted DataFrame.
  """

  # Ensure the date column is in datetime format
  df[date_column] = pd.to_datetime(df[date_column])

  # Sort the DataFrame by the date column in ascending order
  df = df.sort_values(by=date_column) 

  return df

# Example usage:
# Assuming 'df' is your DataFrame and the date column is named 'date'
df_train = sort_dataframe_by_date(df_train, 'date')
df_test = sort_dataframe_by_date(df_test, 'date')

In [22]:
print("df_train shape before dropping columns:", df_train.shape)
print("df_test shape before dropping columns:", df_test.shape)

df_train shape before dropping columns: (3205935, 32)
df_test shape before dropping columns: (801484, 32)


## Feature Selection

The code explicitly drops a list of "noise columns" from both the training and testing dataframes. These columns are likely identified as having low predictive power from a previous correlation analysis (e.g., the mutual information step in the previous code block), or are considered to be redundant.

**Theory: The Curse of Dimensionality**

Dropping features is a form of dimensionality reduction. Including too many features, especially those that are irrelevant or redundant, can lead to the "curse of dimensionality." This can make models slower to train, more prone to overfitting, and harder to interpret. By removing low-impact features, the code streamlines the dataset, which can improve model performance and efficiency.

In [23]:
columns_to_drop = [
    'type_2_discount',
    'date_holiday_flag',
    'date_school_holidays_flag',
    'date_shops_closed_flag',
    'date_second_closed_day',
    'date_winter_school_holidays_flag',
    'date_day_after_closed_day',
    'date_day_after_two_closed_days',
    'type_5_discount',
    'type_3_discount',
    'type_1_discount',
    'unique_id',
    "availability" 
]

# Drop columns from df_train
df_train = df_train.drop(columns=columns_to_drop, axis=1, errors='ignore')

# Drop columns from df_test
df_test = df_test.drop(columns=columns_to_drop, axis=1, errors='ignore')

print("df_train shape after dropping columns:", df_train.shape)
print("df_test shape after dropping columns:", df_test.shape)

df_train shape after dropping columns: (3205935, 19)
df_test shape after dropping columns: (801484, 19)


## Data Preprocessing for Modeling

The `preprocess_data` function consists of several critical steps to transform the data into a format suitable for a machine learning model:

- **Separation of Features, Target, and Weights:** It separates the data into input features (`x_train, x_test`), the target variable (`y_train, y_test`), and a weight column for each set. The weight column is likely used in the model's loss function to give more importance to certain data points.
- **Datetime Feature Extraction:** It extracts simple numerical features (month and day) from the date column and then drops the original column. This is an alternative to the cyclical encoding from the previous block, but still makes temporal information usable by the model.
- **Categorical Encoding:** It uses `OrdinalEncoder` to convert categorical text columns (like warehouse or name) into numerical representations. This is necessary because models cannot directly process string data.
- **Numerical Scaling:** It uses `StandardScaler` to normalize the numerical features. Standardization transforms the data to have a mean of 0 and a standard deviation of 1. This is a crucial step for many algorithms (especially those based on gradient descent, like neural networks) because it ensures that no single feature dominates the learning process due to its magnitude. The scaler fitted on the training data is then applied to the test data to prevent data leakage from the test set.

In [24]:
def preprocess_data(df_train, df_test, target_col):
    """
    Preprocesses df_train (fit and transform) and df_test (transform) for training and testing.

    Args:
        df_train: pandas DataFrame for training.
        df_test: pandas DataFrame for testing.
        target_col: Name of the target column (e.g., 'sales').

    Returns:
        x_train: Training features (float32 NumPy array).
        y_train: Training targets (float32 NumPy array).
        weight_train: Training weights (float32 NumPy array).
        x_test: Test features (float32 NumPy array).
        y_test: Test targets (float32 NumPy array).
        weight_test: Test weights (float32 NumPy array).
    """
    # Extract target and weights
    y_train = df_train[target_col].values.astype(np.float32)
    y_test = df_test[target_col].values.astype(np.float32)
    weight_train = df_train['weight'].values.astype(np.float32)
    weight_test = df_test['weight'].values.astype(np.float32)

    # Extract features
    x_train = df_train.drop([target_col, 'weight'], axis=1).copy()
    x_test = df_test.drop([target_col, 'weight'], axis=1).copy()

    # Handle datetime columns
    datetime_cols = x_train.select_dtypes(include=['datetime']).columns
    if len(datetime_cols) > 0:
        for col in datetime_cols:
            x_train[col + '_month'] = x_train[col].dt.month
            x_train[col + '_day'] = x_train[col].dt.day
            x_test[col + '_month'] = x_test[col].dt.month
            x_test[col + '_day'] = x_test[col].dt.day
        x_train = x_train.drop(datetime_cols, axis=1)
        x_test = x_test.drop(datetime_cols, axis=1)

    # Define categorical and numerical columns
    categorical_cols = x_train.select_dtypes(include=['object', 'category']).columns
    numeric_cols = x_train.select_dtypes(include=['number']).columns

    # Encode categorical features
    encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
    if len(categorical_cols) > 0:
        x_train[categorical_cols] = encoder.fit_transform(x_train[categorical_cols])
        x_test[categorical_cols] = encoder.transform(x_test[categorical_cols])

    # Scale numerical features
    scaler = StandardScaler()
    if len(numeric_cols) > 0:
        x_train[numeric_cols] = scaler.fit_transform(x_train[numeric_cols])
        x_test[numeric_cols] = scaler.transform(x_test[numeric_cols])

    # Save encoder and scaler
    joblib.dump(encoder, 'encoder.pkl')
    joblib.dump(scaler, 'scaler.pkl')

    # Convert to NumPy arrays
    x_train = x_train.values.astype(np.float32)
    x_test = x_test.values.astype(np.float32)

    return x_train, y_train, weight_train, x_test, y_test, weight_test

# Preprocess data
x_train, y_train, weight_train, x_test, y_test, weight_test = preprocess_data(df_train, df_test, 'sales')

print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

x_train shape: (3205935, 18)
x_test shape: (801484, 18)
y_train shape: (3205935,)
y_test shape: (801484,)


## Time-Series Dataset Creation

The final part of the code prepares the data for a time-series model (likely a recurrent neural network, such as an LSTM).

- **Validation Split:** It splits the training data (`x_train, y_train`) further into a training set and a validation set. This is a standard practice for hyperparameter tuning and model evaluation during training. This split is also done chronologically to maintain time-series integrity.
- **Target Scaling:** The sales target variable is also scaled using StandardScaler. This is common in regression problems to stabilize training and improve convergence.
- **Sequential Data Generation:** The `create_dataset` function uses `tf.keras.utils.timeseries_dataset_from_array` to convert the flat data arrays into sequences. The `sequence_length` parameter specifies how many consecutive time steps a model will see as input to predict the next time step. This is the core of time-series sequence modeling. The function also aligns the sample weights to the target values within each sequence, ensuring the model's loss is correctly weighted.

In [25]:
# Hyperparameters
SEQUENCE_LENGTH = 100
BATCH_SIZE = 256

# Preprocessing: Split data and scale targets
val_split = 0.2
val_size = int(len(x_train) * val_split)
x_val, y_val, weights_val = x_train[-val_size:], y_train[-val_size:], weight_train[-val_size:]
x_train, y_train, weights_train = x_train[:-val_size], y_train[:-val_size], weight_train[:-val_size]

# Cast to float32
x_train = x_train.astype(np.float32)
x_val = x_val.astype(np.float32)
x_test = x_test.astype(np.float32)
weights_train = weights_train.astype(np.float32)
weights_val = weights_val.astype(np.float32)
weight_test = weight_test.astype(np.float32)

# Scale targets
scaler_y = StandardScaler()
y_train = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten().astype(np.float32)
y_val = scaler_y.transform(y_val.reshape(-1, 1)).flatten().astype(np.float32)
y_test = scaler_y.transform(y_test.reshape(-1, 1)).flatten().astype(np.float32)
joblib.dump(scaler_y, 'scaler_y.pkl')

# Function to create time-series dataset with sample weights
def create_dataset(features, targets, weights, sequence_length, batch_size):
    """
    Creates a time-series dataset with features, targets, and sample weights.

    Args:
        features (np.ndarray): Input features (float32).
        targets (np.ndarray): Target values (float32).
        weights (np.ndarray): Sample weights (float32).
        sequence_length (int): Length of each sequence.
        batch_size (int): Batch size.

    Returns:
        tf.data.Dataset: Dataset yielding (inputs, targets, sample_weights).
        np.ndarray: Aligned weights.
    """
    dataset = tf.keras.utils.timeseries_dataset_from_array(
        data=features,
        targets=targets,
        sequence_length=sequence_length,
        batch_size=batch_size
    )
    aligned_weights = weights[sequence_length - 1:] if len(weights) > sequence_length else weights
    dataset = dataset.map(lambda x, y: (x, y, tf.gather(aligned_weights, tf.range(tf.shape(y)[0]), axis=0)))
    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    return dataset, aligned_weights

# Create datasets
dataset_train, train_weights = create_dataset(x_train, y_train, weights_train, SEQUENCE_LENGTH, BATCH_SIZE)
dataset_val, val_weights = create_dataset(x_val, y_val, weights_val, SEQUENCE_LENGTH, BATCH_SIZE)
dataset_test, test_weights = create_dataset(x_test, y_test, weight_test, SEQUENCE_LENGTH, BATCH_SIZE)

# Modeling

## Model Architecture, model building, and model training - wavenet with transformer model

The model is a hybrid architecture that combines the strengths of WaveNet and Transformer networks. The code also defines custom loss functions and metrics tailored to the specific problem, and includes advanced training techniques.

**Model Architecture and Theory**

The model's design is a layered, multi-component structure.

- **WaveNet Blocks:** The initial part of the model uses WaveNet blocks, which are based on dilated causal convolutions.
  - **Causal Convolutions:** In time-series forecasting, a model should only use past data to predict the future. Causal convolutions ensure this by restricting the receptive field to only look backward in time. The output at a given time step is a function of the input at that time step and all previous time steps, but not future ones.
  - **Dilated Convolutions:** Dilated convolutions skip input values with a certain step or "dilation rate." This allows the network's receptive field to grow exponentially with each layer, enabling it to efficiently capture long-range dependencies in the sequence without a significant increase in computational cost.
  - **Gated Activation:** The tanh and sigmoid activations work together as a gating mechanism. The sigmoid acts as a gate, determining which information from the tanh activation should be passed through. This helps the network selectively propagate relevant information through the layers.

- **Transformer Encoder Layers:** Following the WaveNet blocks, the model uses two Transformer encoder layers.
  - **Multi-Head Self-Attention:** This is the core of the Transformer. It allows the model to weigh the importance of all other parts of the input sequence when processing a specific part. For time-series, this means the model can learn complex, non-linear dependencies across the entire history, not just local patterns. The "multi-head" aspect means this is done in parallel, allowing the model to focus on different aspects of the sequence simultaneously.

- **Final Layers:** The output of the hybrid network is passed through a Multi-Layer Perceptron (MLP) with dense layers and dropout for regularization.
  - **Dropout:** A regularization technique that randomly sets a fraction of neurons to zero during training. This prevents overfitting by forcing the network to learn more robust features that are not dependent on specific neurons.
  - **Custom ClipLayer:** The final output is passed through a ClipLayer that constrains the prediction to a specific range (-10.0 to 10.0). This is a practical step to ensure the model's predictions remain within a plausible range, which is especially useful for normalized target values.

**Custom Loss and Training**

The model is compiled with a custom loss function and a dynamic learning rate schedule, highlighting a tailored approach to the problem.

**Weighted Mean Absolute Error (WMAE) Loss:** The `custom_wmae_loss` function computes the mean absolute error, but with an added `sample_weight` parameter.
  - **Weighted Loss:** A weighted loss function is used when some data points are considered more important than others. By assigning higher weights to certain samples, the model is penalized more heavily for errors on those samples, forcing it to prioritize them during training. This is a crucial technique for imbalanced datasets or when certain predictions have a higher business impact. The custom `WeightedMAEMetric` tracks this same weighted error during evaluation.

**Learning Rate Schedule:** The Adam optimizer is used with a CosineDecay learning rate schedule and a warmup phase.
  - **Dynamic Learning Rate:** Instead of using a fixed learning rate, this schedule starts with a low learning rate (warmup), gradually increases it, and then slowly decreases it following a cosine function. This helps the model converge more efficiently and find better optima, as a high learning rate early in training can cause instability, while a lower rate at the end allows for fine-tuning.

**Training Process**

The model is trained using the `.fit()` method.

- `model.fit()`: This function takes the training and validation datasets as input.
- `callbacks:` An EarlyStopping callback is used. This is a form of regularization that monitors the validation loss and stops training if it stops improving for a specified number of epochs (patience). It also restores the best model weights, ensuring the final model is the one with the best performance on the validation set, preventing further overfitting.

In [None]:
# Hyperparameters
SEQUENCE_LENGTH = 100
BATCH_SIZE = 256
EPOCHS = 1
BASE_LR = 3e-4
DROPOUT_RATE = 0.3
FILTERS = 256
KERNEL_SIZE = 3
DILATION_RATES = [1, 2, 4, 8, 16, 32]

# Custom WMAE Loss Function
def custom_wmae_loss(y_true, y_pred, sample_weight=None):
    """
    Weighted Mean Absolute Error loss function following evaluation criteria.
    This version strictly interprets zero sample weights as zero contribution
    and handles cases where the sum of weights is zero to prevent NaN.
    """
    y_true = tf.cast(y_true, tf.float32)
    y_pred = tf.cast(y_pred, tf.float32)
    if sample_weight is None:
        sample_weight = tf.ones_like(y_true, dtype=tf.float32)
    else:
        sample_weight = tf.cast(sample_weight, tf.float32)
    weighted_error = tf.abs(y_true - y_pred) * sample_weight
    sum_of_weights = tf.reduce_sum(sample_weight)
    return tf.cond(
        tf.greater(sum_of_weights, 0),
        lambda: tf.reduce_sum(weighted_error) / sum_of_weights,
        lambda: tf.constant(0.0, dtype=tf.float32)
    )

# Custom WMAE Metric
class WeightedMAEMetric(Metric):
    """
    Custom metric to compute Weighted Mean Absolute Error following evaluation criteria.
    This version strictly interprets zero sample weights as zero contribution
    and handles cases where the sum of weights is zero to prevent NaN.
    """
    def __init__(self, name='wmae', **kwargs):
        super(WeightedMAEMetric, self).__init__(name=name, **kwargs)
        self.total_weighted_error = self.add_weight(name='total_weighted_error', initializer='zeros')
        self.total_weights = self.add_weight(name='total_weights', initializer='zeros')

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_true = tf.cast(y_true, tf.float32)
        y_pred = tf.cast(y_pred, tf.float32)
        if sample_weight is None:
            sample_weight = tf.ones_like(y_true, dtype=tf.float32)
        else:
            sample_weight = tf.cast(sample_weight, tf.float32)
        weighted_error = tf.abs(y_true - y_pred) * sample_weight
        self.total_weighted_error.assign_add(tf.reduce_sum(weighted_error))
        self.total_weights.assign_add(tf.reduce_sum(sample_weight))

    def result(self):
        return tf.cond(
            tf.greater(self.total_weights, 0),
            lambda: self.total_weighted_error / self.total_weights,
            lambda: tf.constant(0.0, dtype=tf.float32)
        )

    def reset_state(self):
        self.total_weighted_error.assign(0.0)
        self.total_weights.assign(0.0)

# Learning rate scheduler with warmup
# Note: This assumes y_train is defined elsewhere; adjust accordingly if needed
num_train_sequences = max(0, len(y_train) - SEQUENCE_LENGTH + 1) if 'y_train' in globals() else 10000  # Placeholder value
total_steps = EPOCHS * (num_train_sequences // BATCH_SIZE) if BATCH_SIZE > 0 else 0
warmup_steps = min(1000, total_steps // 10)
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=BASE_LR,
    decay_steps=max(1, total_steps - warmup_steps),
    alpha=0.1
)
optimizer = Adam(learning_rate=lr_schedule)

# WaveNet Block
def wavenet_block(x, dilation_rate, filters, kernel_size, dropout_rate):
    """
    WaveNet block with dilated convolutions, gating, and residual/skip connections.
    """
    conv_filter = Conv1D(filters, kernel_size, padding='causal', dilation_rate=dilation_rate, activation='tanh')(x)
    conv_gate = Conv1D(filters, kernel_size, padding='causal', dilation_rate=dilation_rate, activation='sigmoid')(x)
    gated_output = Multiply()([conv_filter, conv_gate])
    gated_output = BatchNormalization()(gated_output)
    residual = Conv1D(filters, 1, padding='same')(gated_output)
    skip = Conv1D(filters, 1, padding='same')(gated_output)
    if x.shape[-1] != filters:
        x = Conv1D(filters, 1, padding='same')(x)
    return Add()([x, residual]), skip

# Transformer Encoder Layer
def transformer_encoder(x, num_heads=4, ff_dim=128, dropout_rate=0.3):
    """
    Transformer encoder layer with multi-head attention and feed-forward network.
    """
    attn_output = MultiHeadAttention(num_heads=num_heads, key_dim=ff_dim)(x, x)
    attn_output = Dropout(dropout_rate)(attn_output)
    x = Add()([x, attn_output])
    x = LayerNormalization(epsilon=1e-6)(x)
    ff_output = Dense(ff_dim, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.005))(x)
    ff_output = Dropout(dropout_rate)(ff_output)
    ff_output = Dense(x.shape[-1])(ff_output)
    x = Add()([x, ff_output])
    x = LayerNormalization(epsilon=1e-6)(x)
    return x

# Custom Layer for Clipping
class ClipLayer(tf.keras.layers.Layer):
    def __init__(self, min_value=-10.0, max_value=10.0, **kwargs):
        super(ClipLayer, self).__init__(**kwargs)
        self.min_value = min_value
        self.max_value = max_value

    def call(self, inputs):
        return tf.clip_by_value(inputs, self.min_value, self.max_value)

# Build WaveNet + Transformer Model
num_features = x_train.shape[1] if 'x_train' in globals() else 19  # Placeholder value
inputs = Input(shape=(SEQUENCE_LENGTH, num_features), name="input")
x = inputs
skip_connections = []

# WaveNet blocks
for dilation_rate in DILATION_RATES:
    x, skip = wavenet_block(x, dilation_rate, FILTERS, KERNEL_SIZE, DROPOUT_RATE)
    skip_connections.append(skip)

x = Add()(skip_connections)
x = Activation('relu')(x)
x = BatchNormalization()(x)

# Transformer encoder layers
x = transformer_encoder(x, num_heads=4, ff_dim=128, dropout_rate=0.3)
x = transformer_encoder(x, num_heads=4, ff_dim=128, dropout_rate=0.3)

# Final layers
x = Conv1D(FILTERS, 1, activation='relu')(x)
x = GlobalAveragePooling1D()(x)
x = Dense(256, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.005))(x)
x = Dropout(DROPOUT_RATE)(x)
x = Dense(128, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.005))(x)
x = Dropout(DROPOUT_RATE)(x)
x = Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.005))(x)
x = Dropout(DROPOUT_RATE)(x)

# Apply clipping using the custom layer
outputs = ClipLayer(min_value=-10.0, max_value=10.0)(Dense(1, activation='linear', name="output", dtype='float32')(x))

# Build and compile model
model = Model(inputs=inputs, outputs=outputs)
model.compile(
    optimizer=optimizer,
    loss=custom_wmae_loss,
    metrics=['mae'],
    weighted_metrics=[WeightedMAEMetric()]
)

model.summary()

# Train the model with sample weights
print("\n--- Training Model ---")
history = model.fit(
    dataset_train,  # Ensure dataset_train is defined
    validation_data=dataset_val,  # Ensure dataset_val is defined
    epochs=EPOCHS,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=7, restore_best_weights=True)
    ],
    shuffle=False
)


--- Training Model ---


I0000 00:00:1756027840.897335     111 service.cc:148] XLA service 0x7aa6080123b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1756027840.899807     111 service.cc:156]   StreamExecutor device (0): Host, Default Version
I0000 00:00:1756027847.975538     111 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m  609/10019[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m17:08:57[0m 7s/step - loss: 3.3711 - mae: 0.2335 - wmae: 60.5383

**Remark that** : eaxmple of the model above is train only one epoch due to limited time of save version on Kaggle, but actually you have to set number of epoch to 200 epochs.

## Save the model

This focuses on the crucial step of **model persistence**: saving and loading a trained deep learning model. This is a fundamental practice in machine learning for deployment, version control, and reproducibility.

**Saving the Model:** The code first defines a path (`keras_model_path`) and then uses the `model.save()` method to save the trained model to this path.

**Theory : Model Persistence**

Saving a model is a critical step after training is complete. It serializes the entire model architecture, including its layers, weights, training configuration (optimizer, loss function), and the state of the optimizer. This allows you to save the result of potentially long and computationally expensive training processes. The .keras format is a modern, single-file container that bundles all these components, making it easy to share and load.

**Loading the Model:** The code then demonstrates how to load the saved model using `tf.keras.models.load_model()`.

**Theory : Custom Objects**

When loading a model that was compiled with custom components (like the `custom_wmae_loss` and `WeightedMAEMetric` defined in the previous steps), you must explicitly provide these custom objects to the `load_model` function. Keras needs to know how to rebuild these non-standard parts of the model. Without this step, the loading process would fail with an error because it wouldn't recognize the custom functions and classes.

**Verification:** The code prints a success message and then displays a summary of the loaded model using `loaded_model.summary()`. This step is a quick and effective way to verify that the model was loaded correctly, and its architecture, layer names, and output shapes match the original.

In [None]:
# --- Save the model in Keras format (.keras) ---
# Define the path for the .keras file
keras_model_path = "wavenet_transformer_model.keras"

# Save the model
model.save(keras_model_path, overwrite=True)
print(f"\nModel saved successfully to '{keras_model_path}' in Keras format.")

# --- Load the model from Keras format (.keras) ---
print(f"\n--- Loading Model from '{keras_model_path}' ---")
loaded_model = tf.keras.models.load_model(
    keras_model_path,
    custom_objects={
        'custom_wmae_loss': custom_wmae_loss,
        'WeightedMAEMetric': WeightedMAEMetric
    }
)
print("Model loaded successfully!")
loaded_model.summary()

# Model Evaluation

## Model evaluation on regression metrics

This final step performs the essential task of **evaluating a trained deep learning model** on an unseen test dataset. The evaluation is done in two stages: first on the scaled data and then, more importantly, on the unscaled (original) data to provide interpretable metrics.

**1. Evaluation on Scaled Data**

The code first evaluates the model using the `model.evaluate()` function. This leverages the loss and metrics that were defined when the model was compiled (`custom_wmae_loss` and `WeightedMAEMetric`). The evaluation is performed on the `dataset_test`, which contains sequences of preprocessed and scaled data.

**Theory: Model Metrics** 

During training, a model optimizes a loss function, but performance is typically measured using a separate set of metrics that are more interpretable. For this regression task, the key metric is **Weighted Mean Absolute Error (WMAE)**, which indicates the average magnitude of prediction errors, giving more importance to specific data points. Evaluating on scaled data provides a quick sanity check to ensure the model's performance on the test set is consistent with its validation performance during training.

**2. Unscaled Prediction and Metric Calculation**

This is the most critical part of the evaluation. While models are trained on scaled data for stability, the final performance must be measured on the original scale to be meaningful and comparable to real-world values.

**Step-by-Step Process:**

- **Iterate and Predict:** The code loops through the `dataset_test` to get batches of features, true values, and sample weights. For each batch, it uses `model.predict()` to generate predictions.
- **Handling NaNs:** It includes robust checks to handle potential NaN values in both the true values and the predictions. This is a practical step to prevent errors and ensure that the metric calculations are not corrupted.
- **Inverse Transformation:** The core concept here is to unscale the data. The `scaler_y.inverse_transform()` method is used to convert the scaled predictions and true values back to their original numerical range (e.g., from a value like 0.5 to a value like 500). This step is essential because a scaled error (e.g., 0.1) has no real-world meaning until it is converted back to the original unit (e.g., 100 sales units).
- **Metric Calculation:** Finally, the code calculates standard regression metrics—Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE)—on the unscaled data using scikit-learn functions. It also recalculates the WMAE on the unscaled data.

**Theory: The Importance of Unscaled Metrics**

- **Interpretability:** Metrics on unscaled data provide a direct, tangible understanding of the model's performance. For instance, an MAE of 10 means the model's predictions are, on average, off by 10 units of sales, which is a much more useful piece of information than a scaled MAE of 0.05.
- **Comparison:** Unscaled metrics allow for direct comparison between different models or with a simple baseline model, regardless of how each model's internal data was scaled.
- **Regression Metrics:**
  - **MAE:** The average absolute difference between predicted and actual values. It is easy to interpret and not sensitive to outliers.
  - **MSE:** The average of the squared differences. It penalizes large errors more heavily than MAE.
  - **RMSE:** The square root of MSE, which puts the metric back in the same units as the target variable, making it more interpretable than MSE.

In [None]:
# --- Evaluate the model on the test dataset ---
print("\n--- Evaluating Model on Test Set (Scaled) ---")
# model.evaluate directly uses the loss and metrics defined in model.compile
# The output will include the custom WMAE metric.
evaluation_results = model.evaluate(dataset_test)
print(f"Test Set Evaluation Results (Scaled):")
for name, value in zip(model.metrics_names, evaluation_results):
    print(f"{name}: {value:.4f}")

# --- Make predictions and calculate unscaled metrics ---
print("\n--- Making Predictions and Calculating Unscaled Metrics ---")

# Initialize lists for true values and predictions (scaled)
y_true_scaled_list, y_pred_scaled_list, sample_weights_list = [], [], []

# Iterate through the test dataset to get predictions and corresponding true values and weights
for x_batch, y_batch_scaled, weights_batch in dataset_test:
    # Handle NaNs in y_batch_scaled (true values) before extending
    y_batch_scaled_processed = y_batch_scaled.numpy()
    if np.any(np.isnan(y_batch_scaled_processed)):
        print(f"Warning: NaN values detected in true scaled targets. Replacing with 0 for metric calculation.")
        y_batch_scaled_processed = np.nan_to_num(y_batch_scaled_processed, nan=0.0)

    y_pred_batch_scaled = model.predict(x_batch, verbose=0) # Disable verbose logging

    # --- NaN Handling for predictions: Replace NaNs with 0 before further processing ---
    if np.any(np.isnan(y_pred_batch_scaled)):
        print(f"Warning: NaN values detected in model predictions. Replacing with 0 for metric calculation.")
        y_pred_batch_scaled = np.nan_to_num(y_pred_batch_scaled, nan=0.0) # Replace NaNs with 0

    y_true_scaled_list.extend(y_batch_scaled_processed) # Use processed true values
    y_pred_scaled_list.extend(y_pred_batch_scaled.flatten())
    sample_weights_list.extend(weights_batch.numpy())

# Convert lists to NumPy arrays
y_true_scaled = np.array(y_true_scaled_list)
y_pred_scaled = np.array(y_pred_scaled_list)
sample_weights_for_metrics = np.array(sample_weights_list)

# Inverse transform predictions and true values to original scale
# Apply nan_to_num after inverse_transform as well, in case scaling produces NaNs
y_true_unscaled = np.nan_to_num(scaler_y.inverse_transform(y_true_scaled.reshape(-1, 1)).flatten(), nan=0.0)
y_pred_unscaled = np.nan_to_num(scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).flatten(), nan=0.0)

# Ensure sample_weights_for_metrics matches the length of predictions after inverse transform
# This is crucial because timeseries_dataset_from_array might drop initial samples
# and the number of predictions will match the number of targets in the dataset.
# The `create_dataset` function already aligns `aligned_weights` to the number of sequences
# that will be generated, so `sample_weights_for_metrics` should already be correctly aligned.

# Compute evaluation metrics on unscaled data
mae_unscaled = mean_absolute_error(y_true_unscaled, y_pred_unscaled)
mse_unscaled = mean_squared_error(y_true_unscaled, y_pred_unscaled)
rmse_unscaled = np.sqrt(mse_unscaled)

# Calculate WMAE on unscaled data for verification (should be similar to scaled WMAE if scaler is linear)
# Only consider samples with non-zero weights for WMAE calculation to avoid division by zero
non_zero_weight_indices = sample_weights_for_metrics > 0
if np.sum(non_zero_weight_indices) > 0:
    wmae_unscaled = np.sum(np.abs(y_true_unscaled[non_zero_weight_indices] - y_pred_unscaled[non_zero_weight_indices]) * sample_weights_for_metrics[non_zero_weight_indices]) / np.sum(sample_weights_for_metrics[non_zero_weight_indices])
else:
    wmae_unscaled = 0.0 # Handle case where all weights are zero

# Print results
print(f"Evaluation Metrics on Test Set (Unscaled):")
print(f"MAE  : {mae_unscaled:.4f}")
print(f"MSE  : {mse_unscaled:.4f}")
print(f"RMSE : {rmse_unscaled:.4f}")
print(f"WMAE : {wmae_unscaled:.4f}")