# Competition overview

**Competition Summary**

- This is the Rohlik Sales Forecasting Challenge, a time-series forecasting competition hosted on Kaggle. The primary objective is to predict the sales volume for various inventory items across 11 different Rohlik Group warehouses for a period of 14 days.

- Accurate forecasts are vital for the e-grocery company's operations, as they directly impact supply chain efficiency, inventory management, and overall sustainability by minimizing waste.

- The model's performance will be evaluated using the Weighted Mean Absolute Error (WMAE). The specific weights for each inventory item are provided in a separate file. The competition runs from November 15, 2024, to February 15, 2025, and offers cash prizes for the top three competitors.

- The dataset includes historical sales and order data, product metadata, and a calendar with holiday information. Some features available in the training set (e.g., sales and availability) are intentionally removed from the test set, as they would not be known at the time of a real-world prediction.

### [Link to competition](https://www.kaggle.com/competitions/rohlik-sales-forecasting-challenge-v2/overview)

# Data Dictionary

This data dictionary describes the files and columns provided for the competition.

**sales_train.csv and sales_test.csv**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column</th>
    <th class="tg-7zrl">Description</th>
    <th class="tg-7zrl">Data Type</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">unique_id</td>
    <td class="tg-7zrl">A unique identifier for a specific inventory item in a specific warehouse.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">date</td>
    <td class="tg-7zrl">The date of the sales record.</td>
    <td class="tg-7zrl">Date</td>
  </tr>
  <tr>
    <td class="tg-7zrl">warehouse</td>
    <td class="tg-7zrl">The name of the warehouse where the item is stored.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">total_orders</td>
    <td class="tg-7zrl">The historical number of orders for the selected warehouse.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">sales</td>
    <td class="tg-7zrl">The target variable: sales volume (pcs or kg).</td>
    <td class="tg-7zrl">Float</td>
  </tr>
  <tr>
    <td class="tg-7zrl">sell_price_main</td>
    <td class="tg-7zrl">The selling price of the item.</td>
    <td class="tg-7zrl">Float</td>
  </tr>
  <tr>
    <td class="tg-7zrl">availability</td>
    <td class="tg-7zrl">The proportion of the day the item was available. A value of 1 means it was available all day.</td>
    <td class="tg-7zrl">Float</td>
  </tr>
  <tr>
    <td class="tg-7zrl">type_0_discount, type_1_discount, etc.</td>
    <td class="tg-7zrl">The percentage discount offered for various promotion types. Negative values indicate no discount.</td>
    <td class="tg-7zrl">Float</td>
  </tr>
</tbody></table>

**inventory.csv**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column</th>
    <th class="tg-7zrl">Description</th>
    <th class="tg-7zrl">Data Type</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">unique_id</td>
    <td class="tg-7zrl">A unique identifier for a specific inventory item.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">product_unique_id</td>
    <td class="tg-7zrl">A unique identifier for a product, shared across all warehouses.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">name</td>
    <td class="tg-7zrl">The name of the product.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">L1_category_name, L2_category_name, etc.</td>
    <td class="tg-7zrl">Hierarchical category names for the product. L4 is the most granular.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">warehouse</td>
    <td class="tg-7zrl">The name of the warehouse where the inventory item is located.</td>
    <td class="tg-7zrl">String</td>
  </tr>
</tbody></table>

**calendar.csv**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column</th>
    <th class="tg-7zrl">Description</th>
    <th class="tg-7zrl">Data Type</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">warehouse</td>
    <td class="tg-7zrl">The name of the warehouse.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">date</td>
    <td class="tg-7zrl">The date of the calendar event.</td>
    <td class="tg-7zrl">Date</td>
  </tr>
  <tr>
    <td class="tg-7zrl">holiday_name</td>
    <td class="tg-7zrl">The name of the public holiday, if applicable.</td>
    <td class="tg-7zrl">String</td>
  </tr>
  <tr>
    <td class="tg-7zrl">holiday</td>
    <td class="tg-7zrl">A binary flag (0 or 1) indicating if the date is a holiday.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">shops_closed</td>
    <td class="tg-7zrl">A flag indicating a public holiday where most shops are closed.</td>
    <td class="tg-7zrl">Boolean</td>
  </tr>
  <tr>
    <td class="tg-7zrl">winter_school_holidays</td>
    <td class="tg-7zrl">A flag for winter school holidays.</td>
    <td class="tg-7zrl">Boolean</td>
  </tr>
  <tr>
    <td class="tg-7zrl">school_holidays</td>
    <td class="tg-7zrl">A flag for general school holidays.</td>
    <td class="tg-7zrl">Boolean</td>
  </tr>
</tbody></table>

**test_weights.csv**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column</th>
    <th class="tg-7zrl">Description</th>
    <th class="tg-7zrl">Data Type</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">unique_id</td>
    <td class="tg-7zrl">A unique identifier for the inventory item.</td>
    <td class="tg-7zrl">Integer</td>
  </tr>
  <tr>
    <td class="tg-7zrl">weight</td>
    <td class="tg-7zrl">The weight used for calculating the Weighted Mean Absolute Error (WMAE) metric for this item.</td>
    <td class="tg-7zrl">Float</td>
  </tr>
</tbody>
</table>

### [Link to dataset](https://www.kaggle.com/competitions/rohlik-sales-forecasting-challenge-v2/data)

# Submission Pipeline Overview

**Data Preparation**

The process begins by loading the test data along with inventory, weights, and calendar information. The calendar data is enriched with custom holidays and derived features like days to next holiday. The datasets are then merged and sorted chronologically to maintain temporal integrity.

The data is then transformed to prepare it for the model. First, a noise reduction step removes low-relevance columns, followed by a preprocessing function that handles categorical encoding and numerical scaling. This function uses pre-trained `OrdinalEncoder` and `StandardScaler` objects, which ensures the test data is transformed consistently with how the training data was processed. Finally, the preprocessed data is formatted into time-series sequences of length 100, which is the required input format for the model.

**Model Loading**

After the data is prepared, the pipeline loads a pre-trained Keras model from a `.keras` file. This model uses a hybrid architecture combining WaveNet and Transformer components.

The WaveNet model, consisting of multiple Conv1D and BatchNormalization layers, is designed to capture temporal dependencies in the data. The Transformer component, identified by its MultiHeadAttention and LayerNormalization layers, helps the model understand long-range relationships between data points in the time series. The model also includes custom loss and metric functions, `custom_wmae_loss` and `WeightedMAEMetric`, which are essential for its specific forecasting task.

**Prediction Generation**

The process uses the previously loaded WaveNet and Transformer hybrid model to generate sales predictions for the test data. It iterates through the pre-formatted dataset_submission in batches, calling the model.predict method for each batch. The predictions from all batches are then concatenated into a single NumPy array.

**Post-processing and Submission**

A crucial step is to convert the scaled predictions back into their original format. The pipeline loads the `scaler_y` object, which was used to scale the target sales data during the model's training, and applies its `inverse_transform` method to the predictions. This returns the forecast values to their unscaled, interpretable state.

Finally, the unscaled predictions are prepared for submission. A new `sales_hat` column is added to the original submission dataframe, and a unique id column is created by combining the unique_id and date. The final DataFrame is then saved to a `submission.csv` file, containing only the required id and `sales_hat` columns.

# Related Notebooks

### [Model training notebook](https://www.kaggle.com/code/misterfour/rohik-sales-forecasting-challenge)
### [Reference! (add holidays calendar of each country into dataset)](https://www.kaggle.com/competitions/rohlik-sales-forecasting-challenge-v2/overview)

# Import Libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/rohik_scaler_v2/scikitlearn/default/1/scaler.pkl
/kaggle/input/rohik_encoder_v2/scikitlearn/default/1/encoder.pkl
/kaggle/input/rohik_scaler_y_v2/scikitlearn/default/1/scaler_y.pkl
/kaggle/input/rohlik-sales-forecasting-challenge-v2/calendar.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/test_weights.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/inventory.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_train.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_test.csv
/kaggle/input/rohlik-sales-forecasting-challenge-v2/solution.csv
/kaggle/input/wavenet_transformer_model.keras/keras/default/1/wavenet_transformer_model.keras


In [2]:
# Install TensorFlow and Keras
!pip install tensorflow==2.15.0
!pip install keras==2.15.0
!pip install scikit-learn==1.2.2
from joblib import Parallel, delayed
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
import joblib
from tensorflow.keras.layers import Input, Conv1D, Multiply, Add, Dense, Dropout, LayerNormalization
from tensorflow.keras.layers import MultiHeadAttention, GlobalAveragePooling1D, Activation, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import Metric
from sklearn.preprocessing import OrdinalEncoder
import joblib
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler
from joblib import Parallel, delayed
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.decomposition import PCA
import joblib
import tensorflow as tf
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (
    Input, Conv1D, Add, Activation, Dense, Dropout,
    BatchNormalization, GlobalAveragePooling1D, Multiply, LayerNormalization
)
from tensorflow.keras.layers import MultiHeadAttention

Collecting tensorflow==2.15.0
  Downloading tensorflow-2.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting ml-dtypes~=0.2.0 (from tensorflow==2.15.0)
  Downloading ml_dtypes-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting wrapt<1.15,>=1.11.0 (from tensorflow==2.15.0)
  Downloading wrapt-1.14.2-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.metadata (6.5 kB)
Collecting tensorboard<2.16,>=2.15 (from tensorflow==2.15.0)
  Downloading tensorboard-2.15.2-py3-none-any.whl.metadata (1.7 kB)
Collecting tensorflow-estimator<2.16,>=2.15.0 (from tensorflow==2.15.0)
  Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting keras<2.16,>=2.15.0 (from tensorflow==2.15.0)
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Downloading tensorflow-2.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.3 MB)
[2K   [90m━━━━━━━━━

2025-09-21 09:29:55.272770: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-09-21 09:29:55.272826: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-09-21 09:29:55.274566: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Load Dataset and Preprocess Dataset

The core purpose of this script is to prepare the submission data (`sales_test.csv`) to match the exact format of the training data, so it can be fed into the pre-trained WaveNet-Transformer model for prediction.

**Step-by-Step Breakdown of the Code**

**1. Data Merging and Feature Engineering**

The code starts by loading several `datasets—sales_test.csv, inventory.csv, calendar.csv, and test_weights.csv` and merges them into a single `df_submission` dataframe. It also includes functions like `fill_loss_holidays` and enrich_calendar to add new, potentially predictive features.

Feature engineering is the process of using domain knowledge to create new features that are not explicitly present in the original dataset. The goal is to improve the performance of a machine learning model by giving it more relevant information. In this code, combining calendar data with sales data and creating features like `date_days_to_next_holiday` provides the model with crucial context about temporal patterns that would be difficult to learn from raw data alone.

**2. Missing Value Imputation**

After merging the data, the code iterates through all numerical columns and fills any missing values (NaN) with the mean of that column.

**Theory: Data Imputation**

This step is a practical necessity for data integrity. Machine learning models, particularly deep neural networks, generally require complete data and cannot handle missing values. Imputation is the process of filling these gaps. Using the mean is a simple and common strategy that maintains the central tendency of the data.

**3. Feature Selection (Dimensionality Reduction)**

The code drops a predefined list of "noise columns" (`columns_to_drop`). This is the part of the script that directly relates to the user's question about correlation analysis. The list of columns to drop is likely the result of a previous analysis that identified features as having low predictive power, or being highly correlated with other features.

**Theory: The Curse of Dimensionality**

Dropping features is a form of dimensionality reduction. Including too many features can lead to the "curse of dimensionality," making models slower to train and more prone to overfitting. Features are typically dropped if they have:

- **Low Variance:** They don't change much, so they provide little predictive information.
- **High Correlation:** They are redundant with other features, and a model only needs one of them.
- **Low Importance:** A feature importance analysis (such as from a random forest or a linear model) shows they have a negligible impact on the target variable.

**4. Data Transformation and Scaling**

The `preprocess_data` function is used to transform the data to match the format expected by the model. This is a crucial step that loads the same `OrdinalEncoder` and `StandardScaler` objects used during the training phase.

**Theory: Preventing Data Leakage**

This is one of the most important concepts in the entire pipeline. It is paramount to use the same transformations on the test data that were fitted on the training data.  If a new StandardScaler were fit on the test data, it would learn the mean and standard deviation of the test set, which is a form of data leakage from the future. This would lead to unrealistically optimistic performance metrics.

- **Ordinal Encoding:** Converts categorical features (like warehouse names) into numerical values that the model can process.
- **Standard Scaling:** Normalizes numerical features by transforming them to have a mean of 0 and a standard deviation of 1. This is essential for neural networks to ensure no single feature's magnitude dominates the learning process.

**5. Sequence Creation and Padding**

The final steps prepare the data into the sequence format required by the WaveNet-Transformer model. The x_submission data is first padded with zeros to ensure it's long enough to create full sequences. Then, `tf.keras.utils.timeseries_dataset_from_array` is used to convert the flat data into a batched, sequential dataset.

**Theory: Time-Series Sequence Modeling**

Unlike traditional models that look at one row at a time, a time-series model like a recurrent neural network or a Transformer requires sequences of data to make predictions. The `SEQUENCE_LENGTH` hyperparameter (100) dictates how many past time steps the model will use to predict the next value. Padding is necessary for the initial data points in the test set, where there isn't enough historical context to form a full sequence. The `timeseries_dataset_from_array` function efficiently handles this process of creating overlapping sequences and prepares the data for the model's input layer.

In [3]:
# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Hyperparameters (must match training pipeline)
SEQUENCE_LENGTH = 100
BATCH_SIZE = 256

# Additional holiday days (unchanged)
czech_holiday = [
    (['03/31/2024', '04/09/2023', '04/17/2022', '04/04/2021', '04/12/2020'], 'Easter Day'),
    (['05/12/2024', '05/10/2020', '05/09/2021', '05/08/2022', '05/14/2023'], "Mother Day"),
]
brno_holiday = [
    (['03/31/2024', '04/09/2023', '04/17/2022', '04/04/2021', '04/12/2020'], 'Easter Day'),
    (['05/12/2024', '05/10/2020', '05/09/2021', '05/08/2022', '05/14/2023'], "Mother Day"),
]
munich_holidays = [
    (['03/30/2024', '04/08/2023', '04/16/2022', '04/03/2021'], 'Holy Saturday'),
    (['05/12/2024', '05/14/2023', '05/08/2022', '05/09/2021'], 'Mother Day'),
]
frankfurt_holidays = [
    (['03/30/2024', '04/08/2023', '04/16/2022', '04/03/2021'], 'Holy Saturday'),
    (['05/12/2024', '05/14/2023', '05/08/2022', '05/09/2021'], 'Mother Day'),
]

# Functions (unchanged)
def fill_loss_holidays(df_fill, warehouses, holidays):
    df = df_fill.copy()
    for item in holidays:
        dates, holiday_name = item
        generated_dates = [pd.to_datetime(date, format='%m/%d/%Y').strftime('%Y-%m-%d') for date in dates]
        for generated_date in generated_dates:
            df.loc[(df['warehouse'].isin(warehouses)) & (df['date'] == generated_date), 'holiday'] = 1
            df.loc[(df['warehouse'].isin(warehouses)) & (df['date'] == generated_date), 'holiday_name'] = holiday_name
    return df

def enrich_calendar(df):
    df = df.sort_values('date').reset_index(drop=True)
    df['next_holiday_date'] = df.loc[df['holiday'] == 1, 'date'].shift(-1)
    df['next_holiday_date'] = df['next_holiday_date'].bfill()
    df['date_days_to_next_holiday'] = (df['next_holiday_date'] - df['date']).dt.days
    df.drop(columns=['next_holiday_date'], inplace=True)
    df['next_shops_closed_date'] = df.loc[df['shops_closed'] == 1, 'date'].shift(-1)
    df['next_shops_closed_date'] = df['next_shops_closed_date'].bfill()
    df['date_days_to_shops_closed'] = (df['next_shops_closed_date'] - df['date']).dt.days
    df.drop(columns=['next_shops_closed_date'], inplace=True)
    df['date_day_after_closed_day'] = ((df['shops_closed'] == 0) & (df['shops_closed'].shift(1) == 1)).astype(int)
    df['date_second_closed_day'] = ((df['shops_closed'] == 1) & (df['shops_closed'].shift(1) == 1)).astype(int)
    df['date_day_after_two_closed_days'] = ((df['shops_closed'] == 0) & (df['date_second_closed_day'].shift(1) == 1)).astype(int)
    return df

def stack_datasets(df, calendar_extended, inventory, weights):
    df = df.merge(calendar_extended, on=['date', 'warehouse'], how='left')
    df = df.merge(inventory, on=['unique_id', 'warehouse'], how='left')
    df = df.merge(weights, on='unique_id', how='left')
    df['date'] = pd.to_datetime(df['date'])
    return df

def sort_dataframe_by_date(df, date_column):
    df[date_column] = pd.to_datetime(df[date_column])
    df = df.sort_values(by=date_column)
    return df

def encode_datetime(df, datetime_column):
    df[datetime_column] = pd.to_datetime(df[datetime_column])
    df['year'] = df[datetime_column].dt.year
    df['month'] = df[datetime_column].dt.month
    df['day'] = df[datetime_column].dt.day
    df['sin_month'] = np.sin(2 * np.pi * df['month'] / 12)
    df['cos_month'] = np.cos(2 * np.pi * df['month'] / 12)
    df['sin_day'] = np.sin(2 * np.pi * df['day'] / 31)
    df['cos_day'] = np.cos(2 * np.pi * df['day'] / 31)
    df.drop(datetime_column, axis=1, inplace=True)
    return df

# Load datasets
train = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_train.csv', parse_dates=['date'])
inventory = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/inventory.csv')
submission = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_test.csv', parse_dates=['date'])
weights = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/test_weights.csv')

# Load and preprocess calendar
calendar = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/calendar.csv', parse_dates=['date'])
calendar = fill_loss_holidays(calendar, ['Prague_1', 'Prague_2', 'Prague_3'], czech_holiday)
calendar = fill_loss_holidays(calendar, ['Brno_1'], brno_holiday)
calendar = fill_loss_holidays(calendar, ['Munich_1'], munich_holidays)
calendar = fill_loss_holidays(calendar, ['Frankfurt_1'], frankfurt_holidays)

calendar_enriched = pd.DataFrame()
for location in ['Frankfurt_1', 'Prague_2', 'Brno_1', 'Munich_1', 'Prague_3', 'Prague_1', 'Budapest_1']:
    calendar_enriched = pd.concat([calendar_enriched, enrich_calendar(calendar.query('date >= "2020-08-01 00:00:00" and warehouse == @location'))])
calendar_enriched['year'] = calendar_enriched['date'].dt.year
calendar_enriched = calendar_enriched.rename(columns={
    'holiday_name': 'date_holiday_name',
    'year': 'date_year',
    'holiday': 'date_holiday_flag',
    'shops_closed': 'date_shops_closed_flag',
    'winter_school_holidays': 'date_winter_school_holidays_flag',
    'school_holidays': 'date_school_holidays_flag',
})

# Stack datasets
df_submission = stack_datasets(submission, calendar_enriched, inventory, weights)
df_submission['date_holiday_name'] = df_submission['date_holiday_name'].fillna('Working Day')

# Fill NaN values
for col in df_submission.select_dtypes(include=np.number).columns:
    if df_submission[col].isnull().any():
        mean_val = df_submission[col].mean()
        df_submission[col] = df_submission[col].fillna(mean_val)
        print(f"Filled NaN in df_submission['{col}'] with mean: {mean_val:.2f}")

# Sort by date
df_submission = sort_dataframe_by_date(df_submission, 'date')
submission = df_submission.copy()
# Drop noise columns
columns_to_drop = [
    'type_2_discount',
    'date_holiday_flag',
    'date_school_holidays_flag',
    'date_shops_closed_flag',
    'date_second_closed_day',
    'date_winter_school_holidays_flag',
    'date_day_after_closed_day',
    'date_day_after_two_closed_days',
    'type_5_discount',
    'type_3_discount',
    'type_1_discount',
    'unique_id',
    "availability"
]
df_submission = df_submission.drop(columns=columns_to_drop, axis=1, errors='ignore')

# Preprocess data
def preprocess_data(df, encoder_path='/kaggle/input/rohik_encoder_v2/scikitlearn/default/1/encoder.pkl', scaler_path='/kaggle/input/rohik_scaler_v2/scikitlearn/default/1/scaler.pkl'):
    weight = df['weight'].values.astype(np.float32)
    x = df.drop(['weight'], axis=1).copy()

    datetime_cols = x.select_dtypes(include=['datetime']).columns
    if len(datetime_cols) > 0:
        for col in datetime_cols:
            x[col + '_month'] = x[col].dt.month
            x[col + '_day'] = x[col].dt.day
        x = x.drop(datetime_cols, axis=1)

    categorical_cols = x.select_dtypes(include=['object', 'category']).columns
    numeric_cols = x.select_dtypes(include=['number']).columns

    encoder = joblib.load(encoder_path)
    scaler = joblib.load(scaler_path)
    if len(categorical_cols) > 0:
        x[categorical_cols] = encoder.transform(x[categorical_cols])
    if len(numeric_cols) > 0:
        x[numeric_cols] = scaler.transform(x[numeric_cols])

    x = x.values.astype(np.float32)
    return x, weight

x_submission, weight_submission = preprocess_data(df_submission)

# Pad x_submission for sequence length
pad_length = SEQUENCE_LENGTH - 1
x_submission_padded = np.pad(x_submission, ((pad_length, 0), (0, 0)), mode='constant', constant_values=0)

# Function to create test dataset
def create_test_dataset(features, weights, sequence_length, batch_size):
    dataset = tf.keras.utils.timeseries_dataset_from_array(
        data=features,
        targets=None,
        sequence_length=sequence_length,
        batch_size=batch_size,
        shuffle=False
    )
    aligned_weights = weights[-len(features):]
    return dataset, aligned_weights

# Create test dataset
dataset_submission, aligned_weights = create_test_dataset(x_submission_padded, weight_submission, SEQUENCE_LENGTH, BATCH_SIZE)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


# Model Loading

This section of the code handles two critical tasks for the deep learning model: d**efining a custom loss function and a custom metric, and then loading the model itself**.

**step-by-step breakdown**

**1. Defining a Custom Weighted Loss Function (`custom_wmae_loss`)**

This function defines the Weighted Mean Absolute Error (WMAE), which the model uses during training to optimize its performance. In a standard Mean Absolute Error (MAE) loss, every prediction error is treated equally. However, in this problem, some forecasts are more important than others (e.g., predicting sales for high-volume products). WMAE addresses this by multiplying the absolute error of each prediction by a corresponding sample_weight

**Code Breakdown:**

- The function takes `y_true` (actual values), `y_pred` (predicted values), and `sample_weight` as inputs.
- It calculates `tf.abs(y_true - y_pred)`, which is the absolute error for each prediction.
- It multiplies this by `sample_weight` to get the `weighted_error`.
- The loss is then computed as the sum of the weighted errors divided by the sum of the weights, effectively providing a weighted average.
- The `tf.cond` statement is a safety check to prevent a division-by-zero error if all weights are zero.

**2. Defining a Custom Weighted Metric (WeightedMAEMetric)**

This is a custom TensorFlow Metric class that provides a human-readable, aggregate view of the WMAE during the training process, distinct from the loss function. A Metric object maintains a running state and updates its value incrementally for each batch. This is different from a loss function, which calculates the loss for a single batch. By defining a custom metric, you can track the WMAE on the training and validation data without it directly influencing the model's gradient descent.

**Code Breakdown:**

- The `__init__` method initializes two state variables: `total_weighted_error` and `total_weights`, which will accumulate values over an epoch.
- The `update_state` method is called after each batch of data. It calculates the weighted error for that batch and adds it to the running totals.
- The result method is called at the end of an epoch. It computes the final WMAE for the epoch by dividing the total accumulated error by the total accumulated weights.
- The `reset_state` method is called at the beginning of each epoch to clear the running totals.

**3. Loading the Pre-trained Model**

This final step loads the entire deep learning model from a file that was saved after a prior training run. When a model is saved in Keras, custom components like `custom_wmae_loss` and `WeightedMAEMetric` are not automatically recognized. The `custom_objects` argument in `tf.keras.models.load_model` is a dictionary that maps the names of the custom components to their actual Python implementations. This is a crucial step for successfully loading models that use custom loss functions, layers, or metrics.

**Code Breakdown:**

- `tf.keras.models.load_model` is called with the file path to the saved model.
- The `custom_objects` dictionary is passed, linking `custom_wmae_loss` to the custom_wmae_loss function and `WeightedMAEMetric` to the WeightedMAEMetric class.
- Finally, `model.summary()` prints a high-level overview of the model's architecture, confirming that it was loaded correctly and is ready for use.

In [4]:
# Custom WMAE Loss Function
def custom_wmae_loss(y_true, y_pred, sample_weight=None):
    y_true = tf.cast(y_true, tf.float32)
    y_pred = tf.cast(y_pred, tf.float32)
    if sample_weight is None:
        sample_weight = tf.ones_like(y_true, dtype=tf.float32)
    else:
        sample_weight = tf.cast(sample_weight, tf.float32)
    weighted_error = tf.abs(y_true - y_pred) * sample_weight
    sum_of_weights = tf.reduce_sum(sample_weight)
    return tf.cond(
        tf.greater(sum_of_weights, 0),
        lambda: tf.reduce_sum(weighted_error) / sum_of_weights,
        lambda: tf.constant(0.0, dtype=tf.float32)
    )

# Custom WMAE Metric
class WeightedMAEMetric(Metric):
    def __init__(self, name='wmae', **kwargs):
        super(WeightedMAEMetric, self).__init__(name=name, **kwargs)
        self.total_weighted_error = self.add_weight(name='total_weighted_error', initializer='zeros')
        self.total_weights = self.add_weight(name='total_weights', initializer='zeros')

    def update_state(self, y_true, y_pred, sample_weight=None):
        y_true = tf.cast(y_true, tf.float32)
        y_pred = tf.cast(y_pred, tf.float32)
        if sample_weight is None:
            sample_weight = tf.ones_like(y_true, dtype=tf.float32)
        else:
            sample_weight = tf.cast(sample_weight, tf.float32)
        weighted_error = tf.abs(y_true - y_pred) * sample_weight
        self.total_weighted_error.assign_add(tf.reduce_sum(weighted_error))
        self.total_weights.assign_add(tf.reduce_sum(sample_weight))

    def result(self):
        return tf.cond(
            tf.greater(self.total_weights, 0),
            lambda: self.total_weighted_error / self.total_weights,
            lambda: tf.constant(0.0, dtype=tf.float32)
        )

    def reset_state(self):
        self.total_weighted_error.assign(0.0)
        self.total_weights.assign(0.0)

# --- Save the model in Keras format (.keras) ---
# Define the path for the .keras file
keras_model_path = "/kaggle/input/wavenet_transformer_model.keras/keras/default/1/wavenet_transformer_model.keras"

# --- Load the model from Keras format (.keras) ---
print(f"\n--- Loading Model from '{keras_model_path}' ---")
model = tf.keras.models.load_model(
    keras_model_path,
    custom_objects={
        'custom_wmae_loss': custom_wmae_loss,
        'WeightedMAEMetric': WeightedMAEMetric
    }
)
print("Model loaded successfully!")
model.summary()


--- Loading Model from '/kaggle/input/wavenet_transformer_model.keras/keras/default/1/wavenet_transformer_model.keras' ---
Model loaded successfully!
Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input (InputLayer)          [(None, 100, 18)]            0         []                            
                                                                                                  
 conv1d_26 (Conv1D)          (None, 100, 256)             14080     ['input[0][0]']               
                                                                                                  
 conv1d_27 (Conv1D)          (None, 100, 256)             14080     ['input[0][0]']               
                                                                                                  
 multiply_6 (Multiply)       (None, 100,

# Prediction Generation, Post-processing, and Submission

This step focuses on the final stages of the machine learning workflow: **generating predictions from the trained model and formatting them into a submission file.**

**Step-by-step breakdown**

**1. Generating Predictions**

This part of the code uses the pre-trained model to make predictions on the test dataset. After the model is trained, its purpose is to generalize what it has learned to new, unseen data. The `model.predict()` method is the standard way to accomplish this. It takes a preprocessed dataset (in this case, `dataset_submission`) as input and uses the model's learned weights and biases to output a forecast.

**Code Breakdown:**

- `y_pred_list = []`: An empty list is initialized to store the predictions for each batch.
- `for x_batch in dataset_submission`: The code iterates through the test data one batch at a time. This is a memory-efficient practice, especially for large datasets, as it avoids loading the entire dataset into memory at once.
- `y_pred_batch = model.predict(x_batch, verbose=0)`: The model generates predictions for the current batch. The verbose=0 argument prevents the printing of progress bars to the console.
- `y_pred_list.append(y_pred_batch)`: The predictions for the current batch are added to the list.
- `y_pred_padded = np.concatenate(y_pred_list, axis=0)`: After all batches are processed, the list of prediction arrays is combined into a single, padded NumPy array.
- `y_pred = y_pred_padded[-len(x_submission):]`: The padding that was added to the input data for sequence alignment is removed from the predictions to ensure the final prediction array has the correct length.

**2. Inverse Transformation of Predictions**

The predictions are inverse-scaled to be in the same units as the original sales data. During the model's training, the target variable (sales) was likely scaled (e.g., using StandardScaler) to help the optimization process. Models often perform better when input and output data are in a standardized range. However, the raw model output is on this scaled range, which is not directly interpretable. To get the actual sales values, an inverse transformation is required.

**Code Breakdown:**

- `scaler_y = joblib.load(...)`: The pre-trained scaler object that was used to scale the target variable during the training phase is loaded from a file.
- `y_pred.reshape(-1, 1)`: The 1D prediction array is reshaped into a 2D array, which is the required input format for the inverse_transform method of StandardScaler.
- `y_pred_unscaled = scaler_y.inverse_transform(...)`: The loaded scaler object applies the inverse transformation to convert the predictions back to the original scale.

**3. Creating and Saving the Submission File**

This is the final step where the predictions are formatted into the required structure for submission. In a data science competition, there are often strict requirements for the submission file format. The file usually needs to contain a unique identifier for each prediction and the predicted value itself.

**Code Breakdown:**

- `submission['date'] = ...`: The date column is formatted to `YYYY-MM-DD` to ensure consistency.
- `submission['id'] = ...`: A new id column is created by combining the `unique_id` and the formatted date. This provides a unique identifier for each forecast as required.
- `submission['sales_hat'] = y_pred_unscaled`: The unscaled predictions are added to the submission DataFrame under the column name `sales_hat`.
- `submission_final = submission[['id', 'sales_hat']]`: A new DataFrame is created containing only the id and `sales_hat` columns, as these are the only two required for submission.
- `submission_final.to_csv("submission.csv", index=False)`: The final DataFrame is saved as a CSV file. The `index=False` argument prevents pandas from writing the DataFrame's index to the CSV file.

In [5]:
# Generate predictions
y_pred_list = []
for x_batch in dataset_submission:
    y_pred_batch = model.predict(x_batch, verbose=0)
    y_pred_list.append(y_pred_batch)

y_pred_padded = np.concatenate(y_pred_list, axis=0)
y_pred = y_pred_padded[-len(x_submission):]

# Load scaler for inverse transformationa
scaler_y = joblib.load('/kaggle/input/rohik_scaler_y_v2/scikitlearn/default/1/scaler_y.pkl')
y_pred_unscaled = scaler_y.inverse_transform(y_pred.reshape(-1, 1)).flatten()

# Prepare submission
submission['date'] = pd.to_datetime(submission['date']).dt.strftime('%Y-%m-%d')
# Restore unique_id since it was dropped
submission['id'] = submission['unique_id'].astype(str) + '_' + submission['date']
submission['sales_hat'] = y_pred_unscaled
submission_final = submission[['id', 'sales_hat']]

# Save submission
submission_final.to_csv("submission.csv", index=False)
print("Submission saved to submission.csv")
submission_final

Submission saved to submission.csv


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Unnamed: 0,id,sales_hat
0,1226_2024-06-03,45.729156
18110,3510_2024-06-03,46.179909
18100,3517_2024-06-03,46.386429
18077,3148_2024-06-03,49.095165
18069,2385_2024-06-03,48.637413
...,...,...
44578,96_2024-06-16,20.823042
23275,5228_2024-06-16,20.617605
13130,4848_2024-06-16,520.061951
37054,5270_2024-06-16,191.749313
