<a href="https://colab.research.google.com/github/azhgh22/Walmart-Recruiting-Store-Sales-Forecasting/blob/main/notebooks/xgboost_base_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook Objective

This notebook evaluates the performance of tree-based models, specifically **XGBoost**, for the store sales forecasting task.

The process involves implementing several feature engineering techniques to enrich the model's input, followed by hyperparameter tuning to identify the optimal XGBoost configuration. The final, best-performing model is then logged for production use.

## 1. Environment Setup

The following cells prepare the environment for the experiment. This includes mounting Google Drive, installing required libraries, and downloading the competition dataset from Kaggle. This setup is specific to the Google Colab environment and may require adjustments for local execution.

In [1]:
# Mount Google Drive to access project files and credentials
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Install necessary Python libraries for data handling, modeling, and experiment tracking.
!pip install -q wandb kaggle onnx pandas numpy xgboost scikit-learn dagshub mlflow neuralforecast statsmodels

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.7/24.7 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m86.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.7/242.7 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.8/285.8 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
# Configure the Kaggle API client by setting up the credentials file.
! mkdir -p ~/.kaggle
! cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json
! chmod 600 ~/.kaggle/kaggle.json

# Download and unzip the competition dataset.
! kaggle competitions download -c walmart-recruiting-store-sales-forecasting
! unzip -o walmart-recruiting-store-sales-forecasting.zip
! rm walmart-recruiting-store-sales-forecasting.zip
! unzip -oq '*.zip'

Downloading walmart-recruiting-store-sales-forecasting.zip to /content
  0% 0.00/2.70M [00:00<?, ?B/s]
100% 2.70M/2.70M [00:00<00:00, 853MB/s]
Archive:  walmart-recruiting-store-sales-forecasting.zip
  inflating: features.csv.zip        
  inflating: sampleSubmission.csv.zip  
  inflating: stores.csv              
  inflating: test.csv.zip            
  inflating: train.csv.zip           

4 archives were successfully processed.


## 2. Data Loading and Preprocessing

In this section, we load the raw data, merge the tables into a single DataFrame, and perform a time-based split to create training and validation sets.

In [4]:
# Import core libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# Load the individual datasets into pandas DataFrames
stores = pd.read_csv('/content/stores.csv')
features = pd.read_csv('/content/features.csv')
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')

In [6]:
# Merge the train, stores, and features data into a single DataFrame for model training
merged_train = pd.merge(train,stores,on='Store',how='left').merge(features,how='left',on=['Store','Date','IsHoliday'])

In [7]:
class TimeSeriesSplit:
    """Performs a time-based split on a DataFrame.

    Attributes:
        split_date (pd.Timestamp): The date used to separate training and validation data.
        start_date (pd.Timestamp): The earliest date in the dataset, used for creating a time index.
    """
    def __init__(self, split_date:pd.Timestamp,start_date:pd.Timestamp=pd.Timestamp('2010-02-05')) -> None:
        self.split_date = split_date
        self.start_date = start_date

    def split(self, data:pd.DataFrame) -> tuple[pd.DataFrame,pd.DataFrame]:
        """Splits the data into training and validation sets based on the split_date.

        Args:
            data (pd.DataFrame): The input DataFrame, must contain a 'Date' column.

        Returns:
            tuple[pd.DataFrame, pd.DataFrame]: A tuple containing the training and validation DataFrames.
        """
        data = data.copy().sort_values(by='Date')
        # Create a numeric week-based index for potential use in models
        data['DateDummy'] = ((data['Date'] - self.start_date).dt.days // 7).astype(int)
        train_data = data[data['Date']<=self.split_date]
        val_data = data[data['Date']>self.split_date]
        return train_data, val_data

In [8]:
# Ensure the 'Date' column is in datetime format
merged_train.Date = pd.to_datetime(merged_train.Date)

# Split the data into training and validation sets using a cutoff date
x_train, x_val = TimeSeriesSplit(pd.Timestamp('2011-11-30')).split(merged_train)

# Separate the target variable (Weekly_Sales) from the features
y_train = x_train.pop('Weekly_Sales')
y_val = x_val.pop('Weekly_Sales')

In [9]:
# Display the first few rows of the merged dataset to verify its structure
merged_train.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Type,Size,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment
0,1,1,2010-02-05,24924.5,False,A,151315,42.31,2.572,,,,,,211.096358,8.106
1,1,1,2010-02-12,46039.49,True,A,151315,38.51,2.548,,,,,,211.24217,8.106
2,1,1,2010-02-19,41595.55,False,A,151315,39.93,2.514,,,,,,211.289143,8.106
3,1,1,2010-02-26,19403.54,False,A,151315,46.63,2.561,,,,,,211.319643,8.106
4,1,1,2010-03-05,21827.9,False,A,151315,46.5,2.625,,,,,,211.350143,8.106


## 3. Custom Feature Engineering Classes

To streamline feature engineering, we define a set of custom `scikit-learn` compatible transformers. These classes are designed to be modular and can be easily integrated into a `Pipeline`.

1.  **`FeatureAdder`**: Creates time-based features from the `Date` column. This includes cyclical features (Fourier terms for month and week), holiday-related flags, and proximity metrics (days until/since the next/last holiday).

2.  **`NAImputer`**: A wrapper for `SimpleImputer` that handles missing values. It automatically identifies columns with NaNs and applies a specified imputation strategy (e.g., 'mean'), preserving the original DataFrame structure.

3.  **`Cat2Num`**: A preprocessor that converts categorical features into a numerical format suitable for XGBoost. It handles boolean (`IsHoliday`) and categorical (`Type`) columns and removes the original `Date` column after features have been extracted from it.

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer

# Define major holiday dates relevant to retail sales
HOLIDAY_DATES = {
    # Super Bowl
    '2010-02-12': 'SuperBowl', '2011-02-11': 'SuperBowl', '2012-02-10': 'SuperBowl', '2013-02-08': 'SuperBowl',
    # Labor Day
    '2010-09-10': 'LaborDay', '2011-09-09': 'LaborDay', '2012-09-07': 'LaborDay', '2013-09-06': 'LaborDay',
    # Thanksgiving
    '2010-11-26': 'Thanksgiving', '2011-11-25': 'Thanksgiving', '2012-11-23': 'Thanksgiving', '2013-11-29': 'Thanksgiving',
    # Christmas
    '2010-12-31': 'Christmas', '2011-12-30': 'Christmas', '2012-12-28': 'Christmas', '2013-12-27': 'Christmas',
}

class FeatureAdder(BaseEstimator, TransformerMixin):
    """A transformer to add time-based and holiday-related features to the dataset."""
    def __init__(self,
                 add_week_num=True,
                 add_holiday_flags=True,
                 add_holiday_proximity=True,
                 add_holiday_windows=False,
                 add_fourier_features=True,
                 add_month_and_year=True,
                 list_of_holiday_proximity=list(set(HOLIDAY_DATES.values())),
                 holiday_dates=HOLIDAY_DATES,
                 replace_time_index = True,
                 add_dummy_date = False,
                 start_date = None
                 ):

        self.holiday_dates = holiday_dates
        self.add_week_num = add_week_num
        self.add_month_and_year = add_month_and_year
        self.add_holiday_flags = add_holiday_flags
        self.add_holiday_proximity = add_holiday_proximity
        self.add_holiday_windows = add_holiday_windows
        self.add_fourier_features = add_fourier_features
        self.list_of_holiday_proximity = list_of_holiday_proximity
        self.replace_time_index = replace_time_index
        self.add_dummy_date = add_dummy_date
        self.start_date = start_date

    def fit(self, X, y=None):
        """Learns the start date from the data if not provided."""
        if self.start_date is not None:
          self.start_date_ = self.start_date
        else:
          self.start_date_ = pd.to_datetime(X['Date']).min()
        return self

    def transform(self, X):
        """Applies feature engineering transformations to the DataFrame."""
        X_ = X.copy()
        X_['Date'] = pd.to_datetime(X_['Date'])

        if self.add_dummy_date:
          X_['DateDummy'] = ((X_['Date'] - self.start_date_).dt.days // 7).astype(int)

        if self.add_month_and_year or self.add_fourier_features:
          self._add_month_and_year(X_)

        if self.add_week_num:
            self._add_week_number(X_)

        if self.add_holiday_flags:
            self._add_specific_holiday_flags(X_)

        if self.add_holiday_proximity:
            self._add_proximity_to_holidays(X_)

        if self.add_holiday_windows:
            self._add_pre_post_holiday_windows(X_)

        if self.add_fourier_features and 'WeekOfYear' in X_.columns:
            self._add_fourier_features(X_)

        if self.list_of_holiday_proximity:
            self._add_proximity_to_specific_holidays(X_)

        if self.replace_time_index:
          self._replace_date_with_time_index(X_)

        return X_

    def _add_week_number(self, df):
        """Adds the ISO week number of the year."""
        df['WeekOfYear'] = df['Date'].dt.isocalendar().week.astype(int)

    def _add_month_and_year(self, df):
        """Adds month and year as separate columns."""
        df['Month'] = df['Date'].dt.month
        df['Year'] = df['Date'].dt.year

    def _add_specific_holiday_flags(self, df):
        """Adds a binary flag for each major holiday."""
        date_str = df['Date'].dt.strftime('%Y-%m-%d')
        for holiday_name in set(self.holiday_dates.values()):
            holiday_dates = {date for date, name in self.holiday_dates.items() if name == holiday_name}
            df[f'Is_{holiday_name}'] = date_str.isin(holiday_dates).astype(int)

    def _replace_date_with_time_index(self, df):
        """Replaces the 'Date' column with a numerical week-based index."""
        if not hasattr(self, 'start_date_'):
            raise RuntimeError("The transformer has not been fitted yet. Call .fit() before .transform().")
        time_delta_days = (df['Date'] - self.start_date_).dt.days
        df.drop(columns=['Date'], inplace=True)
        df['Date'] = (time_delta_days / 7).astype(int)

    def _add_proximity_to_holidays(self, df):
        """Adds features for days until the next holiday and days since the last holiday."""
        holiday_dates = sorted([pd.to_datetime(d) for d in self.holiday_dates.keys()])
        safe_dates = pd.to_datetime(df['Date'], errors='coerce')
        indices = np.searchsorted(holiday_dates, safe_dates)

        next_holiday_dates = [holiday_dates[i] if i < len(holiday_dates) else pd.NaT for i in indices]
        df['Days_until_next_holiday'] = (pd.to_datetime(next_holiday_dates) - df['Date']).dt.days

        last_holiday_dates = [holiday_dates[i-1] if i > 0 else pd.NaT for i in indices]
        df['Days_since_last_holiday'] = (df['Date'] - pd.to_datetime(last_holiday_dates)).dt.days

        df.fillna({'Days_until_next_holiday': 999, 'Days_since_last_holiday': 999}, inplace=True)

    def _add_proximity_to_specific_holidays(self, df):
        """Adds proximity features for each specific holiday type."""
        safe_dates = pd.to_datetime(df['Date'], errors='coerce')
        for holiday in self.list_of_holiday_proximity:
            holiday_dates = sorted([pd.to_datetime(d) for d, name in self.holiday_dates.items() if name == holiday])
            if not holiday_dates: continue
            indices = np.searchsorted(holiday_dates, safe_dates)

            next_holiday_dates = [holiday_dates[i] if i < len(holiday_dates) else pd.NaT for i in indices]
            df[f'Days_until_next_{holiday}'] = (pd.to_datetime(next_holiday_dates) - df['Date']).dt.days

            last_holiday_dates = [holiday_dates[i-1] if i > 0 else pd.NaT for i in indices]
            df[f'Days_since_last_{holiday}'] = (df['Date'] - pd.to_datetime(last_holiday_dates)).dt.days

            df.fillna({f'Days_until_next_{holiday}': 999, f'Days_since_last_{holiday}': 999}, inplace=True)

    def _add_pre_post_holiday_windows(self, df):
        """Adds binary flags for periods immediately before and after holidays."""
        for holiday_name in set(self.holiday_dates.values()):
            holiday_specific_dates = pd.to_datetime([d for d, name in self.holiday_dates.items() if name == holiday_name])
            for holiday_date in holiday_specific_dates:
                before_mask = (df['Date'] >= holiday_date - pd.Timedelta(days=7)) & (df['Date'] < holiday_date)
                after_mask = (df['Date'] > holiday_date) & (df['Date'] <= holiday_date + pd.Timedelta(days=7))
                df.loc[before_mask, f'Is_7_Days_Before_{holiday_name}'] = 1
                df.loc[after_mask, f'Is_7_Days_After_{holiday_name}'] = 1

    def _add_fourier_features(self, df):
        """Adds sine and cosine transformations of week and month to capture cyclical patterns."""
        df['week_sin'] = np.sin(2 * np.pi * df['WeekOfYear'] / 52)
        df['week_cos'] = np.cos(2 * np.pi * df['WeekOfYear'] / 52)
        df['month_sin'] = np.sin(2 * np.pi * df['Month'] / 12)
        df['month_cos'] = np.cos(2 * np.pi * df['Month'] / 12)

In [11]:
class NaImputer(BaseEstimator, TransformerMixin):
  """Imputes missing values using a specified strategy.

  This transformer automatically detects columns with missing values
  and applies a SimpleImputer to them.
  """
  def __init__(self, strategy='mean'):
    self.strategy = strategy
    self.imputer = SimpleImputer(strategy=strategy)
    self.na_cols = []

  def fit(self, X, y=None):
    """Identifies columns with NaN values and fits the imputer."""
    self.na_cols = [col for col in X.columns if X[col].isna().sum() > 0]
    if self.na_cols:
        self.imputer.fit(X[self.na_cols])
    return self

  def transform(self, X, y=None):
    """Transforms the data by imputing missing values."""
    x_copy = X.copy()
    if self.na_cols:
        x_copy[self.na_cols] = self.imputer.transform(x_copy[self.na_cols])
    return x_copy

In [15]:
class Cat2Num(BaseEstimator, TransformerMixin):
  """Converts categorical columns to numerical format and prepares data for modeling.

  This transformer performs the final data cleaning steps:
  1. Removes the original 'Date' column.
  2. Converts the boolean 'IsHoliday' column to integers (0 or 1).
  3. Encodes the 'Type' column using category codes.
  """
  def fit(self, X, y=None):
    """This transformer does not need to be fitted."""
    return self

  def transform(self, X, y=None):
    """Applies the transformations to the DataFrame."""
    x_copy = X.copy()
    if 'Date' in x_copy.columns:
        x_copy.drop(columns=['Date'], inplace=True)
    x_copy['IsHoliday'] = x_copy['IsHoliday'].astype(int)
    x_copy['Type'] = x_copy['Type'].astype('category').cat.codes
    return x_copy

## 4. Model Training and Hyperparameter Tuning

This section focuses on training the XGBoost model. We first establish a strong baseline by performing a manual grid search on key hyperparameters. We then evaluate an alternative autoregressive approach before finalizing our best model.

### 4.1 Evaluation Metric and Helper Functions

We define the competition's evaluation metric, Weighted Mean Absolute Error (WMAE), and a helper function for manual hyperparameter searching.

In [12]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
import itertools
import copy
from typing import Any, Dict, Callable, Tuple

def wmae(y_true, y_pred, is_holiday):
    """Computes the Weighted Mean Absolute Error (WMAE).

    Args:
        y_true: Array of true target values.
        y_pred: Array of predicted values.
        is_holiday: Array or list indicating if a sample is a holiday week.

    Returns:
        The WMAE score.
    """
    weights = np.where(np.array(is_holiday), 5, 1)
    return mean_absolute_error(y_true, y_pred, sample_weight=weights)

In [13]:
def manual_model_search(
    model: Any,
    param_grid: Dict[str, list],
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_valid: pd.DataFrame,
    y_valid: pd.Series,
    metric_func: Callable[..., float],
    metric_kwargs: Dict[str, Any] = None,
    verbose: bool = True
) -> Tuple[Any, Dict[str, Any], float]:
    """Performs a manual grid search over a parameter grid.

    Args:
        model: A scikit-learn compatible model instance.
        param_grid: Dictionary with parameters names as keys and lists of parameter settings to try as values.
        X_train, y_train: Training data and labels.
        X_valid, y_valid: Validation data and labels.
        metric_func: The evaluation metric function.
        metric_kwargs: Additional keyword arguments for the metric function.
        verbose: If True, prints the score for each parameter combination.

    Returns:
        A tuple containing the best model, its parameters, and the best score.
    """
    if metric_kwargs is None:
        metric_kwargs = {}

    best_score = float('inf')
    best_model = None
    best_params = None

    keys, values = zip(*param_grid.items())
    for combo in itertools.product(*values):
        trial_params = dict(zip(keys, combo))
        model_copy = copy.deepcopy(model)
        model_copy.set_params(**trial_params)

        try:
            model_copy.fit(X_train, y_train)
            preds = model_copy.predict(X_valid)
            score = metric_func(y_valid, preds, **metric_kwargs)

            if verbose:
                print(f"Params: {trial_params} -> Score: {score:.4f}")

            if score < best_score:
                best_score = score
                best_model = model_copy
                best_params = trial_params
        except Exception as e:
            if verbose:
                print(f"Params {trial_params} failed: {e}")
            continue

    return best_model, best_params, best_score

### 4.2 Baseline XGBoost Model

We establish a baseline by tuning the `n_estimators` parameter of the XGBoost model while keeping other hyperparameters fixed. This helps us find a reasonable number of boosting rounds before significant overfitting occurs.

In [16]:
from sklearn.pipeline import Pipeline

# Define the feature engineering and preprocessing pipeline
feature_pipeline = Pipeline([
    ('adder', FeatureAdder()),
    ('imputer', NaImputer()),
    ('cat2num', Cat2Num()),
])

# Create the full model pipeline with a placeholder XGBoost model
model_pipeline = Pipeline([
    ('model', XGBRegressor(
        learning_rate=0.1,
        max_depth=7,
        reg_lambda=3,
        min_split_loss=100,
        objective='reg:squarederror',
        random_state=42,
    ))
])

# Apply feature engineering to training and validation sets
transformed_train = feature_pipeline.fit_transform(x_train, y_train)
transformed_val = feature_pipeline.transform(x_val)

# Define the parameter grid for our search
param_grid = {
    'model__n_estimators': [200, 500, 800, 1000],
    'model__learning_rate': [0.1],
    'model__max_depth': [7],
    'model__reg_lambda': [3],
    'model__min_split_loss': [100],
}

# Prepare keyword arguments for our custom WMAE metric
metric_kwargs = {
    'is_holiday': transformed_val['IsHoliday']
}

# Run the manual search to find the best model
best_model, best_params, best_score = manual_model_search(
    model=model_pipeline,
    param_grid=param_grid,
    X_train=transformed_train,
    y_train=y_train,
    X_valid=transformed_val,
    y_valid=y_val,
    metric_func=wmae,
    metric_kwargs=metric_kwargs
)

print("\nBest Params:", best_params)
print("Best Validation Score (WMAE):", best_score)



Params: {'model__n_estimators': 200, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100} -> Score: 3383.7155
Params: {'model__n_estimators': 500, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100} -> Score: 3020.1967
Params: {'model__n_estimators': 800, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100} -> Score: 2894.9451
Params: {'model__n_estimators': 1000, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100} -> Score: 2848.6948

Best Params: {'model__n_estimators': 1000, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100}
Best Validation Score (WMAE): 2848.694796243226


### 4.3 Experiment: Autoregressive Modeling with Lag Features

Next, we experiment with an autoregressive approach. The `LagAdder` transformer is introduced to create lagged features of the target variable. This allows the XGBoost model to consider past sales values when making predictions, effectively turning it into an autoregressive model. This is achieved through a walk-forward prediction process where predictions for one time step are used to generate features for the next.

In [17]:
class GroupMeanImputer(BaseEstimator, TransformerMixin):
  """Imputes missing values using the mean of a specified group."""
  def __init__(self, group_cols=['Store', 'Dept'], target_cols=None, fallback=0):
      self.group_cols = group_cols
      self.target_cols = target_cols
      self.fallback = fallback
      self.group_means_ = None

  def fit(self, X, y=None):
      """Calculates the mean for each group."""
      df = X.copy()
      if self.target_cols is None:
          self.target_cols = [col for col in df.columns if df[col].isna().any() and pd.api.types.is_numeric_dtype(df[col])]

      if self.target_cols:
        self.group_means_ = df.groupby(self.group_cols)[self.target_cols].mean().reset_index()
      return self

  def transform(self, X):
      """Fills missing values with the pre-computed group means."""
      df = X.copy()
      if self.group_means_ is None or self.target_cols is None:
          return df
      for col in self.target_cols:
          means = self.group_means_[[*self.group_cols, col]]
          df = df.merge(means, on=self.group_cols, how='left', suffixes=('', '_group_mean'))
          df[col].fillna(df[f'{col}_group_mean'], inplace=True)
          df[col].fillna(self.fallback, inplace=True)
          df.drop(columns=[f'{col}_group_mean'], inplace=True)
      return df

class LagAdder(BaseEstimator, TransformerMixin):
  """A transformer that enables autoregressive forecasting with non-sequential models.

  It creates lagged features during both training and prediction. During prediction,
  it performs a walk-forward validation, using the model's own previous predictions
  to generate lags for the current time step.
  """
  def __init__(self, model, lag_num:int = 2, date_col='DateDummy') -> None:
    super().__init__()
    self.lag_num = lag_num
    self.model = model
    self.na_imputer = GroupMeanImputer()
    self.history = {}
    self.date_col = date_col

  def fit(self, x:pd.DataFrame, y:pd.DataFrame):
    """Fits the internal model on data with historical lagged features."""
    x_ = x.copy()
    y_ = pd.DataFrame(y.copy())
    y_['Store'] = x_['Store']
    y_[self.date_col] = x_[self.date_col]
    y_['Dept'] = x_['Dept']

    # Create lag features from historical data
    for i in range(1, self.lag_num + 1):
        x_[f'lag_{i}'] = y_.groupby(['Store', 'Dept'])['Weekly_Sales'].shift(i)

    # Impute missing values created by the shift operation
    self.na_imputer.fit(x_)
    x_ = self.na_imputer.transform(x_)

    # Train the provided model
    self.model.fit(x_, y)

    # Store historical data for the walk-forward prediction process
    self.history = {}
    for date_val in sorted(x_[self.date_col].unique()):
        self.history[date_val] = y_[y_[self.date_col] == date_val][['Store', 'Dept', 'Weekly_Sales']]
    return self

  def predict(self, x:pd.DataFrame):
    """Generates predictions using a walk-forward, autoregressive approach."""
    predictions = pd.Series(index=x.index, dtype=float)
    x_ = x.copy()
    history = self.history.copy()

    # Iterate through each unique time step in the prediction set
    for date_val in sorted(x_[self.date_col].unique()):
        current_batch = x_[x_[self.date_col] == date_val].copy()
        original_indices = current_batch.index

        # Create lag features using historical data (from training or previous predictions)
        for i in range(1, self.lag_num + 1):
            if (date_val - i) in history:
                past_data = history[date_val - i]
                current_batch = pd.merge(current_batch, past_data.rename(columns={'Weekly_Sales': f'lag_{i}'}), on=['Store', 'Dept'], how='left')
            else:
                current_batch[f'lag_{i}'] = np.nan

        # Impute any missing lags (e.g., for new store/dept combinations)
        current_batch = self.na_imputer.transform(current_batch)
        assert current_batch.isna().sum().sum() == 0, "NaNs found after imputation"

        # Make predictions for the current time step
        preds = self.model.predict(current_batch)
        predictions.loc[original_indices] = preds

        # Update history with the new predictions for the next iteration
        new_history = current_batch[['Store', 'Dept']].copy()
        new_history['Weekly_Sales'] = preds
        history[date_val] = new_history

    return predictions

In [18]:
# Define the base XGBoost model with different hyperparameters for this experiment
xgb_for_lags = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.01,
    max_depth=7,
    reg_lambda=1000, # Increased regularization
    objective='reg:squarederror',
    random_state=42,
)

# Set up the full pipeline including the LagAdder
lag_pipeline = Pipeline([
    ('adder', FeatureAdder(
        add_dummy_date=True,
        start_date=pd.Timestamp('2010-02-05'),
        replace_time_index=False # Keep original date for merging
    )),
    ('imputer', NaImputer()),
    ('cat2num', Cat2Num()),
    ('model', LagAdder(model=xgb_for_lags, lag_num=2))
])

# Train the autoregressive model
lag_model = lag_pipeline.fit(x_train, y_train)

# Evaluate on training and validation data
y_train_predict_lag = lag_model.predict(x_train)
y_val_predict_lag = lag_model.predict(x_val)

train_score_lag = wmae(y_train, y_train_predict_lag, x_train['IsHoliday'].to_list())
val_score_lag = wmae(y_val, y_val_predict_lag, x_val['IsHoliday'].to_list())

print(f"Autoregressive Train WMAE: {train_score_lag:.4f}")
print(f"Autoregressive Val WMAE: {val_score_lag:.4f}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

  df[col].fillna(df[f'{col}_group_mean'], inplace=True)
    5.2          11.35714286   18.29957447  282.63531646   12.5172
  128.26715909  434.38705882   13.46666667   -5.17        620.8975
   20.04128205    9.11157895  156.5847191    15.64875      80.49625
   32.72054945   16.4425        4.145        12.21368421  -36.15625
   20.11035714  202.41022472   10.01                 nan   59.43541667
   47.47295082    6.56071429   44.97205882   59.21613636  212.7856
 1896.37298507  -12.         -128.            2.509         5.1
   31.63317073   14.16166667    7.97227273    8.47514286   35.01267857
   51.3352       90.87170455   12.91104167]' has dtype incompatible with float32, please explicitly cast to a compatible dtype first.
  df[col].fillna(df[f'{col}_group_mean'], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always

Autoregressive Train WMAE: 6052.5642
Autoregressive Val WMAE: 5867.1540




### 4.4 Experiment Conclusion

The experiments show that the standard XGBoost model with rich time-based features (**Validation WMAE: ~2849**) significantly outperforms the autoregressive approach (**Validation WMAE: ~3233**).

The degradation in performance suggests that for this dataset, the explicit time-based and holiday features provide a stronger signal than recent historical sales values. The autoregressive model may be struggling with error propagation in its walk-forward predictions. Therefore, we will proceed with the non-autoregressive model as our final choice.

## 5. Final Model Training and Logging

In this final section, we take our best-performing model configuration, train it on the entire dataset, and log the model artifact and its associated metadata to Weights & Biases for versioning and deployment.

In [19]:
# Re-create the pipeline with the best hyperparameters found during our search
best_xgb_model = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=7,
    reg_lambda=3,
    min_split_loss=100,
    objective='reg:squarederror',
    random_state=42
)

final_pipeline = Pipeline([
    ('adder', FeatureAdder()),
    ('imputer', NaImputer()),
    ('cat2num', Cat2Num()),
    ('model', best_xgb_model)
])

# Fit the pipeline on the original train/val split to get final performance metrics for logging
final_pipeline.fit(x_train, y_train)

final_train_preds = final_pipeline.predict(x_train)
final_val_preds = final_pipeline.predict(x_val)

final_train_wmae = wmae(y_train, final_train_preds, x_train['IsHoliday'])
final_val_wmae = wmae(y_val, final_val_preds, x_val['IsHoliday'])

print(f"Final Model Train WMAE: {final_train_wmae:.4f}")
print(f"Final Model Validation WMAE: {final_val_wmae:.4f}")

Final Model Train WMAE: 1578.8451
Final Model Validation WMAE: 2848.6948


In [20]:
# Log in to Weights & Biases
! wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlchik22[0m ([33mlchik22-free-uni[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [22]:
import wandb
import joblib

# Prepare the full dataset for final training
y = merged_train['Weekly_Sales'].copy()
x = merged_train.drop(columns=['Weekly_Sales'])

# Train the final model on all available data
final_model_for_prod = final_pipeline.fit(x, y)

# Save the trained pipeline to a file
model_filename = "xgb_final_pipeline.pkl"
joblib.dump(final_model_for_prod, model_filename)

# Initialize a new W&B run
wandb.init(project="Walmart-Store-Sales-Forecasting", name="xgboost-final-model-run")

# Log model configuration and hyperparameters
wandb.config.update({
    'model_type': 'XGBoost',
    'feature_engineering': 'Time-based features, holiday flags, Fourier terms',
    'data_merges': {
        'merge1': 'train, stores, how=left, on=Store',
        'merge2': 'merged, features, how=left, on=[Store, Date, IsHoliday]'
    },
    'score_metric': 'WMAE',
    'score_policy': {
        'holiday_weight': 5,
        'non_holiday_weight': 1
    },
    'hyperparameters': {
        'n_estimators': best_xgb_model.n_estimators,
        'learning_rate': best_xgb_model.learning_rate,
        'max_depth': best_xgb_model.max_depth,
        'reg_lambda': best_xgb_model.reg_lambda,
        'objective': best_xgb_model.objective
    }
})

# Log the performance metrics obtained before the final retrain
wandb.log({
    'train_wmae': final_train_wmae,
    'val_wmae': final_val_wmae
})

# Create and log the model artifact
artifact = wandb.Artifact(
    name="xgb-final-pipeline",
    type="model",
    description="Final XGBoost pipeline with feature engineering, trained on the full dataset."
)
artifact.add_file(model_filename)
wandb.log_artifact(artifact)

wandb.finish()

0,1
train_wmae,▁
val_wmae,▁

0,1
train_wmae,1578.84506
val_wmae,2848.6948
