In [1]:
!ls data/raw

holidays_events.csv stores.csv          train.csv
oil.csv             test.csv


Below shows abbreviated runthrough of ML pipeline designed for the kaggle store-sales competition. 

NOTES:
- Only a small sample of dataset is included so the pipeline can run. To run the pipeline with the full dataset, download and unzip data from the Kaggle competition: https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data
- Different arguments were passed into the tuning and training scripts below (e.g., '--valset_size 15' for the tune_model.py script instead of '--valset 2')

Then move the data to 'data/raw'

Furthermore, different arguments were passed into tuning/training scripts below.

# Clean/Process Raw Data

In [2]:
!python scripts/process_data.py

Running pipeline...
Processing 'train'/'test' -> 'main'...
Saving './data/clean/main.parquet'...
Saving './data/clean/main_cat_meta.json'...
Processing 'stores'...
Saving './data/clean/stores.parquet'...
Saving './data/clean/stores_cat_meta.json'...
Processing 'oil'...
Saving './data/clean/oil.parquet'...
Saving './data/clean/oil_cat_meta.json'...
Processing 'holidays_events'...
Saving './data/clean/holidays_events.parquet'...
Saving './data/clean/holidays_events_cat_meta.json'...
Computing rolling stats using 'main' and 'stores'...
Rolling stats for group '['store_nbr']', window '1'
Saving './data/clean/rolling_wrt_store_nbr_lag16_window1.parquet'...
Saving './data/clean/rolling_wrt_store_nbr_lag16_window1_cat_meta.json'...
Rolling stats for group '['store_nbr']', window '7'
Saving './data/clean/rolling_wrt_store_nbr_lag16_window7.parquet'...
Saving './data/clean/rolling_wrt_store_nbr_lag16_window7_cat_meta.json'...
Rolling stats for group '['store_nbr']', window '28'
Saving './data/c

In [3]:
!ls data

[34mclean[m[m [34mraw[m[m


In [4]:
!ls data/raw

holidays_events.csv stores.csv          train.csv
oil.csv             test.csv


In [5]:
!ls data/clean

[34mholidays_events.parquet[m[m
holidays_events_cat_meta.json
[34mmain.parquet[m[m
main_cat_meta.json
manifest.json
[34moil.parquet[m[m
oil_cat_meta.json
rolling_wrt_city_lag16_window1.parquet
rolling_wrt_city_lag16_window1_cat_meta.json
rolling_wrt_city_lag16_window28.parquet
rolling_wrt_city_lag16_window28_cat_meta.json
rolling_wrt_city_lag16_window365.parquet
rolling_wrt_city_lag16_window365_cat_meta.json
rolling_wrt_city_lag16_window7.parquet
rolling_wrt_city_lag16_window7_cat_meta.json
rolling_wrt_city_lag16_window91.parquet
rolling_wrt_city_lag16_window91_cat_meta.json
rolling_wrt_cluster_lag16_window1.parquet
rolling_wrt_cluster_lag16_window1_cat_meta.json
rolling_wrt_cluster_lag16_window28.parquet
rolling_wrt_cluster_lag16_window28_cat_meta.json
rolling_wrt_cluster_lag16_window365.parquet
rolling_wrt_cluster_lag16_window365_cat_meta.json
rolling_wrt_cluster_lag16_window7.parquet
rolling_wrt_cluster_lag16_window7_cat_meta.json
rolling_wrt_cluster_lag16_window91.parquet


# Tune Model

In [6]:
!pip install optuna -q
!pip install mlflow -q

In [8]:
# Model tuning
!python scripts/tune_model.py --n_trials 2 --n_backtests 2 --valset_size 1 --n_jobs 1

Loading training data...
Locating 'main data' chunk...
Locating 'secondary_data' chunks...: 100%|████████| 3/3 [00:00<00:00, 10.07it/s]
Locating 'rolling_stats' chunks...: 100%|███████| 30/30 [00:02<00:00, 10.79it/s]
Loading training data into memory...
Loading experiment config from 'experiment_configs.xgb'...
[32m[I 2025-11-14 19:17:23,205][0m Using an existing study with name 'xgb' instead of creating a new one.[0m
#### Backtesting (2 folds) ####
 * Fold 1 of 2 complete (loss: 1.003)
 * Fold 2 of 2 complete (loss: 0.982)
 * MEAN LOSS ACROSS FOLDS: 0.993
[32m[I 2025-11-14 19:17:49,050][0m Trial 8 finished with value: 0.9925656914710999 and parameters: {'n_estimators': 238, 'max_depth': 8, 'learning_rate': 0.0060198894390003045, 'subsample': 0.9832227457906005, 'colsample_bytree': 0.9269019475602449, 'reg_lambda': 1.7397907254419698, 'gamma': 3.7021684672366844, 'min_child_weight': 2}. Best is trial 5 with value: 0.48451660573482513.[0m
#### Backtesting (2 folds) ####
 * Fold 1 

# Fit Best Model

In [9]:
!python scripts/train_best.py --n_iter 2

Loading experiment config from 'experiment_configs.xgb'...

--- Training using following trial.... ---
Best trial number: 5
Best value (objective/loss): 0.48451660573482513
Best hyperparameters:
 * seed: 42
 * objective: reg:squarederror
 * eval_metric: rmse
 * tree_method: hist
 * enable_categorical: True
 * device: cpu
 * max_bin: 256
 * early_stopping_rounds: 100
 * n_estimators: 1718
 * max_depth: 6
 * learning_rate: 0.12415925401304694
 * subsample: 0.6416575267232185
 * colsample_bytree: 0.772218999816706
 * reg_lambda: 4.187195555255359
 * gamma: 0.5874315395322982
 * min_child_weight: 5

Training 1 models

 -- SEED 0 MODEL --

 -- Training Iteration 1/2 (sampling 10.00% of data) --
Locating 'main data' chunk...
Locating 'secondary_data' chunks...: 100%|████████| 3/3 [00:00<00:00, 12.08it/s]
Locating 'rolling_stats' chunks...: 100%|███████| 30/30 [00:02<00:00, 11.67it/s]
Loading chunk into memory...
Splitting train/test...
Training model on chunk...
Loss on chunk: 0.423418879508

# Make Submission

In [10]:
!python scripts/make_submission.py

Loading experiment config from 'experiment_configs.xgb'...
Locating 'main data' chunk...
Locating 'secondary_data' chunks...: 100%|████████| 3/3 [00:00<00:00,  5.23it/s]
Locating 'rolling_stats' chunks...: 100%|███████| 30/30 [00:03<00:00,  9.11it/s]
Loading in xgb_model_0.joblib...
Making predictions...
Making submission...
Saving submission to './submissions/xgb_submission_0.csv'...
