Link: 

https://colab.research.google.com/github/Philst4/Store-Sales/blob/temp/pipeline_colab.ipynb

# Setup

In [1]:
# Clone repo
!git clone -b temp https://github.com/Philst4/Store-Sales.git

Cloning into 'Store-Sales'...
remote: Enumerating objects: 612, done.[K
remote: Counting objects: 100% (86/86), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 612 (delta 27), reused 67 (delta 21), pack-reused 526 (from 1)[K
Receiving objects: 100% (612/612), 10.26 MiB | 4.14 MiB/s, done.
Resolving deltas: 100% (352/352), done.


In [2]:
# Go to root of project
%cd Store-Sales

/Users/idk/Desktop/DesktopFolder/Programming Projects/Store-Sales/Store-Sales


In [3]:
!ls data/raw

holidays_events.csv stores.csv          train.csv
oil.csv             test.csv


Below shows abbreviated runthrough of ML pipeline designed for the kaggle store-sales competition. 

NOTES:
- Only a small sample of dataset is included so the pipeline can run. To run the pipeline with the full dataset, download and unzip data from the Kaggle competition: https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data
- Different arguments were passed into the tuning and training scripts below (e.g., '--valset_size 15' for the tune_model.py script instead of '--valset 2')

Then move the data to 'data/raw'

Furthermore, different arguments were passed into tuning/training scripts below.

# Clean/Process Raw Data

In [4]:
!python scripts/process_data.py

Running pipeline...
Processing 'train'/'test' -> 'main'...
Saving './data/clean/main.parquet'...
Saving './data/clean/main_cat_meta.json'...
Processing 'stores'...
Saving './data/clean/stores.parquet'...
Saving './data/clean/stores_cat_meta.json'...
Processing 'oil'...
Saving './data/clean/oil.parquet'...
Saving './data/clean/oil_cat_meta.json'...
Processing 'holidays_events'...
Saving './data/clean/holidays_events.parquet'...
Saving './data/clean/holidays_events_cat_meta.json'...
Computing rolling stats using 'main' and 'stores'...
Rolling stats for group '['store_nbr']', window '1'
Saving './data/clean/rolling_wrt_store_nbr_lag16_window1.parquet'...
Saving './data/clean/rolling_wrt_store_nbr_lag16_window1_cat_meta.json'...
Rolling stats for group '['store_nbr']', window '7'
Saving './data/clean/rolling_wrt_store_nbr_lag16_window7.parquet'...
Saving './data/clean/rolling_wrt_store_nbr_lag16_window7_cat_meta.json'...
Rolling stats for group '['store_nbr']', window '28'
Saving './data/c

In [5]:
!ls data

[34mclean[m[m [34mraw[m[m


In [6]:
!ls data/raw

holidays_events.csv stores.csv          train.csv
oil.csv             test.csv


In [7]:
!ls data/clean

[34mholidays_events.parquet[m[m
holidays_events_cat_meta.json
[34mmain.parquet[m[m
main_cat_meta.json
manifest.json
[34moil.parquet[m[m
oil_cat_meta.json
rolling_wrt_city_lag16_window1.parquet
rolling_wrt_city_lag16_window1_cat_meta.json
rolling_wrt_city_lag16_window28.parquet
rolling_wrt_city_lag16_window28_cat_meta.json
rolling_wrt_city_lag16_window365.parquet
rolling_wrt_city_lag16_window365_cat_meta.json
rolling_wrt_city_lag16_window7.parquet
rolling_wrt_city_lag16_window7_cat_meta.json
rolling_wrt_city_lag16_window91.parquet
rolling_wrt_city_lag16_window91_cat_meta.json
rolling_wrt_cluster_lag16_window1.parquet
rolling_wrt_cluster_lag16_window1_cat_meta.json
rolling_wrt_cluster_lag16_window28.parquet
rolling_wrt_cluster_lag16_window28_cat_meta.json
rolling_wrt_cluster_lag16_window365.parquet
rolling_wrt_cluster_lag16_window365_cat_meta.json
rolling_wrt_cluster_lag16_window7.parquet
rolling_wrt_cluster_lag16_window7_cat_meta.json
rolling_wrt_cluster_lag16_window91.parquet


# Tune Model

In [8]:
!pip install optuna -q
!pip install mlflow -q

In [9]:
# Model tuning
!python scripts/tune_model.py --n_trials 2 --n_backtests 2 --valset_size 1 --n_jobs 1

Loading training data...
Locating 'main data' chunk...
Locating 'secondary_data' chunks...: 100%|████████| 3/3 [00:00<00:00, 12.69it/s]
Locating 'rolling_stats' chunks...: 100%|███████| 30/30 [00:02<00:00, 11.65it/s]
Loading training data into memory...
Loading experiment config from 'experiment_configs.xgb'...
2025/11/14 19:48:41 INFO mlflow.tracking.fluent: Experiment with name 'xgb' does not exist. Creating a new experiment.
[32m[I 2025-11-14 19:48:45,570][0m A new study created in RDB with name: xgb[0m
#### Backtesting (2 folds) ####
 * Fold 1 of 2 complete (loss: 0.574)
 * Fold 2 of 2 complete (loss: 0.567)
 * MEAN LOSS ACROSS FOLDS: 0.571
[32m[I 2025-11-14 19:52:32,806][0m Trial 0 finished with value: 0.5706499218940735 and parameters: {'n_estimators': 2773, 'max_depth': 2, 'learning_rate': 0.020209366911814818, 'subsample': 0.7005737815095165, 'colsample_bytree': 0.5634895607980901, 'reg_lambda': 6.822317654411166, 'gamma': 0.1415984735053405, 'min_child_weight': 3}. Best i

# Fit Best Model

In [10]:
!python scripts/train_best.py --n_iter 2

Loading experiment config from 'experiment_configs.xgb'...

--- Training using following trial.... ---
Best trial number: 1
Best value (objective/loss): 0.5342710614204407
Best hyperparameters:
 * seed: 42
 * objective: reg:squarederror
 * eval_metric: rmse
 * tree_method: hist
 * enable_categorical: True
 * device: cpu
 * max_bin: 256
 * early_stopping_rounds: 100
 * n_estimators: 2578
 * max_depth: 8
 * learning_rate: 0.8456062593612776
 * subsample: 0.8667079378836515
 * colsample_bytree: 0.9842399805082203
 * reg_lambda: 6.910087659170744
 * gamma: 1.7579566077445103
 * min_child_weight: 10

Training 1 models

 -- SEED 0 MODEL --

 -- Training Iteration 1/2 (sampling 10.00% of data) --
Locating 'main data' chunk...
Locating 'secondary_data' chunks...: 100%|████████| 3/3 [00:01<00:00,  2.85it/s]
Locating 'rolling_stats' chunks...: 100%|███████| 30/30 [00:09<00:00,  3.23it/s]
Loading chunk into memory...
Splitting train/test...
Training model on chunk...
Loss on chunk: 0.584407389163

# Make Submission

In [11]:
!python scripts/make_submission.py

Loading experiment config from 'experiment_configs.xgb'...
Locating 'main data' chunk...
Locating 'secondary_data' chunks...: 100%|████████| 3/3 [00:00<00:00,  6.93it/s]
Locating 'rolling_stats' chunks...: 100%|███████| 30/30 [00:03<00:00,  8.72it/s]
Loading in xgb_model_0.joblib...
Making predictions...
Making submission...
Saving submission to './submissions/xgb_submission_0.csv'...
