# 03 - Model Training

## Introduction

In this phase, we begin training forecasting models aimed at helping Corporación Favorita improve inventory management. Having already identified stockout patterns, promotion effectiveness, and sales dynamics at both the product and store levels, we now focus on predicting future demand. These predictions will serve as input for smarter stock replenishment strategies, reducing understocking and overstocking risks across locations.

We will be using Facebook Prophet, a robust time series forecasting tool, well-suited for daily retail data that may contain missing values, multiple seasonalities, and external regressors such as promotions or oil prices. This model supports explainable forecasting with trend and holiday components—making it ideal for our business use case.

--- 

## Objectives

- Train a forecasting model (Prophet) to predict daily sales for each store-product-family combination.
- Incorporate business-relevant features such as promotions, holidays, and oil prices to improve forecast accuracy.
- Use recent time windows to validate model performance against real-world behavior.
- Identify where improved stocking strategies could yield the most business value.


## 1. Import & Load Data

In [37]:
# --- Import necessary modules ---
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# --- Load prepared inventory dataset ---
df = pd.read_csv("../data/processed/inventory_prepared.csv", parse_dates=['date'])

# --- Preview ---
df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,transactions,dcoilwtico,city,state,type,cluster,is_holiday,rolling_avg_7,rolling_avg_14,sales_lag_1,sales_lag_7,sales_lag_14
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,0.0,93.14,Quito,Pichincha,D,13,1,0.0,0.0,,,
1,1782,2013-01-02,1,AUTOMOTIVE,2.0,0,2111.0,93.14,Quito,Pichincha,D,13,0,1.0,1.0,0.0,,
2,3564,2013-01-03,1,AUTOMOTIVE,3.0,0,1833.0,92.97,Quito,Pichincha,D,13,0,1.666667,1.666667,2.0,,
3,5346,2013-01-04,1,AUTOMOTIVE,3.0,0,1863.0,93.12,Quito,Pichincha,D,13,0,2.0,2.0,3.0,,
4,7128,2013-01-05,1,AUTOMOTIVE,5.0,0,1509.0,93.12,Quito,Pichincha,D,13,1,2.6,2.6,3.0,,


## 2. Prepare Data for Prophet

### 2.1. Load Dataset and Prepare Dataset

Load the cleaned and feature-engineered dataset (`inventory_prepared.csv`).

In [38]:
# Load prepared dataset
df = pd.read_csv("../data/processed/inventory_prepared.csv", parse_dates=["date"])

We format the dataset to match Prophet’s required structure—`ds` for dates and `y` for target values—so the model can correctly interpret and forecast time series data, while adding relevant regressors (like promotions or holidays) helps improve forecast accuracy by capturing external influences on sales.  We also drop any rows with missing values in target or regressors.

In [39]:
# Rename target and date for Prophet
df['y'] = df['sales']
df['ds'] = df['date']

# Drop rows with missing target or regressors
df = df.dropna(subset=['y', 'rolling_avg_7', 'sales_lag_1', 'sales_lag_7'])

### 2.2. Train Prophet Per Store-Family Combination
We iterate through each unique `(store_nbr, family)` pair, train a Prophet model including key regressors (rolling averages, lags, promotions, etc.), and skip combos with insufficient data.


In [40]:
import os

# Get unique store-family combinations
combinations = df[['store_nbr', 'family']].drop_duplicates()

# Prepare directory to save forecasts
os.makedirs("../forecasts", exist_ok=True)

### 2.3. Forecast Next 14 Days

To avoid missing values in future regressors (like rolling averages and sales lags), we forecast on the last 14 observed days for each (store, family) combination. These days already contain all the necessary features, ensuring reliable predictions without imputation or data leakage.

While this does not extend into future unseen dates, it simulates how the model would behave in a real scenario by using the most recent available data. This provides both a stable foundation for validation and insights into how the model responds to known patterns — aligning with our inventory management goals.


In [41]:
from tqdm import tqdm

# Forecast horizon (e.g., next 14 days)
forecast_days = 14
forecast_results = []

# Loop through each store-family combo and train + forecast
for _, row in tqdm(combinations.iterrows(), total=len(combinations)):
    store = row['store_nbr']
    family = row['family']
    subset = df[(df['store_nbr'] == store) & (df['family'] == family)].copy()

    if len(subset) < 100:  # Not enough data to train
        continue

    # Define training data
    train_df = subset[['ds', 'y', 'rolling_avg_7', 'sales_lag_1', 'sales_lag_7',
                       'onpromotion', 'transactions', 'dcoilwtico', 'is_holiday']]

    # Fit model
    model = Prophet()
    model.add_regressor('rolling_avg_7')
    model.add_regressor('sales_lag_1')
    model.add_regressor('sales_lag_7')
    model.add_regressor('onpromotion')
    model.add_regressor('transactions')
    model.add_regressor('dcoilwtico')
    model.add_regressor('is_holiday')
    model.fit(train_df)

    # Create future dataframe with same regressors
    future = subset[['ds', 'rolling_avg_7', 'sales_lag_1', 'sales_lag_7',
                     'onpromotion', 'transactions', 'dcoilwtico', 'is_holiday']].copy()
    future = future.sort_values('ds').tail(forecast_days)

    # Forecast
    forecast = model.predict(future)
    forecast['store_nbr'] = store
    forecast['family'] = family
    forecast['yhat'] = forecast['yhat'].clip(lower=0)  # No negative predictions
    forecast_results.append(forecast[['ds', 'store_nbr', 'family', 'yhat']])

  0%|          | 0/1782 [00:00<?, ?it/s]04:44:55 - cmdstanpy - INFO - Chain [1] start processing
04:44:55 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 2/1782 [00:00<07:14,  4.10it/s]04:44:56 - cmdstanpy - INFO - Chain [1] start processing
04:44:56 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 3/1782 [00:00<08:33,  3.46it/s]04:44:56 - cmdstanpy - INFO - Chain [1] start processing
04:44:56 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 4/1782 [00:01<09:54,  2.99it/s]04:44:57 - cmdstanpy - INFO - Chain [1] start processing
04:44:57 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 5/1782 [00:01<09:48,  3.02it/s]04:44:57 - cmdstanpy - INFO - Chain [1] start processing
04:44:57 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 6/1782 [00:01<10:28,  2.83it/s]04:44:57 - cmdstanpy - INFO - Chain [1] start processing
04:44:57 - cmdstanpy - INFO - Chain [1] done processing
  0%|          | 7/1782 [00:02<10:28,  2.8

### 2.4. Save Forecasts
All predictions are combined and saved into `forecast_results.csv`, making them ready for evaluation or dashboard integration.


In [42]:
# Combine all forecasts and save
all_forecasts = pd.concat(forecast_results)
all_forecasts.to_csv("../forecasts/forecast_results.csv", index=False)