# FMCG Forecasting: Modeling Pipeline (Weekly vs Daily)

This notebook compares two modeling strategies for FMCG sales forecasting using the SKU `MI-006`:

##  Objectives

### 1. **Load preprocessed features**  
- Daily and weekly features previously generated in `02_feature_engineering.py`.

### 2. **Train two models**
- **Baseline weekly model** trained on aggregated weekly features.
- **Daily model** trained on high-resolution daily features (with lags, momentum, etc.).

### 3. **Evaluate**
- Aggregate daily predictions to weekly level.
- Compare model accuracy using RMSE and MAE.

### 4. **Log and register the best model**
- Use MLflow to log model performance.
- Register the best-performing daily model in Databricks Model Registry.

### 5. **Export predictions**
- Save daily predictions in both Parquet and CSV formats for downstream use (dashboards, scoring jobs, etc.)

---

> This notebook simulates a typical modeling pipeline in FMCG and retail forecasting scenarios. It allows teams to benchmark different modeling granularities before going into production.


## Imports & paths

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

from utils.feature_engineering_utils import create_weekly_features
from utils.forecasting_utils import (
    time_split,
    engineer_features_daily,
    aggregate_to_week,
    log_model_with_metrics,
    train_model
)

path_weekly = "dbfs:/FileStore/fmcg/delta/weekly_features"
path_daily = "dbfs:/FileStore/fmcg/parquet/FMCG_2022_2024.parquet"


In [0]:
import mlflow
user = spark.sql("SELECT current_user()").collect()[0][0]
experiment_path = f"/Users/{user}/mlruns_fmcg_forecasting"

mlflow.set_experiment(experiment_path)

## Load data

In [0]:
# Weekly
df_weekly = spark.read.format("delta").load(path_weekly).filter(col("sku") == "MI-006")

# Daily
df_daily = spark.read.parquet(path_daily).filter(col("sku") == "MI-006")


## Feature engineering for daily

In [0]:
df_daily_fe = engineer_features_daily(df_daily)
df_daily_fe = df_daily_fe.dropna()

## Baseline model


In [0]:
## Baseline Weekly Model
from utils.forecasting_utils import train_model, time_split

# Features z weekly
features_weekly = ["lag_1", "lag_2", "rolling_mean_4", "momentum", "avg_by_channel_region"]
df_weekly=df_weekly.dropna()
# Time split 
train_w, test_w = time_split1(df_weekly, date_col="week", split_ratio=0.8)

In [0]:
# Training
baseline_model, preds_baseline, rmse_base, mae_base = train_model(train_w, test_w, features_weekly, label="target_next_week")

In [0]:
log_model_with_metrics(baseline_model, rmse_base, mae_base, features_weekly, data_type="weekly")


## Train model on daily data


In [0]:
train_df, test_df = time_split(df_daily_fe, date_col="date", split_ratio=0.8)

In [0]:
features = ["lag_1", "lag_2", "rolling_mean_4", "momentum", "avg_by_channel_region"]
model, predictions, rmse, mae = train_model(train_df, test_df, features, label="units_sold")


## Aggregate daily predictions to weekly


In [0]:
df_weekly_eval = aggregate_to_week(predictions, date_col="date", pred_col="prediction", true_col="units_sold")


## Evaluate weekly RMSE/MAE


In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

weekly_pd = df_weekly_eval.toPandas()
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

rmse_weekly = mean_squared_error(weekly_pd["actual"], weekly_pd["predicted"], squared=False)
mae_weekly = mean_absolute_error(weekly_pd["actual"], weekly_pd["predicted"])
print(f"📊 Weekly RMSE: {rmse_weekly:.2f}")
print(f"📊 Weekly MAE:  {mae_weekly:.2f}")


## Visualization

In [0]:
import matplotlib.pyplot as plt

weekly_pd = weekly_pd.sort_values(["year", "week"])
plt.figure(figsize=(12,5))
plt.plot(weekly_pd["week"], weekly_pd["actual"], label="Actual", marker="o")
plt.plot(weekly_pd["week"], weekly_pd["predicted"], label="Predicted", marker="o", linestyle="--")
plt.title("Weekly Forecast – Aggregated from Daily RF")
plt.xlabel("Week")
plt.ylabel("Units Sold")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()


# Log model to MLflow

In [0]:
import mlflow
import mlflow.sklearn

with mlflow.start_run(run_name="Register final daily model") as run:
    mlflow.sklearn.log_model(
        sk_model=best_model,
        artifact_path="model",
        registered_model_name="fmcg_rf_daily_model"
    )
    mlflow.log_metric("rmse", rmse_best)
    mlflow.log_metric("mae", mae_best)
    mlflow.log_param("features", features_daily)

    print(f"✅ Model registered in Model Registry: {run.info.run_id}")


In [0]:
log_model_with_metrics(model, rmse, mae, features)


## Save daily predictions from final model

In [0]:

df_final_pred = predictions.select("sku", "date", "channel", "region", "units_sold", "prediction") \
                           .withColumnRenamed("prediction", "units_sold_pred")

# set saving path
latest_date = df_final_pred.agg({"date": "max"}).collect()[0][0]
formatted_date = latest_date.strftime("%Y_%m_%d")

output_path = f"dbfs:/FileStore/fmcg/predictions/final_daily_preds_{formatted_date}.parquet"

# save
df_final_pred.write.mode("overwrite").parquet(output_path)

print(f"✅ Saved final daily predictions to: {output_path}")


In [0]:
#  export to CSV 
df_final_pred.toPandas().to_csv(f"/dbfs/FileStore/fmcg/predictions/final_daily_preds_{formatted_date}.csv", index=False)
