# Optional Retraining Job
This notebook retrains the Random Forest model for SKU **MI-006** using historical daily features from the Feature Store.
It simulates a production use case where the model is periodically refreshed due to potential data drift or evolving patterns.

## Imports

In [0]:
import pandas as pd
from databricks.feature_store import FeatureStoreClient
from pyspark.sql.functions import col
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import mlflow
import mlflow.sklearn
from datetime import datetime, timedelta

## Load data from Feature Store (last 365 days)


In [0]:
fs = FeatureStoreClient()

today = datetime(2025, 1, 7)
start_date = today - timedelta(days=365)

df_spark = fs.read_table("fmcg_features_daily") \
    .filter((col("sku") == "MI-006") & (col("date") >= start_date.strftime("%Y-%m-%d")))

## Prepare training data

In [0]:
#  Convert to pandas
df = df_spark.toPandas()

#  Feature selection
features = [
    "lag_1", "lag_2", "rolling_mean_4", "rolling_std_4",
    "momentum", "avg_by_channel_region"
]

df = df.dropna(subset=features + ["units_sold"])
X = df[features]
y = df["units_sold"]

#  Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Retrain model and log to MLflow

In [0]:
with mlflow.start_run(run_name="retraining_rf_mi006") as run:
    
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    preds = rf.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)
    
    mlflow.log_metric("rmse", rmse)
    mlflow.sklearn.log_model(rf, artifact_path="model")
    
    print(f"✅ Model retrained. Run ID: {run.info.run_id}")
    print(f"📦 Saved at: runs:/{run.info.run_id}/model")
    print(f"📉 RMSE: {rmse:.3f}")

## ✅ Summary
- Trained on: 365 days of historical daily features for SKU MI-006
- Features used: 6 lag-based and engineered features
- Model: RandomForestRegressor (sklearn)
- Logged to MLflow with automatic versioning
- This notebook simulates real-world retraining scenarios in retail/FMCG pipelines.