# Notebook 03 – Machine Learning Forecasting for SunnyBest

In this notebook, I build machine-learning forecasting models to predict **daily revenue** per store and per product category.

This extends the baseline models from Notebook 02 by introducing:

- Feature engineering  
- Categorical encoding  
- Time-based features  
- Gradient boosting models (XGBoost, LightGBM)  
- Proper time-series validation  

The goal is to create a high-performance forecasting model that will later be deployed through an API and used by the GenAI assistant.


In [5]:
import pandas as pd
import numpy as np


### 1. Load and prepare merged dataset

In [2]:
df = pd.read_csv("../data/processed/sunnybest_merged_df.csv", parse_dates=["date"])

df.head()


  df = pd.read_csv("../data/processed/sunnybest_merged_df.csv", parse_dates=["date"])


Unnamed: 0,date,store_id,product_id,units_sold,price,regular_price,discount_pct,promo_flag,promo_type,revenue,...,store_type,store_size_store,year,month,day,day_of_week,is_weekend,is_holiday,is_payday,season
0,2021-01-01,1,1001,0,445838.0,445838,0,0,,0.0,...,Mall,Large,2021,1,1,Friday,False,True,False,Dry
1,2021-01-01,1,1002,2,500410.0,500410,0,0,,1000820.0,...,Mall,Large,2021,1,1,Friday,False,True,False,Dry
2,2021-01-01,1,1003,2,399365.0,399365,0,0,,798730.0,...,Mall,Large,2021,1,1,Friday,False,True,False,Dry
3,2021-01-01,1,1004,4,305796.0,305796,0,0,,1223184.0,...,Mall,Large,2021,1,1,Friday,False,True,False,Dry
4,2021-01-01,1,1005,5,462752.0,462752,0,0,,2313760.0,...,Mall,Large,2021,1,1,Friday,False,True,False,Dry


In [4]:
df.shape

(1227240, 37)

## 2. Selecting Fprecasting Level

In [6]:
store_name = "SunnyBest Benin Main"
category = "Mobile Phones"

df_fc = df[(df["store_name"] == store_name) & (df["category"] == category)]
ts = df_fc.groupby("date")["revenue"].sum().reset_index().sort_values("date")


In [7]:
ts

Unnamed: 0,date,revenue
0,2021-01-01,22549753.0
1,2021-01-02,19388064.0
2,2021-01-03,21155785.0
3,2021-01-04,17087610.0
4,2021-01-05,16450365.0
...,...,...
1456,2024-12-27,20154354.0
1457,2024-12-28,26209820.0
1458,2024-12-29,25352579.0
1459,2024-12-30,21191442.0


### 3. Feature Engineering

In [8]:
ts["day"] = ts["date"].dt.day
ts["month"] = ts["date"].dt.month
ts["year"] = ts["date"].dt.year
ts["dayofweek"] = ts["date"].dt.dayofweek
ts["is_weekend"] = ts["dayofweek"].isin([5,6]).astype(int)


In [9]:
# Add lag

ts["lag_1"] = ts["revenue"].shift(1)
ts["lag_7"] = ts["revenue"].shift(7)
ts["lag_30"] = ts["revenue"].shift(30)
ts["lag_90"] = ts["revenue"].shift(90)


In [11]:
#Add rolling mean
ts["roll_7"] = ts["revenue"].shift(1).rolling(7).mean()
ts["roll_30"] = ts["revenue"].shift(1).rolling(30).mean()


In [12]:
ts = ts.dropna().reset_index(drop=True)
ts.head()


Unnamed: 0,date,revenue,day,month,year,dayofweek,is_weekend,lag_1,lag_7,lag_30,lag_90,roll_7,roll_30
0,2021-04-01,16119181.0,1,4,2021,3,0,15262573.0,17885621.0,18873078.0,22549753.0,18204760.0,18340480.0
1,2021-04-02,16696229.0,2,4,2021,4,0,16119181.0,16547448.0,17506700.0,19388064.0,17952410.0,18248680.0
2,2021-04-03,21325720.0,3,4,2021,5,1,16696229.0,19613306.0,18139723.0,21155785.0,17973670.0,18221670.0
3,2021-04-04,20025075.0,4,4,2021,6,1,21325720.0,21587795.0,16906672.0,17087610.0,18218300.0,18327870.0
4,2021-04-05,17705026.0,5,4,2021,0,0,20025075.0,18953373.0,23381744.0,16450365.0,17995050.0,18431810.0


### Train/Te