# forecasting_demand



## Startup cells

## Notebook

# Project 1: Demand Forecasting with LightGBM

## 1. Introduction
This notebook demonstrates a complete pipeline for demand forecasting using LightGBM.
We will:
- Load data from S3
- Explore and preprocess the dataset
- Create temporal features
- Split the data into train and validation sets
- Train a LightGBM model
- Track parameters, metrics, and the model using MLflow

## 2. Data Loading
Load the dataset from the S3 bucket.

In [None]:
import pandas as pd
import numpy as np
import sklearn
import mlflow

print("OK")

OK


In [None]:
import boto3
import pandas as pd
from io import BytesIO

print("STEP 1: creating S3 client")
s3 = boto3.client("s3")

BUCKET = "ml-portfolio-av"
KEY = "train.csv"

obj = s3.get_object(Bucket=BUCKET, Key=KEY)
df = pd.read_csv(BytesIO(obj["Body"].read()))

print("DF LOADED")
print(df.shape)

STEP 1: creating S3 client


DF LOADED
(3000888, 6)


# 3. Exploratory Data Analysis

Check columns, data types, and date range.

In [None]:
print(df.columns)
print(df.dtypes)

Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')
id               int64
date            object
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object


In [None]:
df["date"] = pd.to_datetime(df["date"])
print(df["date"].min(), df["date"].max())

2013-01-01 00:00:00 2017-08-15 00:00:00


# 4. Sorting and Aggregation

Sort data by date and aggregate sales by date and family.

In [None]:
df = df.sort_values("date").reset_index(drop=True)
print("OK - sorted")

OK - sorted


In [None]:
df_agg = (
    df
    .groupby(["date", "family"], as_index=False)
    .agg({"sales": "sum"})
)

print(df_agg.shape)
df_agg.head()


(55572, 3)


Unnamed: 0,date,family,sales
0,2013-01-01,AUTOMOTIVE,0.0
1,2013-01-01,BABY CARE,0.0
2,2013-01-01,BEAUTY,2.0
3,2013-01-01,BEVERAGES,810.0
4,2013-01-01,BOOKS,0.0


# 5. Train / Validation Split

Split the dataset using a temporal split.

In [None]:
split_date = "2017-01-01"

train_df = df_agg[df_agg["date"] < split_date]
val_df   = df_agg[df_agg["date"] >= split_date]

print("TRAIN:", train_df.shape)
print("VAL:", val_df.shape)


TRAIN: (48081, 3)
VAL: (7491, 3)


# 6. Feature Engineering

Add temporal features: day of week, month, day.

In [None]:
def add_time_features(df):
    df = df.copy()
    df["dayofweek"] = df["date"].dt.dayofweek
    df["month"] = df["date"].dt.month
    df["day"] = df["date"].dt.day
    return df

train_df = add_time_features(train_df)
val_df   = add_time_features(val_df)

print(train_df.head())


        date      family  sales  dayofweek  month  day
0 2013-01-01  AUTOMOTIVE    0.0          1      1    1
1 2013-01-01   BABY CARE    0.0          1      1    1
2 2013-01-01      BEAUTY    2.0          1      1    1
3 2013-01-01   BEVERAGES  810.0          1      1    1
4 2013-01-01       BOOKS    0.0          1      1    1


In [None]:
import lightgbm as lgb
print("LightGBM OK", lgb.__version__)

LightGBM OK 4.6.0


Encode the categorical variable family.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
train_df["family_enc"] = le.fit_transform(train_df["family"])
val_df["family_enc"] = le.transform(val_df["family"])

print(train_df[["family", "family_enc"]].head())

       family  family_enc
0  AUTOMOTIVE           0
1   BABY CARE           1
2      BEAUTY           2
3   BEVERAGES           3
4       BOOKS           4


# 7. Prepare Features and Target

In [None]:
FEATURES = ["family_enc", "dayofweek", "month", "day"]
TARGET = "sales"

X_train = train_df[FEATURES]
y_train = train_df[TARGET]

X_val = val_df[FEATURES]
y_val = val_df[TARGET]

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

X_train shape: (48081, 4)
y_train shape: (48081,)


# 8. LightGBM Training

Train a LightGBM model on CPU.

In [None]:
import lightgbm as lgb

train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val)

params = {
    "objective": "regression",
    "metric": "rmse",
    "verbose": -1,
    "boosting_type": "gbdt",
    "num_threads": 2  # CPU ligero
}

print("Training started...")

callbacks = [lgb.early_stopping(stopping_rounds=10)]

model = lgb.train(
    params,
    train_data,
    valid_sets=[train_data, val_data],
    num_boost_round=50,
    callbacks=callbacks
)

print("Training finished")


Training started...
Training until validation scores don't improve for 10 rounds
Did not meet early stopping. Best iteration is:
[50]	training's rmse: 14853.5	valid_1's rmse: 26160.3
Training finished


# 9. MLflow Tracking

Log parameters, metrics, and model in MLflow.

In [None]:
import mlflow
import mlflow.lightgbm
from sklearn.metrics import mean_squared_error
import numpy as np

mlflow.set_experiment("Forecasting_Demand")

with mlflow.start_run():
    mlflow.log_params(params)
    y_pred = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    mlflow.log_metric("rmse", rmse)
    mlflow.lightgbm.log_model(model, name="model")

print("MLflow run logged, RMSE:", rmse)





MLflow run logged, RMSE: 26160.26321213675


# 10. Summary

- Dataset loaded from S3 
- Temporal features added
- Train / Validation split applied
- LightGBM trained on CPU
- Metrics and model tracked in MLflow
- Ready to deploy or integrate in production