# 04 — Stockout Classification Model (SunnyBest Telecommunications)

In this notebook, I build a **classification model** to predict the risk of stockouts for SunnyBest stores across Edo State.

A stockout occurs when demand exceeds available inventory, leading to lost sales and customer dissatisfaction.

**Objective:**  
Predict whether a `(store, product, date)` combination is likely to experience a **stockout** so that SunnyBest can:

- Improve inventory planning  
- Reduce lost sales  
- Increase customer satisfaction  
- Avoid holding excess stock  

We will:

1. Load the merged dataset  
2. Select + engineer features  
3. Split into train/test  
4. Train multiple classification models  
5. Evaluate using F1 Score, ROC-AUC  
6. Select the best classifier  
7. Save model for deployment  


### 1. Load dependencies

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

import joblib
import os


### 2. Load dataset

In [2]:
df = pd.read_csv("../data/processed/sunnybest_merged_df.csv", parse_dates=["date"], low_memory=False)
df.head()


Unnamed: 0,date,store_id,product_id,units_sold,price,regular_price,discount_pct,promo_flag,promo_type,revenue,...,is_weekend,is_holiday,is_payday,season,temperature_c,rainfall_mm,weather_condition,promo_type_promo,discount_pct_promo,promo_flag_promo
0,2021-01-01,1,1001,0,445838.0,445838,0,0,,0.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,
1,2021-01-01,1,1002,2,500410.0,500410,0,0,,1000820.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,
2,2021-01-01,1,1003,2,399365.0,399365,0,0,,798730.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,
3,2021-01-01,1,1004,4,305796.0,305796,0,0,,1223184.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,
4,2021-01-01,1,1005,5,462752.0,462752,0,0,,2313760.0,...,False,True,False,Dry,30.6,3.7,Rainy,,,


3. ### Select features + target

In [3]:
df.columns

Index(['date', 'store_id', 'product_id', 'units_sold', 'price',
       'regular_price', 'discount_pct', 'promo_flag', 'promo_type', 'revenue',
       'starting_inventory', 'ending_inventory', 'stockout_occurred', 'city',
       'store_size', 'category', 'product_name', 'category_product', 'brand',
       'regular_price_product', 'cost_price', 'is_seasonal', 'warranty_months',
       'store_name', 'city_store', 'area', 'region', 'store_type',
       'store_size_store', 'year', 'month', 'day', 'day_of_week', 'is_weekend',
       'is_holiday', 'is_payday', 'season', 'temperature_c', 'rainfall_mm',
       'weather_condition', 'promo_type_promo', 'discount_pct_promo',
       'promo_flag_promo'],
      dtype='object')

In [13]:
target = "stockout_occurred"

## Feature Selection Strategy

Before training the stockout classification model, I carefully selected features that have a meaningful and logical influence on stockout risk.

A stockout occurs when **demand exceeds available inventory**, so the selected features capture factors that influence:

- consumer demand,
- promotion-driven demand spikes,
- seasonality,
- store-specific behaviour,
- and weather-driven changes in purchasing patterns.

### Selected Features (and why)

- **units_sold** – high sales volume increases the probability of running out of stock.
- **regular_price, discount_pct, promo_flag** – pricing and promotions strongly affect demand pressure.
- **store_size** – larger stores have higher foot traffic and different demand dynamics.
- **category** – stockout behaviour varies by product type (phones vs appliances).
- **month, is_weekend, is_holiday** – captures seasonality and high-traffic periods.
- **temperature_c, rainfall_mm** – weather influences consumer behaviour (e.g., rainy days → more telecom activity).

### Why some features were excluded

- **starting_inventory, ending_inventory** – these directly reveal the stockout outcome (data leakage).
- **product_id, product_name** – high-cardinality identifiers that add noise and little predictive value.
- **store_id, city, area** – redundant once store_size is included.
- **warranty_months, brand** – do not meaningfully influence short-term stockout occurrence.

This feature selection ensures the model remains **predictive**, **generalizable**, and free from **leakage**, resulting in more realistic real-world performance.


In [14]:
features = [
    "units_sold",
    "regular_price",
    "discount_pct",
    "promo_flag",
    "store_size",
    "category",
    "month",
    "is_weekend",
    "is_holiday",
    "rainfall_mm",
    "temperature_c",
]


In [15]:
# Add month + weekend if not present
df["month"] = df["date"].dt.month
df["is_weekend"] = df["date"].dt.day_name().isin(["Saturday", "Sunday"]).astype(int)
df["is_holiday"] = df["is_holiday"].astype(int)


#### 4. Create modelling dataset

In [16]:
df.columns

Index(['date', 'store_id', 'product_id', 'units_sold', 'price',
       'regular_price', 'discount_pct', 'promo_flag', 'promo_type', 'revenue',
       'starting_inventory', 'ending_inventory', 'stockout_occurred', 'city',
       'store_size', 'category', 'product_name', 'category_product', 'brand',
       'regular_price_product', 'cost_price', 'is_seasonal', 'warranty_months',
       'store_name', 'city_store', 'area', 'region', 'store_type',
       'store_size_store', 'year', 'month', 'day', 'day_of_week', 'is_weekend',
       'is_holiday', 'is_payday', 'season', 'temperature_c', 'rainfall_mm',
       'weather_condition', 'promo_type_promo', 'discount_pct_promo',
       'promo_flag_promo'],
      dtype='object')

In [17]:
X = df[features]
y = df[target]


### 5. Train/Test Split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


### 6. Define categorical + numeric columns

In [19]:
categorical_cols = ["store_size", "category"]
numeric_cols = [col for col in features if col not in categorical_cols]


In [20]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numeric_cols)
    ]
)

### 7. Train Classification Models

I train multiple models: Logistic Regression, Random Forest, Gradient Boosting, and XGBoost.

The goal is to identify the model with the best ROC-AUC, Accuracy, Precision, and Recall, particularly for the stockout class (minority class).

In [21]:
logreg_clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("clf", LogisticRegression(max_iter=200))
])

logreg_clf.fit(X_train, y_train)
logreg_pred = logreg_clf.predict(X_test)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [22]:
rf_clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("clf", RandomForestClassifier(n_estimators=300))
])

rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)


In [23]:
xgb_clf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("clf", XGBClassifier(
        n_estimators=300,
        learning_rate=0.1,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        eval_metric="logloss"
    ))
])

xgb_clf.fit(X_train, y_train)
xgb_pred = xgb_clf.predict(X_test)


In [24]:
def evaluate(name, y_true, y_pred, model=None):
    print(f"==== {name} ====")
    print(classification_report(y_true, y_pred))
    if model is not None:
        prob = model.predict_proba(X_test)[:, 1]
        print("ROC-AUC:", roc_auc_score(y_true, prob))
    print("\n")

evaluate("Logistic Regression", y_test, logreg_pred, logreg_clf)
evaluate("Random Forest", y_test, rf_pred, rf_clf)
evaluate("XGBoost", y_test, xgb_pred, xgb_clf)


==== Logistic Regression ====
              precision    recall  f1-score   support

           0       0.96      1.00      0.98    235397
           1       0.00      0.00      0.00     10051

    accuracy                           0.96    245448
   macro avg       0.48      0.50      0.49    245448
weighted avg       0.92      0.96      0.94    245448

ROC-AUC: 0.7989159729784782


==== Random Forest ====
              precision    recall  f1-score   support

           0       0.96      0.99      0.98    235397
           1       0.43      0.13      0.20     10051

    accuracy                           0.96    245448
   macro avg       0.70      0.56      0.59    245448
weighted avg       0.94      0.96      0.95    245448



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


ROC-AUC: 0.8549717699138718


==== XGBoost ====
              precision    recall  f1-score   support

           0       0.96      1.00      0.98    235397
           1       0.68      0.09      0.16     10051

    accuracy                           0.96    245448
   macro avg       0.82      0.55      0.57    245448
weighted avg       0.95      0.96      0.95    245448

ROC-AUC: 0.9000142542488738




In [25]:
os.makedirs("../models", exist_ok=True)
joblib.dump(xgb_clf, "../models/stockout_classifier.pkl")

print("Model saved successfully.")


Model saved successfully.
