## Spend Categorization - CatBoost

This notebook trains a CatBoost model for Borealis Wind Systems spend categorization. The goal is to predict the 3-level category hierarchy (Level 1 → Level 2 → Level 3) based on transaction features.

**Data Schema:**
- `order_id`, `date` - Transaction identifiers
- `supplier`, `supplier_country`, `description` - Text features (LLM-generated)
- `category_level_1/2/3` - Target labels (Direct/Indirect/Non-Procureable hierarchy)
- `cost_centre`, `plant`, `region` - Categorical features
- `amount`, `unit_price`, `total` - Numeric features

**Requirements:** This notebook uses CatBoost for gradient boosting classification. It works on serverless or classic compute.

In [0]:
%pip install catboost
%restart_python

In [None]:
from utils import get_spark
spark = get_spark()


In [0]:
import mlflow
from catboost import CatBoostClassifier, Pool, cv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from mlflow.models import infer_signature
import pandas as pd

In [0]:
# Load and prepare the enhanced transactions table
# This table includes LLM-generated descriptions from 0_data.ipynb
transactions = spark.table('shm.spend.transactions_enh').toPandas()
transactions['date'] = pd.to_datetime(transactions['date'])
transactions['year'] = transactions['date'].dt.year
transactions['month'] = transactions['date'].dt.month

# Create train/test split (80/20)
from sklearn.model_selection import train_test_split
train_pd, test_pd = train_test_split(transactions, test_size=0.2, random_state=42)
print(f"Train: {len(train_pd)}, Test: {len(test_pd)}")

## Feature Engineering

We train hierarchical models: Level 1 → Level 2 → Level 3. Each lower level uses predictions from the level above as features. This creates error propagation, which we monitor during evaluation.

In [0]:
# Define feature columns for Borealis Wind data
cat_feature_cols = [
    'order_id',           # Transaction identifier
    'supplier',           # Vendor name (LLM-generated)
    'supplier_country',   # Vendor country (LLM-generated)
    'description',        # Invoice description (LLM-generated)
    'cost_centre',        # Cost center code
    'plant',              # Plant name
    'plant_id',           # Plant identifier
    'region',             # Geographic region
]

num_feature_cols = [
    'amount',             # Quantity
    'unit_price',         # Price per unit
    'total',              # Total spend (amount * unit_price)
]

# Target columns (hierarchical)
targets = [
    'category_level_1',   # Direct / Indirect / Non-Procureable
    'category_level_2',   # e.g., "Bearings & Seals", "MRO"
    'category_level_3',   # e.g., "Spherical roller bearing"
]

# Convert categorical features to string
for col_name in cat_feature_cols:
    train_pd[col_name] = train_pd[col_name].astype(str)

for targ_name in targets:
    train_pd[targ_name] = train_pd[targ_name].astype(str)

X = train_pd[cat_feature_cols + num_feature_cols]
y = train_pd[targets]

# Split into train/validation (we already have a holdout test set)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Extract targets for each level
y_train_lvl_1 = y_train['category_level_1']
y_val_lvl_1 = y_val['category_level_1']
y_train_lvl_2 = y_train['category_level_2']
y_val_lvl_2 = y_val['category_level_2']
y_train_lvl_3 = y_train['category_level_3']
y_val_lvl_3 = y_val['category_level_3']

print(f"Training samples: {len(X_train)}")
print(f"Validation samples: {len(X_val)}")
print(f"Level 1 classes: {y_train_lvl_1.nunique()}")
print(f"Level 2 classes: {y_train_lvl_2.nunique()}")
print(f"Level 3 classes: {y_train_lvl_3.nunique()}")

## Level 1 Model: Direct / Indirect / Non-Procureable

First model predicts the top-level category. This is a 3-class classification problem. The `Pool` object is CatBoost's efficient data container for handling categorical features.

In [0]:
mlflow.set_registry_uri('databricks-uc')
train_lvl_1 = False
train_lvl_2 = False

In [0]:
if train_lvl_1:
    with mlflow.start_run():
        mlflow.sklearn.autolog()

        params = {
            'iterations': 500,
            'learning_rate': 0.1,
            'depth': 6,
            'l2_leaf_reg': 3,
            'colsample_bylevel': 0.8,
            'early_stopping_rounds': 50,
            'verbose': 100
        }

        model_lvl1 = CatBoostClassifier(**params)

        # Create CatBoost Pool for efficient memory handling
        train_pool = Pool(X_train, y_train_lvl_1, cat_features=cat_feature_cols)
        val_pool = Pool(X_val, y_val_lvl_1, cat_features=cat_feature_cols)

        model_lvl1.fit(
            train_pool,
            eval_set=val_pool,
            use_best_model=True
        )

        mlflow.catboost.log_model(
            cb_model=model_lvl1,
            artifact_path="model",
            registered_model_name="shm.spend.category_level1_model",
            signature=mlflow.models.infer_signature(
                X_train, 
                model_lvl1.predict(X_train)
            )
        )

Models are logged to MLflow and can be served via Databricks Model Serving. In production, wrap all 3 level models in a single `pyfunc` class for a unified prediction endpoint.

In [0]:
import mlflow
# Load the Level 1 model from registry
model_lvl1 = mlflow.catboost.load_model("models:/shm.spend.category_level1_model/1")
model_lvl1.predict(X_val)

## Level 2 Model: Category (e.g., Bearings & Seals, MRO, IT & Software)

Predicts the ~32 Level 2 categories. Uses the Level 1 prediction as an additional feature, which creates error propagation that we'll monitor.

In [0]:
# Get predictions and probabilities for Level 1
pred_train_lvl1 = model_lvl1.predict(X_train)
proba_train_lvl1 = model_lvl1.predict_proba(X_train)
pred_val_lvl1 = model_lvl1.predict(X_val)
proba_val_lvl1 = model_lvl1.predict_proba(X_val)

In [0]:
# Augment training and validation data with Level 1 predictions
X_train_lvl_2 = X_train.copy()
X_train_lvl_2['category_level_1'] = pred_train_lvl1.ravel()

X_val_lvl_2 = X_val.copy()
X_val_lvl_2['category_level_1'] = pred_val_lvl1.ravel()

In [0]:
train_lvl_2 = False

In [0]:
if train_lvl_2:
    with mlflow.start_run():
        mlflow.sklearn.autolog()
        
        # Filter validation set to only include classes present in training set
        train_classes = set(y_train_lvl_2.unique())
        val_mask = y_val_lvl_2.isin(train_classes)
        
        X_val_lvl_2_filtered = X_val_lvl_2[val_mask]
        y_val_lvl_2_filtered = y_val_lvl_2[val_mask]
        
        print(f"Filtered out {(~val_mask).sum()} validation samples with unseen classes")
        print(f"Validation set size: {len(y_val_lvl_2_filtered)} samples")

        params = {
            'loss_function': 'MultiClass',
            'iterations': 500,
            'learning_rate': 0.1,
            'depth': 6,
            'l2_leaf_reg': 3,
            'colsample_bylevel': 0.5,
            'early_stopping_rounds': 50,
            'verbose': 100
        }

        model_lvl2 = CatBoostClassifier(**params)

        # Create CatBoost Pool with Level 1 as additional categorical feature
        train_pool = Pool(
            X_train_lvl_2, 
            y_train_lvl_2, 
            cat_features=cat_feature_cols + ['category_level_1']
        )

        val_pool = Pool(
            X_val_lvl_2_filtered, 
            y_val_lvl_2_filtered, 
            cat_features=cat_feature_cols + ['category_level_1']
        )

        model_lvl2.fit(
            train_pool,
            eval_set=val_pool,
            use_best_model=True
        )

        mlflow.catboost.log_model(
            cb_model=model_lvl2,
            artifact_path="model",
            registered_model_name="shm.spend.category_level2_model",
            signature=mlflow.models.infer_signature(
                X_val_lvl_2_filtered, 
                model_lvl2.predict(X_val_lvl_2_filtered)
            )
        )

## Evaluation

Evaluate hierarchical predictions: Level 1 → Level 2. We reload models from MLflow and measure accuracy at each level.

In [0]:
mlflow.set_registry_uri('databricks-uc')
model_lvl1 = mlflow.catboost.load_model("models:/shm.spend.category_level1_model/1")
model_lvl2 = mlflow.catboost.load_model("models:/shm.spend.category_level2_model/1")

In [0]:
X_val_eval = X_val.copy()
X_val_eval['pred_level_1'] = model_lvl1.predict(X_val).ravel()

# Add the level 1 prediction as a feature for level 2 model
X_val_eval_lvl2 = X_val.copy()
X_val_eval_lvl2['category_level_1'] = X_val_eval['pred_level_1']

# Create Pool with categorical features for level 2 prediction
val_pool_eval = Pool(
    X_val_eval_lvl2,
    cat_features=cat_feature_cols + ['category_level_1']
)

X_val_eval['pred_level_2'] = model_lvl2.predict(val_pool_eval).ravel()

In [0]:
X_val_eval.head(5)

## Holdout Test Predictions

Test on the holdout set that the models have never seen. If not overfit, we expect similar performance to validation.

In [0]:
# test_pd was already created during the train/test split above
test_pd['date'] = pd.to_datetime(test_pd['date'])
test_pd['year'] = test_pd['date'].dt.year
test_pd['month'] = test_pd['date'].dt.month
print(f"Test set size: {len(test_pd)}")

In [0]:
X_test = test_pd[cat_feature_cols + num_feature_cols].copy()

# Convert categorical features to string (same as training)
for col_name in cat_feature_cols:
    X_test[col_name] = X_test[col_name].astype(str)

# Predict Level 1
X_test['pred_level_1'] = model_lvl1.predict(X_test).ravel()

# Add Level 1 prediction as feature for Level 2
X_test_lvl2 = X_test.copy()
X_test_lvl2['category_level_1'] = X_test['pred_level_1']

# Create Pool for level 2 prediction
test_pool_lvl2 = Pool(
    X_test_lvl2,
    cat_features=cat_feature_cols + ['category_level_1']
)

X_test['pred_level_2'] = model_lvl2.predict(test_pool_lvl2).ravel()

# Add actuals for comparison
X_test['actual_level_1'] = test_pd['category_level_1'].values
X_test['actual_level_2'] = test_pd['category_level_2'].values
X_test['total'] = test_pd['total'].values

In [0]:
# Save predictions to Delta table
(
    spark.createDataFrame(X_test)
    .write
    .mode("overwrite")
    .saveAsTable("shm.spend.pred_catboost")
)
print("✅ Predictions saved to shm.spend.pred_catboost")

In [0]:
pred_vs_df = spark.sql("SELECT * FROM shm.spend.pred_catboost").display()

In [0]:
# Predictions table already includes actuals and spend, no join needed
pred_catboost_comp = spark.table('shm.spend.pred_catboost').toPandas()
pred_catboost_comp = pred_catboost_comp.dropna(
    subset=['actual_level_1', 'pred_level_1', 'actual_level_2', 'pred_level_2']
)
print(f"Comparison dataset: {len(pred_catboost_comp)} rows")

In [0]:
# Already loaded above - display sample
pred_catboost_comp[['order_id', 'description', 'actual_level_1', 'pred_level_1', 'actual_level_2', 'pred_level_2', 'total']].head()

## Accuracy Metrics

Measure classification accuracy for each level of the hierarchy, weighted by spend.

In [0]:
from sklearn.metrics import accuracy_score, classification_report

print(f"Level 1 Accuracy (Direct/Indirect/Non-Procureable): {accuracy_score(pred_catboost_comp['actual_level_1'], pred_catboost_comp['pred_level_1']):0.3f}")
print(f"Level 2 Accuracy (Category): {accuracy_score(pred_catboost_comp['actual_level_2'], pred_catboost_comp['pred_level_2']):0.3f}")

In [0]:
import pandas as pd

# Classification report weighted by spend
class_dict = classification_report(
    y_true=pred_catboost_comp['actual_level_2'], 
    y_pred=pred_catboost_comp['pred_level_2'],
    sample_weight=pred_catboost_comp['total'].abs(),
    output_dict=True,
    zero_division=0
)

In [0]:
# Display classification metrics by category, sorted by spend
metrics_df = (
    pd.DataFrame(class_dict)
    .transpose()
    .reset_index()
    .rename(columns={'index': 'category', 'support': 'spend'})
    .query('category not in ["accuracy", "weighted avg", "macro avg"]')
    .sort_values('spend', ascending=False)
    .assign(spend=lambda df: (df['spend'] / 1000).round().astype(int))
    .round(2)
    .head(15)
)
metrics_df

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt

# Prepare visualization data
df = pd.DataFrame(class_dict).transpose().reset_index()
df = df.query('index not in ["accuracy", "weighted avg", "macro avg"]')
df = df.sort_values('support', ascending=False)

# Add total spend by category
spend_by_cat = pred_catboost_comp.groupby('actual_level_2')['total'].sum()
df['total_spend'] = df['index'].map(spend_by_cat).fillna(0)

# Truncate labels to max 20 characters
df['short_label'] = df['index'].str.slice(0, 20)

# Sort by total spend descending
df = df.sort_values('total_spend', ascending=False).reset_index(drop=True).head(10)

# Create visualization
sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 8))
barplot = sns.barplot(x='short_label', y='precision', data=df, color='#1B3139')
plt.xlabel('', fontsize=14)
plt.ylabel('Precision', fontsize=14)
plt.title('CatBoost Precision by Category (Borealis Wind)', fontsize=16)
plt.ylim(0.5, 1.0)

for spine in barplot.spines.values():
    spine.set_visible(False)

plt.xticks(rotation=-45, ha='left')
barplot.grid(False)

# Add spend labels inside bars
for i, row in df.reset_index().iterrows():
    if row.total_spend > 0:
        barplot.text(i, row.precision * 0.85, f"${row.total_spend/1e6:.1f}M", 
                     color='white', ha="center", va='baseline', rotation=-90, 
                     fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig('catboost.png', dpi=300)
plt.show()