## Spend Categorizaton - CatBoost
This notebook trains a CatBoost model on top of the existing spend data. The goal here is to replicate historical categorization and always be rebuilding a model off the corrected 'gold' table of categorization, so that if managers correct spend, it is automatically reflected in the model.

This notebook requires a decently sized classic cluster to run and will not work on Serverless due to the size of the dataset (800k rows).

In [0]:
%pip install catboost
%restart_python

In [0]:
import mlflow
from catboost import CatBoostClassifier, Pool, cv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from mlflow.models import infer_signature
import pandas as pd

In [0]:
train_pd = spark.table('shm.spend.train').toPandas()
train_pd['year'] = train_pd.action_date.dt.year
train_pd['month'] = train_pd.action_date.dt.month

This is traditional machine learning, where we use all the text columns from orders as features and the levels as targets. But it requires a bit of complexity because we need to have one model PER target, since the targets are sequential. This also leads to error propogation so we must be careful to watch for that.

In [0]:
from sklearn.preprocessing import LabelEncoder

cat_feature_cols = [
    'contract_award_unique_key', 'action_date', 'recipient_name_raw',
    'recipient_state_name', 'recipient_city_name',
    'transaction_description', 'product_or_service_code_description',
    'naics_description', 'usaspending_permalink', 
  ]

num_feature_cols = [
  'federal_action_obligation',
  'base_and_all_options_value', 
]

for col_name in cat_feature_cols:
  train_pd[col_name] = train_pd[col_name].astype(str)

targets = [
  'funding_agency_name', 'funding_sub_agency_name'
  ]

encoders = {}

for targ_name in targets:
  train_pd[targ_name] = train_pd[targ_name].astype(str)

X = train_pd[cat_feature_cols + num_feature_cols]
y = train_pd[targets]

# We use the term 'validation' because we have a hold out test set already
X_train, X_val, y_train, y_val = train_test_split(
  X, y, test_size=0.2, random_state=42
  )

# Keep train / test splits consistent
y_train_lvl_1 = y_train['funding_agency_name']
y_val_lvl_1 = y_val['funding_agency_name']
y_train_lvl_2 = y_train['funding_sub_agency_name']
y_val_lvl_2 = y_val['funding_sub_agency_name']

## Level 1 Model
This is out first model, due to the large dataset we only do 5 iterations, but could easily train this longer and do hyperparameter tuning to improve performance. The `Pool` is a CatBoost artifact to help it distribute training across CPUs and improve throughput, similar to Tensors in pytorch etc.

In [0]:
mlflow.set_registry_uri('databricks-uc')
train_lvl_1 = False
train_lvl_2 = False

In [0]:
if train_lvl_1:
    with mlflow.start_run():
        mlflow.sklearn.autolog()

        params = {
            'iterations': 1000,
            'learning_rate': 0.1,
            'depth': 6,
            'l2_leaf_reg': 3,
            'colsample_bylevel': 0.8,
            'early_stopping_rounds': 50,
            'verbose': 100
        }

        model_lvl1 = CatBoostClassifier(**params)

        # Create CatBoost Pool for efficient memory handling
        train_pool = Pool(X_train, y_train_lvl_1, cat_features=cat_feature_cols)
        val_pool = Pool(X_val, y_val_lvl_1, cat_features=cat_feature_cols)

        model_lvl1.fit(
            train_pool,
            eval_set=val_pool,
            use_best_model=True,
            task_type="GPU"
        )

        mlflow.catboost.log_model(
            cb_model=model_lvl1,
            artifact_path="model",
            registered_model_name="shm.spend.agency_model",
            signature=mlflow.models.infer_signature(
                X_train, 
                model_lvl1.predict(X_train)
                )
        )

Here we log the model so we can reload for inference and downstream tasks. We can also serve this model to make the inference process much easier. In practice, we would log all four models (Level 1 --> Level 4) and wrap them in a single `pyfunc` model class under a serving endpoint. This would provide a convenient way to call the latest model without changing code or anything.

In [0]:
import mlflow# Test our model reload
model_lvl1 = mlflow.catboost.load_model("models:/shm.spend.agency_model/1")
model_lvl1.predict(X_val)

## Level 2 Model
Same story as above, but instead we are first taking the prediction from Level 1 as a feature into the Level 2 model. This is what creates the error propogation.

In [0]:
# Get predictions and probabilities for Level 1
pred_train_lvl1 = model_lvl1.predict(X_train)
proba_train_lvl1 = model_lvl1.predict_proba(X_train)
pred_val_lvl1 = model_lvl1.predict(X_val)
proba_val_lvl1 = model_lvl1.predict_proba(X_val)

In [0]:
# Augment the training and validation data
X_train_lvl_2 = X_train.copy()
X_train_lvl_2['funding_agency_name'] = pred_train_lvl1.ravel()

X_val_lvl_2 = X_val.copy()
X_val_lvl_2['funding_agency_name'] = pred_val_lvl1.ravel()

In [0]:
train_lvl_2 = False

In [0]:
if train_lvl_2:
    with mlflow.start_run():
        mlflow.sklearn.autolog()
        
        # Filter validation set to only include classes present in training set
        train_classes = set(y_train_lvl_2.unique())
        val_mask = y_val_lvl_2.isin(train_classes)
        
        X_val_lvl_2_filtered = X_val_lvl_2[val_mask]
        y_val_lvl_2_filtered = y_val_lvl_2[val_mask]
        
        print(f"Filtered out {(~val_mask).sum()} validation samples with unseen classes")
        print(f"Validation set size: {len(y_val_lvl_2_filtered)} samples")

        params = {
            'loss_function': 'MultiClass',
            'iterations': 200,
            'learning_rate': 0.3,
            'depth': 4,
            'l2_leaf_reg': 3,
            'colsample_bylevel': 0.5,
            'early_stopping_rounds': 20,
            'bootstrap_type': 'Bernoulli',
            'subsample': 0.7,
            'one_hot_max_size': 4,
            'border_count': 64,
            'early_stopping_rounds': 20,
            'verbose': 100
        }

        model_lvl2 = CatBoostClassifier(**params)

        # Create CatBoost Pool for efficient memory handling
        train_pool = Pool(
            X_train_lvl_2, 
            y_train_lvl_2, 
            cat_features=cat_feature_cols + ['funding_agency_name']
        )

        val_pool = Pool(
            X_val_lvl_2_filtered, 
            y_val_lvl_2_filtered, 
            cat_features=cat_feature_cols + ['funding_agency_name']
        )

        model_lvl2.fit(
            train_pool,
            eval_set=val_pool,
            use_best_model=True
        )

        mlflow.catboost.log_model(
            cb_model=model_lvl2,
            artifact_path="model",
            registered_model_name="shm.spend.subagency_model",
            signature=mlflow.models.infer_signature(
                X_val_lvl_2_filtered, 
                model_lvl2.predict(X_val_lvl_2_filtered)
            )
        )

## Evaluation
Here we are going to look at the validation predictions for accuracy, first predicting Level 1 then Level 2. In production we would have four levels of prediction and wrap some of the boilerplate in a library. For now, we will reload the models (could use Aliases etc. too) and make predictions

In [0]:
mlflow.set_registry_uri('databricks-uc')
model_lvl1 = mlflow.catboost.load_model("models:/shm.spend.agency_model/1")
model_lvl2 = mlflow.catboost.load_model("models:/shm.spend.subagency_model/1")

In [0]:
X_val_eval = X_val.copy()
X_val_eval['agency'] = model_lvl1.predict(X_val).ravel()

# Add the level 1 prediction as a feature for level 2 model
X_val_eval_lvl2 = X_val.copy()
X_val_eval_lvl2['funding_agency_name'] = X_val_eval['agency']

# Create Pool with categorical features for level 2 prediction
val_pool_eval = Pool(
    X_val_eval_lvl2,
    cat_features=cat_feature_cols + ['funding_agency_name']
)

X_val_eval['subagency'] = model_lvl2.predict(val_pool_eval).ravel()

In [0]:
X_val_eval.head(5)

The real test is the holdout Test Predictions. We load our test set and use the trained models on data that they've never seen to get a better accuracy metric. If the models aren't overfit, we expect similar out of sample performance to the numbers above.

In [0]:
test_pd = spark.table('shm.spend.test').toPandas()
test_pd['year'] = test_pd.action_date.dt.year
test_pd['month'] = test_pd.action_date.dt.month

In [0]:
X_test = test_pd[cat_feature_cols + num_feature_cols].copy()

# Convert categorical features to string (same as training)
for col_name in cat_feature_cols:
    X_test[col_name] = X_test[col_name].astype(str)

# Now make predictions
X_test['funding_agency_name'] = model_lvl1.predict(X_test).ravel()

# Create Pool for level 2 prediction with categorical features specified
test_pool_lvl2 = Pool(
    X_test,
    cat_features=cat_feature_cols + ['funding_agency_name']
)

X_test['funding_sub_agency_name'] = model_lvl2.predict(test_pool_lvl2).ravel()

In [0]:
(
    spark.createDataFrame(X_test)
    .write
    .mode("overwrite")
    .saveAsTable("shm.spend.pred_catboost")
)

In [0]:
pred_vs_df = spark.sql("SELECT * FROM shm.spend.pred_catboost").display()

In [0]:
%sql
CREATE OR REPLACE TABLE shm.spend.pred_catboost_comp AS
SELECT
  t.*,
  p.funding_agency_name as agency,
  p.funding_sub_agency_name as subagency,
  t.federal_action_obligation as spend
FROM 
  shm.spend.pred_catboost p
JOIN
  shm.spend.test t
ON 
  t.contract_award_unique_key = p.contract_award_unique_key

In [0]:
import pandas as pd
pred_catboost_comp = spark.table('shm.spend.pred_catboost_comp').dropna(
    subset=['funding_agency_name', 'agency', 'funding_sub_agency_name', 'subagency']
).toPandas()

The catboost model only trained on 7,000 rows but was still able to improve accuracy to 94% on level 1 and 90% on level 2. There is a lot more optimization that could be done here.

In [0]:
from sklearn.metrics import accuracy_score, classification_report

print(f"""Agency Accuracy: {accuracy_score(
  pred_catboost_comp['funding_agency_name'], 
  pred_catboost_comp['agency']
  ):0.3f}""")

print(f"""Subagency Accuracy: {accuracy_score(
  pred_catboost_comp['funding_sub_agency_name'], 
  pred_catboost_comp['subagency']
  ):0.3f}""")

In [0]:
import pandas as pd

class_dict = classification_report(
  y_true = pred_catboost_comp['funding_agency_name'], 
  y_pred = pred_catboost_comp['agency'],
  sample_weight=pred_catboost_comp['spend'].abs(),
  output_dict=True
  )

In [0]:
display(
    pd.DataFrame(class_dict)
      .transpose()
      .reset_index()
      .drop(columns=['f1-score'], errors='ignore')
      .rename(columns={'support': 'spend'})
      .sort_values('spend', ascending=False)
      .query('index != ["accuracy", "weighted avg", "macro avg"]')
      .assign(spend=lambda df: (df['spend'] / 1000).round().astype(int))
      .round(2)
      .head(12)
)

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt

# Prepare the data
df = pd.DataFrame(class_dict).transpose().reset_index()
df = df.query('index != ["accuracy", "weighted avg", "macro avg"]')
df = df.sort_values('support', ascending=False)

# Add total spend as a label on each bar
df['total_spend'] = df['index'].map(pred_vs_comp.groupby('funding_agency_name')['spend'].sum().round(0))

# Truncate labels to max 20 characters
df['short_label'] = df['index'].str.slice(0, 20)

# Sort by total spend descending
df = df.sort_values('total_spend', ascending=False).reset_index(drop=True).head(9)

# Set the theme to minimal
sns.set_theme(style="whitegrid")

# Create the bar plot for precision with total spend labels inside bars
plt.figure(figsize=(12, 8))  # Increased height for taller bars
barplot = sns.barplot(x='short_label', y='precision', data=df, color='#1B3139')
plt.xlabel('', fontsize=14)
plt.ylabel('Accuracy', fontsize=14)
plt.ylim(0.5,1.0)

for spine in barplot.spines.values():
    spine.set_visible(False)

# Rotate x labels for better readability
plt.xticks(rotation=-60, ha='left')

barplot.grid(False)

# Add rotated spend labels inside each bar
for index, row in df.iterrows():
    barplot.text(index, row.precision*0.8, f"${row.total_spend:,.0f}", color='white', ha="center", va='baseline', rotation=-90, fontsize=12, fontweight='bold')

plt.tight_layout()

plt.savefig('catboost.png', dpi=600)