## Spend Categorization - EDA

This notebook does basic Exploratory Data Analysis (EDA) for spend categorization, but mainly gets the data ready for the data science to follow.

Tested on Serverless v4.

In [0]:
%pip install uv
%uv pip install .
%restart_python

In [0]:
import pandas as pd
from src.utils import get_spark
from src.config import load_config

spark = get_spark()
config = load_config()

In [0]:
config.full_categories_table_path

In [0]:
# Widgets for SQL cells
dbutils.widgets.removeAll()
dbutils.widgets.text("catalog", config.catalog)
dbutils.widgets.text("schema", config.schema_name)
dbutils.widgets.text("invoices", config.invoices)
dbutils.widgets.text("categories", config.categories_table)

In [0]:
df = spark.sql(f"SELECT * FROM {config.full_invoices_table_path}").toPandas()
df['date'] = pd.to_datetime(df['date'])
df

We can see how categorization can quickly spiral out of control. On this small example, we have three level one categories, between 4 to 18 level two categories, and 102 unique level three categories. This is a trivial example and usually there are far more categories that can be difficult to navigate.

In [0]:
df.groupby('category_level_1')['category_level_2'].nunique()

In [0]:
df.groupby('category_level_2')['category_level_3'].nunique()

In [0]:
# spend over time 
import matplotlib.pyplot as plt
import pandas as pd

monthly_spend = df.resample('M', on='date').total.sum()

plt.figure(figsize=(10, 6))
monthly_spend.plot(kind='bar')
plt.title('Monthly Spend')
plt.xlabel('Month')
plt.ylabel('Spend ($)')
plt.show()

We are going to take a 500 row sample of our dataframe for consistent experimentation and profiling. But first we are going to make a combined order text field (for inference and vector search) with an index

In [0]:
df.iloc[0]

In [0]:
%sql
SELECT * FROM IDENTIFIER(:catalog || '.' || :schema || '.' || :invoices) LIMIT 10

In [0]:
%sql
SELECT 
  CONCAT_WS('\n', 
    supplier, supplier_country, order_id, description
  ) AS combined
FROM IDENTIFIER(:catalog || '.' || :schema || '.' || :invoices)
LIMIT 10

In [0]:
%sql
CREATE OR REPLACE TABLE IDENTIFIER(:catalog || '.' || :schema || '.combined') AS
SELECT 
  monotonically_increasing_id() as id,
  CONCAT(
    'date: ', date, '\n',
    'order_id: ', order_id, '\n',
    'category_level_1: ', category_level_1, '\n',
    'category_level_2: ', category_level_2, '\n',
    'category_level_3: ', category_level_3, '\n',
    'cost_centre: ', cost_centre, '\n',
    'plant: ', plant, '\n',
    'plant_id: ', plant_id, '\n',
    'region: ', region, '\n',
    'supplier: ', supplier, '\n',
    'supplier_country: ', supplier_country, '\n',
    'description: ', description, '\n'
  ) AS combined
FROM IDENTIFIER(:catalog || '.' || :schema || '.' || :invoices)

Let's take 500 random rows for our test set and persist it

In [0]:
%sql
CREATE OR REPLACE TABLE IDENTIFIER(:catalog || '.' || :schema || '.test') AS
SELECT * 
FROM IDENTIFIER(:catalog || '.' || :schema || '.combined')
ORDER BY RAND()
LIMIT 500

Now we will make our train dataset - this will be used for both vector search and traditional ML. We also randomize to make it a bit more realistic

In [0]:
%sql
CREATE OR REPLACE TABLE IDENTIFIER(:catalog || '.' || :schema || '.train') AS
SELECT * FROM IDENTIFIER(:catalog || '.' || :schema || '.combined')
WHERE id NOT IN (SELECT id FROM IDENTIFIER(:catalog || '.' || :schema || '.test'))