## Spend Categorization - Data Exploration
This notebook does some basic EDA for spend categorization, but mainly gets the data ready for the data science to follow. The spend data (category_tree and order) were manually uploaded, but this would be done with a synced ingestion job with Lakeflow Connect to SAP

In [0]:
import pandas as pd
df = spark.sql("SELECT * FROM shm.spend.transactions_enh").toPandas()
df['date'] = pd.to_datetime(df['date'])

In [0]:
df.head(5)

We can see how categorization can quickly spiral out of control. On this small example, we have three level one categories, between 4 to 18 level two categories, and 102 unique level three categories. This is a trivial example and usually there are far more categories that can be difficult to navigate.

In [0]:
df.groupby('category_level_1')['category_level_2'].nunique()

In [0]:
df.groupby('category_level_2')['category_level_3'].nunique()

In [0]:
# spend over time 
import matplotlib.pyplot as plt
import pandas as pd

monthly_spend = df.resample('M', on='date').total.sum()

plt.figure(figsize=(10, 6))
monthly_spend.plot(kind='bar')
plt.title('Monthly Spend')
plt.xlabel('Month')
plt.ylabel('Spend ($)')
plt.show()

We are going to take a 500 row sample of our dataframe for consistent experimentation and profiling. But first we are going to make a combined order text field (for inference and vector search) with an index

In [0]:
df.iloc[0]

In [0]:
%sql
SELECT * FROM shm.spend.transactions_gen

In [0]:
%sql
CREATE OR REPLACE TABLE shm.spend.combined AS
SELECT 
  monotonically_increasing_id() as id,
  *,
  CONCAT(
    :prompt, '\n',
    'date: ', date, '\n',
    'order_id: ', order_id, '\n',
    'category_level_1: ', category_level_1, '\n',
    'category_level_2: ', category_level_2, '\n',
    'category_level_3: ', category_level_3, '\n',
    'cost_centre: ', cost_centre, '\n',
    'plant: ', plant, '\n',
    'plant_id: ', plant_id, '\n',
    'region: ', region,
    'supplier: ', supplier, '\n',
    'supplier_country: ', supplier_country, '\n',
    'description: ', description, '\n'
  ) AS combined
FROM shm.spend.transactions_enh

Let's take 500 random rows for our test set and persist it

In [0]:
%sql
CREATE OR REPLACE TABLE shm.spend.test AS
SELECT * 
FROM shm.spend.combined
ORDER BY RAND()
LIMIT 500

Now we will make our train dataset - this will be used for both vector search and traditional ML. We also randomize to make it a bit more realistic

In [0]:
%sql
CREATE OR REPLACE TABLE shm.spend.train AS
SELECT * FROM shm.spend.combined
WHERE id NOT IN (SELECT id FROM shm.spend.test)