## Spend Categorization - Data Exploration
This notebook does some basic EDA for spend categorization, but mainly gets the data ready for the data science to follow. The spend data (category_tree and order) were manually uploaded, but this would be done with a synced ingestion job with Lakeflow Connect to SAP

In [0]:
import pandas as pd

columns_to_keep = [
    "contract_award_unique_key",
    "federal_action_obligation",
    "base_and_all_options_value",
    "action_date",
    "recipient_name_raw",
    "recipient_state_name",
    "recipient_city_name",
    "transaction_description",
    "product_or_service_code_description",
    "naics_description",
    "usaspending_permalink",
    "funding_agency_name",
    "funding_sub_agency_name",
]

df = pd.read_csv('us_tm_mfg_transactions.csv', low_memory=False)[columns_to_keep]
df['action_date'] = pd.to_datetime(df['action_date'])

There is 297 columns in the original data. Let's simplify the problem by taking a couple critical ones:

- parent_award_agency_name
- federal_action_obligation
- potential_total_value_of_award
- action_date
- recipient_name_raw
- recipient_state_name
- recipient_city_name
- transaction_description
- product_or_service_code_description
- naics_description
- usaspending_permalink
- funding_agency_name
- funding_sub_agency_name

Our target is going to be the top level funding agency, of which there are 47 in this dataset, followed by sub agencies (there are often only one or two, but some agencies have up to 15).

In [0]:
df.groupby('funding_agency_name')['funding_sub_agency_name'].nunique()

In [0]:
# spend over time
import matplotlib.pyplot as plt
import pandas as pd

monthly_spend = df.resample('Q', on='action_date').federal_action_obligation.sum()

plt.figure(figsize=(10, 6))
monthly_spend.plot(kind='bar')
plt.title('Quarterly Spend for Awards')
plt.xlabel('Quarter')
plt.ylabel('Spend ($)')
plt.show()

In [0]:
# persist as a Type 0 spark table
( 
  spark.createDataFrame(df)
  .write.option("overwriteSchema", "true")
  .mode("overwrite")
  .saveAsTable("shm.spend.awards")
)

We are going to take a 500 row sample of our dataframe for consistent experimentation and profiling. But first we are going to make a combined order text field (for inference and vector search) with an index

In [0]:
df.iloc[0]

In [0]:
%sql
CREATE OR REPLACE TABLE shm.spend.awards_text AS
SELECT 
  monotonically_increasing_id() as id, 
  *, 
  round(coalesce(CAST(REGEXP_REPLACE(federal_action_obligation, ',', '') AS FLOAT),0.0),2) AS award,
  CONCAT(
    '# Contract ', contract_award_unique_key, '\n',
    '*date: ', action_date, '\n',
    '*obligation: ', federal_action_obligation, '\n',
    '*total value: ', base_and_all_options_value, '\n',
    '*recipient: ', recipient_name_raw, '\n',
    '*location: ', recipient_city_name, ', ', recipient_state_name, '\n',
    '*transaction description: ', transaction_description, '\n',
    '*product description: ', product_or_service_code_description, '\n',
    '*naics description: ', naics_description
  ) AS combined_award_text
FROM shm.spend.awards

Let's take 500 random rows for our test set and persist it

In [0]:
%sql
CREATE OR REPLACE TABLE shm.spend.test AS
SELECT * 
FROM shm.spend.awards_text
ORDER BY RAND()
LIMIT 500

Now we will make our train dataset - this will be used for both vector search and traditional ML. We also randomize to make it a bit more realistic

In [0]:
%sql
CREATE OR REPLACE TABLE shm.spend.train AS
SELECT * FROM shm.spend.awards_text
WHERE id NOT IN (SELECT id FROM shm.spend.test)