## Spend Categorization - Naive LLM Calls
This notebook uses a naive version of our to do the first two categories using only prompt engineering. It provides a baseline of using naive batch inference to solve the problem.

In [0]:
awards = spark.sql("SELECT * FROM shm.spend.awards_text").toPandas()

In [0]:
awards.iloc[0].combined_award_text

In [0]:
awards.columns

In [0]:
df = spark.table("shm.spend.awards").select("funding_agency_name", "funding_sub_agency_name").distinct().toPandas()

hierarchy = "# Agency Hierarchy\n\n"
for agency in sorted(df['funding_agency_name'].dropna().unique()):
    hierarchy += f"## {agency}\n"
    subagencies = df[df['funding_agency_name'] == agency]['funding_sub_agency_name'].dropna().unique()
    for subagency in sorted(subagencies):
        hierarchy += f"- {subagency}\n"
    hierarchy += "\n"

with open("agency_hierarchy.md", "w") as f:
    f.write(hierarchy)

hierarchy

This cell creates a markdown string with all the values from the agency and its subagencies. This is convenient because it always uses the awards table so is always up to date. It could also be pointed at a category tree for hierachical spend clasification.

This could also be a function used as an agent tool.

Next, we write a short prompt for our model - this could definitely be improved, but nowhere near enough to get acceptable accuracy!

In [0]:
prompt = """Use the following agency hierarchy and return the agency and sub agency for the award below. Return a json output with only the agencies. You must use the agencies and subagencies from the hierarchy, pick the best ones.

For example:
'# Contract CONT_IDV_05GA0A17A0017_0559\n*date: 2017-03-30 00:00:00\n*obligation: 0.0\n*total value: 2500000.0\n*recipient: SIGNET TECHNOLOGIES INCORPORATED\n*location: BELTSVILLE, MARYLAND\n*transaction description: IGF::CT::IGF  PROVIDE PREVENTIVE, NORMAL, AND EMERGENCY MAINTENANCE ON ALL COMPONENTS OF THE INTEGRATED ELECTRONIC SECURITY SYSTEM (IESS) AT THE GAO HEADQUARTERS AND 11 FIELD OFFICES.  IN ADDITION TO THE MAINTENANCE, THE CONTRACTOR WILL PROVIDE SUPPORT ON AN AS NEEDED BASIS FOR THE INSTALLATION AND/OR UPGRADE OF IESS COMPONENTS.\n*product description: MAINT/REPAIR/REBUILD OF EQUIPMENT- ALARM, SIGNAL, AND SECURITY DETECTION SYSTEMS\n*naics description: AUTOMATIC ENVIRONMENTAL CONTROL MANUFACTURING FOR RESIDENTIAL, COMMERCIAL, AND APPLIANCE USE'

{'agency': Government Accountability Office, 'subagency': GAO, Except Comptroller General}
"""

We setup widgets so that we can call the category tree and prompt in our batch inference as parameters

We use AI_QUERY to run batch inference - this is all done in SQL - I am repeating the `CONCAT` call twice here, just so I can inspect the combined prompt that went into the model. We also use the `responseFormat` to enforce structured outputs. This is critical for consistency and maintanability of Generative AI solutions - I wouldn't leave POC without it. It's worth pointing out that because of the optimizations done in AI_QUERY - 500 calls to the LLM only takes 20 seconds.

In [0]:
from pyspark.sql import functions as F

pred_naive_df = (
    spark.table("shm.spend.test")
    .withColumn(
        "full_prompt",
        F.concat(
            F.lit(prompt), 
            F.lit('\n'),
            F.lit(hierarchy), 
            F.lit('\n'),
            F.col("combined_award_text")
        )
    )
    .select(
        F.col("id"),
        F.col("full_prompt").alias("prompt"),
        F.expr("""
            AI_QUERY(
                'databricks-meta-llama-3-3-70b-instruct',
                full_prompt,
                responseFormat => '{
                    "type": "json_schema",
                    "json_schema": {
                        "name": "categorization",
                        "schema": {
                            "type": "object",
                            "properties": {
                                "agency": {"type": "string"},
                                "subagency": {"type": "string"}
                            }
                        }
                    }
                }'
            )
        """).alias("llm_output")
    )
)

pred_naive_df.write.mode("overwrite").saveAsTable("shm.spend.pred_naive")

Second step overwrite to deconstruct that JSON file and pull in the actual labels for evaluation. This could be done in the first SQL call, but it was getting long.

In [0]:
%sql
SELECT * FROM shm.spend.pred_naive LIMIT 5

In [0]:
%sql
CREATE OR REPLACE TABLE shm.spend.pred_naive_comp AS
SELECT
  p.*,
  agency,
  subagency,
  t.funding_agency_name,
  t.funding_sub_agency_name
FROM 
  shm.spend.pred_naive p
JOIN
  shm.spend.test t
ON 
  t.id = p.id
LATERAL VIEW 
  JSON_TUPLE(p.llm_output, 'agency', 'subagency') AS agency, subagency

Now let's move into sklearn to get a classification report from our LLM based analysis for comparison sakes.

In [0]:
pred_naive = spark.table('shm.spend.pred_naive_comp').dropna(
    subset=['funding_agency_name', 'agency', 'funding_sub_agency_name', 'subagency']
).toPandas()

In [0]:
from sklearn.metrics import accuracy_score, classification_report

print(f"""Agency Accuracy: {accuracy_score(
  pred_naive['funding_agency_name'], 
  pred_naive['agency']
  ):0.3f}""")

print(f"""Subagency Accuracy: {accuracy_score(
  pred_naive['funding_sub_agency_name'], 
  pred_naive['subagency']
  ):0.3f}""")

With naive inference we have relatively poor accuracy, but more than zero.

In [0]:
%sql
SELECT * FROM shm.spend.test