# Vibe Coding with Databricks Assistant Lab 
## Notebook 2 - Optimizing Code with the Edit Assistant

Welcome to Notebook 2 of the Vibe Coding with Databricks Assistant Lab. In this session, you will explore the edit functionality of the Databricks Assistant. In the scenario a fellow employee has conducted ad-hoc analysis for the demand sensing datasets, however this employee had not taken their Databricks Academy training courses and was not up to speed on the Databricks Assistant. 

Help our teammate by optimizing their code so that the analysis is running fresh, fast, and error free!

This notebook uses serverless compute - please ensure you have are connected to serverless by selecting the Connect drop down and then the Serverless compute option.

![[serverless compute]](./includes/2.0_serverless_compute.png)


## Analysis Context
This notebook produces insights for three main questions regarding the demand sensing datasets.

### August 2025 Grocery Sales by Date
Intent: Provide a chart of sales by date for August 2025 assuming August just wrapped up and we want to check trends and if there is any late arriving data.

Issues: The analysis is done in Pandas, a single node engine and does not take avantage of our serverless spark cluster to distribute the work. Pandas was chosen for its built in plotting capabilities, not knowing that Spark 4.0 added native plotting capabilities as well.

### Lowest Competitor Price
Intent: We seek to provide a competitive price across our products compared to a few select competitor grocers. Understand who presents the lowest price per product by day.

Issues: This implementation uses applyInPandas to sort per group, converts to Python objects, and reconstructs rows which can be slow for large groups.

### Total Sales by Region
Intent: We want to highlight and promote best practices among regional leadership, identify which region is leading in sales.

Issues: This query uses a highly inneficient cartesian join with Pandas that will not complete. Spark would increase performance and its optimizer defaults to better joins such as sortMerge and Broadcast joins. 


In [0]:
df_sales = spark.read.table("lp_dev.vibe_code_assistant_lab.sales_silver")
df_comp = spark.read.table("lp_dev.vibe_code_assistant_lab.competitor_pricing_silver")
df_prod = spark.read.table("lp_dev.vibe_code_assistant_lab.products")
df_stores = spark.read.table("lp_dev.vibe_code_assistant_lab.stores")

In [0]:
import pyspark.sql.functions as F

# Filter for August 2025 data only using Spark
sales_aug_2025 = (
    df_sales
    .filter((F.year('event_date') == 2025) & (F.month('event_date') == 8))
    .groupBy('event_date')
    .agg(F.sum('total_amount').alias('total_amount'))
    .orderBy('event_date')
)

# Display as a line chart using Spark's built-in plotting
display(sales_aug_2025)

Databricks visualization. Run in Databricks to view.

In [0]:
import pandas as pd
from pyspark.sql import functions as F, types as T, Window

schema = T.StructType([
    T.StructField("product_id", T.StringType(), True),
    T.StructField("event_date", T.DateType(), True),
    T.StructField("best_competitor", T.StringType(), True),
    T.StructField("min_comp_price", T.DoubleType(), True),
])

# Find the lowest competitor price per product and date
min_comp_df = (
    df_comp
    .groupBy('product_id', 'event_date')
    .agg(
        F.min('price').alias('min_comp_price')
    )
)

# Find the competitor(s) with the minimum price per product and date
window = (
    Window.partitionBy('product_id', 'event_date').orderBy('price', 'competitor')
)
min_competitor_df = (
    df_comp
    .withColumn('rn', F.row_number().over(window))
    .filter(F.col('rn') == 1)
    .select('product_id', 'event_date', F.col('competitor').alias('best_competitor'), F.col('price').alias('min_comp_price'))
)

# Join with products to compare with base_price
final = (
    min_competitor_df.join(df_prod, on='product_id', how='left')
    .select(
        min_competitor_df.product_id,
        min_competitor_df.event_date,
        F.when((df_prod.base_price.isNotNull()) & ((min_competitor_df.min_comp_price.isNull()) | (df_prod.base_price <= min_competitor_df.min_comp_price)), F.lit('us'))
         .otherwise(min_competitor_df.best_competitor).alias('best_vendor'),
        F.when((df_prod.base_price.isNotNull()) & ((min_competitor_df.min_comp_price.isNull()) | (df_prod.base_price <= min_competitor_df.min_comp_price)), df_prod.base_price)
         .otherwise(min_competitor_df.min_comp_price).alias('best_price')
    )
)
display(final)

In [0]:
from pyspark.sql import functions as F

# Join sales and stores on store_id
joined = df_sales.join(df_stores, on='store_id', how='inner')

# Compute total sales by region
total_sales_by_region = (
    joined.groupBy('region')
    .agg(F.sum('total_amount').alias('total_amount'))
    .orderBy(F.desc('total_amount'))
)

# Display as a bar chart
display(total_sales_by_region)

In [0]:
spark.sql("DROP TABLE IF EXISTS lp_dev.vibe_code_assistant_lab.tmp_min_comp_bad")