# Vibe Coding with Databricks Assistant Lab 
## Notebook 2 - Optimizing Code with the Edit Assistant

Welcome to Notebook 2 of the Vibe Coding with Databricks Assistant Lab. In this session, you will explore the edit functionality of the Databricks Assistant. In the scenario a fellow employee has conducted ad-hoc analysis for the demand sensing datasets, however this employee had not taken their Databricks Academy training courses and was not up to speed on the Databricks Assistant. 

Help our teammate by optimizing their code so that the analysis is running fresh, fast, and error free!

This notebook uses serverless compute - please ensure you have are connected to serverless by selecting the Connect drop down and then the Serverless compute option.

![[serverless compute]](./includes/2.0_serverless_compute.png)


## Analysis Context
This notebook produces insights for three main questions regarding the demand sensing datasets.

### August 2025 Grocery Sales by Date
**Intent:** Provide a chart of sales by date for August 2025 assuming August just wrapped up and we want to check trends and if there is any late arriving data.

**Issues:** The analysis is done in Pandas, a single node engine and does not take avantage of our serverless spark cluster to distribute the work. Pandas was chosen for its built in plotting capabilities, not knowing that Spark 4.0 added native plotting capabilities as well.

### Lowest Competitor Price
**Intent:** We seek to provide a competitive price across our products compared to a few select competitor grocers. Understand who presents the lowest price per product by day.

**Issues:** This implementation uses applyInPandas to sort per group, converts to Python objects, and reconstructs rows which can be slow for large groups.

### Total Sales by Region
**Intent:** We want to highlight and promote best practices among regional leadership, identify which region is leading in sales.

**Issues:** This query uses a highly inneficient cartesian join with Pandas that will not complete. Spark would increase performance and its optimizer defaults to better joins such as sortMerge and Broadcast joins. 


## Optimization Instructions
Our goal is to help this analyst optimize their notebook and insights. Open the Databricks Assistant and open a new, clean thread with the plus icon. In the bottom right, select the "Edit" mode.

<p align="center">
  <img src="./includes/2.1_assistant_edit.png" alt="Databricks Assistant Edit" width="400"/>
</p>


%md
Enter the following text into the assistant:

```
Optimize each analysis cell of the notebook. 
Take into account databricks and spark best practices. 
Clean up python and pandas code into spark using efficient built in plotting, functions, and joins. 
```

In edit mode, the Assistant is more careful to plan out its procedure. It can review multiple cells at once which can be helpful when optimizing ETL pipelines, ML training runs, or Analysis which builds off preceding cells. 

Review the poorly optimized cells below to see suggested code changes leveraging the green/red colors. 

![Code Suggestions in Edit Mode](./includes/2.3_code_corrections.png)

Once you have reviewed the suggestions, select the blue Accept button and run each cell below to see the performance improvements. Notice cell 8: Lowest Competitor Price Analysis goes from 2 minute execution to seconds and cell 9 wouldn't even complete before with its taxing, memory inefficient join.

In [0]:
df_sales = spark.read.table("lp_dev.vibe_code_assistant_lab.sales_silver")
df_comp = spark.read.table("lp_dev.vibe_code_assistant_lab.competitor_pricing_silver")
df_prod = spark.read.table("lp_dev.vibe_code_assistant_lab.products")
df_stores = spark.read.table("lp_dev.vibe_code_assistant_lab.stores")

In [0]:
import pandas as pd

# Convert Spark DataFrame to pandas
pdf_sales = df_sales.toPandas()

# Convert 'event_date' column to datetime in pandas
pdf_sales['event_date'] = pd.to_datetime(pdf_sales['event_date'])

# Filter for August 2025 data only
pdf_sales_aug_2025 = pdf_sales[
    (pdf_sales['event_date'].dt.year == 2025) & (pdf_sales['event_date'].dt.month == 8)
]

# Group by event_date and sum total_amount in pandas
pdf_sales_by_day = pdf_sales_aug_2025.groupby('event_date')['total_amount'].sum().reset_index()

# Display as a line chart using pandas
pdf_sales_by_day.plot(kind="line", x="event_date", y="total_amount", title="Total Sales by Date")

In [0]:
import pandas as pd
from pyspark.sql import functions as F, types as T

schema = T.StructType([
    T.StructField("product_id", T.StringType(), True),
    T.StructField("event_date", T.DateType(), True),
    T.StructField("best_competitor", T.StringType(), True),
    T.StructField("min_comp_price", T.DoubleType(), True),
])

def pick_min_comp(pdf: pd.DataFrame) -> pd.DataFrame:
    # Convert types and sort every group in Python
    pdf["price"] = pdf["price"].astype(float)
    pdf = pdf.sort_values(["price", "competitor"], ascending=[True, True])
    top = pdf.iloc[0] if len(pdf) > 0 else None
    if top is None:
        return pd.DataFrame({"product_id": [pdf["product_id"].iloc[0] if len(pdf) > 0 else None],
                             "event_date": [pdf["event_date"].iloc[0] if len(pdf) > 0 else None],
                             "best_competitor": [None],
                             "min_comp_price": [None]})
    return pd.DataFrame({"product_id": [top["product_id"]],
                         "event_date": [top["event_date"]],
                         "best_competitor": [top["competitor"]],
                         "min_comp_price": [top["price"]]})

min_comp_df = (df_comp
  .groupBy("product_id", "event_date")
  .applyInPandas(pick_min_comp, schema=schema)
)

# Join without hints; materialize intermediate tables unnecessarily
min_comp_df.write.mode("overwrite").saveAsTable("lp_dev.vibe_code_assistant_lab.tmp_min_comp_bad")
tmp = spark.table("lp_dev.vibe_code_assistant_lab.tmp_min_comp_bad")

final = (tmp.join(df_prod, on="product_id", how="left")
  .select(
    tmp.product_id,
    tmp.event_date,
    F.when((df_prod.base_price.isNotNull()) & ((tmp.min_comp_price.isNull()) | (df_prod.base_price <= tmp.min_comp_price)), F.lit("us"))
     .otherwise(tmp.best_competitor).alias("best_vendor"),
    F.when((df_prod.base_price.isNotNull()) & ((tmp.min_comp_price.isNull()) | (df_prod.base_price <= tmp.min_comp_price)), df_prod.base_price)
     .otherwise(tmp.min_comp_price).alias("best_price")
  )
)
display(final)

In [0]:
import pandas as pd

# Convert Spark DataFrames to pandas
pdf_sales = df_sales.toPandas()
pdf_stores = df_stores.toPandas()

# --- Inefficient cartesian join: create a constant key in both frames, merge, then filter ---
pdf_cross = (
    pdf_sales.assign(_tmp_key=1)
    .merge(pdf_stores.assign(_tmp_key=1), on="_tmp_key", suffixes=("_sales", "_stores"))
    .drop(columns=["_tmp_key"])
)

# Filter down to the intended join condition AFTER the cross join
pdf_joined = (
    pdf_cross.loc[pdf_cross["store_id_sales"] == pdf_cross["store_id_stores"], 
                  ["store_id_sales", "region", "total_amount"]]
    .rename(columns={"store_id_sales": "store_id"})
)

# Compute total sales by store region (still pandas, single-threaded)
pdf_total_sales = (
    pdf_joined
    .groupby("region", as_index=False)["total_amount"]
    .sum()
)

# Display as a chart
pdf_total_sales.plot(kind="bar", x="region", y="total_amount", legend=False, title="Total Sales by Store Region (Bad Cartesian)")
display(pdf_total_sales)

In [0]:
spark.sql("DROP TABLE IF EXISTS lp_dev.vibe_code_assistant_lab.tmp_min_comp_bad")