<a href="https://colab.research.google.com/github/MaxMatteucci/mgmt467-analytics-portfolio/blob/main/Lab_VertexAI_BigQuery_PromptsOnly_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Vertex AI–Assisted BigQuery Analytics — Example Prompts
**Goal:** Practice moving from simple SQL to complex analytics in BigQuery using *only* carefully engineered prompts with Vertex AI (Gemini).  
**Important:** This notebook contains **prompts only** (no starter code). Paste the prompts into **Vertex AI Studio**, **Vertex AI in Colab Enterprise**, or your chosen chat interface, and then run the generated SQL directly in **BigQuery**. If you decide to automate later, you can ask Vertex AI to convert the winning SQL into a Colab pipeline.

## How to use this prompts-only notebook
1. Open **Vertex AI Studio** (or Gemini in Colab Enterprise chat panel).  
2. Copy a prompt from this notebook and paste it into the model. Do **not** paste any code from here; let the model generate it.  
3. Run the generated SQL in **BigQuery** (Console → BigQuery Studio).  
4. Iterate: refine the prompt when results aren’t what you expect.  
5. Document: capture your final SQL, plus a one-sentence takeaway, in your notes/README.

## Dataset assumptions
Use one of these sources (adjust table paths accordingly):
- **Global Superstore (Kaggle)** loaded into BigQuery (e.g., `[YOUR_PROJECT].superstore_data.sales`)  
- **TheLook eCommerce** public dataset: `bigquery-public-data.thelook_ecommerce`  
If you are using *Global Superstore*, make sure column names match your schema (e.g., `Order_Date`, `Region`, `Category`, `Sub_Category`, `Sales`, `Profit`, `Discount`, `State`, `Customer_ID`, `Ship_Mode`).

---
## Prompting guardrails (quick checklist)
- **Be explicit**: table path, column names, filters, output columns, sort order, and limits.  
- **Ask for runnable SQL**: “Return a BigQuery SQL block only.”  
- **Control cost**: ask for `LIMIT` during exploration and remove it for the final run.  
- **Validate**: request a brief explanation of why each clause is present and how you can sanity-check results.
---

## Install Dependencies

In [None]:
# Install the Google Cloud BigQuery client library
!pip install google-cloud-bigquery==3.17.0 pandas==2.1.4

# Authenticate your Colab environment
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


## Copy Schema to a dataframe

In [None]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = "database-project-467" # This is derived from your provided table name
dataset_id = 'lab1_foundation'
table_id = 'superstore'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)

# Display the schema DataFrame (optional, for verification)
print("Schema DataFrame created:")
# To see the output, run the code.


Schema DataFrame created:


## CLean Column Names

In [None]:
# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'superstore_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""

# --- 4. Print the Final SQL ---
print("--- Copy the SQL below and run it in your BigQuery Console ---")
print(create_view_sql)

--- Copy the SQL below and run it in your BigQuery Console ---

CREATE OR REPLACE VIEW `database-project-467.lab1_foundation.superstore_clean` AS
SELECT
  `Row ID` AS row_id,
  `Order ID` AS order_id,
  `Order Date` AS order_date,
  `Ship Date` AS ship_date,
  `Ship Mode` AS ship_mode,
  `Customer ID` AS customer_id,
  `Customer Name` AS customer_name,
  Segment AS segment,
  Country AS country,
  City AS city,
  State AS state,
  `Postal Code` AS postal_code,
  Region AS region,
  `Product ID` AS product_id,
  Category AS category,
  `Sub-Category` AS sub_category,
  `Product Name` AS product_name,
  Sales AS sales,
  Quantity AS quantity,
  Discount AS discount,
  Profit AS profit
FROM
  `database-project-467.lab1_foundation.superstore`;



## Generate View with standard column naming convention

In [None]:
# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a reference to the new view
    view_table_ref = client.dataset(dataset_id).table(new_view_id)

    # Fetch the first 10 rows
    rows = client.list_rows(view_table_ref, max_results=10)

    # Print header
    print(" | ".join([field.name for field in rows.schema]))
    print("-" * 80) # Separator

    # Print rows
    for row in rows:
        print(" | ".join([str(item) for item in row.values()]))

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")



View 'superstore_clean' created/replaced successfully in dataset 'lab1_foundation'.

--- First 10 rows from the new view 'superstore_clean' ---
row_id | order_id | order_date | ship_date | ship_mode | customer_id | customer_name | segment | country | city | state | postal_code | region | product_id | category | sub_category | product_name | sales | quantity | discount | profit
--------------------------------------------------------------------------------
An error occurred while fetching rows from the view: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/database-project-467/datasets/lab1_foundation/tables/superstore_clean/data?maxResults=10&formatOptions.useInt64Timestamp=True&prettyPrint=false: Cannot list a table of type VIEW.


In [None]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `database-project-467.lab1_foundation.superstore_clean`
LIMIT 10;
"""


print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


## Part A — SQL Warm‑Up (SELECT, WHERE, ORDER BY, LIMIT, DISTINCT)
**Aim:** Build confidence with precise, unambiguous prompts that yield clean, runnable SQL.

### A1. Unique values (DISTINCT)
**Prompt (paste in Vertex AI):**
```
Act as a senior BigQuery analyst. Produce a **single runnable BigQuery SQL** (no commentary) for:
- Task: List all unique `Sub_Category` values sold in the 'West' region.
- Table: `mgmt-467-47888.lab1_foundation.superstore`
- Filter: `Region = 'West'`
- Output: a single column named `Sub_Category`
- Sort: alphabetically A→Z
- Add: `LIMIT 100` to control cost during exploration.
```
**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?

In [None]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  Customer_ID,
  SUM(Profit) AS total_profit
FROM
  `database-project-467.lab1_foundation.superstore_clean`
GROUP BY
  Customer_ID
ORDER BY
  total_profit DESC
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,Customer_ID,total_profit
0,TC-20980,8981.3239
1,RB-19360,6976.0959
2,SC-20095,5757.4119
3,HL-15040,5622.4292
4,AB-10105,5444.8055
5,TA-21385,4703.7883
6,CM-12385,3899.8904
7,KD-16495,3038.6254
8,AR-10540,2884.6208
9,DR-12940,2869.076


### A2. Top‑N by metric (ORDER BY … DESC)
**Prompt:**
```
BigQuery SQL only.
Task: Return the top 10 customers by total profit.
Table: `mgmt-467-47888.lab_foundation.superstore`
Columns used: `Customer_ID`, `Profit`
Output columns: `Customer_ID`, `total_profit`
Logic: SUM Profit per customer, order by `total_profit` DESC
Add `LIMIT 10`.
```
**Tip:** If your schema uses different identifiers (e.g., `Customer Name`), restate column names explicitly.

In [None]:



# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  Customer_ID,
  SUM(Profit) AS total_profit
FROM
  `database-project-467.lab1_foundation.superstore_clean`
GROUP BY
  Customer_ID
ORDER BY
  total_profit DESC
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,Customer_ID,total_profit
0,TC-20980,8981.3239
1,RB-19360,6976.0959
2,SC-20095,5757.4119
3,HL-15040,5622.4292
4,AB-10105,5444.8055
5,TA-21385,4703.7883
6,CM-12385,3899.8904
7,KD-16495,3038.6254
8,AR-10540,2884.6208
9,DR-12940,2869.076


### A3. Basic filtering (WHERE) + sanity checks
**Prompt:**
```
BigQuery SQL only.
Task: Count orders shipped with each `Ship_Mode`, but only for orders in the 'Technology' category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Ship_Mode`, `order_count`
Logic: COUNT(*) grouped by `Ship_Mode`
Sort by `order_count` DESC
```
**Validation ask:** “Also list two quick sanity checks to verify the numbers.”

In [None]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  Ship_Mode,
  COUNT(*) AS order_count
FROM
  `database-project-467.lab1_foundation.superstore_clean`
WHERE
  Category = 'Technology'
GROUP BY
  Ship_Mode
ORDER BY
  order_count DESC;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,Ship_Mode,order_count
0,Standard Class,1082
1,Second Class,366
2,First Class,301
3,Same Day,98


## Part B — Grouped Analytics (GROUP BY, HAVING)
**Aim:** Turn raw facts into grouped metrics and filtered aggregations.

### B1. KPI aggregation with WHERE + GROUP BY
**Prompt:**
```
BigQuery SQL only.
Task: Compute monthly revenue for the last 12 full months.
Table: `[YOUR_PROJECT].superstore_data.sales`
Assume: `Order_Date` is a DATE or TIMESTAMP column named exactly `Order_Date`.
Output: `year_month` (YYYY-MM format), `monthly_revenue`
Logic: Truncate date to month, SUM `Sales`, filter to last 12 full months.
Sort by `year_month` ascending.
Include a `LIMIT` safeguard for exploration.
```

In [None]:
print("✅ Step 1: Defining the query string...")

query_string = """
-- Last 12 full months anchored to the data's MAX(Order_Date)
WITH bounds AS (
  SELECT
    -- First day of the month AFTER the max order month (exclusive upper bound)
    DATE_TRUNC(DATE_ADD(DATE_TRUNC(DATE(MAX(Order_Date)), MONTH), INTERVAL 1 MONTH), MONTH) AS end_month_excl
  FROM `database-project-467.lab1_foundation.superstore_clean`
),
agg AS (
  SELECT
    DATE_TRUNC(DATE(Order_Date), MONTH) AS ym,
    SUM(Sales) AS monthly_revenue
  FROM `database-project-467.lab1_foundation.superstore_clean`, bounds
  WHERE DATE(Order_Date) >= DATE_SUB(end_month_excl, INTERVAL 12 MONTH)
    AND DATE(Order_Date) <  end_month_excl
  GROUP BY ym
),
months AS (
  -- Generate the 12 month sequence so we always get 12 rows
  SELECT month AS ym
  FROM bounds,
  UNNEST(
    GENERATE_DATE_ARRAY(
      DATE_SUB(end_month_excl, INTERVAL 12 MONTH),   -- start (inclusive)
      DATE_SUB(end_month_excl, INTERVAL 1 MONTH),    -- end   (inclusive)
      INTERVAL 1 MONTH
    )
  ) AS month
)
SELECT
  FORMAT_DATE('%Y-%m', m.ym) AS year_month,
  IFNULL(a.monthly_revenue, 0) AS monthly_revenue
FROM months m
LEFT JOIN agg a USING (ym)
ORDER BY m.ym ASC
LIMIT 100;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. "
              "Please verify column names (Order_Date, Sales) and that the table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 12 rows.

--- Displaying Results ---


Unnamed: 0,year_month,monthly_revenue
0,2017-01,43971.374
1,2017-02,20301.1334
2,2017-03,58872.3528
3,2017-04,36521.5361
4,2017-05,44261.1102
5,2017-06,52981.7257
6,2017-07,45264.416
7,2017-08,63120.888
8,2017-09,87866.652
9,2017-10,77776.9232


### B2. Post‑aggregation filter (HAVING)
**Prompt:**
```
BigQuery SQL only.
Task: Find sub-categories whose total profit over the entire dataset is negative.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Sub_Category`, `total_profit`
Logic: SUM `Profit` GROUP BY `Sub_Category`, HAVING SUM(Profit) < 0
Sort by `total_profit` ASC (most negative first).
```
**Why HAVING?** Ask the model to include a 1-sentence explanation of why HAVING is used instead of WHERE here.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string = """
-- We use HAVING because the filter applies to an aggregate (SUM) result, not individual rows.
SELECT
  Sub_Category,
  SUM(Profit) AS total_profit
FROM
  `database-project-467.lab1_foundation.superstore_clean`
GROUP BY
  Sub_Category
HAVING
  SUM(Profit) < 0
ORDER BY
  total_profit ASC;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please check if any sub-categories actually have negative profit.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 3 rows.

--- Displaying Results ---


Unnamed: 0,Sub_Category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995



We use HAVING because the condition filters on an aggregate (like SUM(Profit)) that is only available after grouping, whereas WHERE filters individual rows before aggregation.

## Part C — Joins (dimension enrichment)
**Aim:** Use joins to enhance facts with attributes.

### C1. Join facts to a small dimension
*(If you have a customer or product dimension in your schema, use it. Otherwise, request a synthetic example.)*  
**Prompt:**
```
BigQuery SQL only.
Task: Join the sales table to a product dimension to report `Product_ID`, `Product_Name`, and total sales.
Tables: `[YOUR_PROJECT].superstore_data.sales` as s, `[YOUR_PROJECT].superstore_data.products` as p
Join key: `s.Product_ID = p.Product_ID`
Output: `Product_ID`, `Product_Name`, `total_sales`
Sort by `total_sales` DESC
```
**If you lack a dimension table:** Ask the model how to simulate one temporarily via a CTE.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string = """
WITH products AS (
  SELECT DISTINCT
    Product_ID,
    Product_Name
  FROM
    `database-project-467.lab1_foundation.superstore_clean`
)
SELECT
  p.Product_ID,
  p.Product_Name,
  SUM(s.Sales) AS total_sales
FROM
  `database-project-467.lab1_foundation.superstore_clean` AS s
JOIN
  products AS p
ON
  s.Product_ID = p.Product_ID
GROUP BY
  p.Product_ID, p.Product_Name
ORDER BY
  total_sales DESC
LIMIT 100;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please check if your 'superstore_clean' table has Product_ID and Sales data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 100 rows.

--- Displaying Results ---


Unnamed: 0,Product_ID,Product_Name,total_sales
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,61599.824
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,27453.384
2,TEC-MA-10002412,Cisco TelePresence System EX90 Videoconferenci...,22638.480
3,FUR-CH-10002024,HON 5400 Series Task Chairs for Big and Tall,21870.576
4,OFF-BI-10001359,GBC DocuBind TL300 Electric Binding System,19823.479
...,...,...,...
95,OFF-ST-10001526,Iceberg Mobile Mega Data/Printer Cart,5751.774
96,FUR-CH-10000595,Safco Contoured Stacking Chairs,5697.760
97,TEC-CO-10001766,Canon PC940 Copier,5669.874
98,FUR-TA-10004256,Bretford “Just In Time” Height-Adjustable Mult...,5634.900


## Part D — Common Table Expressions (CTEs)
**Aim:** Make complex logic readable and testable in steps.

### D1. Multi‑step ranking with CTEs
**Prompt:**
```
BigQuery SQL only.
Goal: Within each `Region`, rank states by total sales and return top 3 per region.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE 1 (`state_sales`): SUM(Sales) by `Region`, `State`
CTE 2 (`ranked_state_sales`): Add `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as `sales_rank`
Final SELECT: rows where `sales_rank <= 3`
Output columns: `Region`, `State`, `total_sales`, `sales_rank`
Sort: by `Region`, then `sales_rank`
```
**Ask for**: a one-paragraph explanation of each step, then **provide only the final runnable SQL**.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string = """
WITH state_sales AS (
  SELECT
    Region,
    State,
    SUM(Sales) AS total_sales
  FROM
    `database-project-467.lab1_foundation.superstore_clean`
  GROUP BY
    Region, State
),
ranked_state_sales AS (
  SELECT
    Region,
    State,
    total_sales,
    RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC) AS sales_rank
  FROM
    state_sales
)
SELECT
  Region,
  State,
  total_sales,
  sales_rank
FROM
  ranked_state_sales
WHERE
  sales_rank <= 3
ORDER BY
  Region,
  sales_rank;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please check if your 'superstore_clean' table has Region, State, and Sales data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 12 rows.

--- Displaying Results ---


Unnamed: 0,Region,State,total_sales,sales_rank
0,Central,Texas,170188.0458,1
1,Central,Illinois,80166.101,2
2,Central,Michigan,76269.614,3
3,East,New York,310876.271,1
4,East,Pennsylvania,116511.914,2
5,East,Ohio,78258.136,3
6,South,Florida,89473.708,1
7,South,Virginia,70636.72,2
8,South,North Carolina,55603.164,3
9,West,California,457687.6315,1


CTE state_sales: We first aggregate the sales data by Region and State. This gives us the total sales per state within each region, collapsing the raw transaction-level records into a more compact summary table.

CTE ranked_state_sales: We apply the RANK() window function to assign a ranking of states within each region based on their total_sales. The PARTITION BY Region ensures the ranking restarts for each region, while the ORDER BY total_sales DESC ranks from highest to lowest sales.

Final SELECT: We filter down to only the top 3 states per region by including only rows where sales_rank <= 3. This produces the top-performing states per region.

Sorting: Finally, we sort by Region and then sales_rank so the output is organized region by region, with states listed in descending sales order.

### D2. Time‑boxed “most improved” analysis
**Prompt:**
```
BigQuery SQL only.
Goal: Identify the top 5 sub-categories with the largest YoY revenue increase from 2023 to 2024.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `yr_sales`: SUM(Sales) by `Sub_Category` and `year` extracted from `Order_Date`
Final: pivot or self-join to compute delta (2024 minus 2023) as `yoy_delta`
Output: `Sub_Category`, `sales_2023`, `sales_2024`, `yoy_delta`
Order by `yoy_delta` DESC
Limit 5
```
**Validation:** Ask the model for two quick failure modes (e.g., missing years) and how to handle them.

Here are two quick failure modes for this YoY query, along with how to handle them:

Failure Mode 1 — Only one year of data exists

If your dataset has sales for just a single year, the query will return zero rows because there’s no prior year to compare against.

Fix: Run a quick SELECT DISTINCT EXTRACT(YEAR FROM Order_Date) first; if only one year exists, report sales growth as “N/A” or skip YoY entirely.

Failure Mode 2 — Some sub-categories are missing in one year

A sub-category might have sales in the latest year but none in the prior year (or vice versa), leading to NULL values for that year.

Fix: Use IFNULL(..., 0) around yearly totals so that the YoY delta still computes correctly (treating the missing year as zero sales).

In [None]:
print("✅ Step 1: Defining the query string...")

query_string = """
WITH yr_sales AS (
  SELECT
    Sub_Category,
    EXTRACT(YEAR FROM Order_Date) AS yr,
    SUM(Sales) AS total_sales
  FROM
    `database-project-467.lab1_foundation.superstore_clean`
  GROUP BY
    Sub_Category, yr
),
year_bounds AS (
  SELECT
    MAX(yr) AS max_year,
    MAX(yr) - 1 AS prev_year
  FROM yr_sales
),
pivoted AS (
  SELECT
    y.Sub_Category,
    IFNULL(SUM(CASE WHEN y.yr = b.prev_year THEN y.total_sales END), 0) AS sales_prev,
    IFNULL(SUM(CASE WHEN y.yr = b.max_year THEN y.total_sales END), 0) AS sales_curr
  FROM yr_sales y
  CROSS JOIN year_bounds b
  WHERE y.yr IN (b.prev_year, b.max_year)
  GROUP BY y.Sub_Category
)
SELECT
  Sub_Category,
  sales_prev AS sales_prior_year,
  sales_curr AS sales_latest_year,
  (sales_curr - sales_prev) AS yoy_delta
FROM
  pivoted
ORDER BY
  yoy_delta DESC
LIMIT 5;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. "
              "This suggests your dataset may only contain a single year of data. "
              "Try removing the YoY requirement or check the available years with a quick EXTRACT(YEAR) DISTINCT query.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 5 rows.

--- Displaying Results ---


Unnamed: 0,Sub_Category,sales_prior_year,sales_latest_year,yoy_delta
0,Phones,78962.03,105340.516,26378.486
1,Binders,49683.325,72788.045,23104.72
2,Accessories,41895.854,59946.232,18050.378
3,Appliances,26050.315,42926.932,16876.617
4,Copiers,49599.41,62899.388,13299.978


## Part E — Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD, moving averages)
**Aim:** Compare rows across partitions and time; compute trends and ranks without collapsing rows.

### E1. Top product per region (ROW_NUMBER)
**Prompt:**
```
BigQuery SQL only.
Task: For each `Region`, return only the single highest-revenue `Sub_Category`.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `subcat_sales`: SUM(Sales) by `Region`, `Sub_Category`
Add `ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as rn
Final: filter `rn = 1`
Output: `Region`, `Sub_Category`, `total_sales`
Sort by `Region`
```
**Why `ROW_NUMBER` instead of `RANK`?** Ask the model to add a 2-sentence contrast.

Why ROW_NUMBER() instead of RANK()?

ROW_NUMBER() guarantees exactly one row per region, even if there are ties in sales, making it best when you want a single “winner.”

RANK() would return multiple sub-categories if they tie for first place, which can produce more than one “top” per region.

Would you like me to also show a RANK() version so you can compare what happens when ties exist?

In [None]:
print("✅ Step 1: Defining the query string...")

query_string = """
WITH subcat_sales AS (
  SELECT
    Region,
    Sub_Category,
    SUM(Sales) AS total_sales
  FROM
    `database-project-467.lab1_foundation.superstore_clean`
  GROUP BY
    Region, Sub_Category
),
ranked AS (
  SELECT
    Region,
    Sub_Category,
    total_sales,
    ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC) AS rn
  FROM
    subcat_sales
)
SELECT
  Region,
  Sub_Category,
  total_sales
FROM
  ranked
WHERE
  rn = 1
ORDER BY
  Region;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please check if your 'superstore_clean' table has Region, Sub_Category, and Sales data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,Region,Sub_Category,total_sales
0,Central,Chairs,85230.646
1,East,Phones,100614.982
2,South,Phones,58304.438
3,West,Chairs,101781.328


### E2. YoY growth with LAG
**Prompt:**
```
BigQuery SQL only.
Task: Compute year-over-year revenue growth for 'Phones' sub-category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Filter to `Sub_Category = 'Phones'`
- Aggregate yearly revenue using EXTRACT(YEAR FROM Order_Date)
- Add `LAG(yearly_revenue) OVER (ORDER BY year)` as `prev_revenue`
- Compute `yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue`
Output: `year`, `yearly_revenue`, `prev_revenue`, `yoy_pct`
Sort by `year` ASC
```
**Ask for**: a guard against divide-by-zero or NULL previous year.

We add the guard (WHEN prev_revenue IS NULL OR prev_revenue = 0 THEN NULL) because YoY growth requires dividing by the previous year’s revenue — if that value is missing (NULL) or zero, dividing would either throw an error (divide-by-zero) or produce invalid results. This way, the query safely returns NULL instead of breaking.

In [None]:
print("✅ Step 1: Defining the query string...")

query_string = """
WITH yearly AS (
  SELECT
    EXTRACT(YEAR FROM Order_Date) AS year,
    SUM(Sales) AS yearly_revenue
  FROM
    `database-project-467.lab1_foundation.superstore_clean`
  WHERE
    Sub_Category = 'Phones'
  GROUP BY
    year
),
lagged AS (
  SELECT
    year,
    yearly_revenue,
    LAG(yearly_revenue) OVER (ORDER BY year) AS prev_revenue
  FROM
    yearly
)
SELECT
  year,
  yearly_revenue,
  prev_revenue,
  CASE
    WHEN prev_revenue IS NULL OR prev_revenue = 0 THEN NULL
    ELSE 100.0 * (yearly_revenue - prev_revenue) / prev_revenue
  END AS yoy_pct
FROM
  lagged
ORDER BY
  year ASC;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please check if your table has 'Phones' as a Sub_Category and sales data across multiple years.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,year,yearly_revenue,prev_revenue,yoy_pct
0,2014,77390.806,,
1,2015,68313.702,77390.806,-11.728918
2,2016,78962.03,68313.702,15.587397
3,2017,105340.516,78962.03,33.406545


### E3. 3‑month moving average (MA)
**Prompt:**
```
BigQuery SQL only.
Task: For the 'Corporate' segment, compute a 3-month moving average of monthly revenue.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Derive `month` via DATE_TRUNC(Order_Date, MONTH)
- SUM(Sales) per `month`
- Add `AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` as `ma_3`
Output: `month`, `monthly_revenue`, `ma_3`
Sort by `month` ASC
```
**Tip:** Ask the model to include a 1‑line cost control note (e.g., restrict date range while iterating).

In [None]:
print("✅ Step 1: Defining the query string...")

query_string = """
-- Tip: For cost control, restrict the date range (e.g., last 3 years) while iterating.
WITH monthly AS (
  SELECT
    DATE_TRUNC(Order_Date, MONTH) AS month,
    SUM(Sales) AS monthly_revenue
  FROM
    `database-project-467.lab1_foundation.superstore_clean`
  WHERE
    Segment = 'Corporate'
  GROUP BY
    month
)
SELECT
  month,
  monthly_revenue,
  AVG(monthly_revenue) OVER (
    ORDER BY month
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS ma_3
FROM
  monthly
ORDER BY
  month ASC;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. "
              "Please verify that your 'superstore_clean' table has Corporate segment data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 48 rows.

--- Displaying Results ---


Unnamed: 0,month,monthly_revenue,ma_3
0,2014-01-01,1701.528,1701.528
1,2014-02-01,1183.668,1442.598
2,2014-03-01,11106.799,4663.998333
3,2014-04-01,14131.729,8807.398667
4,2014-05-01,9142.0,11460.176
5,2014-06-01,3970.914,9081.547667
6,2014-07-01,10032.988,7715.300667
7,2014-08-01,7451.774,7151.892
8,2014-09-01,15507.745,10997.502333
9,2014-10-01,12637.678,11865.732333


## Part F — Debugging & Optimization Prompts
**Aim:** Use the model as a rubber duck for error handling and performance.

### F1. Explain the error, propose a fix
**Prompt:**
```
I ran this BigQuery SQL and got an error:
[PASTE ERROR MESSAGE and the exact SQL here]
Act as a BigQuery trouble‑shooter.
1) Identify the root cause.
2) Propose the smallest possible fix.
3) Suggest a quick sanity check query to verify the fix.
Return only the corrected SQL and a 2‑sentence rationale.
```

Rationale: The error occurred because the query referenced a non-existent dataset superstore_data, while the actual tables are in the lab1_foundation dataset. Updating the dataset reference fixes the error, and adding LIMIT 10 serves as a quick sanity check to confirm the table is accessible and contains rows.

In [None]:
print("✅ Step 1: Defining the corrected query string...")

query_string = """
-- Corrected SQL goes here
SELECT *
FROM `database-project-467.lab1_foundation.superstore_clean`
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ Sanity check ran successfully but returned an empty result. Please confirm the table has rows.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the corrected query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,...,postal_code,region,product_id,category,sub_category,product_name,sales,quantity,discount,profit
0,5769,CA-2015-154900,2015-02-25,2015-03-01,Standard Class,SS-20875,Sung Shariari,Consumer,United States,Leominster,...,1453,East,OFF-LA-10001641,Office Supplies,Labels,Avery 518,3.15,1,0.0,1.512
1,5770,CA-2015-154900,2015-02-25,2015-03-01,Standard Class,SS-20875,Sung Shariari,Consumer,United States,Leominster,...,1453,East,OFF-PA-10002377,Office Supplies,Paper,Adams Telephone Message Book W/Dividers/Space ...,22.72,4,0.0,10.224
2,9028,US-2016-152415,2016-09-17,2016-09-22,Standard Class,PO-18865,Patrick O'Donnell,Consumer,United States,Marlborough,...,1752,East,FUR-FU-10002597,Furniture,Furnishings,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,3,0.0,6.2244
3,9029,US-2016-152415,2016-09-17,2016-09-22,Standard Class,PO-18865,Patrick O'Donnell,Consumer,United States,Marlborough,...,1752,East,FUR-FU-10004864,Furniture,Furnishings,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,3,0.0,61.3824
4,8332,CA-2016-153269,2016-03-09,2016-03-12,First Class,PS-18760,Pamela Stobb,Consumer,United States,Andover,...,1810,East,OFF-ST-10004634,Office Supplies,Storage,"Personal Folder Holder, Ebony",11.21,1,0.0,3.363
5,8333,CA-2016-153269,2016-03-09,2016-03-12,First Class,PS-18760,Pamela Stobb,Consumer,United States,Andover,...,1810,East,FUR-CH-10002647,Furniture,Chairs,"Situations Contoured Folding Chairs, 4/Set",354.9,5,0.0,88.725
6,8334,CA-2016-153269,2016-03-09,2016-03-12,First Class,PS-18760,Pamela Stobb,Consumer,United States,Andover,...,1810,East,OFF-PA-10001801,Office Supplies,Paper,Xerox 193,17.94,3,0.0,8.7906
7,8335,CA-2016-153269,2016-03-09,2016-03-12,First Class,PS-18760,Pamela Stobb,Consumer,United States,Andover,...,1810,East,OFF-BI-10004632,Office Supplies,Binders,GBC Binding covers,51.8,4,0.0,23.31
8,526,CA-2015-158792,2015-12-26,2016-01-02,Standard Class,BD-11605,Brian Dahlen,Consumer,United States,Lawrence,...,1841,East,OFF-FA-10002815,Office Supplies,Fasteners,Staples,22.2,5,0.0,10.434
9,1312,CA-2016-141082,2016-12-09,2016-12-13,Standard Class,FM-14380,Fred McMath,Consumer,United States,Lawrence,...,1841,East,OFF-LA-10001404,Office Supplies,Labels,Avery 517,3.69,1,0.0,1.7343


The error happened because your query referenced a dataset (superstore_data) that doesn’t exist in your project — your actual tables live in the lab1_foundation dataset. The fix is to update the table path to database-project-467.lab1_foundation.superstore_clean, which resolves the 404 and allows the query to run.

### F2. Reduce cost / improve speed
**Prompt:**
```
Act as a BigQuery cost optimizer.
Given this query (below), list 3 ways to reduce scanned bytes and improve performance without changing the business logic.
[PASTE YOUR SQL HERE]
Prioritize: partition filters, column pruning, pre-aggregations, and temporary results via CTEs.
```

## Part G — Validation & Counter‑examples (DIVE: Validate)
**Aim:** Avoid “first‑answer fallacy” by testing alternatives.

### G1. Ask for counter‑queries
**Prompt:**
```
I concluded that 'Tables' is a high‑sales but negative‑profit sub-category due to high discounts.
Create two alternative BigQuery SQL queries that could falsify or nuance this finding:
- One that slices by region and time
- One that controls for order priority or ship mode
Return BigQuery SQL only, then a one-paragraph note on how to compare outcomes.
```

In [None]:
print("✅ Step 1: Defining the first counter-query (slice by region and time)...")

query_string1 = """
SELECT
  Region,
  EXTRACT(YEAR FROM Order_Date) AS year,
  SUM(Sales) AS total_sales,
  SUM(Profit) AS total_profit,
  AVG(Discount) AS avg_discount
FROM
  `database-project-467.lab1_foundation.superstore_clean`
WHERE
  Sub_Category = 'Tables'
GROUP BY
  Region, year
ORDER BY
  Region, year;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job1 = client.query(query_string1)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df1 = query_job1.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df1)} rows.")

    if results_df1.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please confirm there are 'Tables' records by region and year.")
    else:
        print("\n--- Displaying Results (Counter-query 1: Region + Year) ---")
        display(results_df1)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the first counter-query (slice by region and time)...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 16 rows.

--- Displaying Results (Counter-query 1: Region + Year) ---


Unnamed: 0,Region,year,total_sales,total_profit,avg_discount
0,Central,2014,7785.478,-1424.331,0.326667
1,Central,2015,6857.26,-265.0939,0.207143
2,Central,2016,13922.926,292.6211,0.205882
3,Central,2017,10589.307,-2162.8466,0.292308
4,East,2014,10603.704,-3537.8375,0.38
5,East,2015,8884.806,-2275.8641,0.373333
6,East,2016,7825.328,-2306.7783,0.368182
7,East,2017,11825.969,-2904.9002,0.373913
8,South,2014,9940.9445,1107.9902,0.113636
9,South,2015,7370.6745,-2171.3765,0.21875


In [None]:
print("✅ Step 1: Defining the second counter-query (control for ship mode)...")

query_string2 = """
SELECT
  Ship_Mode,
  SUM(Sales) AS total_sales,
  SUM(Profit) AS total_profit,
  AVG(Discount) AS avg_discount
FROM
  `database-project-467.lab1_foundation.superstore_clean`
WHERE
  Sub_Category = 'Tables'
GROUP BY
  Ship_Mode
ORDER BY
  total_profit ASC;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

try:
    query_job2 = client.query(query_string2)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df2 = query_job2.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df2)} rows.")

    if results_df2.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please confirm there are 'Tables' records with ship mode data.")
    else:
        print("\n--- Displaying Results (Counter-query 2: Ship Mode) ---")
        display(results_df2)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 1: Defining the second counter-query (control for ship mode)...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results (Counter-query 2: Ship Mode) ---


Unnamed: 0,Ship_Mode,total_sales,total_profit,avg_discount
0,Standard Class,124826.6615,-11910.0122,0.270526
1,Second Class,43693.7475,-3320.6799,0.248361
2,First Class,28800.776,-1365.3665,0.240426
3,Same Day,9644.347,-1129.4225,0.261905


## Part H — Synthesis (DIVE: Extend)
**Aim:** Turn analysis into business‑ready insights.

### H1. Executive‑style summary
**Prompt:**
```
Act as a business strategist.
Based on the following metrics/figures (briefly summarize your results here), write a 4-sentence executive summary:
- 1 sentence: what changed and by how much
- 1 sentence: why it likely changed (drivers)
- 1 sentence: recommended action (who/what/when)
- 1 sentence: metric to monitor next
```

Here’s a sample executive summary in 4 sentences following your structure:

Executive Summary:
Revenue from the Corporate segment grew by 12% quarter-over-quarter, driven by stronger adoption in the West and Central regions. This change is likely due to a combination of higher average order values and more efficient fulfillment operations that reduced delivery delays. We recommend that the regional sales directors double down on targeted campaigns for Corporate accounts in Q4, supported by the operations team to sustain fulfillment efficiency. The next key metric to monitor is customer acquisition cost (CAC) relative to revenue growth to ensure the expansion remains profitable.

### H2. Convert final SQL into an automated job (optional)
**Prompt (use only after your SQL is final):**
```
Convert my final BigQuery SQL into a Python script that can run as a scheduled job from Colab or Cloud Functions.
Requirements:
- Use python‑bigquery client
- Parameterize date range
- Write results to a destination table `[YOUR_PROJECT].analytics.outputs_kpi`
- Add basic error handling & logging
Return one complete runnable script.
```

In [None]:
"""
Scheduled BigQuery Job Script
Runs a KPI aggregation query with parameterized date range
and writes results to [YOUR_PROJECT].analytics.outputs_kpi.
"""

import sys
import logging
from datetime import datetime
from google.cloud import bigquery

# --------------------------
# Config & Logging
# --------------------------
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)]
)

PROJECT_ID = "database-project-467"    # 🔄 replace with your project if different
SOURCE_DATASET = "lab1_foundation"
SOURCE_TABLE = "superstore_clean"
DESTINATION_DATASET = "analytics"
DESTINATION_TABLE = f"{PROJECT_ID}.{DESTINATION_DATASET}.outputs_kpi"

# --------------------------
# Define Query with Parameters
# --------------------------
QUERY_TEMPLATE = f"""
WITH monthly AS (
  SELECT
    DATE_TRUNC(Order_Date, MONTH) AS month,
    SUM(Sales) AS monthly_revenue
  FROM `{PROJECT_ID}.{SOURCE_DATASET}.{SOURCE_TABLE}`
  WHERE Segment = 'Corporate'
    AND Order_Date BETWEEN @start_date AND @end_date
  GROUP BY month
)
SELECT
  month,
  monthly_revenue,
  AVG(monthly_revenue) OVER (
    ORDER BY month
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS ma_3
FROM monthly
ORDER BY month ASC
"""

# --------------------------
# Job Runner
# --------------------------
def run_query(start_date: str, end_date: str):
    """
    Executes the BigQuery job with date parameters
    and writes the results to the destination table.
    """
    client = bigquery.Client(project=PROJECT_ID)

    # ✅ Ensure destination dataset exists
    dataset_id = f"{PROJECT_ID}.{DESTINATION_DATASET}"
    dataset = bigquery.Dataset(dataset_id)
    dataset.location = "US"
    client.create_dataset(dataset, exists_ok=True)
    logging.info(f"Ensured dataset {dataset_id} exists.")

    job_config = bigquery.QueryJobConfig(
        query_parameters=[
            bigquery.ScalarQueryParameter("start_date", "DATE", start_date),
            bigquery.ScalarQueryParameter("end_date", "DATE", end_date),
        ],
        destination=DESTINATION_TABLE,
        write_disposition="WRITE_TRUNCATE"  # overwrite table on each run
    )

    try:
        logging.info("Starting query job...")
        query_job = client.query(QUERY_TEMPLATE, job_config=job_config)
        query_job.result()  # Wait for job completion

        logging.info(f"✅ Query completed. Results written to {DESTINATION_TABLE}")

    except Exception as e:
        logging.error(f"❌ Query failed: {e}")
        raise

# --------------------------
# Entry Point
# --------------------------
if __name__ == "__main__":
    # Example: last 3 years until today
    today = datetime.today().date()
    start_date = str(today.replace(year=today.year - 3))
    end_date = str(today)

    logging.info(f"Running job for date range {start_date} → {end_date}")
    run_query(start_date, end_date)



---
## Submission checklist
- [ ] Kept prompts precise and reproducible  
- [ ] Captured at least **one** CTE query and **one** window function query  
- [ ] Documented **two** validation attempts (counter‑queries or alternate slice)  
- [ ] Wrote a 4‑sentence executive summary based on results  
- [ ] (Optional) Converted final query into a scheduled job
---