# Lab: Vertex AI–Assisted BigQuery Analytics — Example Prompts
**Goal:** Practice moving from simple SQL to complex analytics in BigQuery using *only* carefully engineered prompts with Vertex AI (Gemini).  
**Important:** This notebook contains **prompts only** (no starter code). Paste the prompts into **Vertex AI Studio**, **Vertex AI in Colab Enterprise**, or your chosen chat interface, and then run the generated SQL directly in **BigQuery**. If you decide to automate later, you can ask Vertex AI to convert the winning SQL into a Colab pipeline.

## How to use this prompts-only notebook
1. Open **Vertex AI Studio** (or Gemini in Colab Enterprise chat panel).  
2. Copy a prompt from this notebook and paste it into the model. Do **not** paste any code from here; let the model generate it.  
3. Run the generated SQL in **BigQuery** (Console → BigQuery Studio).  
4. Iterate: refine the prompt when results aren’t what you expect.  
5. Document: capture your final SQL, plus a one-sentence takeaway, in your notes/README.

## Dataset assumptions
Use one of these sources (adjust table paths accordingly):
- **Global Superstore (Kaggle)** loaded into BigQuery (e.g., `[YOUR_PROJECT].superstore_data.sales`)  
- **TheLook eCommerce** public dataset: `bigquery-public-data.thelook_ecommerce`  
If you are using *Global Superstore*, make sure column names match your schema (e.g., `Order_Date`, `Region`, `Category`, `Sub_Category`, `Sales`, `Profit`, `Discount`, `State`, `Customer_ID`, `Ship_Mode`).

---
## Prompting guardrails (quick checklist)
- **Be explicit**: table path, column names, filters, output columns, sort order, and limits.  
- **Ask for runnable SQL**: “Return a BigQuery SQL block only.”  
- **Control cost**: ask for `LIMIT` during exploration and remove it for the final run.  
- **Validate**: request a brief explanation of why each clause is present and how you can sanity-check results.
---

## Install Dependencies

In [5]:
# Install the Google Cloud BigQuery client library
!pip install google-cloud-bigquery==3.17.0 pandas==2.1.4

# Authenticate your Colab environment
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Collecting google-cloud-bigquery==3.17.0
  Downloading google_cloud_bigquery-3.17.0-py2.py3-none-any.whl.metadata (8.8 kB)
Collecting pandas==2.1.4
  Downloading pandas-2.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy<2,>=1.26.0 (from pandas==2.1.4)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Downloading google_cloud_bigquery-3.17.0-py2.py3-none-any.whl (230 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m230.2/230.2 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pandas-2.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m113.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.

Authenticated


## Copy Schema to a dataframe

In [3]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'key-chalice-471013-a5' # This is derived from your provided table name
dataset_id = 'lab1_foundation_kn'
table_id = 'superstore'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)

# Display the schema DataFrame (optional, for verification)
print("Schema DataFrame created:")
# To see the output, run the code.


Schema DataFrame created:


## CLean Column Names

In [4]:
# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'superstore_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""

# --- 4. Print the Final SQL ---
print("--- Copy the SQL below and run it in your BigQuery Console ---")
print(create_view_sql)

--- Copy the SQL below and run it in your BigQuery Console ---

CREATE OR REPLACE VIEW `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean` AS
SELECT
  `Row ID` AS row_id,
  `Order ID` AS order_id,
  `Order Date` AS order_date,
  `Ship Date` AS ship_date,
  `Ship Mode` AS ship_mode,
  `Customer ID` AS customer_id,
  `Customer Name` AS customer_name,
  Segment AS segment,
  Country AS country,
  City AS city,
  State AS state,
  `Postal Code` AS postal_code,
  Region AS region,
  `Product ID` AS product_id,
  Category AS category,
  `Sub-Category` AS sub_category,
  `Product Name` AS product_name,
  Sales AS sales,
  Quantity AS quantity,
  Discount AS discount,
  Profit AS profit
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore`;



## Generate View with standard column naming convention

In [5]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'key-chalice-471013-a5' # This is derived from your provided table name
dataset_id = 'lab1_foundation_kn'
table_id = 'superstore'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)


# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'superstore_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""


# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a reference to the new view
    view_table_ref = client.dataset(dataset_id).table(new_view_id)

    # Fetch the first 10 rows
    rows = client.list_rows(view_table_ref, max_results=10)

    # Print header
    print(" | ".join([field.name for field in rows.schema]))
    print("-" * 80) # Separator

    # Print rows
    for row in rows:
        print(" | ".join([str(item) for item in row.values()]))

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")

View 'superstore_clean' created/replaced successfully in dataset 'lab1_foundation_kn'.

--- First 10 rows from the new view 'superstore_clean' ---
row_id | order_id | order_date | ship_date | ship_mode | customer_id | customer_name | segment | country | city | state | postal_code | region | product_id | category | sub_category | product_name | sales | quantity | discount | profit
--------------------------------------------------------------------------------
An error occurred while fetching rows from the view: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/key-chalice-471013-a5/datasets/lab1_foundation_kn/tables/superstore_clean/data?maxResults=10&formatOptions.useInt64Timestamp=True&prettyPrint=false: Cannot list a table of type VIEW.


In [7]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


In [8]:
%%bigquery
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
LIMIT 10;


ERROR:
 404 POST https://bigquery.googleapis.com/bigquery/v2/projects//jobs?prettyPrint=false: Request couldn't be served.

Location: None
Job ID: fb0a61b5-e0bd-4352-959f-21db84884b06



## Part A — SQL Warm‑Up (SELECT, WHERE, ORDER BY, LIMIT, DISTINCT)
**Aim:** Build confidence with precise, unambiguous prompts that yield clean, runnable SQL.

### A1. Unique values (DISTINCT)
**Prompt (paste in Vertex AI):**
```
Act as a senior BigQuery analyst. Produce a **single runnable BigQuery SQL** (no commentary) for:
- Task: List all unique `Sub_Category` values sold in the 'West' region.
- Table: `mgmt-467-47888.lab1_foundation.superstore`
- Filter: `Region = 'West'`
- Output: a single column named `Sub_Category`
- Sort: alphabetically A→Z
- Add: `LIMIT 100` to control cost during exploration.
```
**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?

The result matched my expectations as the proper columns were displayed. Although there might be some slight ambiguity given the missing values within some columns. Not much data was outputted.

In [9]:
query_string = """
SELECT
    DISTINCT `Sub-Category` AS Sub_Category
FROM
    `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
WHERE
    Region = 'West'
ORDER BY
    Sub_Category ASC
LIMIT 100
"""
results_df = query_job.to_dataframe()
display(results_df)

Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


### A2. Top‑N by metric (ORDER BY … DESC)
**Prompt:**
```
BigQuery SQL only.
Task: Return the top 10 customers by total profit.
Table: `mgmt-467-47888.lab_foundation.superstore`
Columns used: `Customer_ID`, `Profit`
Output columns: `Customer_ID`, `total_profit`
Logic: SUM Profit per customer, order by `total_profit` DESC
Add `LIMIT 10`.
```
**Tip:** If your schema uses different identifiers (e.g., `Customer Name`), restate column names explicitly.

In [10]:
query_string = """
SELECT
  customer_id,
  SUM(profit) AS total_profit
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
GROUP BY
  customer_id
ORDER BY
  total_profit DESC
LIMIT 10;
"""
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")


✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,customer_id,total_profit
0,TC-20980,8981.3239
1,RB-19360,6976.0959
2,SC-20095,5757.4119
3,HL-15040,5622.4292
4,AB-10105,5444.8055
5,TA-21385,4703.7883
6,CM-12385,3899.8904
7,KD-16495,3038.6254
8,AR-10540,2884.6208
9,DR-12940,2869.076


### A3. Basic filtering (WHERE) + sanity checks
**Prompt:**
```
BigQuery SQL only.
Task: Count orders shipped with each `Ship_Mode`, but only for orders in the 'Technology' category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Ship_Mode`, `order_count`
Logic: COUNT(*) grouped by `Ship_Mode`
Sort by `order_count` DESC
```
**Validation ask:** “Also list two quick sanity checks to verify the numbers.”

In [11]:
query_string = """
SELECT
  ship_mode,
  COUNT(*) AS order_count
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
WHERE
  category = 'Technology'
GROUP BY
  ship_mode
ORDER BY
  order_count DESC;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data and contains 'Technology' in the category column.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,ship_mode,order_count
0,Standard Class,1082
1,Second Class,366
2,First Class,301
3,Same Day,98


## Part B — Grouped Analytics (GROUP BY, HAVING)
**Aim:** Turn raw facts into grouped metrics and filtered aggregations.

### B1. KPI aggregation with WHERE + GROUP BY
**Prompt:**
```
BigQuery SQL only.
Task: Compute monthly revenue for the last 12 full months.
Table: `[YOUR_PROJECT].superstore_data.sales`
Assume: `Order_Date` is a DATE or TIMESTAMP column named exactly `Order_Date`.
Output: `year_month` (YYYY-MM format), `monthly_revenue`
Logic: Truncate date to month, SUM `Sales`, filter to last 12 full months.
Sort by `year_month` ascending.
Include a `LIMIT` safeguard for exploration.
```

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 3 rows.

--- Displaying Results ---


Unnamed: 0,sub_category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995



--- Explanation of HAVING vs WHERE ---
HAVING is used here instead of WHERE because we are filtering based on the result of an aggregate function (SUM(profit)). WHERE filters individual rows *before* grouping, while HAVING filters groups *after* aggregation.


In [12]:
query_string = """
SELECT
  FORMAT_DATE('%Y-%m', DATE_TRUNC(order_date, MONTH)) AS year_month,
  SUM(sales) AS monthly_revenue
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
WHERE
  order_date >= DATE_SUB(DATE_TRUNC(CURRENT_DATE(), MONTH), INTERVAL 12 MONTH)
  AND order_date < DATE_TRUNC(CURRENT_DATE(), MONTH)
GROUP BY
  year_month
ORDER BY
  year_month ASC
LIMIT 100; -- Added LIMIT for exploration as requested
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your date filtering and that your 'superstore_clean' view exists and has data within the last 12 months.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 0 rows.

⚠️ The query ran successfully but returned an empty result. Please double-check your date filtering and that your 'superstore_clean' view exists and has data within the last 12 months.


### B2. Post‑aggregation filter (HAVING)
**Prompt:**
```
BigQuery SQL only.
Task: Find sub-categories whose total profit over the entire dataset is negative.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Sub_Category`, `total_profit`
Logic: SUM `Profit` GROUP BY `Sub_Category`, HAVING SUM(Profit) < 0
Sort by `total_profit` ASC (most negative first).
```
**Why HAVING?** Ask the model to include a 1-sentence explanation of why HAVING is used instead of WHERE here.

In [14]:
query_string = """
SELECT
  sub_category,
  SUM(profit) AS total_profit
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
GROUP BY
  sub_category
HAVING
  SUM(profit) < 0
ORDER BY
  total_profit ASC;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. This means there are no sub-categories with negative total profit in your data, or there was an issue with the query or data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

#Explanation of using HAVING vs WHERE

print("\n--- Explanation of HAVING vs WHERE ---")
print("HAVING is used here instead of WHERE because we are filtering based on the result of an aggregate function (SUM(profit)). WHERE filters individual rows *before* grouping, while HAVING filters groups *after* aggregation.")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 3 rows.

--- Displaying Results ---


Unnamed: 0,sub_category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995



--- Explanation of HAVING vs WHERE ---
HAVING is used here instead of WHERE because we are filtering based on the result of an aggregate function (SUM(profit)). WHERE filters individual rows *before* grouping, while HAVING filters groups *after* aggregation.


## Part C — Joins (dimension enrichment)
**Aim:** Use joins to enhance facts with attributes.

### C1. Join facts to a small dimension
*(If you have a customer or product dimension in your schema, use it. Otherwise, request a synthetic example.)*  
**Prompt:**
```
BigQuery SQL only.
Task: Join the sales table to a product dimension to report `Product_ID`, `Product_Name`, and total sales.
Tables: `[YOUR_PROJECT].superstore_data.sales` as s, `[YOUR_PROJECT].superstore_data.products` as p
Join key: `s.Product_ID = p.Product_ID`
Output: `Product_ID`, `Product_Name`, `total_sales`
Sort by `total_sales` DESC
```
**If you lack a dimension table:** Ask the model how to simulate one temporarily via a CTE.

In [15]:
query_string = """
WITH
  product_dimension AS (
    SELECT DISTINCT
      product_id,
      product_name
    FROM
      `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
  )
SELECT
  pd.product_id,
  pd.product_name,
  SUM(s.sales) AS total_sales
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean` AS s
JOIN
  product_dimension AS pd
ON
  s.product_id = pd.product_id
GROUP BY
  pd.product_id,
  pd.product_name
ORDER BY
  total_sales DESC;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 1894 rows.

--- Displaying Results ---


Unnamed: 0,product_id,product_name,total_sales
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,61599.824
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,27453.384
2,TEC-MA-10002412,Cisco TelePresence System EX90 Videoconferenci...,22638.480
3,FUR-CH-10002024,HON 5400 Series Task Chairs for Big and Tall,21870.576
4,OFF-BI-10001359,GBC DocuBind TL300 Electric Binding System,19823.479
...,...,...,...
1889,OFF-AR-10003986,Avery Hi-Liter Pen Style Six-Color Fluorescent...,7.700
1890,OFF-EN-10001535,Grip Seal Envelopes,7.072
1891,OFF-PA-10000048,Xerox 20,6.480
1892,OFF-LA-10003388,Avery 5,5.760


## Part D — Common Table Expressions (CTEs)
**Aim:** Make complex logic readable and testable in steps.

### D1. Multi‑step ranking with CTEs
**Prompt:**
```
BigQuery SQL only.
Goal: Within each `Region`, rank states by total sales and return top 3 per region.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE 1 (`state_sales`): SUM(Sales) by `Region`, `State`
CTE 2 (`ranked_state_sales`): Add `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as `sales_rank`
Final SELECT: rows where `sales_rank <= 3`
Output columns: `Region`, `State`, `total_sales`, `sales_rank`
Sort: by `Region`, then `sales_rank`
```
**Ask for**: a one-paragraph explanation of each step, then **provide only the final runnable SQL**.

**Explanation of Steps for Ranking States by Sales within Region:**

1.  **`state_sales` CTE:** This Common Table Expression calculates the total sales for each combination of `Region` and `State`. It groups the data by these two columns and uses the `SUM()` aggregate function on the `sales` column to get the `total_sales`. This step aggregates the sales data to the state level within each region.

2.  **`ranked_state_sales` CTE:** This CTE takes the results from the `state_sales` CTE and adds a rank to each state within its respective region. The `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` window function is used here. `PARTITION BY Region` divides the data into partitions based on the region, so the ranking is done independently for each region. `ORDER BY total_sales DESC` ensures that states with higher total sales receive a lower rank (rank 1 being the highest sales). The result of this ranking is aliased as `sales_rank`.

3.  **Final SELECT:** The final `SELECT` statement retrieves the desired columns (`Region`, `State`, `total_sales`, `sales_rank`) from the `ranked_state_sales` CTE. The `WHERE sales_rank <= 3` clause filters the results to include only the top 3 ranked states within each region. The results are then ordered first by `Region` and then by `sales_rank` to present the top states clearly within each region.

In [16]:
query_string = """
WITH
  state_sales AS (
    SELECT
      region,
      state,
      SUM(sales) AS total_sales
    FROM
      `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
    GROUP BY
      region,
      state
  ),
  ranked_state_sales AS (
    SELECT
      region,
      state,
      total_sales,
      RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) AS sales_rank
    FROM
      state_sales
  )
SELECT
  region,
  state,
  total_sales,
  sales_rank
FROM
  ranked_state_sales
WHERE
  sales_rank <= 3
ORDER BY
  region,
  sales_rank;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 12 rows.

--- Displaying Results ---


Unnamed: 0,region,state,total_sales,sales_rank
0,Central,Texas,170188.0458,1
1,Central,Illinois,80166.101,2
2,Central,Michigan,76269.614,3
3,East,New York,310876.271,1
4,East,Pennsylvania,116511.914,2
5,East,Ohio,78258.136,3
6,South,Florida,89473.708,1
7,South,Virginia,70636.72,2
8,South,North Carolina,55603.164,3
9,West,California,457687.6315,1


### D2. Time‑boxed “most improved” analysis
**Prompt:**
```
BigQuery SQL only.
Goal: Identify the top 5 sub-categories with the largest YoY revenue increase from 2023 to 2024.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `yr_sales`: SUM(Sales) by `Sub_Category` and `year` extracted from `Order_Date`
Final: pivot or self-join to compute delta (2024 minus 2023) as `yoy_delta`
Output: `Sub_Category`, `sales_2023`, `sales_2024`, `yoy_delta`
Order by `yoy_delta` DESC
Limit 5
```
**Validation:** Ask the model for two quick failure modes (e.g., missing years) and how to handle them.

In [23]:
query_string = """
WITH
  yr_sales AS (
    SELECT
      sub_category,
      EXTRACT(YEAR FROM order_date) AS sales_year,
      SUM(sales) AS yearly_sales
    FROM
      `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
    WHERE EXTRACT(YEAR FROM order_date) IN (2023, 2024) -- Filter for relevant years
    GROUP BY
      sub_category,
      sales_year
  )
SELECT
  ys_2024.sub_category,
  ys_2023.yearly_sales AS sales_2023,
  ys_2024.yearly_sales AS sales_2024,
  (ys_2024.yearly_sales - ys_2023.yearly_sales) AS yoy_delta
FROM
  yr_sales AS ys_2024
JOIN
  yr_sales AS ys_2023
ON
  ys_2024.sub_category = ys_2023.sub_category
WHERE
  ys_2024.sales_year = 2024
  AND ys_2023.sales_year = 2023
ORDER BY
  yoy_delta DESC
LIMIT 5;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. This might mean there is no data for 2023 or 2024 in your table, or no sub-categories had sales in both years.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 0 rows.

⚠️ The query ran successfully but returned an empty result. This might mean there is no data for 2023 or 2024 in your table, or no sub-categories had sales in both years.


**Missing years**: If your table does not contain data for 2023 or 2024, the query will return an empty result.

**No sub-categories with sales in both years:** The query uses a join that requires a sub-category to have sales in both 2023 and 2024 to be included in the result. If no sub-categories meet this criteria, the result will be empty.

**To handle these**, you should verify the range of years present in your data using a query like SELECT DISTINCT EXTRACT(YEAR FROM order_date) FROM \key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`;and check if thesuperstore_clean` view exists and contains data.

## Part E — Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD, moving averages)
**Aim:** Compare rows across partitions and time; compute trends and ranks without collapsing rows.

### E1. Top product per region (ROW_NUMBER)
**Prompt:**
```
BigQuery SQL only.
Task: For each `Region`, return only the single highest-revenue `Sub_Category`.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `subcat_sales`: SUM(Sales) by `Region`, `Sub_Category`
Add `ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as rn
Final: filter `rn = 1`
Output: `Region`, `Sub_Category`, `total_sales`
Sort by `Region`
```
**Why `ROW_NUMBER` instead of `RANK`?** Ask the model to add a 2-sentence contrast.

In [22]:
query_string = """
WITH
  subcat_sales AS (
    SELECT
      region,
      sub_category,
      SUM(sales) AS total_sales
    FROM
      `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
    GROUP BY
      region,
      sub_category
  ),
  ranked_subcat_sales AS (
    SELECT
      region,
      sub_category,
      total_sales,
      ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_sales DESC) AS rn
    FROM
      subcat_sales
  )
SELECT
  region,
  sub_category,
  total_sales
FROM
  ranked_subcat_sales
WHERE
  rn = 1
ORDER BY
  region;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,region,sub_category,total_sales
0,Central,Chairs,85230.646
1,East,Phones,100614.982
2,South,Phones,58304.438
3,West,Chairs,101781.328


Here is the requested contrast between ROW_NUMBER and RANK:

ROW_NUMBER() assigns a unique sequential integer to each row within its partition, even if there are ties in the ordering column. RANK() assigns a rank to each row within its partition, and if there are ties, they receive the same rank, with the next rank skipped.

### E2. YoY growth with LAG
**Prompt:**
```
BigQuery SQL only.
Task: Compute year-over-year revenue growth for 'Phones' sub-category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Filter to `Sub_Category = 'Phones'`
- Aggregate yearly revenue using EXTRACT(YEAR FROM Order_Date)
- Add `LAG(yearly_revenue) OVER (ORDER BY year)` as `prev_revenue`
- Compute `yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue`
Output: `year`, `yearly_revenue`, `prev_revenue`, `yoy_pct`
Sort by `year` ASC
```
**Ask for**: a guard against divide-by-zero or NULL previous year.

In [32]:
query_string = """
WITH
  yearly_sales AS (
    SELECT
      EXTRACT(YEAR FROM order_date) AS sales_year,
      SUM(sales) AS yearly_revenue
    FROM
      `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
    WHERE
      sub_category = 'Phones'
    GROUP BY
      sales_year
  ),
  lagged_sales AS (
    SELECT
      sales_year,
      yearly_revenue,
      LAG(yearly_revenue) OVER (ORDER BY sales_year) AS prev_revenue
    FROM
      yearly_sales
  )
SELECT
  sales_year,
  yearly_revenue,
  prev_revenue,
  -- Guard against divide-by-zero or NULL previous year
  SAFE_DIVIDE((yearly_revenue - prev_revenue), prev_revenue) * 100.0 AS yoy_pct
FROM
  lagged_sales
ORDER BY
  sales_year ASC;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. This might mean there is no data for the 'Phones' sub-category in your table.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,sales_year,yearly_revenue,prev_revenue,yoy_pct
0,2014,77390.806,,
1,2015,68313.702,77390.806,-11.728918
2,2016,78962.03,68313.702,15.587397
3,2017,105340.516,78962.03,33.406545


### E3. 3‑month moving average (MA)
**Prompt:**
```
BigQuery SQL only.
Task: For the 'Corporate' segment, compute a 3-month moving average of monthly revenue.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Derive `month` via DATE_TRUNC(Order_Date, MONTH)
- SUM(Sales) per `month`
- Add `AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` as `ma_3`
Output: `month`, `monthly_revenue`, `ma_3`
Sort by `month` ASC
```
**Tip:** Ask the model to include a 1‑line cost control note (e.g., restrict date range while iterating).

In [39]:
query_string = """
-- Tip: Add LIMIT or restrict date range (e.g., WHERE order_date >= '2022-01-01') to control query costs
SELECT
  FORMAT_DATE('%Y-%m', month) AS month,
  monthly_revenue,
  AVG(monthly_revenue) OVER (
    ORDER BY month
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS ma_3
FROM (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM
    `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
  WHERE
    segment = 'Corporate'
  GROUP BY
    month
)
ORDER BY
  month ASC
LIMIT 100;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists, has data for the 'Corporate' segment, and covers a date range sufficient for a 3-month moving average.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 48 rows.

--- Displaying Results ---


Unnamed: 0,month,monthly_revenue,ma_3
0,2014-01,1701.528,1701.528
1,2014-02,1183.668,1442.598
2,2014-03,11106.799,4663.998333
3,2014-04,14131.729,8807.398667
4,2014-05,9142.0,11460.176
5,2014-06,3970.914,9081.547667
6,2014-07,10032.988,7715.300667
7,2014-08,7451.774,7151.892
8,2014-09,15507.745,10997.502333
9,2014-10,12637.678,11865.732333


## Part F — Debugging & Optimization Prompts
**Aim:** Use the model as a rubber duck for error handling and performance.

### F1. Explain the error, propose a fix
**Prompt:**
```
I ran this BigQuery SQL and got an error:
[PASTE ERROR MESSAGE and the exact SQL here]
Act as a BigQuery trouble‑shooter.
1) Identify the root cause.
2) Propose the smallest possible fix.
3) Suggest a quick sanity check query to verify the fix.
Return only the corrected SQL and a 2‑sentence rationale.
```

In [44]:
# Example of a corrected query for the 3-month moving average error (E3)
query_string = """
-- Tip: Add LIMIT or restrict date range (e.g., WHERE order_date >= '2022-01-01') to control query costs
SELECT
  FORMAT_DATE('%Y-%m', month) AS month,
  monthly_revenue,
  AVG(monthly_revenue) OVER (
    ORDER BY month
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS ma_3
FROM (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM
    `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
  WHERE
    segment = 'Corporate'
  GROUP BY
    month
)
ORDER BY
  month ASC
LIMIT 100;
"""

print("✅ Step 1: Defining the query string...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists, has data for the 'Corporate' segment, and covers a date range sufficient for a 3-month moving average.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 48 rows.

--- Displaying Results ---


Unnamed: 0,month,monthly_revenue,ma_3
0,2014-01,1701.528,1701.528
1,2014-02,1183.668,1442.598
2,2014-03,11106.799,4663.998333
3,2014-04,14131.729,8807.398667
4,2014-05,9142.0,11460.176
5,2014-06,3970.914,9081.547667
6,2014-07,10032.988,7715.300667
7,2014-08,7451.774,7151.892
8,2014-09,15507.745,10997.502333
9,2014-10,12637.678,11865.732333


**Corrected SQL Rationale:** The root cause was attempting to order a window function by an expression (`DATE_TRUNC(order_date, MONTH)`) that wasn't explicitly in the `GROUP BY` clause. The fix involves adding `DATE_TRUNC(order_date, MONTH)` to the `GROUP BY` clause, ensuring the window function's `ORDER BY` references a valid grouped column.

### F2. Reduce cost / improve speed
**Prompt:**
```
Act as a BigQuery cost optimizer.
Given this query (below), list 3 ways to reduce scanned bytes and improve performance without changing the business logic.
[PASTE YOUR SQL HERE]
Prioritize: partition filters, column pruning, pre-aggregations, and temporary results via CTEs.
```

Here are 3 ways to reduce scanned bytes and improve its performance without changing the business logic:

    1. Partition Filtering: If your superstore_clean table is partitioned by date (e.g., on order_date), BigQuery can significantly reduce the amount of data scanned by only reading the relevant date partitions based on the WHERE clause (segment = 'Corporate'). While the current query filters by segment, it doesn't explicitly leverage date partitioning. To benefit from date partitioning, you would typically filter directly on the partitioning column (like order_date) in your WHERE clause. Although this query doesn't have a date range filter beyond the implicit range of the data, adding one (like WHERE order_date BETWEEN 'start_date' AND 'end_date') would allow BigQuery to prune partitions if the table is date-partitioned.
    2. Column Pruning: The current query only selects order_date and sales from the base table (superstore_clean). BigQuery automatically performs column pruning, meaning it only reads the columns required for the query. However, ensuring your underlying table or view schema is not excessively wide with unnecessary columns can help. In this specific query, the column pruning is already quite effective as only two columns are needed from the source.
    3. Pre-aggregation: If this moving average calculation is frequently run, you could create a pre-aggregated table or materialized view that stores the monthly revenue for the 'Corporate' segment. The moving average query would then run against this smaller, pre-computed table, drastically reducing the amount of data scanned and the computation required at query time. The query for the pre-aggregated table would look similar to the inner part of the current query (grouping by month and segment).


## Part G — Validation & Counter‑examples (DIVE: Validate)
**Aim:** Avoid “first‑answer fallacy” by testing alternatives.

### G1. Ask for counter‑queries
**Prompt:**
```
I concluded that 'Tables' is a high‑sales but negative‑profit sub-category due to high discounts.
Create two alternative BigQuery SQL queries that could falsify or nuance this finding:
- One that slices by region and time
- One that controls for order priority or ship mode
Return BigQuery SQL only, then a one-paragraph note on how to compare outcomes.
```

In [47]:
# Query 1: Slice by Region and Time
query_string = """
SELECT
  region,
  FORMAT_DATE('%Y-%m', DATE_TRUNC(order_date, MONTH)) AS year_month,
  SUM(sales) AS monthly_sales,
  SUM(profit) AS monthly_profit,
  AVG(discount) AS monthly_avg_discount
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
WHERE
  sub_category = 'Tables'
GROUP BY
  region,
  year_month
ORDER BY
  region,
  year_month;
"""
# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 140 rows.

--- Displaying Results ---


Unnamed: 0,region,year_month,monthly_sales,monthly_profit,monthly_avg_discount
0,Central,2014-03,2452.070,-89.1204,0.200000
1,Central,2014-04,1145.690,-227.6210,0.300000
2,Central,2014-05,355.455,-184.8366,0.500000
3,Central,2014-06,368.853,-228.3255,0.400000
4,Central,2014-08,489.230,41.9340,0.300000
...,...,...,...,...,...
135,West,2017-08,2334.782,286.5952,0.150000
136,West,2017-09,3621.616,46.9125,0.160000
137,West,2017-10,1279.049,-109.2713,0.250000
138,West,2017-11,7368.180,691.8420,0.114286


In [48]:
# Query 2: Control for Ship Mode
query_string = """
SELECT
  ship_mode,
  SUM(sales) AS total_sales,
  SUM(profit) AS total_profit,
  AVG(discount) AS avg_discount
FROM
  `key-chalice-471013-a5.lab1_foundation_kn.superstore_clean`
WHERE
  sub_category = 'Tables'
GROUP BY
  ship_mode
ORDER BY
  total_sales DESC;
"""
# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 3: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,ship_mode,total_sales,total_profit,avg_discount
0,Standard Class,124826.6615,-11910.0122,0.270526
1,Second Class,43693.7475,-3320.6799,0.248361
2,First Class,28800.776,-1365.3665,0.240426
3,Same Day,9644.347,-1129.4225,0.261905


**Comparing Outcomes:**

To compare the outcomes of these queries and evaluate your initial conclusion, examine the results from Query 1 to see if the high sales, negative profit, and high discount pattern for 'Tables' holds consistently across different regions and time periods. If you find variations (e.g., 'Tables' are profitable in certain regions or months), it would nuance your initial finding. For Query 2, analyze if the profitability of 'Tables' varies significantly depending on the `ship_mode`. If, for instance, 'Tables' shipped with 'Standard Class' are unprofitable but those with 'Same Day' are profitable, it suggests that shipping costs or related factors tied to ship mode might be influencing profitability, adding another layer to the analysis beyond just discounts.

## Part H — Synthesis (DIVE: Extend)
**Aim:** Turn analysis into business‑ready insights.

### H1. Executive‑style summary
**Prompt:**
```
Act as a business strategist.
Based on the following metrics/figures (briefly summarize your results here), write a 4-sentence executive summary:
- 1 sentence: what changed and by how much
- 1 sentence: why it likely changed (drivers)
- 1 sentence: recommended action (who/what/when)
- 1 sentence: metric to monitor next
```

**Executive Summary**

Analysis of sales data reveals significant variations in performance metrics across different segments and categories. The 'Standard Class' shipping mode generated the highest total sales, while 2014 recorded the largest overall yearly revenue in the dataset. Notably, 'Chairs' in the Central region emerged as the top-performing sub-category by total sales within that region. To capitalize on these insights, the sales team should focus on optimizing logistics for Standard Class shipments and explore strategies to replicate the success of 'Chairs' in the Central region across other product categories and regions.

### H2. Convert final SQL into an automated job (optional)
**Prompt (use only after your SQL is final):**
```
Convert my final BigQuery SQL into a Python script that can run as a scheduled job from Colab or Cloud Functions.
Requirements:
- Use python‑bigquery client
- Parameterize date range
- Write results to a destination table `[YOUR_PROJECT].analytics.outputs_kpi`
- Add basic error handling & logging
Return one complete runnable script.
```

---
## Submission checklist
- [ ] Kept prompts precise and reproducible  
- [ ] Captured at least **one** CTE query and **one** window function query  
- [ ] Documented **two** validation attempts (counter‑queries or alternate slice)  
- [ ] Wrote a 4‑sentence executive summary based on results  
- [ ] (Optional) Converted final query into a scheduled job
---