<a href="https://colab.research.google.com/github/SanjanaMohan34/mgmt467-analytics-portfolio/blob/Labs/Lab_VertexAI_BigQuery_PromptsOnly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Vertex AI–Assisted BigQuery Analytics — Example Prompts
**Goal:** Practice moving from simple SQL to complex analytics in BigQuery using *only* carefully engineered prompts with Vertex AI (Gemini).  
**Important:** This notebook contains **prompts only** (no starter code). Paste the prompts into **Vertex AI Studio**, **Vertex AI in Colab Enterprise**, or your chosen chat interface, and then run the generated SQL directly in **BigQuery**. If you decide to automate later, you can ask Vertex AI to convert the winning SQL into a Colab pipeline.

## How to use this prompts-only notebook
1. Open **Vertex AI Studio** (or Gemini in Colab Enterprise chat panel).  
2. Copy a prompt from this notebook and paste it into the model. Do **not** paste any code from here; let the model generate it.  
3. Run the generated SQL in **BigQuery** (Console → BigQuery Studio).  
4. Iterate: refine the prompt when results aren’t what you expect.  
5. Document: capture your final SQL, plus a one-sentence takeaway, in your notes/README.

## Dataset assumptions
Use one of these sources (adjust table paths accordingly):
- **Global Superstore (Kaggle)** loaded into BigQuery (e.g., `[YOUR_PROJECT].superstore_data.sales`)  
- **TheLook eCommerce** public dataset: `bigquery-public-data.thelook_ecommerce`  
If you are using *Global Superstore*, make sure column names match your schema (e.g., `Order_Date`, `Region`, `Category`, `Sub_Category`, `Sales`, `Profit`, `Discount`, `State`, `Customer_ID`, `Ship_Mode`).

---
## Prompting guardrails (quick checklist)
- **Be explicit**: table path, column names, filters, output columns, sort order, and limits.  
- **Ask for runnable SQL**: “Return a BigQuery SQL block only.”  
- **Control cost**: ask for `LIMIT` during exploration and remove it for the final run.  
- **Validate**: request a brief explanation of why each clause is present and how you can sanity-check results.
---

## Install Dependencies

In [49]:
# Install the Google Cloud BigQuery client library
!pip install google-cloud-bigquery==3.17.0 pandas==2.1.4

# Authenticate your Colab environment
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


## Copy Schema to a dataframe

In [50]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'mgmt467-project1' # This is derived from your provided table name
dataset_id = 'lab1_foundation'
table_id = 'SuperStore'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)

# Display the schema DataFrame (optional, for verification)
print("Schema DataFrame created:")
# To see the output, run the code.


Schema DataFrame created:


## CLean Column Names

In [51]:
# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'superstore_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""

# --- 4. Print the Final SQL ---
print("--- Copy the SQL below and run it in your BigQuery Console ---")
print(create_view_sql)

--- Copy the SQL below and run it in your BigQuery Console ---

CREATE OR REPLACE VIEW `mgmt467-project1.lab1_foundation.superstore_clean` AS
SELECT
  `Row ID` AS row_id,
  `Order ID` AS order_id,
  `Order Date` AS order_date,
  `Ship Date` AS ship_date,
  `Ship Mode` AS ship_mode,
  `Customer ID` AS customer_id,
  `Customer Name` AS customer_name,
  Segment AS segment,
  Country AS country,
  City AS city,
  State AS state,
  `Postal Code` AS postal_code,
  Region AS region,
  `Product ID` AS product_id,
  Category AS category,
  `Sub-Category` AS sub_category,
  `Product Name` AS product_name,
  Sales AS sales,
  Quantity AS quantity,
  Discount AS discount,
  Profit AS profit
FROM
  `mgmt467-project1.lab1_foundation.SuperStore`;



## Generate View with standard column naming convention

In [52]:
# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a reference to the new view and query it
    query_view_sql = f"""
    SELECT *
    FROM `{project_id}.{dataset_id}.{new_view_id}`
    LIMIT 10;
    """
    query_job_view = client.query(query_view_sql)
    rows = query_job_view.result()

    # Print header
    print(" | ".join([field.name for field in rows.schema]))
    print("-" * 80) # Separator

    # Print rows
    for row in rows:
        print(" | ".join([str(item) for item in row.values()]))

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")

View 'superstore_clean' created/replaced successfully in dataset 'lab1_foundation'.

--- First 10 rows from the new view 'superstore_clean' ---
row_id | order_id | order_date | ship_date | ship_mode | customer_id | customer_name | segment | country | city | state | postal_code | region | product_id | category | sub_category | product_name | sales | quantity | discount | profit
--------------------------------------------------------------------------------
5769 | CA-2015-154900 | 2015-02-25 | 2015-03-01 | Standard Class | SS-20875 | Sung Shariari | Consumer | United States | Leominster | Massachusetts | 1453 | East | OFF-LA-10001641 | Office Supplies | Labels | Avery 518 | 3.15 | 1 | 0.0 | 1.512
5770 | CA-2015-154900 | 2015-02-25 | 2015-03-01 | Standard Class | SS-20875 | Sung Shariari | Consumer | United States | Leominster | Massachusetts | 1453 | East | OFF-PA-10002377 | Office Supplies | Paper | Adams Telephone Message Book W/Dividers/Space For Phone Numbers, 5 1/4"X8 1/2", 200/Mes

In [53]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `mgmt467-project1.lab1_foundation.superstore_clean`
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


## Part A — SQL Warm‑Up (SELECT, WHERE, ORDER BY, LIMIT, DISTINCT)
**Aim:** Build confidence with precise, unambiguous prompts that yield clean, runnable SQL.

### A1. Unique values (DISTINCT)
**Prompt (paste in Vertex AI):**
```
Act as a senior BigQuery analyst. Produce a **single runnable BigQuery SQL** (no commentary) for:
- Task: List all unique `Sub_Category` values sold in the 'West' region.
- Table: `mgmt467-project1.lab1_foundation.superstore_clean`
- Filter: `Region = 'West'`
- Output: a single column named `Sub_Category`
- Sort: alphabetically A→Z
- Add: `LIMIT 100` to control cost during exploration.
```
**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?

I think that the result did match my expectations. Everything from the context to the specific guidelines of what output is needed was defined. The only mismatch was the column names but the AI figure out that the column names were all in lower case. Other than that, maybe defining exactly what the goal is would make the output more structured, but it's really good as is.

In [54]:
query_string = """
SELECT
    DISTINCT sub_category AS Sub_Category
FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
WHERE
    region = 'West'
ORDER BY
    Sub_Category ASC
LIMIT 100
"""

# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.
print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and has data in the 'West' region.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 17 rows.

--- Displaying Results ---


Unnamed: 0,Sub_Category
0,Accessories
1,Appliances
2,Art
3,Binders
4,Bookcases
5,Chairs
6,Copiers
7,Envelopes
8,Fasteners
9,Furnishings


### A2. Top‑N by metric (ORDER BY … DESC)
**Prompt:**
```
BigQuery SQL only.
Task: Return the top 10 customers by total profit.
Table: `mgmt-467-47888.lab_foundation.superstore`
Columns used: `Customer_ID`, `Profit`
Output columns: `Customer_ID`, `total_profit`
Logic: SUM Profit per customer, order by `total_profit` DESC
Add `LIMIT 10`.
```
**Tip:** If your schema uses different identifiers (e.g., `Customer Name`), restate column names explicitly.

**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?

The result matched my expectations. The output was clear and precise and the AI also built off of the previous example output to make this code match the previous code by addign the print statement and the alternative print statement in case of an error. It used the customer id column as asked and created a new column called total profit and gave an output as defined in the prompt.

In [55]:
query_string = """
SELECT
    sub_category AS Sub_Category,
    SUM(profit) AS total_profit
FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
GROUP BY
    sub_category
HAVING
    SUM(profit) < 0
ORDER BY
    total_profit ASC;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view and data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 3 rows.

--- Displaying Results ---


Unnamed: 0,Sub_Category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995


### A3. Basic filtering (WHERE) + sanity checks
**Prompt:**
```
BigQuery SQL only.
Task: Count orders shipped with each `Ship_Mode`, but only for orders in the 'Technology' category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Ship_Mode`, `order_count`
Logic: COUNT(*) grouped by `Ship_Mode`
Sort by `order_count` DESC
```
**Validation ask:** “Also list two quick sanity checks to verify the numbers.”

In [56]:
query_string = """
SELECT
    ship_mode,
    COUNT(*) AS order_count
FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
WHERE
    category = 'Technology'
GROUP BY
    ship_mode
ORDER BY
    order_count DESC;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view, data, and filter condition.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,ship_mode,order_count
0,Standard Class,1082
1,Second Class,366
2,First Class,301
3,Same Day,98


**Sanity check #1:** If we add up the order count for all the ship modes, we get 1,847. So the code below asks for the total count for each category and as we see below, the total count for technology is 1847 which matches the above.

In [57]:
query_string = """
SELECT
    COUNT(*) AS total_technology_orders
FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
WHERE
    category = 'Technology';
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view, data, and filter condition.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 1 rows.

--- Displaying Results ---


Unnamed: 0,total_technology_orders
0,1847


**Sanity Check #2:** The second sanity check checks for all the various ship modes to check whether or not everything is listed while getting the counth for each one for the technology category. As we can see below, there are 4 ship modes, standard class, second class, first class and same day which matches the original output.

In [58]:
query_string = """
SELECT DISTINCT ship_mode
FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view and data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,ship_mode
0,Standard Class
1,First Class
2,Second Class
3,Same Day


## Part B — Grouped Analytics (GROUP BY, HAVING)
**Aim:** Turn raw facts into grouped metrics and filtered aggregations.

### B1. KPI aggregation with WHERE + GROUP BY
**Prompt:**
```
BigQuery SQL only.
Task: Compute monthly revenue for the last 12 full months.
Table: `[YOUR_PROJECT].superstore_data.sales`
Assume: `Order_Date` is a DATE or TIMESTAMP column named exactly `Order_Date`.
Output: `year_month` (YYYY-MM format), `monthly_revenue`
Logic: Truncate date to month, SUM `Sales`, filter to last 12 full months.
Sort by `year_month` ascending.
Include a `LIMIT` safeguard for exploration.
```

**Prompt Explanation:** After inputting the prompt into the AI, it first gave no output because it took the current date (today's date). I then added a stipulation for it to check the last date in the dataset and then compute the monthly revenue for 12 months in the past from that date.

In [59]:
# First, find the maximum order_date in the dataset
query_max_date = """
SELECT MAX(order_date) AS max_date
FROM `mgmt467-project1.lab1_foundation.superstore_clean`;
"""

print("✅ Step 1: Finding the maximum order date...")
try:
    query_job_max_date = client.query(query_max_date)
    max_date_result = query_job_max_date.result()
    # Extract the max date (assuming only one row and one column)
    for row in max_date_result:
        latest_date_in_data = row.max_date
    print(f"✅ Step 2: Found latest date in data: {latest_date_in_data}")

except Exception as e:
    print(f"\n❌ An error occurred while finding the maximum date: {e}")
    latest_date_in_data = None # Set to None to prevent errors in the next step

if latest_date_in_data:
    # Now, use the latest_date_in_data to filter for the last 12 full months
    # We need to calculate the start date for the last 12 full months based on the latest date.
    # This means finding the first day of the month 12 months prior to the month of the latest date.
    query_string = f"""
    SELECT
        FORMAT_DATE('%Y-%m', DATE_TRUNC(order_date, MONTH)) AS year_month,
        SUM(sales) AS monthly_revenue
    FROM
        `mgmt467-project1.lab1_foundation.superstore_clean`
    WHERE
        order_date >= DATE_SUB(DATE_TRUNC('{latest_date_in_data}', MONTH), INTERVAL 12 MONTH)
        AND order_date < DATE_TRUNC('{latest_date_in_data}', MONTH)
    GROUP BY
        year_month
    ORDER BY
        year_month ASC
    LIMIT 100;
    """

    print("✅ Step 3: Defining the query string for the last 12 months...")
    print("✅ Step 4: Sending the query to BigQuery. This may take a moment...")

    # Use a try-except block to catch potential errors
    try:
        query_job = client.query(query_string)

        print("✅ Step 5: Waiting for query to complete and fetching results...")
        results_df = query_job.to_dataframe()

        print(f"✅ Step 6: Query finished. Found {len(results_df)} rows.")

        if results_df.empty:
            print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view, data, and date filters based on the latest data date.")
        else:
            print("\n--- Displaying Results ---")
            display(results_df)

    except Exception as e:
        print(f"\n❌ An error occurred: {e}")
else:
    print("\nSkipping monthly revenue calculation due to error in finding the latest date.")

✅ Step 1: Finding the maximum order date...
✅ Step 2: Found latest date in data: 2017-12-30
✅ Step 3: Defining the query string for the last 12 months...
✅ Step 4: Sending the query to BigQuery. This may take a moment...
✅ Step 5: Waiting for query to complete and fetching results...
✅ Step 6: Query finished. Found 12 rows.

--- Displaying Results ---


Unnamed: 0,year_month,monthly_revenue
0,2016-12,96999.043
1,2017-01,43971.374
2,2017-02,20301.1334
3,2017-03,58872.3528
4,2017-04,36521.5361
5,2017-05,44261.1102
6,2017-06,52981.7257
7,2017-07,45264.416
8,2017-08,63120.888
9,2017-09,87866.652


### B2. Post‑aggregation filter (HAVING)
**Prompt:**
```
BigQuery SQL only.
Task: Find sub-categories whose total profit over the entire dataset is negative.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Sub_Category`, `total_profit`
Logic: SUM `Profit` GROUP BY `Sub_Category`, HAVING SUM(Profit) < 0
Sort by `total_profit` ASC (most negative first).
```
**Why HAVING?** Ask the model to include a 1-sentence explanation of why HAVING is used instead of WHERE here.

HAVING is used here because you are filtering based on an aggregated value (SUM(Profit)), whereas WHERE is used to filter individual rows before aggregation.


In [60]:
query_string = """
SELECT
    sub_category AS Sub_Category,
    SUM(profit) AS total_profit
FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
GROUP BY
    sub_category
HAVING
    SUM(profit) < 0
ORDER BY
    total_profit ASC;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view and data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 3 rows.

--- Displaying Results ---


Unnamed: 0,Sub_Category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995


## Part C — Joins (dimension enrichment)
**Aim:** Use joins to enhance facts with attributes.

### C1. Join facts to a small dimension
*(If you have a customer or product dimension in your schema, use it. Otherwise, request a synthetic example.)*  
**Prompt:**
```
BigQuery SQL only.
Task: Join the sales table to a product dimension to report `Product_ID`, `Product_Name`, and total sales.
Tables: `[YOUR_PROJECT].superstore_data.sales` as s, `[YOUR_PROJECT].superstore_data.products` as p
Join key: `s.Product_ID = p.Product_ID`
Output: `Product_ID`, `Product_Name`, `total_sales`
Sort by `total_sales` DESC
```
**If you lack a dimension table:** Ask the model how to simulate one temporarily via a CTE.

In [61]:
query_string = """
WITH ProductDimension AS (
  SELECT DISTINCT
    product_id,
    product_name
  FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
),

ProductSales AS (
  SELECT
    product_id,
    SUM(sales) AS total_sales
  FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
  GROUP BY
    product_id
)

SELECT
  pd.product_id,
  pd.product_name,
  ps.total_sales
FROM
  ProductDimension pd
JOIN
  ProductSales ps ON pd.product_id = ps.product_id
ORDER BY
  ps.total_sales DESC
LIMIT 100; -- Added LIMIT for exploration
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view and data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 100 rows.

--- Displaying Results ---


Unnamed: 0,product_id,product_name,total_sales
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,61599.824
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,27453.384
2,TEC-MA-10002412,Cisco TelePresence System EX90 Videoconferenci...,22638.480
3,FUR-CH-10002024,HON 5400 Series Task Chairs for Big and Tall,21870.576
4,OFF-BI-10001359,GBC DocuBind TL300 Electric Binding System,19823.479
...,...,...,...
95,OFF-ST-10001526,Iceberg Mobile Mega Data/Printer Cart,5751.774
96,FUR-CH-10000595,Safco Contoured Stacking Chairs,5697.760
97,TEC-CO-10001766,Canon PC940 Copier,5669.874
98,FUR-TA-10004256,Bretford “Just In Time” Height-Adjustable Mult...,5634.900


## Part D — Common Table Expressions (CTEs)
**Aim:** Make complex logic readable and testable in steps.

### D1. Multi‑step ranking with CTEs
**Prompt:**
```
BigQuery SQL only.
Goal: Within each `Region`, rank states by total sales and return top 3 per region.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE 1 (`state_sales`): SUM(Sales) by `Region`, `State`
CTE 2 (`ranked_state_sales`): Add `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as `sales_rank`
Final SELECT: rows where `sales_rank <= 3`
Output columns: `Region`, `State`, `total_sales`, `sales_rank`
Sort: by `Region`, then `sales_rank`
```
**Ask for**: a one-paragraph explanation of each step, then **provide only the final runnable SQL**.

In [62]:
query_string = """
WITH state_sales AS (
  SELECT
    region,
    state,
    SUM(sales) AS total_sales
  FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
  GROUP BY
    region,
    state
),

ranked_state_sales AS (
  SELECT
    region,
    state,
    total_sales,
    RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) AS sales_rank
  FROM
    state_sales
)

SELECT
  region,
  state,
  total_sales,
  sales_rank
FROM
  ranked_state_sales
WHERE
  sales_rank <= 3
ORDER BY
  region,
  sales_rank;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view and data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 12 rows.

--- Displaying Results ---


Unnamed: 0,region,state,total_sales,sales_rank
0,Central,Texas,170188.0458,1
1,Central,Illinois,80166.101,2
2,Central,Michigan,76269.614,3
3,East,New York,310876.271,1
4,East,Pennsylvania,116511.914,2
5,East,Ohio,78258.136,3
6,South,Florida,89473.708,1
7,South,Virginia,70636.72,2
8,South,North Carolina,55603.164,3
9,West,California,457687.6315,1


### D2. Time‑boxed “most improved” analysis
**Prompt:**
```
BigQuery SQL only.
Goal: Identify the top 5 sub-categories with the largest YoY revenue increase from 2023 to 2024.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `yr_sales`: SUM(Sales) by `Sub_Category` and `year` extracted from `Order_Date`
Final: pivot or self-join to compute delta (2024 minus 2023) as `yoy_delta`
Output: `Sub_Category`, `sales_2023`, `sales_2024`, `yoy_delta`
Order by `yoy_delta` DESC
Limit 5
```
**Validation:** Ask the model for two quick failure modes (e.g., missing years) and how to handle them.

In [63]:
# First, find the maximum year in the dataset
query_max_year = """
SELECT MAX(EXTRACT(YEAR FROM order_date)) AS max_year
FROM `mgmt467-project1.lab1_foundation.superstore_clean`;
"""

print("✅ Step 1: Finding the maximum year in the dataset...")
try:
    query_job_max_year = client.query(query_max_year)
    max_year_result = query_job_max_year.result()
    # Extract the max year
    for row in max_year_result:
        latest_year_in_data = row.max_year
    print(f"✅ Step 2: Found latest year in data: {latest_year_in_data}")

except Exception as e:
    print(f"\n❌ An error occurred while finding the maximum year: {e}")
    latest_year_in_data = None # Set to None to prevent errors in the next step

if latest_year_in_data:
    year_1 = latest_year_in_data - 1
    year_2 = latest_year_in_data

    # Now, use the last two years found to aggregate sales
    query_string = f"""
    WITH yr_sales AS (
      SELECT
        sub_category,
        EXTRACT(YEAR FROM order_date) AS year,
        SUM(sales) AS yearly_revenue
      FROM
        `mgmt467-project1.lab1_foundation.superstore_clean`
      WHERE EXTRACT(YEAR FROM order_date) IN ({year_1}, {year_2})
      GROUP BY
        sub_category,
        year
    )
    -- This CTE is the first step. You would typically build upon this
    -- with further CTEs or joins to calculate the YoY delta.
    -- For now, we'll select from this CTE to show the intermediate result.
    SELECT * FROM yr_sales
    ORDER BY sub_category, year;
    """

    print(f"✅ Step 3: Defining the query string (aggregating yearly sales for {year_1} and {year_2})...")
    print("✅ Step 4: Sending the query to BigQuery. This may take a moment...")

    # Use a try-except block to catch potential errors
    try:
        query_job = client.query(query_string)

        print("✅ Step 5: Waiting for query to complete and fetching results...")
        results_df = query_job.to_dataframe()

        print(f"✅ Step 6: Query finished. Found {len(results_df)} rows.")

        if results_df.empty:
            print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view and data for the specified years.")
        else:
            print("\n--- Displaying Results ---")
            display(results_df)

    except Exception as e:
        print(f"\n❌ An error occurred: {e}")
else:
    print("\nSkipping yearly revenue calculation due to error in finding the latest year.")

✅ Step 1: Finding the maximum year in the dataset...
✅ Step 2: Found latest year in data: 2017
✅ Step 3: Defining the query string (aggregating yearly sales for 2016 and 2017)...
✅ Step 4: Sending the query to BigQuery. This may take a moment...
✅ Step 5: Waiting for query to complete and fetching results...
✅ Step 6: Query finished. Found 34 rows.

--- Displaying Results ---


Unnamed: 0,sub_category,year,yearly_revenue
0,Accessories,2016,41895.854
1,Accessories,2017,59946.232
2,Appliances,2016,26050.315
3,Appliances,2017,42926.932
4,Art,2016,5960.908
5,Art,2017,8863.068
6,Binders,2016,49683.325
7,Binders,2017,72788.045
8,Bookcases,2016,26275.4665
9,Bookcases,2017,30024.2797


In [64]:
query_string = """
WITH yr_sales AS (
  SELECT
    sub_category,
    EXTRACT(YEAR FROM order_date) AS year,
    SUM(sales) AS yearly_revenue
  FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
  GROUP BY
    sub_category,
    year
),

-- Step 2: Pivot or Self-Join for YoY Comparison
-- Using conditional aggregation (a form of pivoting)
yoy_comparison AS (
  SELECT
    sub_category,
    SUM(CASE WHEN year = (SELECT MAX(year) FROM yr_sales) - 1 THEN yearly_revenue ELSE 0 END) AS sales_prev_year,
    SUM(CASE WHEN year = (SELECT MAX(year) FROM yr_sales) THEN yearly_revenue ELSE 0 END) AS sales_current_year
  FROM
    yr_sales
  GROUP BY
    sub_category
)

-- Step 3 & 4: Calculate YoY Delta, Filter, and Order Results
SELECT
  sub_category,
  sales_prev_year AS sales_2016, -- Assuming 2016 is the previous year based on data
  sales_current_year AS sales_2017, -- Assuming 2017 is the current year based on data
  (sales_current_year - sales_prev_year) AS yoy_delta
FROM
  yoy_comparison
WHERE sales_prev_year IS NOT NULL AND sales_prev_year != 0 -- Guard against divide-by-zero or NULL previous year
ORDER BY
  yoy_delta DESC
LIMIT 5;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string (calculating YoY delta)...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view and data for the specified years.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string (calculating YoY delta)...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 5 rows.

--- Displaying Results ---


Unnamed: 0,sub_category,sales_2016,sales_2017,yoy_delta
0,Phones,78962.03,105340.516,26378.486
1,Binders,49683.325,72788.045,23104.72
2,Accessories,41895.854,59946.232,18050.378
3,Appliances,26050.315,42926.932,16876.617
4,Copiers,49599.41,62899.388,13299.978


**Missing Data for a Year:** If a sub-category only had sales in one of the two years being compared (either the previous year or the current year), the conditional aggregation (SUM(CASE WHEN year = ... THEN yearly_revenue ELSE 0 END)) in the yoy_comparison CTE handles this by assigning 0 to the year with missing data. This allows the YoY delta to be calculated, although a delta from zero might require different interpretation.

**Sub-categories with Zero Sales in the Previous Year:** If a sub-category had zero sales in the previous year (sales_prev_year = 0), calculating yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue would result in a "division by zero" error. The generated query avoids calculating yoy_pct directly and instead focuses on the yoy_delta. However, if you were to calculate the percentage, you would need a guard like WHERE sales_prev_year IS NOT NULL AND sales_prev_year != 0 (which is included in the final SELECT of the generated query) or use a SAFE_DIVIDE function in BigQuery.



## Part E — Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD, moving averages)
**Aim:** Compare rows across partitions and time; compute trends and ranks without collapsing rows.

### E1. Top product per region (ROW_NUMBER)
**Prompt:**
```
BigQuery SQL only.
Task: For each `Region`, return only the single highest-revenue `Sub_Category`.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `subcat_sales`: SUM(Sales) by `Region`, `Sub_Category`
Add `ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as rn
Final: filter `rn = 1`
Output: `Region`, `Sub_Category`, `total_sales`
Sort by `Region`
```
**Why `ROW_NUMBER` instead of `RANK`?** Ask the model to add a 2-sentence contrast.

ROW_NUMBER() assigns a unique number to each row within a partition, ensuring only one row gets rank 1 even in the case of ties. RANK() assigns the same rank to tied rows and skips the next rank number, meaning multiple rows could share rank 1 if their values are identical.

In [65]:
query_string = """
WITH subcat_sales AS (
  SELECT
    region,
    sub_category,
    SUM(sales) AS total_sales
  FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
  GROUP BY
    region,
    sub_category
),

ranked_subcat_sales AS (
  SELECT
    region,
    sub_category,
    total_sales,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_sales DESC) AS rn
  FROM
    subcat_sales
)

SELECT
  region,
  sub_category,
  total_sales
FROM
  ranked_subcat_sales
WHERE
  rn = 1
ORDER BY
  region;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view and data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,region,sub_category,total_sales
0,Central,Chairs,85230.646
1,East,Phones,100614.982
2,South,Phones,58304.438
3,West,Chairs,101781.328


### E2. YoY growth with LAG
**Prompt:**
```
BigQuery SQL only.
Task: Compute year-over-year revenue growth for 'Phones' sub-category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Filter to `Sub_Category = 'Phones'`
- Aggregate yearly revenue using EXTRACT(YEAR FROM Order_Date)
- Add `LAG(yearly_revenue) OVER (ORDER BY year)` as `prev_revenue`
- Compute `yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue`
Output: `year`, `yearly_revenue`, `prev_revenue`, `yoy_pct`
Sort by `year` ASC
```
**Ask for**: a guard against divide-by-zero or NULL previous year.

in the query below are:
SAFE_DIVIDE((yearly_revenue - prev_revenue), prev_revenue): BigQuery's SAFE_DIVIDE function is used here. This function returns NULL if the denominator (prev_revenue) is zero or NULL, instead of causing a division-by-zero error.

WHERE prev_revenue IS NOT NULL: This clause in the final SELECT statement filters out the first year in the dataset. Since the LAG function has no previous row for the very first entry, prev_revenue would be NULL for that year. This WHERE clause ensures that we only attempt to calculate yoy_pct for years where a previous year's revenue exists.
These two mechanisms together prevent errors and handle cases where a previous year's revenue is missing or zero.

In [66]:
query_string = """
WITH yearly_phone_sales AS (
  SELECT
    EXTRACT(YEAR FROM order_date) AS year,
    SUM(sales) AS yearly_revenue
  FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
  WHERE
    sub_category = 'Phones'
  GROUP BY
    year
),

yoy_growth AS (
  SELECT
    year,
    yearly_revenue,
    LAG(yearly_revenue) OVER (ORDER BY year) AS prev_revenue
  FROM
    yearly_phone_sales
)

SELECT
  year,
  yearly_revenue,
  prev_revenue,
  -- Guard against divide-by-zero or NULL previous year
  SAFE_DIVIDE((yearly_revenue - prev_revenue), prev_revenue) * 100 AS yoy_pct
FROM
  yoy_growth
WHERE prev_revenue IS NOT NULL -- Exclude the first year which has no previous revenue
ORDER BY
  year ASC;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view, data, and filter conditions.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 3 rows.

--- Displaying Results ---


Unnamed: 0,year,yearly_revenue,prev_revenue,yoy_pct
0,2015,68313.702,77390.806,-11.728918
1,2016,78962.03,68313.702,15.587397
2,2017,105340.516,78962.03,33.406545


### E3. 3‑month moving average (MA)
**Prompt:**
```
BigQuery SQL only.
Task: For the 'Corporate' segment, compute a 3-month moving average of monthly revenue.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Derive `month` via DATE_TRUNC(Order_Date, MONTH)
- SUM(Sales) per `month`
- Add `AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` as `ma_3`
Output: `month`, `monthly_revenue`, `ma_3`
Sort by `month` ASC
```
**Tip:** Ask the model to include a 1‑line cost control note (e.g., restrict date range while iterating).

In [67]:
query_string = """
WITH monthly_corporate_sales AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
  WHERE
    segment = 'Corporate'
  GROUP BY
    month
)

SELECT
  month,
  monthly_revenue,
  AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
FROM
  monthly_corporate_sales
ORDER BY
  month ASC;

-- To control cost during exploration, you could add a WHERE clause
-- to limit the date range, e.g., WHERE month >= '2017-01-01'
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view, data, and filter conditions.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 48 rows.

--- Displaying Results ---


Unnamed: 0,month,monthly_revenue,ma_3
0,2014-01-01,1701.528,1701.528
1,2014-02-01,1183.668,1442.598
2,2014-03-01,11106.799,4663.998333
3,2014-04-01,14131.729,8807.398667
4,2014-05-01,9142.0,11460.176
5,2014-06-01,3970.914,9081.547667
6,2014-07-01,10032.988,7715.300667
7,2014-08-01,7451.774,7151.892
8,2014-09-01,15507.745,10997.502333
9,2014-10-01,12637.678,11865.732333


## Part F — Debugging & Optimization Prompts
**Aim:** Use the model as a rubber duck for error handling and performance.

### F1. Explain the error, propose a fix
**Prompt:**
```
I ran this BigQuery SQL and got an error:
403 GET https://bigquery.googleapis.com/bigquery/v2/projects/mgmt-467-47888/datasets/lab1_foundation/tables/superstore_clean?fields=schema: Access Denied: Table mgmt-467-47888.lab1_foundation.superstore_clean: Permission bigquery.tables.get denied on table projects/mgmt-467-47888/datasets/lab1_foundation/tables/superstore_clean (or it may not exist).
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `mgmt-467-47888.lab1_foundation.superstore_clean`
LIMIT 10;
Act as a BigQuery trouble‑shooter.
1) Identify the root cause.
2) Propose the smallest possible fix.
3) Suggest a quick sanity check query to verify the fix.
Return only the corrected SQL and a 2‑sentence rationale.
```

```sql
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `mgmt467-project1.lab1_foundation.superstore_clean`
LIMIT 10;

**Rationale:** The root cause of the error was an incorrect project ID in the table path (`mgmt-467-47888` instead of `mgmt467-project1`). The fix is to correct the project ID in the `FROM` clause to match the actual project where the `superstore_clean` view was created.

### F2. Reduce cost / improve speed
**Prompt:**
```
Act as a BigQuery cost optimizer.
Given this query (below), list 3 ways to reduce scanned bytes and improve performance without changing the business logic.
SELECT
  month,
  monthly_revenue,
  AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
FROM
  (
    SELECT
      DATE_TRUNC(order_date, MONTH) AS month,
      SUM(sales) AS monthly_revenue
    FROM
      `mgmt467-project1.lab1_foundation.superstore_clean`
    WHERE
      segment = 'Corporate'
    GROUP BY
      month
  )
ORDER BY
  month ASC;Prioritize: partition filters, column pruning, pre-aggregations, and temporary results via CTEs.
```

Here are 3 ways to reduce scanned bytes and improve performance for the 3-month moving average query:

1.  **Partitioning the Source Table (if applicable):** If the `superstore_clean` view's underlying table is partitioned by `order_date` (e.g., by `DAY`, `MONTH`, or `YEAR`), BigQuery can use partition pruning. By adding a `WHERE` clause to filter the `order_date` in the inner query (the `monthly_corporate_sales` CTE), BigQuery will only scan the relevant partitions, significantly reducing scanned bytes and improving performance, especially on large datasets.

2.  **Column Pruning:** The current query only uses the `order_date`, `sales`, and `segment` columns from the source table. Ensure that only these necessary columns are being read from the underlying table. If the `superstore_clean` view is defined with `SELECT *`, modifying the view definition to select only required columns (`order_date`, `sales`, `segment`) would be beneficial for performance and cost.

3.  **Clustering the Source Table (if applicable):** If the underlying table is clustered by `segment` and/or `order_date`, BigQuery can use the clustering information to efficiently skip scanning data within blocks that do not contain the relevant `segment` or `order_date` ranges. Clustering by columns used in `WHERE` clauses (`segment`) and `ORDER BY` clauses within window functions (`order_date` implicitly for `month`) can significantly improve performance.

## Part G — Validation & Counter‑examples (DIVE: Validate)
**Aim:** Avoid “first‑answer fallacy” by testing alternatives.

### G1. Ask for counter‑queries
**Prompt:**
```
I concluded that 'Tables' is a high‑sales but negative‑profit sub-category due to high discounts.
Create two alternative BigQuery SQL queries that could falsify or nuance this finding:
- One that slices by region and time
- One that controls for order priority or ship mode
Return BigQuery SQL only, then a one-paragraph note on how to compare outcomes.
```

In [68]:
query_string = """
SELECT
    region,
    EXTRACT(YEAR FROM order_date) AS year,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS average_discount -- Including discount for context
FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    region,
    year
ORDER BY
    region,
    year;
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string (Counter-query 1)...") # Corrected print statement
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view, data, and filter conditions.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string (Counter-query 1)...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 16 rows.

--- Displaying Results ---


Unnamed: 0,region,year,total_sales,total_profit,average_discount
0,Central,2014,7785.478,-1424.331,0.326667
1,Central,2015,6857.26,-265.0939,0.207143
2,Central,2016,13922.926,292.6211,0.205882
3,Central,2017,10589.307,-2162.8466,0.292308
4,East,2014,10603.704,-3537.8375,0.38
5,East,2015,8884.806,-2275.8641,0.373333
6,East,2016,7825.328,-2306.7783,0.368182
7,East,2017,11825.969,-2904.9002,0.373913
8,South,2014,9940.9445,1107.9902,0.113636
9,South,2015,7370.6745,-2171.3765,0.21875


**Query 2: Control for Order Priority or Ship Mode**
This query (which is the one you just ran in cell `8ec4e19d`) slices the sales, profit, and discount for 'Tables' by `Ship_Mode` to see if certain shipping methods are associated with higher discounts or lower profits.

**How to Compare Outcomes:**

To compare the outcomes of these counter-queries to your initial finding (that 'Tables' is high-sales, negative-profit due to high discounts), consider the following:

*   **Query 1 (Region and Time):** Does the negative profit for 'Tables' hold true across all regions and years, or is it concentrated in specific areas or time periods? If it's not consistent, your initial conclusion might need nuance (e.g., "Tables are unprofitable *in the Central region*"). Look at the `total_sales` and `total_profit` columns for different region-year combinations.
*   **Query 2 (Ship Mode):** Are certain `Ship_Mode` values associated with significantly higher `average_discount` values or lower `total_profit`? If you see a strong correlation between a particular ship mode and high discounts/low profit, it would support your hypothesis about discounts being a key driver of unprofitability for 'Tables'.

By examining these different slices of the data, you can see if your initial conclusion is consistently supported or if there are exceptions or contributing factors you hadn't considered.

Here are two alternative BigQuery SQL queries to help falsify or nuance your finding about the 'Tables' sub-category:

**Query 1: Slice by Region and Time**
This query will break down the sales and profit for 'Tables' by `Region` and `year` to see if the high sales/negative profit pattern holds consistently across different regions and over time.

In [69]:
query_string = """
SELECT
    ship_mode,
    COUNT(*) AS order_count,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS average_discount
FROM
    `mgmt467-project1.lab1_foundation.superstore_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    ship_mode
ORDER BY
    total_profit ASC; -- Ordering by profit to see which ship modes are most unprofitable
"""

# This assumes your 'client' object is still active and authenticated

print("✅ Step 1: Defining the query string (Counter-query 2)...")
print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check your view, data, and filter conditions.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string (Counter-query 2)...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 4 rows.

--- Displaying Results ---


Unnamed: 0,ship_mode,order_count,total_sales,total_profit,average_discount
0,Standard Class,190,124826.6615,-11910.0122,0.270526
1,Second Class,61,43693.7475,-3320.6799,0.248361
2,First Class,47,28800.776,-1365.3665,0.240426
3,Same Day,21,9644.347,-1129.4225,0.261905


## Part H — Synthesis (DIVE: Extend)
**Aim:** Turn analysis into business‑ready insights.

### H1. Executive‑style summary
**Prompt:**
```
Act as a business strategist.
Based on the following metrics/figures (briefly summarize your results here), write a 4-sentence executive summary:
- 1 sentence: what changed and by how much
- 1 sentence: why it likely changed (drivers)
- 1 sentence: recommended action (who/what/when)
- 1 sentence: metric to monitor next
```

Based on all the data above and the outputs, the phones sub category shows a strong performance with a YOY growth of 33.4% from 2016 to 2017. THis is probably from the effective marketing done in the technology category. The business development team should look into the exact cause of this growth so that it can apply the same or similar techniques to other categories as well. They should also be careful to make sure this progress stays constant or keeps increasing and ensure there isn't a heavy decline in sales in the future.

### H2. Convert final SQL into an automated job (optional)
**Prompt (use only after your SQL is final):**
```
Convert my final BigQuery SQL into a Python script that can run as a scheduled job from Colab or Cloud Functions.
Requirements:
- Use python‑bigquery client
- Parameterize date range
- Write results to a destination table `[YOUR_PROJECT].analytics.outputs_kpi`
- Add basic error handling & logging
Return one complete runnable script.
```

---
## Submission checklist
- [ ] Kept prompts precise and reproducible  
- [ ] Captured at least **one** CTE query and **one** window function query  
- [ ] Documented **two** validation attempts (counter‑queries or alternate slice)  
- [ ] Wrote a 4‑sentence executive summary based on results  
- [ ] (Optional) Converted final query into a scheduled job
---