<a href="https://colab.research.google.com/github/CalebBrunton2/mgmt467-analytics-portfolio/blob/main/Caleb_Brunton_Unit2_Lab1_PromptPlusExamples_Colab_Kaggle_GCS_BQ_DQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [92]:
# Authenticate to Google Cloud so this Colab session can use GCS, BigQuery, and other APIs.
from google.colab import auth
auth.authenticate_user()  # Opens an authentication popup (sign in with your Purdue Google account)

# Prompt for your active GCP Project ID and define the default region (you may edit if needed)
import os
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # Common multi-zone region; consistent region helps control cost and latency
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID  # Export for gcloud and BigQuery CLI to detect automatically

# Set the active project for gcloud commands and verify configuration
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project



print(f"✅ Project: {PROJECT_ID} | Region: {REGION}")
# Done: Auth + Project/Region set


Enter your GCP Project ID: original-wonder-471819-n2
Updated property [core/project].
original-wonder-471819-n2
✅ Project: original-wonder-471819-n2 | Region: us-central1


### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


**Reflection:** Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?

## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [93]:
# Upload and configure your Kaggle API credentials securely for reproducible downloads
from google.colab import files
print("📂 Please upload your kaggle.json file (Kaggle → Account → Create New API Token)")
uploaded = files.upload()  # Prompts you to select kaggle.json from your local machine

# Save the uploaded token to the standard Kaggle CLI directory with secure permissions
import os
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])
os.chmod('/root/.kaggle/kaggle.json', 0o600)  # 0600 = owner read/write only → prevents credential leaks

# Verify installation and ensure Kaggle CLI is ready for reproducible dataset access
!kaggle --version
# ✅ Done: Kaggle API configured securely and reproducibly

import kagglehub

# Download latest version
path = kagglehub.dataset_download("sayeeduddin/netflix-2025user-behavior-dataset-210k-records")

print("Path to dataset files:", path)


📂 Please upload your kaggle.json file (Kaggle → Account → Create New API Token)


Saving kaggle.json to kaggle (3).json
Kaggle API 1.7.4.5


### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


In [94]:
!kaggle --help | head -n 20

usage: kaggle [-h] [-v] [-W]
              {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
              ...

options:
  -h, --help            show this help message and exit
  -v, --version         Print the Kaggle API version

commands:
  {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        kernels {list, files, init, push, pull, output, status}
                        models {instances, get, list, init, create, delete, update}
                        models instances {versions, get, files, init, create, delete, update}
                        models instances versions {init, create, download, delete, files}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle compe

**Reflection:** Why require strict `0600` permissions on API tokens? What risks are we avoiding?

## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [95]:
# Step 2 — Download & unzip the Netflix dataset; inventory raw CSVs
# This cell:
# 1) Creates a predictable raw-data folder,
# 2) Downloads the specified Kaggle dataset into /content/data,
# 3) Unzips all archives into /content/data/raw (overwriting OK),
# 4) Lists all CSVs found with human-readable sizes in a neat table.

import os, glob
from pathlib import Path
import pandas as pd

# 1) Create raw-data directory (idempotent)
RAW_DIR = "/content/data/raw"
os.makedirs(RAW_DIR, exist_ok=True)

# 2) Download dataset ZIP(s) from Kaggle into /content/data
#    (Requires kaggle.json to be configured in /root/.kaggle/kaggle.json)
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# 3) Unzip all downloaded archives to /content/data/raw (overwrite = OK)
!unzip -o "/content/data/*.zip" -d "$RAW_DIR"

# 4) Build a neat table of CSV inventory with sizes
files = sorted(glob.glob(f"{RAW_DIR}/*.csv"))
rows = []
for f in files:
    p = Path(f)
    size = p.stat().st_size
    # human-readable size
    units = ["B","KB","MB","GB","TB"]
    n = float(size); i = 0
    while n >= 1024 and i < len(units)-1:
        n /= 1024; i += 1
    rows.append({"file": p.name, "size": f"{n:.1f} {units[i]}"})

df = pd.DataFrame(rows, columns=["file","size"])
if df.empty:
    print(f"⚠️ No CSVs found in {RAW_DIR}. Check the download/unzip steps above.")
else:
    display(df)  # neat, readable table in Colab
    assert len(files) == 6, f"Expected 6 CSVs, found {len(files)}"
    print(f"✅ Inventory complete: {len(files)} CSV files in {RAW_DIR}")


Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
netflix-2025user-behavior-dataset-210k-records.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  


Unnamed: 0,file,size
0,movies.csv,113.2 KB
1,recommendation_logs.csv,4.5 MB
2,reviews.csv,1.8 MB
3,search_logs.csv,2.1 MB
4,users.csv,1.5 MB
5,watch_history.csv,8.8 MB


✅ Inventory complete: 6 CSV files in /content/data/raw


### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


In [96]:
# Verification — confirm exactly six CSV files exist in /content/data/raw
import glob

csv_files = glob.glob("/content/data/raw/*.csv")
assert len(csv_files) == 6, f"❌ Expected 6 CSV files, but found {len(csv_files)}."
print("✅ Found 6 CSV files:")
for f in csv_files:
    print(" -", f)


✅ Found 6 CSV files:
 - /content/data/raw/reviews.csv
 - /content/data/raw/search_logs.csv
 - /content/data/raw/watch_history.csv
 - /content/data/raw/recommendation_logs.csv
 - /content/data/raw/users.csv
 - /content/data/raw/movies.csv


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [97]:
# FIXED Step 3 — Create a unique GCS bucket (in REGION) and stage CSVs

import os, uuid, subprocess, sys, glob

# --- Export REGION for shell (this was missing) ---
# Assumes you already set REGION earlier in your notebook; if not, set it here:
try:
    REGION  # verify it exists in Python
except NameError:
    REGION = "us-central1"  # fallback; edit if you use a different region
os.environ["REGION"] = REGION  # <-- makes $REGION available to !gcloud

# Unique bucket name
bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
os.environ["BUCKET_NAME"] = bucket_name

# Create bucket (handle "already exists" or bad location gracefully)
print(f"Creating bucket: gs://{bucket_name} in {REGION} ...")
rc = subprocess.call(["bash","-lc", "gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION"])
if rc != 0:
    print("⚠️ Bucket create returned non-zero exit code. "
          "Double-check that REGION is valid (e.g., us-central1) and your project/billing are set.")
    # Early exit so we don’t try to upload to a non-existent bucket
    sys.exit(1)

# Verify bucket exists
print("Verifying bucket...")
subprocess.check_call(["bash","-lc", "gcloud storage ls gs://$BUCKET_NAME"])

# Upload all CSVs from /content/data/raw
csvs = glob.glob("/content/data/raw/*.csv")
assert csvs, "No CSVs found in /content/data/raw. Run the download/unzip step first."
print("Uploading CSVs to gs://$BUCKET_NAME/netflix/ ...")
rc = subprocess.call(["bash","-lc", "gcloud storage cp /content/data/raw/*.csv gs://$BUCKET_NAME/netflix/"])
if rc != 0:
    print("⚠️ Upload failed. Re-run after confirming CSVs exist and bucket is reachable.")
    sys.exit(1)

# List staged objects
print("\nStaged objects:")
subprocess.check_call(["bash","-lc", "gcloud storage ls -l gs://$BUCKET_NAME/netflix/"])

print(f"\n✅ Bucket ready: gs://{bucket_name}")
print("📦 Why stage in GCS? It gives a stable, versionable source of truth, "
      "enables collaboration, and cleanly decouples raw storage from BigQuery loads.")
# Done


Creating bucket: gs://mgmt467-netflix-362a55bc in us-central1 ...
Verifying bucket...
Uploading CSVs to gs://$BUCKET_NAME/netflix/ ...

Staged objects:

✅ Bucket ready: gs://mgmt467-netflix-362a55bc
📦 Why stage in GCS? It gives a stable, versionable source of truth, enables collaboration, and cleanly decouples raw storage from BigQuery loads.


### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


In [98]:
# Verification — list all staged objects under the 'netflix/' prefix and show their sizes
!gcloud storage ls -l gs://$BUCKET_NAME/netflix/


    115942  2025-10-23T20:29:57Z  gs://mgmt467-netflix-362a55bc/netflix/movies.csv
   4695557  2025-10-23T20:29:57Z  gs://mgmt467-netflix-362a55bc/netflix/recommendation_logs.csv
   1861942  2025-10-23T20:29:57Z  gs://mgmt467-netflix-362a55bc/netflix/reviews.csv
   2250902  2025-10-23T20:29:57Z  gs://mgmt467-netflix-362a55bc/netflix/search_logs.csv
   1606820  2025-10-23T20:29:57Z  gs://mgmt467-netflix-362a55bc/netflix/users.csv
   9269425  2025-10-23T20:29:57Z  gs://mgmt467-netflix-362a55bc/netflix/watch_history.csv
TOTAL: 6 objects, 19800588 bytes (18.88MiB)


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [99]:
%%bigquery --project $PROJECT_ID
-- Create (idempotently) dataset netflix in US multi-region
-- If it exists, print a friendly message.
-- US multi-region is a good default for many use cases.
CREATE SCHEMA IF NOT EXISTS netflix
OPTIONS(
  location="US",
  description="MGMT467 Netflix dataset"
);

-- Show a friendly message if it already exists
-- This can be done in Python after executing the SQL

Query is running:   0%|          |

In [100]:
from google.cloud import bigquery
import os

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
bucket_name = os.environ.get("BUCKET_NAME")
dataset_id = "netflix"

if not project_id or not bucket_name:
    print("❌ Error: GOOGLE_CLOUD_PROJECT or BUCKET_NAME environment variable not set.")
else:
    client = bigquery.Client(project=project_id)

    tables = ['users', 'movies', 'watch_history', 'recommendation_logs', 'search_logs', 'reviews']

    for table_name in tables:
        table_id = f"{project_id}.{dataset_id}.{table_name}"
        uri = f"gs://{bucket_name}/netflix/{table_name}.csv"

        job_config = bigquery.LoadJobConfig(
            source_format=bigquery.SourceFormat.CSV,
            skip_leading_rows=1,
            autodetect=True,
        )

        print(f"Loading data from {uri} into {table_id}...")
        load_job = client.load_table_from_uri(uri, table_id, job_config=job_config)
        load_job.result()  # Waits for the job to complete

        print(f"✅ Loaded {load_job.output_rows} rows into {table_id}.")

    print("\n✅ All tables loaded.")

    # Row count verification (as in the original plan)
    print("\n✅ Loads complete. Row counts (via BigQuery client):")
    rows = []
    for t in tables:
        sql = f"SELECT '{t}' AS table_name, COUNT(*) AS row_count FROM `{project_id}.{dataset_id}.{t}`"
        result = client.query(sql).result()
        for r in result:
            rows.append({"table_name": r["table_name"], "row_count": r["row_count"]})

    import pandas as pd
    df = pd.DataFrame(rows).sort_values("table_name").reset_index(drop=True)
    display(df)

Loading data from gs://mgmt467-netflix-362a55bc/netflix/users.csv into original-wonder-471819-n2.netflix.users...
✅ Loaded 10300 rows into original-wonder-471819-n2.netflix.users.
Loading data from gs://mgmt467-netflix-362a55bc/netflix/movies.csv into original-wonder-471819-n2.netflix.movies...
✅ Loaded 1040 rows into original-wonder-471819-n2.netflix.movies.
Loading data from gs://mgmt467-netflix-362a55bc/netflix/watch_history.csv into original-wonder-471819-n2.netflix.watch_history...
✅ Loaded 105000 rows into original-wonder-471819-n2.netflix.watch_history.
Loading data from gs://mgmt467-netflix-362a55bc/netflix/recommendation_logs.csv into original-wonder-471819-n2.netflix.recommendation_logs...
✅ Loaded 52000 rows into original-wonder-471819-n2.netflix.recommendation_logs.
Loading data from gs://mgmt467-netflix-362a55bc/netflix/search_logs.csv into original-wonder-471819-n2.netflix.search_logs...
✅ Loaded 26500 rows into original-wonder-471819-n2.netflix.search_logs.
Loading data 

Unnamed: 0,table_name,row_count
0,movies,8320
1,recommendation_logs,416000
2,reviews,123600
3,search_logs,212000
4,users,82400
5,watch_history,840000


### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [101]:
%%bigquery --project $PROJECT_ID
-- Calculate total rows and percentage missing for 'country', 'subscription_plan', and 'age' in the users table.
SELECT
    COUNT(*) AS total_rows,
    COUNTIF(country IS NULL) AS missing_country,
    ROUND(SAFE_DIVIDE(COUNTIF(country IS NULL), COUNT(*)) * 100, 2) AS percent_missing_country,
    COUNTIF(subscription_plan IS NULL) AS missing_subscription_plan,
    ROUND(SAFE_DIVIDE(COUNTIF(subscription_plan IS NULL), COUNT(*)) * 100, 2) AS percent_missing_subscription_plan,
    COUNTIF(age IS NULL) AS missing_age,
    ROUND(SAFE_DIVIDE(COUNTIF(age IS NULL), COUNT(*)) * 100, 2) AS percent_missing_age
FROM
    `netflix.users`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_rows,missing_country,percent_missing_country,missing_subscription_plan,percent_missing_subscription_plan,missing_age,percent_missing_age
0,82400,0,0.0,0,0.0,9832,11.93


In [102]:
%%bigquery --project $PROJECT_ID
-- Calculate the percentage of missing 'subscription_plan' values by 'country'.
-- This helps identify if missingness of 'subscription_plan' is related to the 'country' (Missing At Random - MAR).
SELECT
    country,
    COUNTIF(subscription_plan IS NULL) AS missing_subscription_plan_count,
    COUNT(*) AS total_in_country,
    ROUND(SAFE_DIVIDE(COUNTIF(subscription_plan IS NULL), COUNT(*)) * 100, 2) AS percent_missing_subscription_plan
FROM
    `netflix.users`
GROUP BY
    country
ORDER BY
    percent_missing_subscription_plan DESC;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,country,missing_subscription_plan_count,total_in_country,percent_missing_subscription_plan
0,Canada,0,24768,0.0
1,USA,0,57632,0.0


In [103]:
%%bigquery --project $PROJECT_ID
-- Verification query to print the three missingness percentages rounded to two decimals.
-- Assumes the previous cells have been run to calculate these values.
SELECT
    ROUND(SAFE_DIVIDE(COUNTIF(country IS NULL), COUNT(*)) * 100, 2) AS percent_missing_country,
    ROUND(SAFE_DIVIDE(COUNTIF(subscription_plan IS NULL), COUNT(*)) * 100, 2) AS percent_missing_subscription_plan,
    ROUND(SAFE_DIVIDE(COUNTIF(age IS NULL), COUNT(*)) * 100, 2) AS percent_missing_age
FROM
    `netflix.users`;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,percent_missing_country,percent_missing_subscription_plan,percent_missing_age
0,0.0,0.0,11.93


### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, event_ts, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [104]:
%%bigquery --project $PROJECT_ID
-- Report top 20 duplicate groups on (user_id, movie_id, event_ts, device_type) with counts.
SELECT
    user_id,
    movie_id,
    event_ts,
    device_type,
    COUNT(*) as duplicate_count
FROM
    `netflix.watch_history`
GROUP BY
    user_id,
    movie_id,
    event_ts,
    device_type
HAVING
    COUNT(*) > 1
ORDER BY
    duplicate_count DESC
LIMIT 20;

Executing query with job ID: 459d9edf-c52a-4cc5-9fe8-f109ab601876
Query executing: 0.32s


ERROR:
 400 Unrecognized name: event_ts at [5:5]; reason: invalidQuery, location: query, message: Unrecognized name: event_ts at [5:5]

Location: US
Job ID: 459d9edf-c52a-4cc5-9fe8-f109ab601876



In [105]:
%%bigquery --project $PROJECT_ID
-- Create table watch_history_dedup that keeps one row per group.
-- We prefer the row with the higher progress_ratio, then minutes_watched to break ties.
CREATE OR REPLACE TABLE `netflix.watch_history_dedup` AS
SELECT
    * EXCEPT(row_num)
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER(
            PARTITION BY user_id, movie_id, event_ts, device_type
            ORDER BY progress_ratio DESC, minutes_watched DESC
        ) as row_num
    FROM
        `netflix.watch_history`
)
WHERE
    row_num = 1;

-- Verify the number of rows in the new table (optional, next cell has a detailed verification)
SELECT COUNT(*) AS deduped_row_count FROM `netflix.watch_history_dedup`;

Executing query with job ID: 28bcdce7-e4e0-4083-8b4a-5e90a839a1db
Query executing: 0.18s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/original-wonder-471819-n2/queries/28bcdce7-e4e0-4083-8b4a-5e90a839a1db?maxResults=0&location=US&prettyPrint=false: Query error: Unrecognized name: event_ts at [10:45]

Location: US
Job ID: 28bcdce7-e4e0-4083-8b4a-5e90a839a1db



### Verification Prompt
Generate a before/after count query comparing raw vs `watch_history_dedup`.


In [106]:
%%bigquery --project $PROJECT_ID
-- Verification — Compare row counts before vs after deduplication
-- Confirms that duplicates were successfully removed.

SELECT
  'raw_watch_history' AS table_name,
  COUNT(*) AS row_count
FROM `netflix.watch_history`

UNION ALL

SELECT
  'deduped_watch_history' AS table_name,
  COUNT(*) AS row_count
FROM `netflix.watch_history_dedup`;


Executing query with job ID: 527d4c98-6668-4afd-8e81-96c245691cf5
Query executing: 0.44s


ERROR:
 404 Not found: Table original-wonder-471819-n2:netflix.watch_history_dedup was not found in location US; reason: notFound, message: Not found: Table original-wonder-471819-n2:netflix.watch_history_dedup was not found in location US

Location: US
Job ID: 527d4c98-6668-4afd-8e81-96c245691cf5



**Reflection:** Why do duplicates arise (natural vs system-generated)? How do they corrupt labels and KPIs?

### 5.3 Outliers (minutes_watched) — What & Why
Estimate extreme values via IQR; report % outliers; **winsorize** to P01/P99 for robustness while also **flagging** extremes.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [107]:
%%bigquery --project $PROJECT_ID
-- EXAMPLE (from LLM) — IQR outlier rate
WITH dist AS (
  SELECT
    APPROX_QUANTILES(minutes_watched, 4)[OFFSET(1)] AS q1,
    APPROX_QUANTILES(minutes_watched, 4)[OFFSET(3)] AS q3
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
),
bounds AS (
  SELECT q1, q3, (q3-q1) AS iqr,
         q1 - 1.5*(q3-q1) AS lo,
         q3 + 1.5*(q3-q1) AS hi
  FROM dist
)
SELECT
  COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi) AS outliers,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi)/COUNT(*),2) AS pct_outliers
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h
CROSS JOIN bounds b;

Executing query with job ID: e1f0af91-151b-4003-801f-fd27f4280d68
Query executing: 0.13s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: e1f0af91-151b-4003-801f-fd27f4280d68



In [108]:
%%bigquery --project $PROJECT_ID
-- EXAMPLE (from LLM) — Winsorize + quantiles
CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust` AS
WITH q AS (
  SELECT
    APPROX_QUANTILES(minutes_watched, 100)[OFFSET(1)]  AS p01,
    APPROX_QUANTILES(minutes_watched, 100)[OFFSET(98)] AS p99
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
)
SELECT
  h.*,
  GREATEST(q.p01, LEAST(q.p99, h.minutes_watched)) AS minutes_watched_capped
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h, q;

-- Quantiles before vs after
WITH before AS (
  SELECT 'before' AS which, APPROX_QUANTILES(minutes_watched, 5) AS q
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
),
after AS (
  SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
  FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
)
SELECT * FROM before UNION ALL SELECT * FROM after;

Executing query with job ID: 9fa8a12a-08ba-4400-94c8-dd49a17df00c
Query executing: 0.13s


ERROR:
 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/original-wonder-471819-n2/queries/9fa8a12a-08ba-4400-94c8-dd49a17df00c?maxResults=0&location=US&prettyPrint=false: Invalid value: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash. at [2:1]

Location: US
Job ID: 9fa8a12a-08ba-4400-94c8-dd49a17df00c



### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [109]:
%%bigquery --project $PROJECT_ID
-- Compute and summarize flag_binge for sessions > 8 hours (480 minutes) in watch_history_robust.
SELECT
  COUNTIF(minutes_watched_capped > 480) AS sessions_over_8h,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(minutes_watched_capped > 480)/COUNT(*),2) AS pct
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`;

Executing query with job ID: 294f013d-5b86-4a76-9112-2862029b57f1
Query executing: 0.20s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 294f013d-5b86-4a76-9112-2862029b57f1



In [110]:
%%bigquery --project $PROJECT_ID
-- Compute and summarize flag_age_extreme if age is <10 or >100 in the users table.
SELECT
  COUNTIF(age < 10 OR age > 100) AS extreme_age_rows,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(age < 10 OR age > 100)/COUNT(*),2) AS pct
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`;

Executing query with job ID: 50777d8c-d7b9-4750-943d-b22286b36ea6
Query executing: 0.21s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.users, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 50777d8c-d7b9-4750-943d-b22286b36ea6



In [111]:
%%bigquery --project $PROJECT_ID
-- Compute and summarize flag_duration_anomaly where duration_min < 15 or > 480 in the movies table.
SELECT
  COUNTIF(duration_min < 15) AS titles_under_15m,
  COUNTIF(duration_min > 480) AS titles_over_8h,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(duration_min < 15 OR duration_min > 480)/COUNT(*), 2) AS pct_anomaly
FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

Executing query with job ID: 7b8d73ca-6373-480c-98a2-d83ab5032e1b
Query executing: 0.19s


ERROR:
 400 Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: ${GOOGLE_CLOUD_PROJECT}.netflix.movies, message: Invalid project ID '${GOOGLE_CLOUD_PROJECT}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: 7b8d73ca-6373-480c-98a2-d83ab5032e1b



### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

### Build Prompt
Generate a checklist (Markdown) students can paste at the end:
- Save this notebook to the team Drive.
- Export a `.sql` file with your DQ queries and save to repo.
- Push notebook + SQL to the **team GitHub** with a descriptive commit.
- Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.


## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
