# MGMT 467 — Prompt-Driven Lab (with Commented Examples)
## Kaggle ➜ Google Cloud Storage ➜ BigQuery ➜ Data Quality (DQ)

**How to use this notebook**
- Each section gives you a **Build Prompt** to paste into Gemini/Vertex AI (or Gemini in Colab).
- Below each prompt, you’ll see a **commented example** of what a good LLM answer might look like.
- **Do not** just uncomment and run. Use the prompt to generate your own code, then compare to the example.
- After every step, run the **Verification Prompt**, and write the **Reflection** in Markdown.

> Goal today: Download the Netflix dataset (Kaggle) → Stage on GCS → Load into BigQuery → Run DQ profiling (missingness, duplicates, outliers, anomaly flags).


### Academic integrity & LLM usage
- Use the prompts here to generate your own code cells.
- Read concept notes and write the reflection answers in your own words.
- Keep credentials out of code. Upload `kaggle.json` when asked.


## Learning objectives
1) Explain **why** we stage data in GCS and load it to BigQuery.  
2) Build an **idempotent**, auditable pipeline.  
3) Diagnose **missingness**, **duplicates**, and **outliers** and justify cleaning choices.  
4) Connect DQ decisions to **business/ML impact**.


## 0) Environment setup — What & Why
Authenticate Colab to Google Cloud so we can use `gcloud`, GCS, and BigQuery. Set **PROJECT_ID** and **REGION** once for consistency (cost/latency).

### Build Prompt (paste to LLM)
You are my cloud TA. Generate a single **Colab code cell** that:
1) Authenticates to Google Cloud in Colab,  
2) Prompts for `PROJECT_ID` via `input()` and sets `REGION="us-central1"` (editable),  
3) Exports `GOOGLE_CLOUD_PROJECT`,  
4) Runs `gcloud config set project $GOOGLE_CLOUD_PROJECT`,  
5) Prints both values. Add 2–3 comments explaining what/why.
End with a comment: `# Done: Auth + Project/Region set`.


In [None]:
# Authenticate to Google Cloud in Colab
from google.colab import auth
auth.authenticate_user()

# Prompt for PROJECT_ID and set REGION
import os
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # keep consistent; change if instructed

# Export GOOGLE_CLOUD_PROJECT and REGION environment variables
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
os.environ["REGION"] = REGION # Export REGION as an environment variable

# Set active project for gcloud/BigQuery CLI
!gcloud config set project $GOOGLE_CLOUD_PROJECT

# Print the values
print("Project:", PROJECT_ID, "| Region:", REGION)

# Done: Auth + Project/Region set

Enter your GCP Project ID: heroic-trilogy-471119-k8
Updated property [core/project].
Project: heroic-trilogy-471119-k8 | Region: us-central1


In [None]:
# # EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
# # from google.colab import auth
# # auth.authenticate_user()
# #
# # import os
# # PROJECT_ID = input("Enter your GCP Project ID: ").strip()
# # REGION = "us-central1"  # keep consistent; change if instructed
# # os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
# # print("Project:", PROJECT_ID, "| Region:", REGION)
# #
# # # Set active project for gcloud/BigQuery CLI
# # !gcloud config set project $GOOGLE_CLOUD_PROJECT
# # !gcloud config get-value project
# # # Done: Auth + Project/Region set

### Verification Prompt
Generate a short cell that prints the active project using `gcloud config get-value project` and echoes the `REGION` you set.


In [None]:
# Verify the active project and region
# This confirms that the environment variables and gcloud configuration are set correctly.
import os
gcloud_project = !gcloud config get-value project
print("Active gcloud project:", gcloud_project[0])
print("REGION:", os.environ.get("REGION", "REGION environment variable not set"))

Active gcloud project: heroic-trilogy-471119-k8
REGION: us-central1


**Reflection:** Why do we set `PROJECT_ID` and `REGION` at the top? What can go wrong if we don’t?

We set the 'PROJECT_ID' and 'REGION' at the top to ensure consistency, reproducibility and cost and resource management. This ensures that when creating new commands, we are using the same project and region and anyone running the notebook can see which project and region are being used. It also helps manage operational cost within google cloud.

If we don't set the 'PROJECT_ID' and 'REGION' at the top, then a lot of things can go wrong. For example, commands might fail because they do not know which project to operate in, datasets or buckets can be created in unexpected or default projects, making them difficult to manage. Then, costs could be incurred in different projects or regions, and reproducibility is compromised which can cause more errors.

## 1) Kaggle API — What & Why
Use Kaggle CLI for reproducible downloads. Store `kaggle.json` at `~/.kaggle/kaggle.json` with `0600` permissions to protect secrets.

### Build Prompt
Generate a **single Colab code cell** that:
- Prompts me to upload `kaggle.json`,
- Saves to `~/.kaggle/kaggle.json` with `0600` permissions,
- Prints `kaggle --version`.
Add comments about security and reproducibility.


In [None]:
# Prompt to upload kaggle.json for Kaggle API authentication
# This file contains your API credentials and should be kept secure.
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

# Save kaggle.json to the correct directory with secure permissions
# This ensures only the owner can read and write the file, protecting your credentials.
import os
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])
os.chmod('/root/.kaggle/kaggle.json', 0o600)  # Set owner-only permissions

# Verify the Kaggle CLI is installed and ready to use
# This confirms the setup was successful and you can proceed with Kaggle commands.
!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
Kaggle API 1.7.4.5


In [None]:
# # EXAMPLE (from LLM) — Kaggle setup (commented)
# # from google.colab import files
# # print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
# # uploaded = files.upload()
# #
# # import os
# # os.makedirs('/root/.kaggle', exist_ok=True)
# # with open('/root/.kaggle/kaggle.json', 'wb') as f:
# #     f.write(uploaded[list(uploaded.keys())[0]])
# # os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only
# #
# # !kaggle --version

### Verification Prompt
Generate a one-liner that runs `kaggle --help | head -n 20` to show the CLI is ready.


In [None]:
# Verify the Kaggle CLI is ready by showing the first 20 lines of the help output
!kaggle --help | head -n 20

usage: kaggle [-h] [-v] [-W]
              {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
              ...

options:
  -h, --help            show this help message and exit
  -v, --version         Print the Kaggle API version

commands:
  {competitions,c,datasets,d,kernels,k,models,m,files,f,config}
                        Use one of:
                        competitions {list, files, download, submit, submissions, leaderboard}
                        datasets {list, files, download, create, version, init, metadata, status}
                        kernels {list, files, init, push, pull, output, status}
                        models {instances, get, list, init, create, delete, update}
                        models instances {versions, get, files, init, create, delete, update}
                        models instances versions {init, create, download, delete, files}
                        config {view, set, unset}
    competitions (c)    Commands related to Kaggle compe

**Reflection:** Why require strict `0600` permissions on API tokens? What risks are we avoiding?

Requiring strict '0600' permissions on API tokens such as the 'kaggle.json' file ensures that only the owner of the file can read or write to it, thereby protecting the user's API token from being accessed by unintended parties. This is a crucial security measure that helps prevent unauthorized access to your credentials. The risks that we are avoiding are unauthorized access to the user's kaggle account, data breaches and malicious activity.

## 2) Download & unzip dataset — What & Why
Keep raw files under `/content/data/raw` for predictable paths and auditing.
**Dataset:** `sayeeduddin/netflix-2025user-behavior-dataset-210k-records`

### Build Prompt
Generate a **Colab code cell** that:
- Creates `/content/data/raw`,
- Downloads the dataset to `/content/data` with Kaggle CLI,
- Unzips into `/content/data/raw` (overwrite OK),
- Lists all CSVs with sizes in a neat table.
Include comments describing each step.


In [None]:
# Create directory for raw data
# This ensures a clean and predictable location for the downloaded files.
!mkdir -p /content/data/raw

# Download the dataset from Kaggle to /content/data
# Using the Kaggle CLI allows for reproducible downloads of datasets.
!kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data

# Unzip the downloaded dataset into the raw data directory
# The -o flag allows overwriting if the file already exists.
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes
# This provides a quick inventory of the downloaded files.
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/sayeeduddin/netflix-2025user-behavior-dataset-210k-records
License(s): CC0-1.0
Downloading netflix-2025user-behavior-dataset-210k-records.zip to /content/data
  0% 0.00/4.02M [00:00<?, ?B/s]
100% 4.02M/4.02M [00:00<00:00, 526MB/s]
Archive:  /content/data/netflix-2025user-behavior-dataset-210k-records.zip
  inflating: /content/data/raw/README.md  
  inflating: /content/data/raw/movies.csv  
  inflating: /content/data/raw/recommendation_logs.csv  
  inflating: /content/data/raw/reviews.csv  
  inflating: /content/data/raw/search_logs.csv  
  inflating: /content/data/raw/users.csv  
  inflating: /content/data/raw/watch_history.csv  
-rw-r--r-- 1 root root 114K Aug  2 19:36 /content/data/raw/movies.csv
-rw-r--r-- 1 root root 4.5M Aug  2 19:36 /content/data/raw/recommendation_logs.csv
-rw-r--r-- 1 root root 1.8M Aug  2 19:36 /content/data/raw/reviews.csv
-rw-r--r-- 1 root root 2.2M Aug  2 19:36 /content/data/raw/search_logs.csv
-rw-r--r-- 1 root 

In [None]:
# # EXAMPLE (from LLM) — Download & unzip (commented)
# # !mkdir -p /content/data/raw
# # !kaggle datasets download -d sayeeduddin/netflix-2025user-behavior-dataset-210k-records -p /content/data
# # !unzip -o /content/data/*.zip -d /content/data/raw
# # # List CSV inventory
# # !ls -lh /content/data/raw/*.csv

### Verification Prompt
Generate a snippet that asserts there are exactly **six** CSV files and prints their names.


In [None]:
# Verify there are exactly six CSV files and print their names
import glob
csv_files = glob.glob('/content/data/raw/*.csv')
num_csv_files = len(csv_files)

print(f"Found {num_csv_files} CSV files.")

if num_csv_files == 6:
    print("Verification successful: Exactly 6 CSV files found.")
    print("CSV files:")
    for csv_file in csv_files:
        print(csv_file)
else:
    print("Verification failed: Expected 6 CSV files, but found", num_csv_files)

Found 6 CSV files.
Verification successful: Exactly 6 CSV files found.
CSV files:
/content/data/raw/recommendation_logs.csv
/content/data/raw/watch_history.csv
/content/data/raw/search_logs.csv
/content/data/raw/users.csv
/content/data/raw/movies.csv
/content/data/raw/reviews.csv


**Reflection:** Why is keeping a clean file inventory (names, sizes) useful downstream?

Keeping a clean file inventory provides a foundational layer of transparency and control over your data inputs, which is essential for building reliable and maintainable data pipelines. It is useful for several reasons such as auditing and reproducibility, troubleshooting, data integrity checks, documentation and input for automation.



## 3) Create GCS bucket & upload — What & Why
Stage in GCS → consistent, versionable source for BigQuery loads. Bucket names must be **globally unique**.

### Build Prompt
Generate a **Colab code cell** that:
- Creates a unique bucket in `${REGION}` (random suffix),
- Saves name to `BUCKET_NAME` env var,
- Uploads all CSVs to `gs://$BUCKET_NAME/netflix/`,
- Prints the bucket name and explains staging benefits.


In [None]:
# Create a unique bucket name with a random suffix
import uuid
import os

# Ensure REGION is set in the Python environment
REGION = os.environ.get("REGION", "us-central1") # Get from env or default if not set
os.environ["REGION"] = REGION # Ensure it's set for subsequent Python calls

bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"

# Save the bucket name to an environment variable
os.environ["BUCKET_NAME"] = bucket_name

# Create the GCS bucket in the specified region
# The --location flag ensures the bucket is created in the desired region.
# The command will succeed even if the bucket name was somehow already taken (though unlikely with uuid).
print(f"Attempting to create bucket {bucket_name} in region {os.environ['REGION']}")
!gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION

# Upload all CSV files from the raw data directory to the bucket under a 'netflix/' prefix
# Staging data in GCS provides a durable, versionable, and accessible source for cloud services like BigQuery.
print(f"Uploading files to gs://{bucket_name}/netflix/")
!gcloud storage cp /content/data/raw/*.csv gs://$BUCKET_NAME/netflix/

# Print the created bucket name
print("\nCreated GCS bucket:", bucket_name)
print("\nBenefits of staging data in GCS:")
print("- **Durability:** Data is stored redundantly across multiple devices and locations.")
print("- **Accessibility:** Data can be easily accessed by various Google Cloud services (BigQuery, Dataflow, AI Platform, etc.).")
print("- **Versionability:** GCS supports object versioning, allowing you to retrieve previous versions of your data.")
print("- **Scalability:** GCS can handle virtually unlimited amounts of data.")

Attempting to create bucket mgmt467-netflix-10928894 in region us-central1
Creating gs://mgmt467-netflix-10928894/...
Uploading files to gs://mgmt467-netflix-10928894/netflix/
Copying file:///content/data/raw/movies.csv to gs://mgmt467-netflix-10928894/netflix/movies.csv
Copying file:///content/data/raw/recommendation_logs.csv to gs://mgmt467-netflix-10928894/netflix/recommendation_logs.csv
Copying file:///content/data/raw/reviews.csv to gs://mgmt467-netflix-10928894/netflix/reviews.csv
Copying file:///content/data/raw/search_logs.csv to gs://mgmt467-netflix-10928894/netflix/search_logs.csv
Copying file:///content/data/raw/users.csv to gs://mgmt467-netflix-10928894/netflix/users.csv
Copying file:///content/data/raw/watch_history.csv to gs://mgmt467-netflix-10928894/netflix/watch_history.csv

Average throughput: 63.3MiB/s

Created GCS bucket: mgmt467-netflix-10928894

Benefits of staging data in GCS:
- **Durability:** Data is stored redundantly across multiple devices and locations.
- *

In [None]:
# # EXAMPLE (from LLM) — GCS staging (commented)
# # import uuid, os
# # bucket_name = f"mgmt467-netflix-{uuid.uuid4().hex[:8]}"
# # os.environ["BUCKET_NAME"] = bucket_name
# # !gcloud storage buckets create gs://$BUCKET_NAME --location=$REGION
# # !gcloud storage cp /content/data/raw/* gs://$BUCKET_NAME/netflix/
# # print("Bucket:", bucket_name)
# # # Verify contents
# # !gcloud storage ls gs://$BUCKET_NAME/netflix/

### Verification Prompt
Generate a snippet that lists the `netflix/` prefix and shows object sizes.


In [None]:
# List objects in the bucket under the 'netflix/' prefix and show details (including size)
import os
bucket_name = os.environ.get("BUCKET_NAME")
if bucket_name:
  !gcloud storage ls -l gs://$BUCKET_NAME/netflix/
else:
  print("BUCKET_NAME environment variable is not set.")

    115942  2025-10-21T19:50:28Z  gs://mgmt467-netflix-10928894/netflix/movies.csv
   4695557  2025-10-21T19:50:28Z  gs://mgmt467-netflix-10928894/netflix/recommendation_logs.csv
   1861942  2025-10-21T19:50:28Z  gs://mgmt467-netflix-10928894/netflix/reviews.csv
   2250902  2025-10-21T19:50:28Z  gs://mgmt467-netflix-10928894/netflix/search_logs.csv
   1606820  2025-10-21T19:50:28Z  gs://mgmt467-netflix-10928894/netflix/users.csv
   9269425  2025-10-21T19:50:28Z  gs://mgmt467-netflix-10928894/netflix/watch_history.csv
TOTAL: 6 objects, 19800588 bytes (18.88MiB)


**Reflection:** Name two benefits of staging in GCS vs loading directly from local Colab.

The two benefits of staging in GCS vs loading directly from local Colab are scalability and accessibility and durability and reliability. That is to say, GCS provides a more robust, scalable, and integrated platform for managing and processing your data within the Google Cloud ecosystem.

## 4) BigQuery dataset & loads — What & Why
Create dataset `netflix` and load six CSVs with **autodetect** for speed (we’ll enforce schemas later).

### Build Prompt (two cells)
**Cell A:** Create (idempotently) dataset `netflix` in US multi-region; if it exists, print a friendly message.  
**Cell B:** Load tables from `gs://$BUCKET_NAME/netflix/`:
`users, movies, watch_history, recommendation_logs, search_logs, reviews`
with `--skip_leading_rows=1 --autodetect --source_format=CSV`.
Finish with row-count queries for each table.


In [None]:
# Cell A: Create BigQuery dataset (idempotent)
DATASET="netflix"
# Attempt to create; ignore if exists and print a message
!bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET 2> /dev/null || echo "BigQuery dataset '$DATASET' may already exist."

BigQuery error in mk operation: Dataset 'heroic-trilogy-471119-k8:netflix'
already exists.
BigQuery dataset '' may already exist.


In [None]:
# Cell B: Load tables from GCS and get row counts
tables = {
  "users": "users.csv",
  "movies": "movies.csv",
  "watch_history": "watch_history.csv",
  "recommendation_logs": "recommendation_logs.csv",
  "search_logs": "search_logs.csv",
  "reviews": "reviews.csv",
}

import os
from google.cloud import bigquery
import pandas as pd

# Retrieve environment variables set in previous cells
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
bucket_name = os.environ.get("BUCKET_NAME")
dataset = "netflix"

if not all([project_id, bucket_name]):
    print("Error: GOOGLE_CLOUD_PROJECT or BUCKET_NAME environment variable is not set. Please run the setup cells first.")
else:
    print("--- Loading Tables from GCS to BigQuery ---")
    for table_name, file_name in tables.items():
        gcs_uri = f"gs://{bucket_name}/netflix/{file_name}"
        print(f"Loading {dataset}.{table_name} from {gcs_uri}...")
        # Load data using bq command-line tool.
        # --replace makes the job idempotent. If the table exists, it's replaced.
        !bq load --location=US --skip_leading_rows=1 --autodetect --source_format=CSV --replace {dataset}.{table_name} {gcs_uri}

    # Verify the row counts for each table using the Python client
    print("\n--- Verifying Row Counts ---")
    client = bigquery.Client(project=project_id)

    # A single query to get all counts is more efficient
    union_all_query = " UNION ALL ".join([
        f"SELECT '{table_name}' AS table_name, COUNT(*) AS row_count FROM `{project_id}.{dataset}.{table_name}`"
        for table_name in tables.keys()
    ])

    try:
        # client.query returns a job, .to_dataframe() waits for completion and returns a DataFrame
        df = client.query(union_all_query).to_dataframe()
        print("Row counts verification successful:")
        display(df)
    except Exception as e:
        print(f"An error occurred during row count verification: {e}")

--- Loading Tables from GCS to BigQuery ---
Loading netflix.users from gs://mgmt467-netflix-10928894/netflix/users.csv...
Waiting on bqjob_r401fa373ae91955d_0000019a085c7a20_1 ... (1s) Current status: DONE   
Loading netflix.movies from gs://mgmt467-netflix-10928894/netflix/movies.csv...
Waiting on bqjob_r7ffbcde8d008984_0000019a085c8cbf_1 ... (2s) Current status: DONE   
Loading netflix.watch_history from gs://mgmt467-netflix-10928894/netflix/watch_history.csv...
Waiting on bqjob_r41c182489c408372_0000019a085ca382_1 ... (3s) Current status: DONE   
Loading netflix.recommendation_logs from gs://mgmt467-netflix-10928894/netflix/recommendation_logs.csv...
Waiting on bqjob_r293b38ef558e4f2a_0000019a085cbf00_1 ... (3s) Current status: DONE   
Loading netflix.search_logs from gs://mgmt467-netflix-10928894/netflix/search_logs.csv...
Waiting on bqjob_r7822d1db33284005_0000019a085cda91_1 ... (2s) Current status: DONE   
Loading netflix.reviews from gs://mgmt467-netflix-10928894/netflix/reviews

Unnamed: 0,table_name,row_count
0,users,10300
1,search_logs,26500
2,reviews,15450
3,movies,1040
4,watch_history,105000
5,recommendation_logs,52000


In [None]:
# # EXAMPLE (from LLM) — BigQuery dataset (commented)
# # DATASET="netflix"
# # # Attempt to create; ignore if exists
# # !bq --location=US mk -d --description "MGMT467 Netflix dataset" $DATASET || echo "Dataset may already exist."

In [None]:
# # EXAMPLE (from LLM) — Load tables (commented)
# # tables = {
# #   "users": "users.csv",
# #   "movies": "movies.csv",
# #   "watch_history": "watch_history.csv",
# #   "recommendation_logs": "recommendation_logs.csv",
# #   "search_logs": "search_logs.csv",
# #   "reviews": "reviews.csv",
# # }
# # import os
# # for tbl, fname in tables.items():
# #   src = f"gs://{os.environ['BUCKET_NAME']}/netflix/{fname}"
# #   print("Loading", tbl, "from", src)
# #   !bq load --skip_leading_rows=1 --autodetect --source_format=CSV $DATASET.$tbl $src
# #
# # # Row counts
# # for tbl in tables.keys():
# #   !bq query --nouse_legacy_sql "SELECT '{tbl}' AS table_name, COUNT(*) AS n FROM `${GOOGLE_CLOUD_PROJECT}.netflix.{tbl}`".format(tbl=tbl)

### Verification Prompt
Generate a single query that returns `table_name, row_count` for all six tables in `${GOOGLE_CLOUD_PROJECT}.netflix`.


In [None]:
import os
from google.cloud import bigquery
import pandas as pd

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
dataset = "netflix"

if not project_id:
    print("Error: GOOGLE_CLOUD_PROJECT environment variable is not set.")
else:
    # A single query to get all counts is more efficient
    union_all_query = f"""
    SELECT 'users' AS table_name, COUNT(*) AS row_count FROM `{project_id}.{dataset}.users`
    UNION ALL
    SELECT 'movies' AS table_name, COUNT(*) AS row_count FROM `{project_id}.{dataset}.movies`
    UNION ALL
    SELECT 'watch_history' AS table_name, COUNT(*) AS row_count FROM `{project_id}.{dataset}.watch_history`
    UNION ALL
    SELECT 'recommendation_logs' AS table_name, COUNT(*) AS row_count FROM `{project_id}.{dataset}.recommendation_logs`
    UNION ALL
    SELECT 'search_logs' AS table_name, COUNT(*) AS row_count FROM `{project_id}.{dataset}.search_logs`
    UNION ALL
    SELECT 'reviews' AS table_name, COUNT(*) AS row_count FROM `{project_id}.{dataset}.reviews`
    """

    try:
        client = bigquery.Client(project=project_id)
        df = client.query(union_all_query).to_dataframe()
        print("Row counts for all tables:")
        display(df)
    except Exception as e:
        print(f"An error occurred: {e}")

Row counts for all tables:


Unnamed: 0,table_name,row_count
0,watch_history,105000
1,recommendation_logs,52000
2,search_logs,26500
3,users,10300
4,movies,1040
5,reviews,15450


**Reflection:** When is `autodetect` acceptable? When should you enforce explicit schemas and why?

'Autodetect' is acceptable during initial exploration and quick loading, consistent and simple data, prototyping and development. Explicit schemas should be enforced during production pipelines, complex or inconsistent data, data validation and governance, performance and cost and while maintaining data type integrity. Furthermore, Enforcing explicit schemas provides control, predictability, and data integrity. It is a critical part of building robust and reliable data pipelines.

## 5) Data Quality (DQ) — Concepts we care about
- **Missingness** (MCAR/MAR/MNAR). Impute vs drop. Add `is_missing_*` indicators.
- **Duplicates** (exact vs near). Double-counted engagement corrupts labels & KPIs.
- **Outliers** (IQR). Winsorize/cap vs robust models. Always **flag** and explain.
- **Reproducibility**. Prefer `CREATE OR REPLACE` and deterministic keys.


### 5.1 Missingness (users) — What & Why
Measure % missing and check if missingness depends on another variable (MAR) → potential bias & instability.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Total rows and % missing in `region`, `plan_tier`, `age_band` from `users`.
2) `% plan_tier missing by region` ordered descending. Add comments on MAR.


In [None]:
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']

client = bigquery.Client(project=project_id)

query = f"""
-- Users: % missing per column
WITH base AS (
  SELECT COUNT(*) n,
         COUNTIF(country IS NULL) miss_country,
         COUNTIF(subscription_plan IS NULL) miss_plan,
         COUNTIF(age IS NULL) miss_age
  FROM `{project_id}.netflix.users`
)
SELECT n,
       ROUND(100*miss_country/n,2) AS pct_missing_country,
       ROUND(100*miss_plan/n,2)   AS pct_missing_subscription_plan,
       ROUND(100*miss_age/n,2)    AS pct_missing_age
FROM base;
"""

query_job = client.query(query)
results = query_job.result()

# Print the results
for row in results:
    print(row)

Row((10300, 0.0, 0.0, 11.93), {'n': 0, 'pct_missing_country': 1, 'pct_missing_subscription_plan': 2, 'pct_missing_age': 3})


In [None]:
import os
import pandas as pd
from google.cloud import bigquery

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if not project_id:
    print("Error: GOOGLE_CLOUD_PROJECT environment variable is not set.")
else:
    # This query calculates the percentage of missing 'subscription_plan' values for each country.
    # It helps identify if missingness is related to the country (Missing At Random - MAR),
    # which could introduce bias if not handled appropriately.
    query = f"""
    SELECT
        country,
        COUNT(*) AS n,
        ROUND(100 * COUNTIF(subscription_plan IS NULL) / COUNT(*), 2) AS pct_missing_subscription_plan
    FROM
        `{project_id}.netflix.users`
    GROUP BY
        country
    ORDER BY
        pct_missing_subscription_plan DESC;
    """

    print("Running query for % subscription_plan missing by country:")
    try:
        client = bigquery.Client(project=project_id)
        missing_plan_by_country_df = client.query(query).to_dataframe()
        display(missing_plan_by_country_df)
    except Exception as e:
        print(f"An error occurred: {e}")

Running query for % subscription_plan missing by country:


Unnamed: 0,country,n,pct_missing_subscription_plan
0,Canada,3096,0.0
1,USA,7204,0.0


In [None]:
# # EXAMPLE (from LLM) — Missingness profile (commented)
# # -- Users: % missing per column
# # WITH base AS (
# #   SELECT COUNT(*) n,
# #          COUNTIF(region IS NULL) miss_region,
# #          COUNTIF(plan_tier IS NULL) miss_plan,
# #          COUNTIF(age_band IS NULL) miss_age
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # )
# # SELECT n,
# #        ROUND(100*miss_region/n,2) AS pct_missing_region,
# #        ROUND(100*miss_plan/n,2)   AS pct_missing_plan_tier,
# #        ROUND(100*miss_age/n,2)    AS pct_missing_age_band
# # FROM base;

In [None]:
# # EXAMPLE (from LLM) — MAR by region (commented)
# # SELECT region,
# #        COUNT(*) AS n,
# #        ROUND(100*COUNTIF(plan_tier IS NULL)/COUNT(*),2) AS pct_missing_plan_tier
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`
# # GROUP BY region
# # ORDER BY pct_missing_plan_tier DESC;

### Verification Prompt
Generate a query that prints the three missingness percentages from (1), rounded to two decimals.


In [28]:
# Verification: Print the three missingness percentages from the users table.
import os
from google.cloud import bigquery
import pandas as pd

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")

if not project_id:
    print("Error: GOOGLE_CLOUD_PROJECT environment variable is not set.")
else:
    client = bigquery.Client(project=project_id)

    # This query calculates the percentage of missing values for 'country', 'subscription_plan', and 'age'.
    query = f"""
    WITH base AS (
      SELECT
        COUNT(*) AS n,
        COUNTIF(country IS NULL) AS miss_country,
        COUNTIF(subscription_plan IS NULL) AS miss_plan,
        COUNTIF(age IS NULL) AS miss_age
      FROM
        `{project_id}.netflix.users`
    )
    SELECT
      ROUND(100 * miss_country / n, 2) AS pct_missing_country,
      ROUND(100 * miss_plan / n, 2) AS pct_missing_subscription_plan,
      ROUND(100 * miss_age / n, 2) AS pct_missing_age
    FROM
      base;
    """

    print("Running query to verify missingness percentages:")
    try:
        missingness_df = client.query(query).to_dataframe()
        display(missingness_df)
    except Exception as e:
        print(f"An error occurred: {e}")

Running query to verify missingness percentages:


Unnamed: 0,pct_missing_country,pct_missing_subscription_plan,pct_missing_age
0,0.0,0.0,11.93


**Reflection:** Which columns are most missing? Hypothesize MCAR/MAR/MNAR and why.

The 'age' column is the only one with missing data, at a rate of 11.93%. The 'country' and 'subscription_plan' columns have no missing values.

Here are the hypotheses regarding the missingness in the age column:

1. Missing Completely At Random (MCAR): This is unlikely but possible, suggesting the data is missing due to a completely random event, like a system glitch, unrelated to any other data.

2. Missing At Random (MAR): This is a strong possibility. The likelihood of age being missing could depend on another observed variable. For example, users from a specific country or with a particular subscription plan might be less inclined to provide their age. This could be tested by analyzing the missingness rate across different segments.

3. Missing Not At Random (MNAR): This is also plausible. The missingness could be directly related to the user's age. For instance, very young or older users might be more reluctant to disclose their age due to privacy concerns.

### 5.2 Duplicates (watch_history) — What & Why
Find exact duplicate interaction records and keep **one best** per group (deterministic policy).

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Report duplicate groups on `(user_id, movie_id, event_ts, device_type)` with counts (top 20).
2) Create table `watch_history_dedup` that keeps one row per group (prefer higher `progress_ratio`, then `minutes_watched`). Add comments.


In [None]:
# # EXAMPLE (from LLM) — Detect duplicate groups (commented)
# # SELECT user_id, movie_id, event_ts, device_type, COUNT(*) AS dup_count
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history`
# # GROUP BY user_id, movie_id, event_ts, device_type
# # HAVING dup_count > 1
# # ORDER BY dup_count DESC
# # LIMIT 20;

In [29]:
# Report duplicate groups with counts (top 20)
import os
import pandas as pd
from google.cloud import bigquery

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
if not project_id:
    print("Error: GOOGLE_CLOUD_PROJECT environment variable is not set.")
else:
    client = bigquery.Client(project=project_id)
    # The original table uses 'watch_date' as the timestamp.
    # This query groups by the interaction key to find duplicates.
    query = f"""
    SELECT
        user_id,
        movie_id,
        watch_date,
        device_type,
        COUNT(*) AS dup_count
    FROM `{project_id}.netflix.watch_history`
    GROUP BY user_id, movie_id, watch_date, device_type
    HAVING dup_count > 1
    ORDER BY dup_count DESC
    LIMIT 20;
    """

    print("Running query to report duplicate groups...")
    try:
        duplicate_groups_df = client.query(query).to_dataframe()
        if duplicate_groups_df.empty:
            print("No duplicate groups found.")
        else:
            print("Top 20 duplicate groups found:")
            display(duplicate_groups_df)
    except Exception as e:
        print(f"An error occurred: {e}")

Running query to report duplicate groups...
Top 20 duplicate groups found:


Unnamed: 0,user_id,movie_id,watch_date,device_type,dup_count
0,user_03310,movie_0640,2024-09-08,Smart TV,4
1,user_00391,movie_0893,2024-08-26,Laptop,4
2,user_07617,movie_0785,2024-07-14,Desktop,3
3,user_05629,movie_0697,2025-01-23,Desktop,3
4,user_06799,movie_0458,2024-08-15,Desktop,3
5,user_04899,movie_0142,2025-01-20,Desktop,3
6,user_02652,movie_0352,2024-10-22,Desktop,3
7,user_02126,movie_0642,2025-02-09,Desktop,3
8,user_01581,movie_0933,2024-03-30,Desktop,3
9,user_05952,movie_0893,2024-04-29,Desktop,3


In [31]:
# Create table `watch_history_dedup` that keeps one best row per duplicate group.
import os
from google.cloud import bigquery

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
client = bigquery.Client(project=project_id)

# This query creates a new, deduplicated table.
# It uses the ROW_NUMBER() window function to identify duplicates based on the key
# (user_id, movie_id, watch_date, device_type).
# Within each duplicate group, it ranks rows to determine which one to keep,
# preferring the one with the highest progress_percentage, and then the highest watch_duration_minutes.
# Finally, it keeps only the top-ranked row (rk = 1) for each group.
query = f"""
CREATE OR REPLACE TABLE `{project_id}.netflix.watch_history_dedup` AS
SELECT * EXCEPT(rk)
FROM (
  SELECT
    h.*,
    ROW_NUMBER() OVER (
      PARTITION BY user_id, movie_id, watch_date, device_type
      ORDER BY progress_percentage DESC, watch_duration_minutes DESC
    ) AS rk
  FROM `{project_id}.netflix.watch_history` h
)
WHERE rk = 1;
"""

print("Running query to create or replace `watch_history_dedup`...")
try:
    # Execute the query and wait for the job to complete.
    job = client.query(query)
    job.result()
    print("Successfully created or replaced `watch_history_dedup` table.")

    # Verification step
    before_count_query = f"SELECT COUNT(*) FROM `{project_id}.netflix.watch_history`"
    after_count_query = f"SELECT COUNT(*) FROM `{project_id}.netflix.watch_history_dedup`"

    before_count = client.query(before_count_query).to_dataframe().iloc[0,0]
    after_count = client.query(after_count_query).to_dataframe().iloc[0,0]

    print(f"Row count before deduplication: {before_count}")
    print(f"Row count after deduplication:  {after_count}")
    print(f"Rows removed: {before_count - after_count}")

except Exception as e:
    print(f"An error occurred: {e}")

Running query to create or replace `watch_history_dedup`...
Successfully created or replaced `watch_history_dedup` table.
Row count before deduplication: 105000
Row count after deduplication:  100000
Rows removed: 5000


In [32]:
# # EXAMPLE (from LLM) — Keep-one policy (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` AS
# # SELECT * EXCEPT(rk) FROM (
# #   SELECT h.*,
# #          ROW_NUMBER() OVER (
# #            PARTITION BY user_id, movie_id, event_ts, device_type
# #            ORDER BY progress_ratio DESC, minutes_watched DESC
# #          ) AS rk
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history` h
# # )
# # WHERE rk = 1;

### Verification Prompt
Generate a before/after count query comparing raw vs `watch_history_dedup`.


In [33]:
# Generate a before/after count query comparing raw vs watch_history_dedup
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
client = bigquery.Client(project=project_id)

query = f"""
SELECT 'raw' AS table_name, COUNT(*) AS row_count FROM `{project_id}.netflix.watch_history`
UNION ALL
SELECT 'deduplicated' AS table_name, COUNT(*) AS row_count FROM `{project_id}.netflix.watch_history_dedup`;
"""

query_job = client.query(query)
results = query_job.result()

# Print the results in a clear format
print("Comparing row counts before and after deduplication:")
for row in results:
    print(f"- {row['table_name']}: {row['row_count']} rows")

Comparing row counts before and after deduplication:
- deduplicated: 100000 rows
- raw: 105000 rows


**Reflection:** Why do duplicates arise (natural vs system-generated)? How do they corrupt labels and KPIs?

Duplicates in data can stem from various sources, including operational errors like multiple data entry points or system issues such as retry mechanisms in data pipelines. These duplicates are problematic because they inflate counts and skew aggregations, leading to inaccurate Key Performance Indicators (KPIs). For machine learning, duplicated data can cause models to overfit and learn biased relationships, ultimately corrupting labels and resulting in unreliable predictions and misleading business insights.

### 5.3 Outliers (minutes_watched) — What & Why
Estimate extreme values via IQR; report % outliers; **winsorize** to P01/P99 for robustness while also **flagging** extremes.

### Build Prompt
Generate **two BigQuery SQL cells**:
1) Compute IQR bounds for `minutes_watched` on `watch_history_dedup` and report % outliers.
2) Create `watch_history_robust` with `minutes_watched_capped` capped at P01/P99; return quantile summaries before/after.


In [37]:
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
client = bigquery.Client(project=project_id)

query = f"""
WITH dist AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(1)] AS q1,
    APPROX_QUANTILES(watch_duration_minutes, 4)[OFFSET(3)] AS q3
  FROM `{project_id}.netflix.watch_history_dedup`
),
bounds AS (
  SELECT q1, q3, (q3-q1) AS iqr,
         q1 - 1.5*(q3-q1) AS lo,
         q3 + 1.5*(q3-q1) AS hi
  FROM dist
)
SELECT
  COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi) AS outliers,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(h.watch_duration_minutes < b.lo OR h.watch_duration_minutes > b.hi)/COUNT(*),2) AS pct_outliers
FROM `{project_id}.netflix.watch_history_dedup` h
CROSS JOIN bounds b;
"""

print("Running query to calculate IQR and outlier percentage...")
query_job = client.query(query)
results = query_job.result()

# Print the results
print("Outlier Analysis Results:")
for row in results:
    print(f"- Outliers: {row['outliers']}")
    print(f"- Total Rows: {row['total']}")
    print(f"- Percentage of Outliers: {row['pct_outliers']}%")

Running query to calculate IQR and outlier percentage...
Outlier Analysis Results:
- Outliers: 3462
- Total Rows: 100000
- Percentage of Outliers: 3.46%


In [35]:
import os
from google.cloud import bigquery

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
client = bigquery.Client(project=project_id)

# This query creates a new table with 'watch_duration_minutes' capped at the 1st and 99th percentiles.
# 1. A CTE 'q' calculates the P01 and P99 values.
# 2. The main query selects all original columns and adds 'watch_duration_minutes_capped'.
# 3. GREATEST/LEAST functions perform the capping (Winsorization).
query = f"""
CREATE OR REPLACE TABLE `{project_id}.netflix.watch_history_robust` AS
WITH q AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(1)]  AS p01,
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(99)] AS p99
  FROM `{project_id}.netflix.watch_history_dedup`
)
SELECT
  h.*,
  GREATEST(q.p01, LEAST(q.p99, h.watch_duration_minutes)) AS watch_duration_minutes_capped
FROM `{project_id}.netflix.watch_history_dedup` h, q;
"""

print("Running query to create `watch_history_robust` with capped values...")
try:
    # Execute the query and wait for completion.
    job = client.query(query)
    job.result()
    print("Successfully created `watch_history_robust` table.")
except Exception as e:
    print(f"An error occurred: {e}")

Running query to create `watch_history_robust` with capped values...
Successfully created `watch_history_robust` table.


In [42]:
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
client = bigquery.Client(project=project_id)

query = f"""
CREATE OR REPLACE TABLE `{project_id}.netflix.watch_history_robust` AS
WITH q AS (
  SELECT
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(1)]  AS p01,
    APPROX_QUANTILES(watch_duration_minutes, 100)[OFFSET(98)] AS p99
  FROM `{project_id}.netflix.watch_history_dedup`
)
SELECT
  h.*,
  GREATEST(q.p01, LEAST(q.p99, h.watch_duration_minutes)) AS watch_duration_minutes_capped
FROM `{project_id}.netflix.watch_history_dedup` h, q;

-- Quantiles before vs after
WITH before AS (
  SELECT 'before' AS which, APPROX_QUANTILES(watch_duration_minutes, 5) AS q
  FROM `{project_id}.netflix.watch_history_dedup`
),
after AS (
  SELECT 'after' AS which, APPROX_QUANTILES(watch_duration_minutes_capped, 5) AS q
  FROM `{project_id}.netflix.watch_history_robust`
)
SELECT * FROM before UNION ALL SELECT * FROM after;
"""

query_job = client.query(query)
results = query_job.result()

# Print the results
for row in results:
    print(row)

Row(('before', [0.2, 25.0, 41.8, 61.3, 91.4, 799.3]), {'which': 0, 'q': 1})
Row(('after', [4.4, 24.6, 41.5, 61.5, 92.0, 203.6]), {'which': 0, 'q': 1})


In [43]:
# # EXAMPLE (from LLM) — IQR outlier rate (commented)
# # WITH dist AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(1)] AS q1,
# #     APPROX_QUANTILES(minutes_watched, 4)[OFFSET(3)] AS q3
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # bounds AS (
# #   SELECT q1, q3, (q3-q1) AS iqr,
# #          q1 - 1.5*(q3-q1) AS lo,
# #          q3 + 1.5*(q3-q1) AS hi
# #   FROM dist
# # )
# # SELECT
# #   COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi) AS outliers,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(h.minutes_watched < b.lo OR h.minutes_watched > b.hi)/COUNT(*),2) AS pct_outliers
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h
# # CROSS JOIN bounds b;

In [44]:
# # EXAMPLE (from LLM) — Winsorize + quantiles (commented)
# # CREATE OR REPLACE TABLE `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust` AS
# # WITH q AS (
# #   SELECT
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(1)]  AS p01,
# #     APPROX_QUANTILES(minutes_watched, 100)[OFFSET(98)] AS p99
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # )
# # SELECT
# #   h.*,
# #   GREATEST(q.p01, LEAST(q.p99, h.minutes_watched)) AS minutes_watched_capped
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup` h, q;
# #
# # -- Quantiles before vs after
# # WITH before AS (
# #   SELECT 'before' AS which, APPROX_QUANTILES(minutes_watched, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_dedup`
# # ),
# # after AS (
# #   SELECT 'after' AS which, APPROX_QUANTILES(minutes_watched_capped, 5) AS q
# #   FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`
# # )
# # SELECT * FROM before UNION ALL SELECT * FROM after;

### Verification Prompt
Generate a query that shows min/median/max before vs after capping.


In [45]:
# Verification: Show min/median/max before vs after capping
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
client = bigquery.Client(project=project_id)

query = f"""
WITH before AS (
  SELECT
    'before' AS which,
    MIN(watch_duration_minutes) AS min_val,
    APPROX_QUANTILES(watch_duration_minutes, 2)[OFFSET(1)] AS median_val,
    MAX(watch_duration_minutes) AS max_val
  FROM `{project_id}.netflix.watch_history_dedup`
),
after AS (
  SELECT
    'after' AS which,
    MIN(watch_duration_minutes_capped) AS min_val,
    APPROX_QUANTILES(watch_duration_minutes_capped, 2)[OFFSET(1)] AS median_val,
    MAX(watch_duration_minutes_capped) AS max_val
  FROM `{project_id}.netflix.watch_history_robust`
)
SELECT * FROM before UNION ALL SELECT * FROM after;
"""

query_job = client.query(query)
results = query_job.result()

# Print the results
for row in results:
    print(row)

Row(('after', 4.4, 51.4, 203.6), {'which': 0, 'min_val': 1, 'median_val': 2, 'max_val': 3})
Row(('before', 0.2, 51.3, 799.3), {'which': 0, 'min_val': 1, 'median_val': 2, 'max_val': 3})


**Reflection:** When might capping be harmful? Name a model type less sensitive to outliers and why.

Capping outliers can be harmful if the extreme values represent genuine, important information rather than errors, potentially distorting the data's true distribution and reducing interpretability. Tree-based models such as Decision Trees, Random Forests, and Gradient Boosting Machines are generally less sensitive to outliers because they make decisions based on splitting data at thresholds rather than relying on the exact magnitude of values or assuming linear relationships, making them more robust to extreme points.

### 5.4 Business anomaly flags — What & Why
Human-readable flags help both product decisioning and ML features (e.g., binge behavior).

### Build Prompt
Generate **three BigQuery SQL cells** (adjust if columns differ):
1) In `watch_history_robust`, compute and summarize `flag_binge` for sessions > 8 hours.
2) In `users`, compute and summarize `flag_age_extreme` if age can be parsed from `age_band` (<10 or >100).
3) In `movies`, compute and summarize `flag_duration_anomaly` where `duration_min` < 15 or > 480 (if exists).
Each cell should output count and percentage and include 1–2 comments.


In [46]:
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
client = bigquery.Client(project=project_id)

# This query computes the number and percentage of binge-watching sessions (over 8 hours).
# It provides a summary of extreme user engagement.
query = f"""
SELECT
  COUNTIF(watch_duration_minutes_capped > 8*60) AS sessions_over_8h,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(watch_duration_minutes_capped > 8*60)/COUNT(*),2) AS pct
FROM `{project_id}.netflix.watch_history_robust`;
"""

query_job = client.query(query)
results = query_job.result()

# Print the results
print("Summary for flag_binge:")
for row in results:
    print(f"- Binge Sessions (>8 hours): {row['sessions_over_8h']}")
    print(f"- Total Sessions: {row['total']}")
    print(f"- Percentage: {row['pct']}%")

Summary for flag_binge:
- Binge Sessions (>8 hours): 0
- Total Sessions: 100000
- Percentage: 0.0%


In [47]:
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
client = bigquery.Client(project=project_id)

# This query flags users with potentially anomalous ages (<10 or >100).
# It helps identify data entry errors or outliers in user demographics.
query = f"""
SELECT
  COUNTIF(age < 10 OR age > 100) AS extreme_age_rows,
  COUNT(*) AS total,
  ROUND(100*COUNTIF(age < 10 OR age > 100)/COUNT(*),2) AS pct
FROM `{project_id}.netflix.users`;
"""

query_job = client.query(query)
results = query_job.result()

# Print the results
print("Summary for flag_age_extreme:")
for row in results:
    print(f"- Extreme Age Rows (<10 or >100): {row['extreme_age_rows']}")
    print(f"- Total Rows: {row['total']}")
    print(f"- Percentage: {row['pct']}%")

Summary for flag_age_extreme:
- Extreme Age Rows (<10 or >100): 179
- Total Rows: 10300
- Percentage: 1.74%


In [53]:
import os
from google.cloud import bigquery

project_id = os.environ['GOOGLE_CLOUD_PROJECT']
client = bigquery.Client(project=project_id)

# This query flags movies with anomalous durations (<15 or >480 minutes).
# It helps identify potential data errors or unusual content formats.
query = f"""
SELECT
  COUNTIF(duration_minutes < 15 OR duration_minutes > 480) AS anomaly_count,
  COUNT(*) AS total,
  ROUND(100 * COUNTIF(duration_minutes < 15 OR duration_minutes > 480) / COUNT(*), 2) AS pct
FROM `{project_id}.netflix.movies`;
"""

query_job = client.query(query)
results = query_job.result()

# Print the results
print("Summary for flag_duration_anomaly:")
for row in results:
    print(f"- Anomaly Count (<15 or >480 min): {row['anomaly_count']}")
    print(f"- Total Movies: {row['total']}")
    print(f"- Percentage: {row['pct']}%" )

Summary for flag_duration_anomaly:
- Anomaly Count (<15 or >480 min): 23
- Total Movies: 1040
- Percentage: 2.21%


In [54]:
# # EXAMPLE (from LLM) — flag_binge (commented)
# # SELECT
# #   COUNTIF(minutes_watched > 8*60) AS sessions_over_8h,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(minutes_watched > 8*60)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.watch_history_robust`;

In [55]:
# # EXAMPLE (from LLM) — flag_age_extreme (commented)
# # SELECT
# #   COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #           CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100) AS extreme_age_rows,
# #   COUNT(*) AS total,
# #   ROUND(100*COUNTIF(CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) < 10 OR
# #                     CAST(REGEXP_EXTRACT(age_band, r'\d+') AS INT64) > 100)/COUNT(*),2) AS pct
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.users`;

In [56]:
# # EXAMPLE (from LLM) — flag_duration_anomaly (commented)
# # SELECT
# #   COUNTIF(duration_min < 15) AS titles_under_15m,
# #   COUNTIF(duration_min > 8*60) AS titles_over_8h,
# #   COUNT(*) AS total
# # FROM `${GOOGLE_CLOUD_PROJECT}.netflix.movies`;

### Verification Prompt
Generate a single compact summary query that returns two columns per flag: `flag_name, pct_of_rows`.


In [58]:
# Generate a single compact summary query that returns two columns per flag: flag_name, pct_of_rows
import os
from google.cloud import bigquery

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
if not project_id:
    print("Error: GOOGLE_CLOUD_PROJECT environment variable is not set.")
else:
    client = bigquery.Client(project=project_id)

    query = f"""
    -- Calculate percentage for flag_binge from watch_history_robust
    SELECT 'flag_binge' AS flag_name,
           ROUND(100 * COUNTIF(watch_duration_minutes_capped > 8*60) / COUNT(*), 2) AS pct_of_rows
    FROM `{project_id}.netflix.watch_history_robust`

    UNION ALL

    -- Calculate percentage for flag_age_extreme from users
    SELECT 'flag_age_extreme' AS flag_name,
           ROUND(100 * COUNTIF(age < 10 OR age > 100) / COUNT(*), 2) AS pct_of_rows
    FROM `{project_id}.netflix.users`

    UNION ALL

    -- Calculate percentage for flag_duration_anomaly from movies
    SELECT 'flag_duration_anomaly' AS flag_name,
           ROUND(100 * COUNTIF(duration_minutes < 15 OR duration_minutes > 480) / COUNT(*), 2) AS pct_of_rows
    FROM `{project_id}.netflix.movies`;
    """

    print("Running summary query for anomaly flags:")
    query_job = client.query(query)
    results = query_job.result()

    # Print the results
    for row in results:
        print(row)

Running summary query for anomaly flags:
Row(('flag_binge', 0.0), {'flag_name': 0, 'pct_of_rows': 1})
Row(('flag_age_extreme', 1.74), {'flag_name': 0, 'pct_of_rows': 1})
Row(('flag_duration_anomaly', 2.21), {'flag_name': 0, 'pct_of_rows': 1})


**Reflection:** Which anomaly flag is most common? Which would you keep as a feature and why?

The most common flag is 'flag_duration_anomaly' at 2.21%. This would also be the best flag to keep as a feature for a machine learning model because it describes the type of content (e.g., short-form vs. long-form), which is a strong predictor of user preference.

## 6) Save & submit — What & Why
Reproducibility: save artifacts and document decisions so others can rerun and audit.

### Build Prompt
Generate a checklist (Markdown) students can paste at the end:
- Save this notebook to the team Drive.
- Export a `.sql` file with your DQ queries and save to repo.
- Push notebook + SQL to the **team GitHub** with a descriptive commit.
- Add a README with your `PROJECT_ID`, `REGION`, bucket, dataset, and today’s row counts.


## Grading rubric (quick)
- Profiling completeness (30)  
- Cleaning policy correctness & reproducibility (40)  
- Reflection/insight (20)  
- Hygiene (naming, verification, idempotence) (10)
