<a href="https://colab.research.google.com/github/QianyueWang0212/mgmt467-analytics-portfolio/blob/main/%20Labs/Lab5_Classification_BQML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Lab 5: Predicting Diversions (Classification with BQML)**
**Unit 2 • Week 8 (Thu) — Classification & Evaluation**

**Objective:** Train and evaluate a **logistic regression** model to classify whether a flight will be **diverted**. Interpret **precision/recall** and the **confusion matrix**, and practice threshold tuning.


## Setup & Authentication

In [37]:
from google.colab import files
import os

# Prompt the user to upload their kaggle.json file.
# This file contains your Kaggle API credentials.
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

# Create the .kaggle directory if it doesn't exist.
# This is where Kaggle expects to find the credentials file.
os.makedirs('/root/.kaggle', exist_ok=True)

# Save the uploaded file to the correct location.
# Using the first uploaded file as we expect only one (kaggle.json).
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])

# Set file permissions to 0600 (owner read/write only).
# This is crucial for security to prevent other users from accessing your API key.
os.chmod('/root/.kaggle/kaggle.json', 0o600)

# Verify the Kaggle installation by printing the version.
# This confirms the CLI is installed and can access the credentials.
!kaggle --version

# Done: Kaggle setup

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle (3).json to kaggle (3) (2).json
Kaggle API 1.7.4.5


In [38]:
#EXAMPLE (from LLM) — Auth + Project/Region (commented; write your own cell using the prompt)
from google.colab import auth
auth.authenticate_user()

import os
PROJECT_ID = input("Enter your GCP Project ID: ").strip()
REGION = "us-central1"  # keep consistent; change if instructed
os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
print("Project:", PROJECT_ID, "| Region:", REGION)

# Set active project for gcloud/BigQuery CLI
!gcloud config set project $GOOGLE_CLOUD_PROJECT
!gcloud config get-value project
# Done: Auth + Project/Region set

Enter your GCP Project ID: mgmt-46700
Project: mgmt-46700 | Region: us-central1
Updated property [core/project].
mgmt-46700


In [39]:
# EXAMPLE (from LLM) — Kaggle setup (commented)
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
uploaded = files.upload()

import os
os.makedirs('/root/.kaggle', exist_ok=True)
with open('/root/.kaggle/kaggle.json', 'wb') as f:
    f.write(uploaded[list(uploaded.keys())[0]])
os.chmod('/root/.kaggle/kaggle.json', 0o600)  # owner-only

!kaggle --version

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle (3).json to kaggle (3) (3).json
Kaggle API 1.7.4.5


In [40]:
# Create the directory to store raw data
!mkdir -p /content/data/raw

# Download the dataset using Kaggle CLI to /content/data
# The -d flag specifies the dataset, and -p specifies the download path
!kaggle datasets download -d mahoora00135/flights -p /content/data

# Unzip the downloaded dataset into the raw data directory
# -o flag overwrites files if they exist
!unzip -o /content/data/*.zip -d /content/data/raw

# List all CSV files in the raw data directory with their sizes in a neat table
!ls -lh /content/data/raw/*.csv

Dataset URL: https://www.kaggle.com/datasets/mahoora00135/flights
License(s): CC0-1.0
flights.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/data/flights.zip
  inflating: /content/data/raw/flights.csv  
-rw-r--r-- 1 root root 41M Sep 26  2023 /content/data/raw/flights.csv


In [41]:
# Create a GCS bucket (only once per project)
BUCKET_NAME = f"{PROJECT_ID}-flights-bucket"
!gsutil mb -l {REGION} gs://{BUCKET_NAME}/ || echo "Bucket may already exist"

# Upload dataset to the bucket
!gsutil cp /content/data/raw/flights.csv gs://{BUCKET_NAME}/flights.csv

print(f"✅ Uploaded flights.csv to: gs://{BUCKET_NAME}/flights.csv")


Creating gs://mgmt-46700-flights-bucket/...
ServiceException: 409 A Cloud Storage bucket named 'mgmt-46700-flights-bucket' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.
Bucket may already exist
Copying file:///content/data/raw/flights.csv [Content-Type=text/csv]...
- [1 files][ 40.9 MiB/ 40.9 MiB]                                                
Operation completed over 1 objects/40.9 MiB.                                     
✅ Uploaded flights.csv to: gs://mgmt-46700-flights-bucket/flights.csv


In [42]:
from google.cloud import bigquery
from google.api_core.exceptions import Conflict

client = bigquery.Client(project=PROJECT_ID)

dataset_id = f"{PROJECT_ID}.flights_dataset"
table_id = f"{dataset_id}.flights"

# Create dataset if it doesn't exist
dataset = bigquery.Dataset(dataset_id)
dataset.location = REGION

try:
    client.create_dataset(dataset, timeout=30)  # Make an API request.
    print(f"✅ Dataset {dataset_id} created.")
except Conflict:
    print(f"Dataset {dataset_id} already exists.")
except Exception as e:
    print(f"Error creating dataset {dataset_id}: {e}")
    # Exit the cell execution if dataset creation fails
    raise

# Define load job from GCS
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV,
    skip_leading_rows=1,
    autodetect=True,
)

uri = f"gs://{BUCKET_NAME}/flights.csv"
load_job = client.load_table_from_uri(uri, table_id, job_config=job_config)
load_job.result()

table = client.get_table(table_id)
print(f"✅ Loaded {table.num_rows} rows into {table_id}")

Dataset mgmt-46700.flights_dataset already exists.
✅ Loaded 1347104 rows into mgmt-46700.flights_dataset.flights



---
## Business Context

> An airline wants to proactively identify flights with a high probability of being **diverted** to better manage logistics and passenger communication.

**Question:** Which is more costly for the airline: a **false positive** (predict diversion, but no diversion) or a **false negative** (fail to predict a diversion that occurs)?  
Write your reasoning below in 4–6 sentences.


**Reasoning:**

A **false negative** (failing to predict a diversion that occurs) is likely more costly for an airline than a false positive. A false negative could lead to significant disruptions, including:

*   **Passenger dissatisfaction and potential loss of future business:** Passengers are not informed of the diversion and may be stranded, miss connections, or face significant delays without proper support.
*   **Logistical nightmares:** The airline is unprepared for the diversion, leading to difficulties in arranging alternative transportation, accommodations, and crew changes.
*   **Increased operational costs:** Unexpected diversions can incur significant costs for fuel, landing fees, ground handling, and re-routing.

While false positives (predicting a diversion that doesn't happen) also have costs (e.g., unnecessary preparations, passenger anxiety), they are generally less severe than the consequences of an unpredicted diversion.


---
## Train a Classification Model (LOGISTIC_REG)

Use BQML to train a **logistic regression** model predicting `diverted` using a few features.


In [43]:
schema_sql = f"""
SELECT column_name, data_type
FROM `{PROJECT_ID}.flights_dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'flights'
"""
schema_df = client.query(schema_sql).result().to_dataframe()
display(schema_df)

Unnamed: 0,column_name,data_type
0,id,INT64
1,year,INT64
2,month,INT64
3,day,INT64
4,dep_time,FLOAT64
5,sched_dep_time,INT64
6,dep_delay,FLOAT64
7,arr_time,FLOAT64
8,sched_arr_time,INT64
9,arr_delay,FLOAT64


Python
prompt = """
```
prompt = """
# TASK: Generate a BQML query to create a classification model.
# CONTEXT: I’m using the Kaggle flights dataset in BigQuery.
# GOAL: Predict whether a flight is delayed more than 60 minutes (binary outcome).
# FEATURES: Use dep_delay, distance, air_time, hour, and carrier.
# MODEL TYPE: LOGISTIC_REG
# REQUIREMENTS:
# - Create a new label column called delayed_flag using CASE WHEN arr_delay > 60 THEN 1 ELSE 0 END.
# - Use model_type='LOGISTIC_REG' and enable_global_explain=TRUE.
# - Limit to 200,000 rows.
"""

```


In [44]:
query = f"""
CREATE OR REPLACE MODEL `{dataset_id}.flight_delay_classifier_v2`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['delayed_flag'],
  enable_global_explain=TRUE
) AS
SELECT
  CASE WHEN arr_delay > 60 THEN 1 ELSE 0 END AS delayed_flag,
  dep_delay,
  distance,
  air_time,
  hour,
  carrier
FROM `{table_id}`
WHERE arr_delay IS NOT NULL
  AND dep_delay IS NOT NULL
LIMIT 200000;
"""

client.query(query).result()
print("✅ Model created successfully: flight_delay_classifier_v2")


✅ Model created successfully: flight_delay_classifier_v2



---
## Evaluate with `ML.EVALUATE` — Validate

Get **precision**, **recall**, **log_loss**, and other metrics. Also compute a **confusion matrix**.


In [45]:
# Evaluate model performance for the classifier
eval_sql = f"""
SELECT *
FROM ML.EVALUATE(MODEL `{dataset_id}.flight_delay_classifier_v2`);
"""

eval_df = client.query(eval_sql).result().to_dataframe()
display(eval_df)


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.935696,0.793103,0.976382,0.858519,0.07586,0.982321


In [46]:
# Compute confusion matrix at the default threshold
cm_sql = f"""
SELECT *
FROM ML.CONFUSION_MATRIX(
  MODEL `{dataset_id}.flight_delay_classifier_v2`,
  (
    SELECT
      CASE WHEN arr_delay > 60 THEN 1 ELSE 0 END AS delayed_flag,
      dep_delay,
      distance,
      air_time,
      hour,
      carrier
    FROM `{table_id}`
    WHERE arr_delay IS NOT NULL
      AND dep_delay IS NOT NULL
    LIMIT 200000
  )
);
"""

cm_df = client.query(cm_sql).result().to_dataframe()
display(cm_df)


Unnamed: 0,expected_label,_0,_1
0,0,180158,1170
1,1,3762,14910


In [47]:
print("✅ Model Evaluation Summary (Classification Metrics)")
display(eval_df[['precision', 'recall', 'accuracy', 'log_loss', 'roc_auc']].round(3))


✅ Model Evaluation Summary (Classification Metrics)


Unnamed: 0,precision,recall,accuracy,log_loss,roc_auc
0,0.936,0.793,0.976,0.076,0.982



**Gemini Explainer Prompt:**

```python
prompt =

# TASK: Explain classification model evaluation metrics in plain English.
# CONTEXT: I just evaluated my BigQuery ML logistic regression model predicting flight delays.
# The output includes precision, recall, accuracy, log_loss, and roc_auc, plus a confusion matrix.
# GOAL: Write a short 3–4 sentence explanation of what these results mean for airline operations —
# e.g., how well the model identifies delayed flights vs. non-delayed flights.
# TONE: Business-friendly, clear, and focused on decision-making impact.

```
Paste your explanation below.


Based on the model evaluation, here's a summary of what the metrics mean for predicting flight diversions:

*   **Accuracy (0.976):** Overall, the model is highly accurate, correctly predicting whether a flight will be delayed or not 97.6% of the time.
*   **Precision (0.936):** When the model predicts a flight will be delayed, it is correct about 93.6% of the time. This is important for minimizing false alarms (false positives).
*   **Recall (0.793):** The model identifies about 79.3% of all actual delayed flights. This metric is crucial for the airline to capture as many true delays as possible and avoid being unprepared (false negatives).
*   **Log Loss (0.076):** This indicates a low level of prediction uncertainty, meaning the model's probability estimates are well-calibrated.
*   **ROC AUC (0.982):** The model has excellent ability to distinguish between delayed and non-delayed flights.

In terms of the confusion matrix:

*   **True Negatives (_0, 0):** 180,158 flights were correctly predicted as not delayed.
*   **False Positives (_0, 1):** 1,170 flights were incorrectly predicted as delayed.
*   **False Negatives (_1, 0):** 3,762 flights were incorrectly predicted as not delayed (these are the missed diversions).
*   **True Positives (_1, 1):** 14,910 flights were correctly predicted as delayed.

For airline operations, the relatively lower recall compared to precision means the model is more cautious about predicting delays, prioritizing avoiding false alarms over catching every single delay. The number of false negatives (3,762) is higher than false positives (1,170), which aligns with our earlier business context discussion about false negatives being potentially more costly. Tuning the prediction threshold could help balance these outcomes depending on the airline's priorities.


---
## Threshold Tuning

By default, `ML.PREDICT` uses a threshold of **0.5**. You can change it to 0.75 and observe impacts on FP/FN.

> **Task:** Author your own Gemini prompt asking for an `ML.PREDICT` example that uses **`STRUCT(0.75 AS threshold)`** and explains when/why an airline might pick a higher threshold.


**Gemini Explainer Prompt:**
prompt = """
```
prompt = """
# TASK: Write an ML.PREDICT query using a custom threshold of 0.75.
# CONTEXT: I trained a BigQuery ML logistic regression model to predict flight diversions (1 = diverted, 0 = not diverted).
# GOAL: Show how to write a full ML.PREDICT example that applies STRUCT(0.75 AS threshold)
# and uses sample inputs like distance, dep_delay, and carrier.
# EXPLANATION REQUEST: Also explain why an airline might prefer a higher threshold,
# focusing on business reasoning (e.g., balancing false positives and false negatives).
# STYLE: Clear, concise, and formatted for Colab — include both SQL and short explanation.
"""

```



In [48]:
# Assuming client and model_id are defined in previous cells

# ML.PREDICT query with a custom threshold of 0.75
pred_sql = f"""
SELECT *
FROM ML.PREDICT(
  MODEL `{model_id}`,
  (
    SELECT
      CAST(1500 AS INT64)    AS distance,   -- Example distance (Corrected to INT64)
      CAST(45.0 AS FLOAT64)  AS dep_delay,  -- Example departure delay
      CAST(250.0 AS FLOAT64) AS air_time,   -- Example air time (Added)
      CAST(10 AS INT64)      AS hour,       -- Example hour (Added)
      CAST('UA' AS STRING)   AS carrier     -- Example carrier
  ),
  STRUCT(0.75 AS threshold)                 -- Apply a higher confidence threshold
)
"""

# Execute the query and display the result
pred_df = client.query(pred_sql).result().to_dataframe()
display(pred_df)

Unnamed: 0,predicted_delayed_flag,predicted_delayed_flag_probs,distance,dep_delay,air_time,hour,carrier
0,0,"[{'label': 1, 'prob': 0.23395840374967045}, {'...",1500,45.0,250.0,10,UA


Threshold tuning is like setting the confidence level for when the model should say a flight will be diverted.

*   By raising the threshold from 0.5 to 0.75, you're requiring the model to be more certain before predicting a diversion.
*   This will generally **decrease false positives** (predicting a diversion that doesn't happen), saving resources and reducing unnecessary passenger anxiety.
*   However, it will also likely **increase false negatives** (failing to predict a diversion that *does* happen), meaning the airline might be caught unprepared for some actual diversions.

An airline might choose a higher threshold (like 0.75) if the cost and disruption of a false positive (e.g., unnecessarily holding a connecting flight, re-planning logistics for a flight that proceeds as scheduled) are significantly higher than the cost of being occasionally unprepared for a diversion. It's a trade-off based on the airline's operational priorities and risk tolerance.


---
## ✅ Deliverable for Lab 5

- Completed `Lab5_Classification_BQML.ipynb` with:
  - Business context write-up (FP vs FN)
  - `CREATE MODEL` SQL
  - `ML.EVALUATE` + `ML.CONFUSION_MATRIX` outputs and explanations
  - Threshold tuning example (`STRUCT(0.75 AS threshold)`)
- Push to **GitHub** and submit the link on **Brightspace**.
