<a href="https://colab.research.google.com/github/MaxMatteucci/mgmt467-analytics-portfolio/blob/main/Lab6Brightspace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### üß± Lab 6: Improving Models with Feature Engineering  
**Author:** Max Matteucci  
**Course:** MGMT 467 ‚Äì Big Data and Cloud Analytics  
**Project ID:** `database-project-467`  
**Dataset:** `flights.flights_classification`  

**Objective:**  
The goal of this lab is to improve the performance of the flight diversion classification model by applying **feature engineering** in BigQueryML.  
Using the `TRANSFORM` concept (implemented through engineered features in SQL), we create new variables that capture richer relationships in the data, such as combined airline‚Äìairport behavior.  
The lab concludes by comparing baseline and improved model performance and by designing a challenge prompt for bucketizing delay severity.


In [1]:
from google.colab import auth
from google.cloud import bigquery

auth.authenticate_user()
client = bigquery.Client(project="database-project-467")
print("‚úÖ Connected to BigQuery project:", client.project)


‚úÖ Connected to BigQuery project: database-project-467


In [5]:
baseline_eval_query = """
SELECT *
FROM ML.EVALUATE(MODEL `database-project-467.flights.flight_diversion_classifier`);
"""
baseline_df = client.query(baseline_eval_query).to_dataframe()
baseline_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.917367,0.752009,0.972734,0.826498,0.086067,0.971097


In [3]:
prompt = """
# TASK: Brainstorm new features for an ML model, a process called feature engineering.
# CONTEXT: I want to improve my flight diversion prediction model. The raw data has a column called 'origin' (e.g., 'JFK', 'ORD') and another called 'carrier' (e.g., 'AA', 'UA').
# GOAL: Suggest one new feature I could create by combining 'origin' and 'carrier' that might be more predictive than either column alone. Explain why this new feature could be more powerful.
"""


In [None]:
prompt = """
# TASK: Generate the TRANSFORM clause for a BQML CREATE MODEL statement.
# GOAL: I need to pass through my original features ('dep_delay', 'distance') and create two new features:
# 1. 'route': a new feature created by combining the 'origin' and 'destination' columns using the CONCAT() function.
# 2. 'day_of_week': a new feature created by extracting the day of the week from the 'fl_date' column using the EXTRACT() function.
"""


In [13]:
from google.cloud import bigquery
client = bigquery.Client(project="database-project-467")

create_model_query = """
CREATE OR REPLACE MODEL `database-project-467.flights.flight_diversion_classifier_fe`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['diverted'],
  enable_global_explain=TRUE
) AS
SELECT
  -- label
  diverted,
  -- original features
  dep_delay,
  distance,
  carrier,
  origin,
  month,
  -- engineered feature that mimics TRANSFORM
  CONCAT(origin, "_", carrier) AS origin_carrier_combo
FROM
  `database-project-467.flights.flights_classification`;
"""

job = client.query(create_model_query)
job.result()
print("‚úÖ FE model created")


‚úÖ FE model created


In [14]:
fe_eval = client.query("""
SELECT *
FROM ML.EVALUATE(MODEL `database-project-467.flights.flight_diversion_classifier_fe`);
""").to_dataframe()
fe_eval


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.936709,0.760274,0.974133,0.839319,0.077442,0.977705


In [15]:
# Evaluate feature-engineered model
fe_eval_query = """
SELECT *
FROM ML.EVALUATE(MODEL `database-project-467.flights.flight_diversion_classifier_fe`);
"""
fe_eval_df = client.query(fe_eval_query).to_dataframe()
fe_eval_df


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.936709,0.760274,0.974133,0.839319,0.077442,0.977705


### ‚öñÔ∏è Performance Comparison

| Model | Precision | Recall | F1 Score | ROC AUC |
|--------|------------|---------|-----------|---------|
| Baseline (Lab 5) | 0.917 | 0.752 | 0.826 | 0.971 |
| Feature Engineered (Lab 6) | *your new precision* | *your new recall* | *your new F1* | *your new AUC* |

Feature engineering slightly improved the model by adding `origin_carrier_combo`, which captured airport-specific airline behavior.  Precision and recall both rose, confirming that engineered categorical interactions can increase predictive power.


In [None]:
prompt = """
# TASK: Write a TRANSFORM clause using ML.BUCKETIZE on dep_delay to create four delay categories.
# CONTEXT: I want to categorize departure delays into buckets representing delay severity.
# GOAL: Create a feature called delay_category with bucket boundaries [-999, 0, 15, 60, 9999],
# representing 'early_or_on_time', 'minor_delay', 'moderate_delay', and 'major_delay'.
"""


### üßæ Lab 6 Summary: Improving Models with Feature Engineering

**Objective:**  
Enhance the logistic-regression flight-diversion model by creating new engineered features in BigQueryML.

**Key Steps:**  
1. Evaluated baseline model (precision = 0.917, recall = 0.752).  
2. Created engineered feature `origin_carrier_combo` = `CONCAT(origin, "_", carrier)`.  
3. Retrained model and re-evaluated performance.  
4. Compared baseline vs. feature-engineered results to confirm improvement.  
5. Authored Gemini prompt to explore `ML.BUCKETIZE` for future feature creation.

**Outcome:**  
Feature engineering produced a modest but measurable lift in model performance, showing that combining categorical fields can reveal more predictive patterns for flight diversions.
