# **Team 6 Classification Model**

In [1]:
# --- Minimal setup (edit 3 vars) ---
from google.colab import auth
auth.authenticate_user()

import os
from google.cloud import bigquery

PROJECT_ID = "mgmt-467-project-1"      # e.g., mgmt-467-47888
REGION     = "us-central1"
TABLE_PATH = "mgmt-467-project-1.flights.kaggle_flight_data"   # or your `bigquery-public-data.flights` table/view

os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"]     = REGION
bq = bigquery.Client(project=PROJECT_ID)

print("BQ Project:", PROJECT_ID)
print("Source table:", TABLE_PATH)


BQ Project: mgmt-467-project-1
Source table: mgmt-467-project-1.flights.kaggle_flight_data


In [2]:
preview_sql = f"SELECT * FROM `{TABLE_PATH}` LIMIT 5"
bq.query(preview_sql).result().to_dataframe()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
0,1998,2,4,21,2,1998-04-21,DL,19790,DL,N334DL,...,,,,,,,,,,
1,1992,4,10,19,1,1992-10-19,DL,19790,DL,,...,,,,,,,,,,
2,2000,4,10,13,5,2000-10-13,DL,19790,DL,N225DL,...,,,,,,,,,,
3,1996,3,9,29,7,1996-09-29,DL,19790,DL,N331DL,...,,,,,,,,,,
4,1996,4,10,30,3,1996-10-30,DL,19790,DL,N236WA,...,,,,,,,,,,


# **Default Model**

# Task
Create a BigQuery `LOGISTIC_REG` model using `BQ.ML.CREATE_MODEL` to predict `Diverted` (a boolean indicating if a flight was diverted) based on `DepDelay`, `Distance`, `Reporting_Airline`, `Origin`, `Dest`, and `DayOfWeek` from the `mgmt-467-project-1.flights.kaggle_flight_data` table.

In [3]:
query_diverted = f"""SELECT CAST(Diverted AS INT64) AS Diverted, DepDelay, Distance, Reporting_Airline, Origin, Dest, DayOfWeek FROM `{TABLE_PATH}` WHERE Diverted IS NOT NULL AND DepDelay IS NOT NULL"""
df_diverted = bq.query(query_diverted).result().to_dataframe()
print(f"Shape of the prepared data for Diverted: {df_diverted.shape}")
df_diverted.head()

Shape of the prepared data for Diverted: (1963932, 7)


Unnamed: 0,Diverted,DepDelay,Distance,Reporting_Airline,Origin,Dest,DayOfWeek
0,0,-8.0,692.0,DL,ABE,ATL,7
1,0,0.0,692.0,EA,ABE,ATL,3
2,0,-6.0,692.0,EV,ABE,ATL,7
3,0,0.0,692.0,EV,ABE,ATL,7
4,0,-4.0,339.0,XE,ABE,CLE,7


## Create LOGISTIC_REG Model


In [4]:
model_name_diverted = "bqml_diverted_model"
create_model_query_diverted = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.flights.{model_name_diverted}`
OPTIONS(
    model_type='LOGISTIC_REG',
    input_label_cols=['Diverted']
)
AS
SELECT
    CAST(Diverted AS INT64) AS Diverted,
    DepDelay,
    Distance,
    Reporting_Airline,
    Origin,
    Dest,
    DayOfWeek
FROM
    `{TABLE_PATH}`
WHERE
    Diverted IS NOT NULL AND DepDelay IS NOT NULL
"""
bq.query(create_model_query_diverted).result()
print(f"BigQuery LOGISTIC_REG model '{model_name_diverted}' created successfully!")

BigQuery LOGISTIC_REG model 'bqml_diverted_model' created successfully!


### Evaluate the LOGISTIC_REG Model with ML.EVALUATE

In [5]:
evaluate_diverted_model_query = f"""SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.flights.{model_name_diverted}`) """
evaluation_results_diverted = bq.query(evaluate_diverted_model_query).result().to_dataframe()
print("LOGISTIC_REG Model Evaluation Results:")
display(evaluation_results_diverted)

LOGISTIC_REG Model Evaluation Results:


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.0,0.0,0.998115,0.0,0.013002,0.820762


**Our team immediately noticed that the precision and recall values were zero, pointing towards an extremely rare diversion class. The costs that come with missing a diverted flight and having to scramble to accomodate passengers are exponentially higher than the comparitively small costs that come with overpreparing. As a result, our team has decided to hone in on class weighting at 0.5 because it gave us the best results.**

# **Class Weights Model**

# Task
Re-create the `LOGISTIC_REG` model with `AUTO_CLASS_WEIGHTS=TRUE` using `BQ.ML.CREATE_MODEL` to account for class imbalance, and then evaluate this new model to display its classification metrics.

## Re-create LOGISTIC_REG Model with Class Weights



In [None]:
model_name_diverted_weighted = "bqml_diverted_model_weighted"
create_model_query_diverted_weighted = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.flights.{model_name_diverted_weighted}`
OPTIONS(
    model_type='LOGISTIC_REG',
    input_label_cols=['Diverted'],
    AUTO_CLASS_WEIGHTS=TRUE
)
AS
SELECT
    CAST(Diverted AS INT64) AS Diverted,
    DepDelay,
    Distance,
    Reporting_Airline,
    Origin,
    Dest,
    DayOfWeek
FROM
    `{TABLE_PATH}`
WHERE
    Diverted IS NOT NULL AND DepDelay IS NOT NULL
"""
bq.query(create_model_query_diverted_weighted).result()
print(f"BigQuery LOGISTIC_REG model '{model_name_diverted_weighted}' created successfully with class weights!")

BigQuery LOGISTIC_REG model 'bqml_diverted_model_weighted' created successfully with class weights!


In [None]:
evaluate_diverted_model_weighted_query = f"""SELECT * FROM ML.EVALUATE(MODEL `{PROJECT_ID}.flights.{model_name_diverted_weighted}`) """
evaluation_results_diverted_weighted = bq.query(evaluate_diverted_model_weighted_query).result().to_dataframe()
print("LOGISTIC_REG Model (with class weights) Evaluation Results:")
display(evaluation_results_diverted_weighted)

LOGISTIC_REG Model (with class weights) Evaluation Results:


Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.004432,0.947368,0.59869,0.008824,0.693004,0.837579


## Interpret Tuned LOGISTIC_REG Model

### Subtask:
Analyze the evaluation results of the tuned model, comparing the new precision, recall, and other metrics to the previous model's performance, specifically noting any improvements for the 'Diverted' class due to `AUTO_CLASS_WEIGHTS`.


### Comparison of LOGISTIC_REG Model Performance (with and without `AUTO_CLASS_WEIGHTS`)

**Original Model (`bqml_diverted_model`):**
```
precision:  0.0
recall:     0.0
accuracy:   0.998115
f1_score:   0.0
log_loss:   0.013002
roc_auc:    0.820762
```

**Weighted Model (`bqml_diverted_model_weighted` with `AUTO_CLASS_WEIGHTS=TRUE`):**
```
precision:  0.004432
recall:     0.947368
accuracy:   0.59869
f1_score:   0.008824
log_loss:   0.693004
roc_auc:    0.837579
```

# **Interpretation:**
The original model had very high accuracy, but not where it mattered most; Flight diversions. In our weighted model there are more false positives, but we catch almost all of the diverted flights. In this context where missing positives can cause massive damage to revenue and brand image while false positives have minimal effect, we believe the weighted model is the superior option. Unexpected diverted flights can cost anywhere from tens of thousands to hundreds of thousands of dollars compared to the relatively minimal costs of moving around equipment and staging a few extra staff at the gates in the event of a false alarm. This can also be mitigated by ongoing communications with the pilots and air traffic control, making false positives less impactful. As a result, we highly prefer the risk mitigation and long term security that the weighted model provides.