<a href="https://colab.research.google.com/github/TylerWichman/Tyler_Wichman_Portfolio/blob/main/Flight_Delay_Prediction_NN/Flight_Delay_Prediction_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Flight Arrival Delay Prediction (15+ minutes)

**Summary:** This project develops a neural network to predict whether a flight will arrive 15+ minutes late using only pre-departure information. Using ~2M U.S. flights, the model increases late-arrival detection recall from ~2% under a logistic baseline to over 80% at operational thresholds, enabling proactive delay risk screening and earlier operational decision-making.

In [1]:
from google.colab import drive
import pandas as pd
import numpy as np

drive.mount("/content/drive")

file_path = "/content/drive/MyDrive/airline_2m.csv"

df = pd.read_csv(file_path, encoding="latin1") #Many airline datasets use latin1 encoding

print(df.shape)
df.head()

Mounted at /content/drive


  df = pd.read_csv(file_path, encoding="latin1") #Many airline datasets use latin1 encoding


(2000000, 109)


Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
0,1998,1,1,2,5,1998-01-02,NW,19386,NW,N297US,...,,,,,,,,,,
1,2009,2,5,28,4,2009-05-28,FL,20437,FL,N946AT,...,,,,,,,,,,
2,2013,2,6,29,6,2013-06-29,MQ,20398,MQ,N665MQ,...,,,,,,,,,,
3,2010,3,8,31,2,2010-08-31,DL,19790,DL,N6705Y,...,,,,,,,,,,
4,2006,1,1,15,7,2006-01-15,US,20355,US,N504AU,...,,,,,,,,,,


## Data Quality Checks + Target Definition Validation

Before modeling, we:
1. Inspect missingness and duplicates
2. Validate that `ArrDel15` behaves like an **arrival delay ≥ 15 minutes** indicator by checking consistency with `ArrDelayMinutes`

In [2]:
print("Shape:", df.shape)
print("Duplicate rows:", df.duplicated().sum())

df = df.drop_duplicates()



missing_pct = df.isna().mean().sort_values(ascending=False) #Many columns missing values
display(missing_pct.head(15))

Shape: (2000000, 109)
Duplicate rows: 0


Unnamed: 0,0
Div5WheelsOn,1.0
Div5TotalGTime,1.0
Div5LongestGTime,1.0
Div5WheelsOff,1.0
Div5TailNum,1.0
Div4TailNum,1.0
Div4WheelsOff,1.0
Div4LongestGTime,1.0
Div4TotalGTime,1.0
Div4WheelsOn,1.0


In [3]:
# Make sure ArrDel15 behaves how it should
valid_rows = df["ArrDelayMinutes"].notna() & df["ArrDel15"].notna()
consistency = ((df.loc[valid_rows, "ArrDelayMinutes"] >= 15).astype(int) == df.loc[valid_rows, "ArrDel15"])

print("ArrDel15 vs ArrDelayMinutes consistency (on rows with both present):")
display(consistency.value_counts(normalize=True))

ArrDel15 vs ArrDelayMinutes consistency (on rows with both present):


Unnamed: 0,proportion
True,1.0


## Handle Missing Data

Neural networks cannot train with missing values.

1. Remove all cancelled and diverted flights that are missing crucial data structure
2. Drop columns with >40% missingness
3. Drop any columns that are 100% missing after filtering
4. Impute remaining values:
    - Numeric -> median
    - Categorical -> mode


In [4]:
# Remove cancelled/diverted flights
df = df[(df["Cancelled"] == 0) & (df["Diverted"] == 0)].copy()
print("Shape after removing cancelled/diverted flights:", df.shape)

# Drop columns with >40% missingness
missing_pct = df.isna().mean()
cols_drop_40 = missing_pct[missing_pct > 0.40].index
df = df.drop(columns=cols_drop_40)
print(f"Dropped {len(cols_drop_40)} columns with >40% missingness")

# Drop columns that are entirely missing after filtering
cols_drop_all = df.columns[df.isna().all()].tolist()
df = df.drop(columns=cols_drop_all)

# Impute remaining missing values
numeric_cols = df.select_dtypes(include=["number"]).columns
cat_cols = df.select_dtypes(include=["object", "category", "string"]).columns

# Numeric -> median
for col in numeric_cols:
    if df[col].isna().any():
        df[col] = df[col].fillna(df[col].median())

# Categorical -> mode (most frequent)
for col in cat_cols:
    if df[col].isna().any():
        mode_series = df[col].mode(dropna=True)
        fill_val = mode_series.iloc[0] if len(mode_series) > 0 else "Unknown"
        df[col] = df[col].fillna(fill_val)

print("Total missing values remaining:", int(df.isna().sum().sum()))
print("Final shape:", df.shape)

Shape after removing cancelled/diverted flights: (1958948, 109)
Dropped 54 columns with >40% missingness
Total missing values remaining: 0
Final shape: (1958948, 55)


## Target, Realism, and Leakage Control

We aim to predict ArrDel15 (1 = arrival delay ≥ 15 minutes).

For realism purposes, we remove variables that would not be known
For data leakage, we remove variables that directly encode the outcome.

In [5]:
Target = "ArrDel15"

# Columns that are post-event or directly tied to outcomes
leakage_cols = ["DepTime", "DepDelay", "DepDelayMinutes", "DepDel15", "DepartureDelayGroups",
    "TaxiOut", "WheelsOff", "WheelsOn", "TaxiIn",
    "ArrTime", "ArrDelay", "ArrDelayMinutes", "ArrivalDelayGroups",
    "ActualElapsedTime", "AirTime",
    "Cancelled", "Diverted"]

y = df[Target].astype(int)
X = df.drop(columns=[Target] + leakage_cols)

print("Target distribution:")
print(y.value_counts(normalize=True))


Target distribution:
ArrDel15
0    0.801968
1    0.198032
Name: proportion, dtype: float64


### Feature Selection Rationale

- High-cardinality identifiers (e.g., tail number, flight number, internal IDs) were excluded to reduce dimensionality and avoid overfitting without adding meaningful predictive signal.
- Schedule and calendar features were retained, as they are known prior to departure and are operationally actionable.
- Post-arrival and outcome-derived variables were removed to prevent data leakage and ensure the model reflects a realistic deployment scenario.


## Baseline: Logistic Regression

Before using a neural network, I will train a simple baseline model.
This provides a benchmark and will help justify whether the neural network adds value.

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Train/Val/Test split
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.30,
    random_state=474,
    stratify=y
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.50,
    random_state=474,
    stratify=y_temp
)

print("Train:", X_train.shape, y_train.mean())
print("Val:  ", X_val.shape, y_val.mean())
print("Test: ", X_test.shape, y_test.mean())

# Identify feature types
numeric_features = X_train.select_dtypes(include=["number"]).columns.tolist()
categorical_features = X_train.select_dtypes(include=["object", "category", "string"]).columns.tolist()

print("Numeric features:", len(numeric_features))
print("Categorical features:", len(categorical_features))

# Preprocessing
preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(with_mean=False), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), categorical_features),
    ],
    remainder="drop",
    sparse_threshold=1.0
)

# Fit on train/Transform
X_train_proc = preprocess.fit_transform(X_train)
X_val_proc   = preprocess.transform(X_val)
X_test_proc  = preprocess.transform(X_test)

Train: (1371263, 37) 0.19803203324234667
Val:   (293842, 37) 0.19803159521103178
Test:  (293843, 37) 0.19803432445217345
Numeric features: 23
Categorical features: 14


In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report

baseline_model = LogisticRegression(max_iter=1000)

baseline_model.fit(X_train_proc, y_train)

val_proba = baseline_model.predict_proba(X_val_proc)[:, 1]

val_auc = roc_auc_score(y_val, val_proba)
print("Baseline Validation ROC-AUC:", round(val_auc, 4))

val_pred = (val_proba >= 0.5).astype(int)
print("\nBaseline Classification Report:")
print(classification_report(y_val, val_pred))

Baseline Validation ROC-AUC: 0.6394

Baseline Classification Report:
              precision    recall  f1-score   support

           0       0.80      1.00      0.89    235652
           1       0.47      0.00      0.00     58190

    accuracy                           0.80    293842
   macro avg       0.63      0.50      0.45    293842
weighted avg       0.74      0.80      0.71    293842



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Handle Class Imbalance

Late flights (~20%) are the minority class.
Without correction, a neural network will learn the shortcut:
“predict on-time for everything.”

We compute class weights so misclassifying late flights is penalized more.


In [9]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

classes = np.array([0, 1])
class_weights = compute_class_weight(
    class_weight="balanced",
    classes=classes,
    y=y_train
)

class_weight_dict = {0: class_weights[0], 1: class_weights[1]}
print("Class weights:", class_weight_dict)

Class weights: {0: np.float64(0.6234662988117766), 1: np.float64(2.5248440457514896)}


## Neural Network Model

We use a simple feed-forward network:
- Dense layers with ReLU
- Dropout for regularization
- Sigmoid output for binary classification

Early stopping is used to prevent overfitting.

In [10]:
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
import numpy as np

tf.random.set_seed(474)
np.random.seed(474)

input_dim = X_train_proc.shape[1]

nn_model = keras.Sequential([
    layers.Input(shape=(input_dim,)),
    layers.Dense(32, activation="relu"),
    layers.Dropout(0.2),
    layers.Dense(16, activation="relu"),
    layers.Dropout(0.1),
    layers.Dense(1, activation="sigmoid")
])

nn_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=[keras.metrics.AUC(name="auc")]
)

nn_model.summary()


### Model Architecture Design Choice

An embedding-based neural network was considered for handling high-cardinality categorical features.  
Given the scale of the dataset (~2 million rows), Colab memory constraints, and the goal of rapid prototyping, a sparse one-hot representation was used instead.


## Train Neural Network

We train using:
- Validation monitoring
- Early stopping (restore best weights)
- Class weights to address imbalance

In [11]:
# Train on a representative subset for speed
sample_n = 300_000
rng = np.random.RandomState(474)
idx = rng.choice(X_train_proc.shape[0], size=sample_n, replace=False)

X_train_small = X_train_proc[idx]
y_train_small = y_train.iloc[idx].values

early_stop = keras.callbacks.EarlyStopping(
    monitor="val_auc",
    mode="max",
    patience=2,
    restore_best_weights=True
)

history = nn_model.fit(
    X_train_small, y_train_small,
    validation_data=(X_val_proc, y_val),
    epochs=10,
    batch_size=1024,
    class_weight=class_weight_dict,
    callbacks=[early_stop],
    verbose=1)

Epoch 1/10
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 185ms/step - auc: 0.5443 - loss: 0.6997 - val_auc: 0.6231 - val_loss: 0.6924
Epoch 2/10
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 181ms/step - auc: 0.6116 - loss: 0.6756 - val_auc: 0.6309 - val_loss: 0.6814
Epoch 3/10
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 180ms/step - auc: 0.6336 - loss: 0.6670 - val_auc: 0.6393 - val_loss: 0.7052
Epoch 4/10
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 182ms/step - auc: 0.6609 - loss: 0.6531 - val_auc: 0.6453 - val_loss: 0.6755
Epoch 5/10
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 184ms/step - auc: 0.6878 - loss: 0.6362 - val_auc: 0.6430 - val_loss: 0.6743
Epoch 6/10
[1m293/293[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 182ms/step - auc: 0.6964 - loss: 0.6255 - val_auc: 0.6408 - val_loss: 0.6349


In [12]:
from sklearn.metrics import roc_auc_score, classification_report

nn_val_proba = nn_model.predict(X_val_proc, batch_size=4096).ravel()
nn_val_auc = roc_auc_score(y_val, nn_val_proba)

print("Neural Net Validation ROC-AUC:", round(nn_val_auc, 4))
print("Baseline Validation ROC-AUC:", round(val_auc, 4))
print("Delta (NN - Baseline):", round(nn_val_auc - val_auc, 4))

for t in [0.20, 0.30, 0.40, 0.50]:
    preds = (nn_val_proba >= t).astype(int)
    print(f"\nThreshold = {t}")
    print(classification_report(y_val, preds, digits=3))

[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 300ms/step
Neural Net Validation ROC-AUC: 0.6453
Baseline Validation ROC-AUC: 0.6394
Delta (NN - Baseline): 0.0059

Threshold = 0.2
              precision    recall  f1-score   support

           0      0.938     0.017     0.034    235652
           1      0.200     0.995     0.333     58190

    accuracy                          0.211    293842
   macro avg      0.569     0.506     0.184    293842
weighted avg      0.792     0.211     0.093    293842


Threshold = 0.3
              precision    recall  f1-score   support

           0      0.914     0.132     0.231    235652
           1      0.213     0.950     0.348     58190

    accuracy                          0.294    293842
   macro avg      0.563     0.541     0.289    293842
weighted avg      0.775     0.294     0.254    293842


Threshold = 0.4
              precision    recall  f1-score   support

           0      0.888     0.333     0.484    235652
      

In [17]:
from sklearn.metrics import roc_auc_score, precision_recall_fscore_support

# Logistic Regression
log_proba = val_proba
log_auc = roc_auc_score(y_val, log_proba)
log_pred = (log_proba >= 0.40).astype(int)
log_prec, log_rec, log_f1, _ = precision_recall_fscore_support(
    y_val, log_pred, average="binary", zero_division=0
)

# Neural Network
nn_proba = nn_val_proba
nn_auc = roc_auc_score(y_val, nn_proba)
nn_pred = (nn_proba >= 0.40).astype(int)
nn_prec, nn_rec, nn_f1, _ = precision_recall_fscore_support(
    y_val, nn_pred, average="binary", zero_division=0
)

# Comparison Table
comparison = pd.DataFrame({
    "Model": ["Logistic Regression", "Neural Network"],
    "ROC_AUC": [log_auc, nn_auc],
    "Precision (@0.4)": [log_prec, nn_prec],
    "Recall (@0.4)": [log_rec, nn_rec],
    "F1 (@0.4)": [log_f1, nn_f1]
})

display(comparison)

Unnamed: 0,Model,ROC_AUC,Precision (@0.4),Recall (@0.4),F1 (@0.4)
0,Logistic Regression,0.639378,0.41657,0.024626,0.046503
1,Neural Network,0.645282,0.23518,0.830727,0.36658


## Final Insights & Takeaways

### Key Findings

- **Ranking Performance:**  
  The neural network achieved a higher ROC-AUC (0.645 vs 0.639), indicating better overall ranking of late vs on-time flights. This is a small but meaningful increase.

- **Class Imbalance Matters:**  
  At a 0.4 decision threshold, the logistic regression model failed to meaningfully identify late flights (recall at ~2.5%), effectively defaulting to predicting “on-time” for most observations. This highlights the limitation of linear models when heavy class imbalance exists.

- **Operational Tradeoffs:**  
  The neural network captured over 83% of late arrivals at the same threshold, at the cost of lower precision. This should be applied when the cost of missing a delay is higher than issuing additional warnings.

- **Threshold Selection is Situationally Dependent:**
  
  Accuracy alone is not an appropriate metric for this problem. Instead, decision thresholds should be selected based on the objective:
  - Lower thresholds -> prioritize recall, capturing most potential delays at the cost of more false positives (useful for early risk screening and proactive alerts).
  - Higher thresholds -> prioritize precision, flagging fewer flights but with higher confidence (useful when intervention resources are limited).

### Interpretation

> The neural network provides an operationally useful model for early delay risk detection, especially when the objective is to proactively identify flights likely to arrive late rather than maximize overall accuracy.



### Limitations & Next Steps

- The model uses schedule-only, pre-departure features and does not yet incorporate real-time factors such as weather or air-traffic conditions.
- One future improvement could be integrating weather data at origin/destination
