# Flight Delay Prediction with Weather-Aware ML Models

## Project Details

#### Phase Leader Plan
| Week 1/Phase 1          | Week 2        | Week 3/Phase 2   | Week 4        | Week 5/Phase 3            |
|-----------------|---------------|----------|---------------|-------------------|
| Siddharth Manu  | Paul Lin | Emily Lieske | Connor Watson | Indri Adisoemarta |

### Team
<table>
  <tr>
    <td align="center" style="padding:12px;">
      <img src="https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/emily_lieske.jpeg?raw=true" width="110" style="border-radius:12px; object-fit:cover;">
      <div style="font-weight:600; margin-top:8px;">Emily Lieske</div>
      <a href="mailto:emily-lieske@berkeley.edu">emily-lieske@berkeley.edu</a>
    </td>
    <td align="center" style="padding:12px;">
      <img src="https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/siddharth_manu.jpg?raw=true" width="110" style="border-radius:12px; object-fit:cover;">
      <div style="font-weight:600; margin-top:8px;">Siddharth Manu</div>
      <a href="mailto:siddharthmanu@berkeley.edu">siddharthmanu@berkeley.edu</a>
    </td>
      </td>
    <td align="center" style="padding:12px;">
      <img src="https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/indri_Adisoemarta.jpg?raw=true" width="110" style="border-radius:12px; object-fit:cover;">
      <div style="font-weight:600; margin-top:8px;">Indri Adisoemarta</div>
      <a href="mailto:indri.a@berkeley.edu">indri.a@berkeley.edu</a>
    </td>
        </td>
    <td align="center" style="padding:12px;">
      <img src="/Workspace/Users/paul.lin@berkeley.edu/images/paul.jpg" width="110" style="border-radius:12px; object-fit:cover;">
      <div style="font-weight:600; margin-top:8px;">Paul Lin</div>
      <a href="mailto:paul.lin@berkeley.edu">paul.lin@berkeley.edu</a>
    </td>
        </td>
    <td align="center" style="padding:12px;">
      <img src="https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/conner-watson.jpg?raw=true" width="110" style="border-radius:12px; object-fit:cover;">
      <div style="font-weight:600; margin-top:8px;">Connor Watson</div>
      <a href="mailto:connorwatson@berkeley.edu">connorwatson@berkeley.edu</a>
    </td>
  </tr>
</table>

### Credit Assignment for Phase 1
| **Team Member** | **Tasks & Deliverables** |
|-----------------|--------------------------|
| **Emily Lieske** | Abstract, Methods/metrics section initial draft, Linear Regression methodology |
| **Indri Adisoemarta** | Data dictionary, Train/test/validation split and missing data initial draft, EDA visualizations (correlation heatmap, histograms), XGBoost methodology |
| **Siddharth Manu** | Data sources documentation, Data pipeline design, Domain specific metrics, Custom join prototype on 3-month sample, Checkpointing strategy, Conclusion section |
| **Connor Watson** | Project plan and Gantt diagram, Pipeline steps and GitHub version control, Time series methodology (ARIMA/SARIMA), Graph network features (centrality, connecting flights) |
| **Paul Lin** | Data section integration, Missing data and duplicate values, EDA visualizations (time series, map), Ensemble model methodology |

### Project Plan

| **Team Member** | **Responsibilities** |
|-----------------|----------------------|
| **Emily Lieske** | Abstract and methods section, linear regression methodology including feature selection, baseline model establishment, graph network features (with Connor), Phase 2 report submission, final report writing and polishing (with Paul and Indri) |
| **Indri Adisoemarta** | Data dictionary and feature descriptions, missing data and null analysis, EDA visualizations (correlations, histograms) (with Paul), XGBoost methodology and implementation, model performance evaluation framework, final report submission |
| **Siddharth Manu** | Data sources documentation, data pipeline architecture design, custom join prototype and scaling to full 2015-2021 dataset, checkpointing strategy, deep neural network (MLP) implementation, evaluation & final report |
| **Connor Watson** | Project plan and Gantt diagram, pipeline steps and GitHub version control, time-series feature engineering (ARIMA/SARIMA), graph network features (centrality, connecting flights) with Emily, ML pipeline scalability setup with Spark, model integration into production pipeline |
| **Paul Lin** | Data section integration and compilation, ensemble model methodology, data cleaning and imputation with leakage controls (with Indri), ensemble model implementation and hyperparameter tuning, final report |

[Gantt Sheet Link](https://docs.google.com/spreadsheets/d/17sUCCqbSnw993flvm4h1CdE2uG1IRUXsBQCVW_c5yd4/edit?gid=1698796238#gid=1698796238)

<img src='https://raw.githubusercontent.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/master/report-images/project_gantt.png' style="width:100%;">


## Abstract
Flight delays cost airlines billions annually through crew scheduling disruptions, gate misallocations, and passenger compensation while causing traveler frustration and missed connections. This project aims to predict flight departure delays in minutes two hours before scheduled departure, enabling proactive operational decisions for airlines and better trip planning for travelers.

We use the U.S. Department of Transportation (DOT) TranStats On-Time Performance dataset  combined with hourly weather observations from NOAA dataset. Initial exploratory data analysis on a three-month sample revealed highly skewed delay distributions: the median delay is -1 minute while the mean is 10 minutes with a standard deviation of 38 minutes, indicating 50% of flights depart on-time but delays are substantial when they occur. Additionally, 80% of flights have delays within 15 minutes, creating significant class imbalance. Temporal patterns show Mondays and Sundays experience higher average delays than mid-week flights, confirming temporal and network features will be critical predictors.

We frame this as a regression problem predicting continuous delay times, evaluated using MSE, RMSE, and MAE. Predictions will be post-processed into delay buckets and assessed using precision, recall, and F1-score, plus domain-specific metrics: On-Time Performance Prediction Accuracy and Severe Delay Detection Rate. Our incremental approach will start with simple linear regression on raw features, then progressively add feature engineering (temporal imputation, ARIMA/SARIMA time-series features, graph network centrality) and sophisticated models (Lasso regularization, XGBoost, MLP neural networks). Given the four-week timeline, we will implement data leakage controls incrementally, starting with simple imputation to unblock development, then modularly swapping in leak-free implementations. 

We will use time-based cross-validation and track performance regressions as we add stricter controls, ensuring offline metrics reflect production performance.



### Data Size and Source


##### 1. Flights Data

We plan to use the U.S. Department of Transportation (DOT) TranStats On-Time Performance (OTP) dataset, which contains detailed records of U.S. passenger flights from 2015 to 2021.
The dataset includes operational details such as departure and arrival times, carrier information, delays, and airport identifiers.

Source: U.S. DOT TranStats – On-Time Performance

Dimensions (2015–2019): 31,746,841 rows × 109 columns


##### 2. Weather Data

Weather conditions are a critical factor influencing flight delays.
We plan to use hourly weather observations from the National Oceanic and Atmospheric Administration (NOAA), covering January 2015 – December 2021.
This dataset contains temperature, wind, visibility, and precipitation data from global weather stations.

Source: NOAA Global Hourly Weather Repository

Dimensions (2015–2019): 630,904,436 rows × 177 columns

##### 3. Airport Metadata

We plan to incorporate airport-level metadata to enhance the flight dataset with geographic and administrative context.

Source: U.S. Department of Transportation

Dimensions: 18,097 rows × 10 columns


##### 4. Airport Codes

To link airports between datasets, we plan to use an external airport-code reference table containing both IATA (3-letter) and ICAO (4-letter) codes.
This mapping will allow us to join flight records (which use IATA codes) with weather and airport datasets (which may use ICAO codes).

Source: DataHub – Global Airport Codes

Purpose: IATA/ICAO code harmonization for join operations

### Data Dictionary

Below is a subset of the data we determined relevant for modeling. See **Missing and Duplicate Values Analysis** for details on how these columns were chosen.

| Column Name                 | Data Type   | Null Percentage | Description                                                                          | Data Source   | Notes            |
| --------------------------- | ----------- | --------------- | ------------------------------------------------------------------------------------ | ------------- | ---------------- |
| WEATHER_DELAY               | Double      | 79.58594597     | Weather delay in minutes                                                             | BTS Flight    | outcome variable |
| LATE_AIRCRAFT_DELAY         | Double      | 79.58594597     | Late aircraft delay in minutes                                                       | BTS Flight    | outcome variable |
| NAS_DELAY                   | Double      | 79.58594597     | National Air System delay in minutes                                                 | BTS Flight    | outcome variable |
| SECURITY_DELAY              | Double      | 79.58594597     | Security delay in minutes                                                            | BTS Flight    | outcome variable |
| CARRIER_DELAY               | Double      | 79.58594597     | Carrier delay in minutes (within airline's control)                                  | BTS Flight    | outcome variable |                  |
| HourlyPressureChange        | Double      | 66.36082157     | Pressure change in hectopascals over 3 hours                                         | NOAA Weather  |                  |
| HourlyPrecipitation         | Double      | 11.34809468     | Liquid precipitation depth in mm (scaled by 10)                                      | NOAA Weather  |                  |
| HourlySeaLevelPressure      | Double      | 10.95048178     | Air pressure at sea level in hectopascals (scaled by 10)                             | NOAA Weather  |                  |
| HourlyAltimeterSetting      | Double      | 4.510965396     | Altimeter setting for aviation                                                       | NOAA Weather  |                  |
| ARR_DELAY                   | Double      | 3.325405338     | Arrival delay in minutes - negative = early arrival                                  | BTS Flight    | outcome variable |
| AIR_TIME                    | Double      | 3.325405338     | Flight time in minutes (wheels off to wheels on)                                     | BTS Flight    | outcome variable |
| ACTUAL_ELAPSED_TIME         | Double      | 3.325405338     | Actual elapsed time of flight in minutes                                             | BTS Flight    | outcome variable |
| ARR_DELAY_NEW               | Double      | 3.325405338     | Arrival delay in minutes - early arrivals set to 0                                   | BTS Flight    | outcome variable |
| ARR_DEL15                   | Integer     | 3.325405338     | Arrival delay indicator: 15+ minutes (1=Yes, 0=No)                                   | BTS Flight    | outcome variable |
| ARR_DELAY_GROUP             | Integer     | 3.325405338     | Arrival delay intervals in 15-minute increments                                      | BTS Flight    | outcome variable |
| WHEELS_ON                   | Integer     | 3.162349798     | Wheels on time (local time: HHMM)                                                    | BTS Flight    |                  |
| ARR_TIME                    | Integer     | 3.162349798     | Actual arrival time (local time: HHMM)                                               | BTS Flight    | outcome variable |
| TAXI_IN                     | Double      | 3.162349798     | Taxi in time in minutes (wheels on to gate)                                          | BTS Flight    |                  |
| TAXI_OUT                    | Double      | 3.076932957     | Taxi out time in minutes (gate to wheels off)                                        | BTS Flight    |                  |
| WHEELS_OFF                  | Integer     | 3.076932957     | Wheels off time (local time: HHMM)                                                   | BTS Flight    |                  |
| DEP_DELAY_NEW               | Double      | 3.018918011     | Departure delay in minutes - early departures set to 0                               | BTS Flight    | outcome variable |
| DEP_DELAY_GROUP             | Integer     | 3.018918011     | Departure delay intervals in 15-minute increments                                    | BTS Flight    | outcome variable |
| DEP_DELAY                   | Double      | 3.018918011     | Departure delay in minutes - negative = early departure                              | BTS Flight    | outcome variable |
| DEP_DEL15                   | Integer     | 3.018918011     | Departure delay indicator: 15+ minutes (1=Yes, 0=No)                                 | BTS Flight    | outcome variable |
| DEP_TIME                    | Integer     | 3.018918011     | Actual departure time (local time: HHMM)                                             | BTS Flight    | outcome variable |
| HourlySkyConditions         | String      | 2.346429869     | Cloud coverage and sky condition codes                                               | NOAA Weather  |                  |
| HourlyWetBulbTemperature    | Double      | 0.73200163      | Wet bulb temperature in °C (scaled by 10)                                           | NOAA Weather  |                  |
| HourlyStationPressure       | Double      | 0.6867599616    | Station pressure in hectopascals                                                     | NOAA Weather  |                  |
| TAIL_NUM                    | String      | 0.5842169374    | Aircraft tail number                                                                 | BTS Flight    |                  |
| HourlyWindDirection         | Double      | 0.4332210855    | Wind direction angle in degrees (1-360, 999=missing)                                 | NOAA Weather  |                  |
| HourlyRelativeHumidity      | Double      | 0.3368863028    | Relative humidity percentage                                                         | NOAA Weather  |                  |
| HourlyWindSpeed             | Double      | 0.3315343705    | Wind speed rate in m/s (scaled by 10)                                                | NOAA Weather  |                  |
| HourlyDewPointTemperature   | Double      | 0.3263251563    | Dew point temperature in °C (scaled by 10)                                           | NOAA Weather  |                  |
| HourlyDryBulbTemperature    | Double      | 0.3180475009    | Air temperature in °C (scaled by 10)                                                 | NOAA Weather  |                  |
| HourlyVisibility            | Double      | 0.2916446345    | Horizontal visibility distance in meters                                             | NOAA Weather  |                  |
| REM                         | String      | 0.040246531     | Remarks - additional plain language or coded weather information                     | NOAA Weather  |                  |
| CRS_ELAPSED_TIME            | Double      | 0.000142718     | CRS (scheduled) elapsed time of flight in minutes                                    | BTS Flight    |                  |
| origin_station_lon          | Double      | 0               | Longitude of weather station for origin                                              | Airport Codes |                  |
| FLIGHTS                     | Integer     | 0               | Number of flights                                                                    | BTS Flight    |                  |
| SOURCE                      | String      | 0               | Data source flag indicating origin                                                   | NOAA Weather  |                  |
| CRS_DEP_TIME                | Integer     | 0               | CRS (scheduled) departure time (local time: HHMM)                                    | BTS Flight    |                  |
| ORIGIN_WAC                  | Integer     | 0               | Origin airport World Area Code                                                       | BTS Flight    |                  |
| CRS_ARR_TIME                | Integer     | 0               | CRS (scheduled) arrival time (local time: HHMM)                                      | BTS Flight    |                  |
| dest_station_lon            | Double      | 0               | Longitude of weather station for destination                                         | Airport Codes |                  |
| CANCELLED                   | Integer     | 0               | Cancelled flight indicator (1=Yes, 0=No)                                             | BTS Flight    | outcome variable |
| origin_region               | String      | 0               | Origin airport region/state code                                                     | Airport Codes |                  |
| DEST_WAC                    | Integer     | 0               | Destination airport World Area Code                                                  | BTS Flight    |                  |
| LATITUDE                    | Double      | 0               | Latitude coordinate in angular degrees (scaled by 1000)                              | NOAA Weather  |                  |
| DEP_TIME_BLK                | String      | 0               | CRS departure time block in hourly intervals                                         | BTS Flight    |                  |
| DAY_OF_MONTH                | Integer     | 0               | Day of month                                                                         | BTS Flight    |                  |
| DIVERTED                    | Integer     | 0               | Diverted flight indicator (1=Yes, 0=No)                                              | BTS Flight    |                  |
| NAME                        | String      | 0               | Weather station call letter identifier                                               | NOAA Weather  |                  |
| ARR_TIME_BLK                | String      | 0               | CRS arrival time block in hourly intervals                                           | BTS Flight    |                  |
| dest_airport_name           | String      | 0               | Full name of destination airport                                                     | Airport Codes |                  |
| DEST_STATE_NM               | String      | 0               | Destination airport state name                                                       | BTS Flight    |                  |
| ELEVATION                   | Double      | 0               | Elevation relative to Mean Sea Level in meters                                       | NOAA Weather  |                  |
| DEST                        | String      | 0               | Destination airport code (IATA)                                                      | BTS Flight    |                  |
| LONGITUDE                   | Double      | 0               | Longitude coordinate in angular degrees (scaled by 1000)                             | NOAA Weather  |                  |
| origin_station_lat          | Double      | 0               | Latitude of weather station for origin                                               | Airport Codes |                  |
| dest_station_lat            | Double      | 0               | Latitude of weather station for destination                                          | Airport Codes |                  |
| FL_DATE                     | String/Date | 0               | Flight date (YYYYMMDD)                                                               | BTS Flight    |                  |
| sched_depart_date_time      | Timestamp   | 0               | Scheduled departure date-time in local timezone                                      | Derived       |                  |
| dest_station_name           | String      | 0               | Weather station name nearest to destination                                          | Airport Codes |                  |
| DATE                        | String      | 0               | Geophysical-point-observation date-time (YYYYMMDD HHMM)                              | NOAA Weather  |                  |
| DEST_AIRPORT_ID             | Integer     | 0               | Destination airport ID - unique key assigned by DOT                                  | BTS Flight    |                  |
| dest_airport_lat            | Double      | 0               | Latitude of destination airport                                                      | Airport Codes |                  |
| dest_icao                   | String      | 0               | Destination airport 4-letter ICAO code                                               | Airport Codes |                  |
| dest_type                   | String      | 0               | Destination airport type                                                             | Airport Codes |                  |
| OP_CARRIER_FL_NUM           | Integer     | 0               | Flight number                                                                        | BTS Flight    |                  |
| dest_airport_lon            | Double      | 0               | Longitude of destination airport                                                     | Airport Codes |                  |
| origin_icao                 | String      | 0               | Origin airport 4-letter ICAO code                                                    | Airport Codes |                  |
| dest_station_dis            | Double      | 0               | Distance between destination airport and weather station                             | Airport Codes |                  |
| DEST_CITY_MARKET_ID         | Integer     | 0               | Destination city market ID                                                           | BTS Flight    |                  |
| REPORT_TYPE                 | String      | 0               | Type of observation (METAR, SYNOP, AUTO, etc.)                                       | NOAA Weather  |                  |
| dest_station_id             | String      | 0               | NOAA weather station identifier for destination                                      | Airport Codes |                  |
| STATION                     | String      | 0               | Fixed-weather-station USAF master station catalog identifier                         | NOAA Weather  |                  |
| QUARTER                     | Integer     | 0               | Quarter (1-4) of flight                                                              | BTS Flight    |                  |
| dest_region                 | String      | 0               | Destination airport region/state code                                                | Airport Codes |                  |
| origin_type                 | String      | 0               | Origin airport type                                                                  | Airport Codes |                  |
| two_hours_prior_depart_UTC  | Timestamp   | 0               | Two hours prior to scheduled departure in UTC                                        | Derived       |                  |
| OP_CARRIER                  | String      | 0               | IATA code for carrier - not always unique over time                                  | BTS Flight    |                  |
| sched_depart_date_time_UTC  | Timestamp   | 0               | Scheduled departure date-time in UTC                                                 | Derived       |                  |
| origin_airport_lat          | Double      | 0               | Latitude of origin airport                                                           | Airport Codes |                  |
| four_hours_prior_depart_UTC | Timestamp   | 0               | Four hours prior to scheduled departure in UTC                                       | Derived       |                  |
| OP_UNIQUE_CARRIER           | String      | 0               | Unique carrier code - when same code used by multiple carriers, numeric suffix added | BTS Flight    |                  |
| origin_station_name         | String      | 0               | Weather station name nearest to origin airport                                       | Airport Codes |                  |
| ORIGIN_CITY_NAME            | String      | 0               | Origin airport city name                                                             | BTS Flight    |                  |
| DISTANCE_GROUP              | Integer     | 0               | Distance intervals in 250-mile increments                                            | BTS Flight    |                  |
| MONTH                       | Integer     | 0               | Month of flight                                                                      | BTS Flight    |                  |
| origin_station_dis          | Double      | 0               | Distance between origin airport and weather station                                  | Airport Codes |                  |
| DEST_STATE_FIPS             | String      | 0               | Destination airport state FIPS code                                                  | BTS Flight    |                  |
| dest_iata_code              | String      | 0               | Destination airport 3-letter IATA code                                               | Airport Codes |                  |
| ORIGIN_AIRPORT_SEQ_ID       | Integer     | 0               | Origin airport sequence ID - unique key for time-specific airport info               | BTS Flight    |                  |
| origin_airport_lon          | Double      | 0               | Longitude of origin airport                                                          | Airport Codes |                  |
| ORIGIN_AIRPORT_ID           | Integer     | 0               | Origin airport ID - unique key assigned by DOT                                       | BTS Flight    |                  |
| origin_airport_name         | String      | 0               | Full name of origin airport                                                          | Airport Codes |                  |
| ORIGIN                      | String      | 0               | Origin airport code (IATA)                                                           | BTS Flight    |               |
| origin_station_id           | String      | 0               | NOAA weather station identifier for origin airport                                   | Airport Codes |                  |
| DEST_AIRPORT_SEQ_ID         | Integer     | 0               | Destination airport sequence ID                                                      | BTS Flight    |                  |
| DISTANCE                    | Double      | 0               | Distance between airports in miles                                                   | BTS Flight    |                  |
| ORIGIN_STATE_NM             | String      | 0               | Origin airport state name                                                            | BTS Flight    |                  |
| origin_iata_code            | String      | 0               | Origin airport 3-letter IATA code                                                    | Airport Codes |               |
| YEAR                        | Integer     | 0               | Year of flight                                                                       | BTS Flight    |                  |
| DEST_CITY_NAME              | String      | 0               | Destination airport city name                                                        | BTS Flight    |                  |
| OP_CARRIER_AIRLINE_ID       | Integer     | 0               | DOT identification number for unique airline/carrier                                 | BTS Flight    |                  |
| ORIGIN_CITY_MARKET_ID       | Integer     | 0               | Origin city market ID - consolidates airports serving same city market               | BTS Flight    |                  |
| ORIGIN_STATE_FIPS           | String      | 0               | Origin airport state FIPS code                                                       | BTS Flight    |                  |
| DAY_OF_WEEK                 | Integer     | 0               | Day of week                                                                          | BTS Flight    |                  |
| ORIGIN_STATE_ABR            | String      | 0               | Origin airport state code                                                            | BTS Flight    |                  |
| DEST_STATE_ABR              | String      | 0               | Destination airport state code                                                       | BTS Flight    |                  |

### Train, Test, Validation Split

To ensure robust model evaluation and prevent overfitting, we will employ **time-based cross-validation** that respects the temporal nature of flight delay data. Traditional random cross-validation would violate temporal structure by potentially training on future data to predict past events, introducing data leakage that leads to overly optimistic performance estimates.

Instead, we use **time-series cross-validation** where the training set always precedes the validation set chronologically, ensuring models are evaluated on their ability to predict future delays based on historical patterns.

**Expanding Window**

We will use an **expanding window** approach for cross validation where the training set continuously grows while maintaining a fixed-size validation window. Each fold adds more historical data to the training set.

```
Dataset Timeline: 2015 ────────── 2016 ────────── 2017 ────────── 2018 ────────── 2019
                  |═══════════════|═══════════════|═══════════════|═══════════════|

Fold 1:           [Train: 2015]                   [Val: 2016]
                  |══════════════════════════════>|────────|

Fold 2:           [Train: 2015─────2016]                       [Val: 2017]
                  |═══════════════════════════════════════════>|────────|

Fold 3:           [Train: 2015─────2016─────2017]                       [Val: 2018]
                  |════════════════════════════════════════════════════>|────────|

Fold 4:           [Train: 2015─────2016─────2017─────2018]                       [Val: 2019]
                  |═══════════════════════════════════════════════════════════════>|────────|

Legend:  [═══] = Training Data    [───] = Validation Data
```

For our train / test split, we would split temporally as shown below, omitting 2020 and 2021 due to the exceptional and irregular disruptions caused by the COVID-19 pandemic. 

```
Training:     2015 ──────────> 2017  (60% of data)
Validation:   2018                   (20% of data)
Test:         2019                   (20% of data, held out until final evaluation)
```

For cross-validation, we would create multiple folds:
- Fold 1: Train on 2015, validate on 2016
- Fold 2: Train on 2015-2016, validate on 2017
- Fold 3: Train on 2015-2017, validate on 2018

**Rolling Window** (Stretch goal)

If time permits, we may also deploy a **rolling window** approach, which is computationally cheaper and allows our model to focus on more recent patterns, discarding old data. This approach may be more appropriate if we expect pattern breaks (the underlying random process changing) in our data.

```
Dataset Timeline: 2015 ────────── 2016 ────────── 2017 ────────── 2018 ────────── 2019
                  |═══════════════|═══════════════|═══════════════|═══════════════|

Fold 1:           [Train: 2015─2016]              [Val: 2017]
                  |═══════════════════════════════>|────────|

Fold 2:                            [Train: 2016─2017]              [Val: 2018]
                                   |═══════════════════════════════>|────────|

Fold 3:                                             [Train: 2017─2018]              [Val: 2019]
                                                    |═══════════════════════════════>|────────|

Legend:  [═══] = Training Data    [───] = Validation Data
```

### Train, Test, Validation Split

To ensure robust model evaluation and prevent overfitting, we will employ **time-based cross-validation** that respects the temporal nature of flight delay data. Traditional random cross-validation would violate temporal structure by potentially training on future data to predict past events, introducing data leakage that leads to overly optimistic performance estimates.

Instead, we use **time-series cross-validation** where the training set always precedes the validation set chronologically, ensuring models are evaluated on their ability to predict future delays based on historical patterns.

**Expanding Window**

We will use an **expanding window** approach for cross validation where the training set continuously grows while maintaining a fixed-size validation window. Each fold adds more historical data to the training set.

```
Dataset Timeline: 2015 ────────── 2016 ────────── 2017 ────────── 2018 ────────── 2019
                  |═══════════════|═══════════════|═══════════════|═══════════════|

Fold 1:           [Train: 2015]                   [Val: 2016]
                  |══════════════════════════════>|────────|

Fold 2:           [Train: 2015─────2016]                       [Val: 2017]
                  |═══════════════════════════════════════════>|────────|

Fold 3:           [Train: 2015─────2016─────2017]                       [Val: 2018]
                  |════════════════════════════════════════════════════>|────────|

Fold 4:           [Train: 2015─────2016─────2017─────2018]                       [Val: 2019]
                  |═══════════════════════════════════════════════════════════════>|────────|

Legend:  [═══] = Training Data    [───] = Validation Data
```

For our train / test split, we would split temporally as shown below, omitting 2020 and 2021 due to the exceptional and irregular disruptions caused by the COVID-19 pandemic. 

```
Training:     2015 ──────────> 2017  (60% of data)
Validation:   2018                   (20% of data)
Test:         2019                   (20% of data, held out until final evaluation)
```

For cross-validation, we would create multiple folds:
- Fold 1: Train on 2015, validate on 2016
- Fold 2: Train on 2015-2016, validate on 2017
- Fold 3: Train on 2015-2017, validate on 2018

**Rolling Window** (Stretch goal)

If time permits, we may also deploy a **rolling window** approach, which is computationally cheaper and allows our model to focus on more recent patterns, discarding old data. This approach may be more appropriate if we expect pattern breaks (the underlying random process changing) in our data.

```
Dataset Timeline: 2015 ────────── 2016 ────────── 2017 ────────── 2018 ────────── 2019
                  |═══════════════|═══════════════|═══════════════|═══════════════|

Fold 1:           [Train: 2015─2016]              [Val: 2017]
                  |═══════════════════════════════>|────────|

Fold 2:                            [Train: 2016─2017]              [Val: 2018]
                                   |═══════════════════════════════>|────────|

Fold 3:                                             [Train: 2017─2018]              [Val: 2019]
                                                    |═══════════════════════════════>|────────|

Legend:  [═══] = Training Data    [───] = Validation Data
```

### Missing and Duplicate Values Analysis
Since we are predicting departure delay by the number of minutes (`DEP_DELAY`), we drop all rows for which this is null. Given the large number of columns, we choose to focus on columns with fewer than 10% nulls. There are no duplicate rows among the non-null rows.

We then dropped columns that had greater than 99% null values, which was around 91 columns of the 216 columns in the OTPW table. We also dropped columns irrelevant to our model, such as BackupElevation and the related backup weather stations. After this, there are still variables of interest that contain a significant number of null values, such as HourlyPrecipitation which has around 11% null values. Using the mean of time series data as our imputation method is problematic as it could cause data leakage, with values in the future influencing imputed values in the past. To incorporate these features into our models without omitting rows with missing values, we have proposed a phased imputation strategy: <a href="#command/8450206840205530">Imputation and Feature Engineering</a>.

Some null values are absent for the reason that they are not relevant to the variable in question; for example, CARRIER_DELAY has null values for rows where there is no delay caused by a carrier. We could replace the null values with a 0 instead for these variables, indicating this variable is not relevant; or we could create indicator variables instead for CARRIER/WEATHER/etc, and have a single "Delay time" column. To determine in the future as well is whether or not cancelled flights also count as delayed flights.

### Data Checkpointing strategy

To ensure reproducibility and efficient data management, we will implement a checkpointing process throughout the data pipeline. After the initial data ingestion, a cleaned version of the dataset (after removing missing and duplicate entries) will be saved as a Parquet file in DBFS. This serves as our first checkpoint, allowing us to skip re-ingestion in subsequent sessions. Following feature engineering, another checkpoint will store the processed dataset for modeling. We will also maintain a stable 3-month sample dataset as a baseline checkpoint for consistent EDA and metric comparisons.


### Custom Dataset with enhanced joins 

After reviewing the initial 3-month OTPW joined dataset, we realized that the existing joins introduced inconsistencies and missing information, making the data less suitable for machine learning.
To address this, we plan to recreate a cleaner, custom joined dataset by combining three primary data sources — flights, airport codes, and weather data — using improved join logic and preprocessing steps.

##### 1. Enriching Flight Data with Airport Information

We plan to begin by enriching the flight data with metadata from the airport codes dataset.
The airport codes table currently contains coordinates in a single string column, which we will split into separate latitude and longitude fields using PySpark string functions.

Next, we plan to perform two left joins:

* The first join will match the origin airport code in the flight data to the iata_code in the airport table.
* The second join will match the dest field in the same way.

These joins will allow us to add important contextual details such as airport names, countries, and geographic coordinates for both the origin and destination airports.

##### 2. Aligning Date and Time Formats

We plan to standardize the temporal fields so that the flight and weather data are compatible for joining.
While the flight data currently records only the date (e.g., 2015-01-01), the weather data contains timestamps (e.g., 2015-01-01T02:00:00).

Using Spark’s date and timestamp functions, we will normalize both to ensure consistent formats that support accurate temporal matching.

##### 3. Joining Flights with Weather Data

Once the flight dataset is enriched with airport information, we plan to combine it with NOAA weather data based on both spatial and temporal proximity.

* Spatial match: within ±0.1° of latitude/longitude (~10–11 km radius)
* Temporal match: within ±24 hours of the flight’s departure timestamp

This approach will ensure that each flight is paired with the most relevant weather observations available at or near its origin airport.

##### 4. Optimizing for Large-Scale Data Processing

Because the full dataset is large, we plan to apply several Spark optimization strategies to improve performance and stability during joins:

* Broadcast joins will be tested on smaller subsets but avoided for the full weather dataset (~29 GB), as it exceeds executor memory capacity.

* We plan to repartition the data by date and filter the weather dataset by relevant years and geographic regions to minimize shuffle size.

* For large-scale processing, we plan to use chunked joins (processing one year or month at a time) to prevent out-of-memory errors and enable incremental data generation.

* Finally, we plan to run null checks and visual validations in Databricks to confirm join accuracy and overall data completeness.


## Data Pipeline 

Our data pipeline is designed to efficiently integrate large-scale flight, weather, and airport datasets from multiple sources (DOT TranStats, NOAA, and DataHub) into a unified, analysis-ready dataset for predictive modeling. Built on **Databricks and PySpark**, it follows a structured **ETL (Extract, Transform, Load)** process emphasizing scalability, fault tolerance, and reproducibility.

##### 1. Incremental Data Ingestion
Data is ingested in **incremental batches** (e.g., monthly or quarterly) from multiple sources using Spark connectors and batch processing. This approach reduces memory usage, enables efficient reprocessing in case of failures, and minimizes compute costs. Each dataset is partitioned by date to optimize distributed joins and balance cluster load.

##### 2. Data Quality and Validation
During ingestion, automated data validation checks ensure data integrity and consistency. These include:
- Detecting missing or null values in critical columns (flight date, origin, destination, coordinates).  
- Identifying schema drifts or new columns across updates.  
- Removing duplicate records introduced during incremental loads.  

These checks maintain high-quality, standardized input for downstream feature engineering.

##### 3. Data Cleaning and Standardization
Schema alignment, type normalization, and temporal/spatial synchronization are performed to harmonize datasets. Flight dates and NOAA timestamps are standardized to a unified datetime field, and geographic coordinates are normalized to enable spatial joins.

##### 4. Data Integration (Custom Joins)
A custom join process merges flight, airport, and weather data using:
- **Spatial matching:** within ±0.1° latitude/longitude (~10 km radius).  
- **Temporal matching:** within ±24 hours of scheduled departure.  

Optimizations such as partitioning by date  and year-wise joins ensure efficient scaling without memory bottlenecks.

##### 5. Checkpointing and Versioning
Each major stage—raw ingestion, cleaned data, and merged datasets—is checkpointed as Delta/Parquet files in DBFS. Checkpointing allows the pipeline to resume from the last successful stage, avoiding full re-runs after failures.  
Versioned, timestamped checkpoint paths (e.g., `/mnt/data/checkpoints/year=2018/month=01/`) ensure full traceability and reproducibility across iterations.

##### 6. Feature Engineering Integration
Processed datasets feed into downstream modules for **feature generation**, including time-series and graph features. 

##### 7. Automation and Scalability
The pipeline runs on Databricks clusters with Spark and PHOTON runtime, supporting auto-scaling, versioned outputs, and proactive data drift monitoring. Incremental processing ensures that new data can be integrated without reprocessing historical data.



# Machine Learning Pipeline

<img src='https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/ml_pipeline.png?raw=true' style="width:40%;">

### EDA
#### Target Variable

Here are the summary statistics of our target variable, depature delay.

| mean | stddev | min  | 25% | median | 75% | max   |
|------|--------|------|-----|--------|-----|-------|
| 10.39| 37.99  | -1.0 | -5  | -1     | 9   | 996.0 |

The variable is highly skewed. On average, flight are delayed by 10 minutes, with a large standard deviation of 38 minutes. Interestingly, the median delay is -1 minutes, indicating that many flights depart slightly early. This means half the flights in fact depart on time or early. Even at the 75-th percentile, the delay is ata 9 minutes -- still on time! Indeed, 80.31% have a delay within 15 minutes, considered on-time. So if framed as a classification problem (which we are not), there is a serious class imbalance. 

There is also a more normal distribution of the delayed flights centered around 40 minutes, but with a very long right tail and a large range of extreme values (the more extreme values >= 300 are not shown here in the histogram). This is expected as departure time is censored on the left -- a flight simply cannot depart hours before scheduled time. 

![Histogram: Departure Delay](https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/EDA_histogram-dep-delay.png?raw=true)

#### Seasonality
Is there any seasonality in the data? It is hard to see given only three months, but the average delay does not appear homogeneous across time. Whereas the first week of the year averaged 20 minutes of delay, only two weeks later this dropped significantly to under 6 minutes. It's likely that the beginning-of-the-year delay is a spillover from the holiday season. In mid-February, there is another spike in delay minutes, then it goes down again, likely due to the winter storm that February.

![Line Plot: Weekly Departure Delay](https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/EDA_weekly-dep-delay.png?raw=true)

Is there any difference by day of the week? It seems so. Delays on Monday and Sunday are higher than delays on other days of the week. This possibly reflects the weekend travel back to work, though there is not a comparable spike on Friday and Saturday. Mid-week Wednesday has the lowest average delay of all days.

![Line Plot: Daily Departure Delay](https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/EDA_daily-dep-delay.png?raw=true)

Given these observations, it is likely that the time of flight in a year and in a week would be a useful predictor. However, it also needs to be treated carefully, because there may be highly irregular or unusual events happening in a particular week or month that doesn't provide generalizable insight into future flights.

#### Geography

Departure delays differ by airports. Here we plot the delay on a map by departure airport of their average delay. Flight volume is encoded by the bubble size, and delay is encoded by color, with warmer colors being longer delays. The 10 airports with longest delays are annotated on the map in yellow bubbles.

It is interesting to observe that airports with large volumes (large bubbles) do not necessarily have worse delays than airports with less traffic. For example, hubs like Los Angeles CA, San Francisco CA, Austin TX have shorter delays than many smaller airports on the map.

Regionally, a number of clustered airports around New York though do seem to be a regional hotspot of delays. So is the Chicago area. Therefore, in model fitting, the geographical region should likely play an important role.

![Map: Departure Delay by Airport and Volume](https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/EDA_dep-delay-map.png?raw=true)

#### Correlations between features

The most correlated features in our dataset are departure time and arrival time - as we would expect, delays in a flight's departure are correlated with a flight's arrival delay. Dew Point, Dry Bulb, and Wet Bulb Hourly Temperature are also highly correlated. HourlyStation Pressure and Elevation are inversely correlated - the higher elevation a station is at, the lower the pressure.

Based on this information, we plan to remove features that are highly correlated with each other such as HourlyDewPointTemperature and HourlyWetBulbTemperature. We will also remove features that occur after departure, such as the arrival features, to avoid data leakage. You can read more about this in the Data Leakage Prevention Strategy below.

![Heat Map: Feature Correlations](https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/EDA_heat_map.png?raw=true)

#### Histograms

The time variables (departures, arrivals) seem to have large outliers which need to be removed. The temperature variables show some normality in their histogram shapes. The wind variables are less normally distributed.

![Histograms: Feature Correlations](https://github.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/blob/main/report-images/EDA_histograms.png?raw=true)

## Methods
We have chosen to formulate the flight delay prediction problem as a regression task where the target variable is the continuous delay time in minutes. This decision is based on several considerations:

**Granularity of Information:** A regression approach provides the specific expected delay duration (e.g., 23 minutes), which is more informative than categorical buckets (e.g., "15-30 minutes"). This granular information is valuable for operational planning and resource allocation.

**Flexible Evaluation:** While regression is our primary approach, we can post-process the continuous predictions into time buckets for classification-based evaluation metrics. This allows us to assess both the precision of delay magnitude and the accuracy of delay categorization.

**Real-World Application:** Airlines and passengers benefit from knowing the exact expected delay duration rather than broad categories, enabling more precise scheduling adjustments and passenger communication.

The regression predictions will be binned into the following discrete intervals: 
* 0-15 minutes (on-time) 
* 15-60 minutes (delayed)
* 60+ minutes (severely delayed)

### Imputation and Feature Engineering

Feature engineering transforms raw flight data into predictive signals that capture temporal patterns, network dependencies, and data completeness. Our approach incorporates three complementary feature types: imputed missing values, time-series patterns, and graph network structures. Given this is a four-week project, we will implement these features incrementally, starting with simple approaches that may contain data leakage to unblock model development, then systematically adding leakage controls and tracking performance changes.

#### Data Leakage Prevention Strategy

**Why this matters**: Data leakage occurs when information from the future inadvertently influences predictions about the past, creating artificially inflated performance metrics that fail in production. A model may appear highly accurate in testing but perform significantly worse or become completely unusable in production. For example, if we use weather conditions at actual departure time rather than prediction time, the model learns patterns it cannot access in reality. When deployed, the model encounters NULL values for weather (since the flight hasn't departed yet) and cannot make predictions, rendering it useless for operational planning.

Our moment of prediction is 2 hours before scheduled departure time. All features must use only information available at or before this threshold.

<img src='https://raw.githubusercontent.com/Connor-Watson-Berkeley/flight-departure-delay-predictive-modeling/master/report-images/data_leakage_visual.png' style="width:100%;">

Our prevention strategy enforces strict temporal boundaries:

- **Point-in-time correctness**: Features use only information available 2+ hours before scheduled departure
- **Rolling window computations**: Historical aggregations look backward from the prediction moment
- **Temporal train-test splits**: Data is partitioned chronologically, ensuring models never train on dates after test periods
- **Incremental rollout with transparency**: We will implement leakage controls progressively and clearly document which features contain leakage at each development phase

**Phased Implementation Approach**:

Our initial baseline models will use simple feature-average imputation and raw features, which will contain data leakage. This unblocks rapid model development and establishes performance benchmarks. We will then incrementally add leakage controls, modularly swapping imputation methods and adding temporal safeguards. We expect performance to decrease as we add stricter controls, which is desirable because it means our training metrics more accurately reflect production performance. We will be radically transparent about which controls are active in each model iteration and track performance regressions as we tighten data integrity.

The business impact of uncontrolled data leakage in deployment would be severe: a model achieving 95% accuracy in testing but only 70% in production would undermine operational planning and erode stakeholder trust. By measuring the gap between leaky and leak-free performance during development, we can set realistic production expectations.

#### Missing Value Imputation

**Why this matters**: Many potentially predictive features contain NULL (empty) values for various reasons. Weather observations may be delayed during severe conditions, taxi times are unavailable for cancelled flights, and equipment data may be missing due to sensor failures. ML models cannot handle missing values natively and will fail or throw errors when encountering NULLs. To work around this limitation, we "guess" what that value might have been were it populated, allowing us to retain the feature rather than throwing it away entirely. Without imputation, we would either lose valuable predictive features or be forced to discard large portions of our dataset.

Flight data suffers from systematic missingness patterns. Simply dropping incomplete records would exclude critical edge cases; filling with global means introduces data leakage and obscures important missingness signals.

**Phase 1 - Simple Imputation (with data leakage)**:
- Global mean/median imputation across entire dataset to unblock model development
- Allows rapid baseline model training while we implement proper temporal controls

**Phase 2 - Temporal Imputation (leak-free)**:
- **Carry-forward imputation for weather**: Use the most recent valid observation for each weather variable timestamped at least 2 hours before scheduled departure. Weather conditions change gradually, so the last known observation provides a reasonable estimate. This mirrors real-world forecasting where dispatchers reference the latest available conditions.
- **Historical median imputation**: For operational metrics (taxi times, turnaround duration), impute using airline-airport pair medians computed only from training data preceding the prediction moment. Different airlines have different operational efficiencies at different airports (e.g., Delta's average taxi time at ATL differs from Southwest's), so airline-airport specific medians provide more accurate estimates than global averages.
- **Missingness indicators**: Create binary flag features (e.g., `weather_imputed`, `taxi_time_missing`) to preserve the information content of missingness itself. Missing data patterns may correlate with operational disruptions, so the fact that a value is missing can be predictive even after imputation.

**Data Leakage Controls**:
All imputation statistics are computed exclusively from training data preceding the prediction point. For validation and test sets, we freeze imputation parameters at their training-time values. Weather carry-forward uses only observations timestamped at least 2 hours before scheduled departure.

**Business Value**: Robust imputation enables predictions even when data feeds are incomplete, maintaining system uptime during operational disruptions that most require accurate forecasting.

#### Time-Series Features

**Why this matters**: Flight delays exhibit strong temporal dependencies and seasonal patterns that simple snapshot features cannot capture. Morning delays cascade through daily schedules as the same aircraft operates multiple flights. Weekly patterns emerge around weekend travel versus weekday business travel. Seasonal patterns appear both globally (summer thunderstorms, winter snow) and at specific hubs (monsoon season in Phoenix, lake-effect snow in Chicago). By extracting features from time-series models, we can capture these multi-scale temporal dynamics and provide our downstream ML models with information about underlying trends and seasonality that raw timestamps cannot convey.

**Feature Categories**:

- **Temporal encodings**: Hour-of-day, day-of-week, and month features to capture periodic patterns like morning departure peaks, weekend travel surges, and seasonal weather effects

- **ARIMA/SARIMA forecasts**: Train Seasonal ARIMA models on historical delay patterns at the airport level. Extract the fitted seasonal component and trend as features, capturing systematic patterns like summer travel peaks or holiday congestion. The SARIMA forecast for 2 hours ahead provides a baseline expectation of delay given historical patterns.

- **Exponential smoothing (ETS) features**: Apply ETS models to recent airport delay history, extracting level, trend, and seasonal components as features. The smoothed level captures current operational baseline, while trend indicates whether conditions are improving or deteriorating.

- **Rolling window aggregations**: Compute statistics over past flights at the airport level to quantify current congestion

**Rolling Window Approach (Leak-Free)**:

For a flight departing at 10:00 AM (prediction moment at 8:00 AM), we compute aggregations using only flights that departed before 8:00 AM:
```
Timeline: 4 AM    5 AM    6 AM    7 AM    8 AM (prediction) | 10 AM (departure)
          |-------|-------|-------|-------|----------------|
                   [---2hr window---]
          [--------6hr window---------]

Features computed from this window:
- Mean delay of flights in window
- % of flights delayed >15 min
- Standard deviation of delays
- Exponentially smoothed delay (recent flights weighted higher)
```

We will implement windows of varying lengths (2-hour, 6-hour, 24-hour) to capture both immediate conditions and broader daily trends. Exponential smoothing within windows gives recent flights higher weight, balancing recency with statistical stability.

**Data Leakage Controls**:
Rolling windows are strictly backward-looking from the prediction moment (2 hours before departure). For a 10 AM flight, the prediction happens at 8 AM, so the 2-hour window includes only flights departing between 6-8 AM. ARIMA/SARIMA and ETS models are trained only on historical data preceding the prediction cutoff and generate forecasts forward from that point. All window computations use scheduled times for filtering to prevent future information from defining temporal boundaries.

**Business Value**: Time-series features enable proactive delay anticipation based on emerging operational patterns and seasonal trends, allowing airlines to preemptively adjust gate assignments and crew positioning before delays compound.

#### Graph Network Features

**Why this matters**: Airport operations exist within an interconnected network where delays at hub airports propagate through connecting flights to downstream destinations. A delay at ATL doesn't just affect flights departing ATL; it affects passengers connecting through ATL to reach other airports, and those delays cascade further through the network. Graph features model this systemic connectivity, identifying when delays at critical network nodes will trigger cascading disruptions. By quantifying each airport's structural importance and tracking delays across connected airports, we can predict when localized issues will spread network-wide.

**Feature Categories**:

- **Airport centrality metrics**: 
  - **Degree centrality**: Count of direct flight connections for each airport. High-degree airports (ATL, ORD, DFW) serve as critical bottlenecks where delays impact many downstream flights.
  - **PageRank**: Measures airport importance based on connections to other important airports. Captures that a delay at a hub connecting to other hubs (ATL to ORD to LAX) has greater cascading potential than a delay at a spoke airport.
  - These metrics quantify an airport's structural importance in the network, providing the model with context about systemic risk.

- **Connecting flight features**: For flights with probable connections (same airline, tight turnaround windows), create features capturing:
  - Minimum connection time available
  - Inbound flight delay status (if the inbound leg feeding passengers is delayed, the outbound flight may be held)
  - Number of potential connections at risk
  - These features directly model delay propagation through multi-leg itineraries

- **Network congestion indicators**: Aggregate current delay statistics across an airport's immediate network neighbors (airports with direct flight connections). If connected airports are experiencing delays, propagation to the target airport becomes more likely. This captures regional congestion patterns that may not be visible from local conditions alone.

**Data Leakage Controls**:
Graph topology is constructed from scheduled flights using only data available before the prediction moment. Centrality metrics are computed from the published schedule, not from actual flight outcomes. Delay aggregations across network neighbors use only flights that departed at least 2 hours before the target flight's scheduled departure time.

**Business Value**: Network-aware predictions identify high-risk flights even when local conditions appear normal. If a hub experiences delays, graph features enable preemptive passenger notifications and rebooking before cascading cancellations occur.

**Integration and Pipeline**:

These feature engineering strategies will be implemented in phases, with each phase clearly documented:

**Phase 1 (Week 1-2)**: Simple imputation with data leakage, raw temporal features, baseline models  
**Phase 2 (Week 2-3)**: Temporal imputation controls, rolling window features, initial graph features  
**Phase 3 (Week 3-4)**: Full leakage controls, ARIMA/ETS features, refined graph features, performance comparison

Each feature type is computed in sequence: imputation first (establishing complete dataset), then time-series features (from imputed data), then graph features (layering network context). We will modularly swap implementations as we add leakage controls, tracking performance regressions to quantify the cost of temporal correctness.

The combined feature set provides models with a multi-dimensional view of delay risk: immediate conditions (imputed state), temporal momentum (time-series trends), and network position (graph connectivity). This enables accurate predictions across diverse scenarios from isolated weather events to systemic network congestion.

### Predictive Modeling

We will on build model and feature complexity incrementally to evaluate how performance improves or regresses as new techniques are introduced. We begin with a simple linear regression model trained on 10 or fewer "obvious" features with minimal feature engineering. This baseline establishes initial performance benchmarks using leaky imputation to unblock development. 

From there, we progressively add complexity in two dimensions: more sophisticated models (Lasso regularization, XGBoost, deep neural networks) discussed in this section and better feature engineering (temporal imputation controls, time-series features, graph network features) discussed in the prior section. At each step, we evaluate performance to understand whether added complexity is justified.

| Algorithm | Implementation | Purpose | Key Strengths | Key Limitations |
|-----------|---------------|---------|---------------|-----------------|
| **Linear Regression (Baseline)** | scikit-learn | Simple Baseline with Top Features | Fast, interpretable, automatic feature selection | Linear only, sensitive to outliers |
| **Linear Regression (Lasso)** | scikit-learn | Feature Selection | Fast, interpretable, automatic feature selection | Linear only, sensitive to outliers |
| **XGBoost** | XGBoost library (SparkXGBRegressor) | Primary prediction model | Non-linear patterns, feature interactions, robust | Requires tuning, less interpretable |
| **Deep Neural Network** | Spark MLlib MultilayerPerceptronRegressor | Complex pattern capture | Deep non-linear patterns, scalable with Spark | Computationally intensive, black box |
| **ARIMA/Prophet** | statsmodels/prophet | Temporal pattern modeling | Explicit time dependencies, seasonality | Univariate, assumes stationarity |
| **Ensemble** | Custom PySpark implementation | Combine model strengths | Improved robustness, reduced variance | Increased complexity, all models needed |


**Loss Function Selection**: Our project goal is to predict the expected delay (the conditional average delay given input features). Mean Squared Error (MSE) is the appropriate loss function for this objective because it is the unique loss function that regresses to the conditional expectation. Mathematically, MSE is defined as:

$$
\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
$$

where \\(y_i\\) is the actual delay and \\(ŷ_i\\) is the predicted delay per flight. By minimizing MSE, our models learn to predict \\(E[Y|X]\\), the expected value of delay given features X, which directly aligns with our goal of estimating typical delay durations for operational planning. Alternative loss functions like Mean Absolute Error would regress to the conditional median rather than the mean, which would not provide the expected delay airlines need for resource allocation decisions.

All models below use MSE as the base objective function. Individual models may add regularization penalties (L1, L2) to the loss function to prevent overfitting, which we describe in each model's subsection.

#### Baseline Model: Simple Linear Regression with Top Features
We will implement a simple linear regression baseline using scikit-learn's LinearRegression class, trained only on the ten raw features that appear to have the most significance. Building upon this baseline, we plan to perform feature engineering outlined in the previous section. Additionally, the improved baseline model will be evaluated using n-fold cross-validation to ensure robust performance estimates and prevent overfitting. To improve generalization, we will apply Lasso and Ridge regularization via LassoCV and RidgeCV, which automatically tune the regularization parameter \\(\lambda\\) through cross-validation, selecting the value that minimizes cross-validated error.

This model will be used to establish a baseline to set minimum performance expectations and provide an interpretable benchmark against which we can evaluate more complex models and performance gains. The model will be trained using ordinary least squares, minimizing the residual sum of squares, and will help us assess whether additional complexity (non-linear models, more features) yields meaningful improvements in prediction accuracy.

#### Feature Selection Model: Linear Regression with Lasso Regularization
We will implement Linear Regression with Lasso regularization using scikit-learn's Lasso or LassoCV class. The Lasso method minimizes the objective function: $$\mathcal{L}{Lasso} = \frac{1}{2n}||y - X\beta||_2^2 + \lambda||\beta||_1$$, where y represents the target vector of delay times in minutes, X is the feature matrix, \\(\beta\\) is the coefficient vector to be estimated, \\(\lambda\\) is the regularization parameter controlling the L1 penalty strength, and \\(n\\) is the number of samples. The first term represents the standard least squares loss function, while the second term is the L1 penalty that encourages sparsity in the coefficient vector. Lasso will run on a single node by converting Spark DataFrames to pandas DataFrames; for distributed processing, use Spark MLlib's LinearRegression with elasticNetParam=1.0. [Databricks scikit-learn Documentation](https://docs.databricks.com/aws/en/machine-learning/train-model/scikit-learn)

Linear regression with Lasso performs automatic feature selection through its L1 regularization mechanism. By driving the coefficients of less important features to exactly zero, Lasso effectively identifies the most predictive variables in our dataset. This feature selection capability is particularly valuable given the high-dimensional nature of flight delay prediction, which may include numerous weather variables, airport characteristics, temporal features, and airline-specific factors. 

The features retained by Lasso with non-zero coefficients will serve as input to our more complex models (XGBoost and Neural Networks), thereby reducing dimensionality and computational cost while potentially improving generalization performance. Additionally, the linear nature of this model provides direct interpretability through coefficient magnitudes, allowing us to identify and understand the most influential factors contributing to flight delays, such as weather conditions, airport congestion, or airline operational patterns.

The key hyperparameter for this model is the regularization strength, denoted as $\lambda$, which controls the trade-off between model complexity and fit quality. We will determine the optimal value through cross-validation using LassoCV, which automatically tests multiple regularization parameters and selects the value that minimizes cross-validated error, ensuring robust generalization performance.

#### XGBoost

We will implement XGBoost using the xgboost.spark library, specifically the [SparkXGBRegressor](https://xgboost.readthedocs.io/en/stable/tutorials/spark_estimator.html) PySpark ML Estimator.

XGBoost is of particular interest to us as a primary prediction model for several reasons. We are mainly interested in the increased accuracy of XGBoost models, in comparison to regular random forest and decision tree models. It also captures non-linear and complex interactions that linear models cannot represent, such as the interactions between weather, airport congestion, airline operations, and other factors that result in flight delays. XGBoost additionally includes built-in regularization through L1 and L2 penalties, handles missing values internally, and is less prone to overfitting than unregularized tree-based methods. It it also computationally efficient on large datasets, as it supports parallel processing and GPU acceleration.

The XGBoost algorithm optimizes a regularized objective function where the first term represents the loss function, Mean Squared Error (MSE). The MSE equation is shown above. 

The second term \\(\Omega(f_k) = \gamma T + \frac{1}{2}\lambda |w|^2\\) is the regularization component that penalizes model complexity, where \\(T\\) represents the number of leaves in each tree, \\(w\\) denotes the leaf weights, and \\(\gamma\\) and \\(\lambda\\) are regularization parameters that control the minimum loss reduction required for splits and L2 regularization on leaf weights, respectively.

As a reminder, our outcome variable is departure delay in minutes.

#### Deep Neural Network with MLP


We plan to implement a Deep Neural Network (DNN) based on a Multilayer Perceptron (MLP) architecture using [Spark MLlib’s MultilayerPerceptronRegressor](https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier) for distributed training on large-scale flight datasets.

The model will serve as a non-linear regression framework for predicting flight delays, leveraging engineered features from the flight, airport, and weather data.


The network will start with an input layer representing the features selected through Lasso, followed by two to three hidden layers with 64–128 neurons each, using the ReLU activation function to introduce non-linearity and improve gradient efficiency. The output layer will have a single neuron with a linear activation, appropriate for predicting continuous flight delay times in minutes. The model will be trained to minimize Mean Squared Error (MSE), with Mean Absolute Error (MAE) considered if strong outliers are detected. 

We chose MLP because it can capture complex, non-linear relationships between flight, weather, and airport variables using static features available two hours before departure—without requiring sequential data processing. To prevent overfitting, we plan to apply L2 regularization, dropout between 0.2 and 0.5, and early stopping based on validation loss. The model’s complexity will be increased iteratively, starting from a simple 2-layer architecture and tuning depth and width based on cross-validation results. This approach balances scalability, interpretability, and predictive performance while aligning with Databricks’ distributed architecture for efficient training and deployment.

#### Ensemble

We will build ensemble models that combine predictions from time-series and non-time-series models:

1. We will first try **simple averaging**, simply taking the mean of predictions made by the models already trained above. This is easy to implement as a starting point as it does not require re-training the previous models.

2. We will also try **weighted averaging**. We will try fitting a linear regression model on out-of-fold predictions to learn the optimal weights of combining the individual predictions.

3. We will also experiment with training **meta-models** by using the `StackingRegressor`. Depending on the actual performance of the models-—Lasso, XGBoost, MLP, and ARIMA/Prophet-—we would use all or a subset of the models, since adding more models to the stack significantly increases the training time.

We consider using an ensemble model in the end because this data has a temporal element. An ensemble of time-series and non-time-series model may help us capture different aspects of the prediction problem. For example, combining XGBoost, which can be a powerful classifier on its own, with time series models can introduce temporal dependencies to the former. An ensemble can also reduce variance and prevent overfitting, since the data likely has many anomalies that may be emphasized more in some models. For example, an unusual event in a specific year may disproportionately skew the time series model.

### Evaluation & Metrics

Our model is a regression model designed to predict flight delays (in minutes). To evaluate its performance, we use a combination of standard regression metrics and domain-specific operational metrics.

#### Standard Metrics

#####  Mean Squared Error (MSE)

We will evaluate our models using standard regression metric, MSE, to assess how well they predict continuous delay times in minutes. Mean Squared Error (MSE), calculated as 

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}i)^2$$

measures the average squared difference between predicted and actual delay times. MSE penalizes large errors more than small errors due to the squaring operation, making it sensitive to outliers and ensuring the model focuses on reducing substantial prediction errors.

##### Root Mean Squared Error (RMSE)

The Root Mean Squared Error (RMSE), computed as


$$RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n}\sum{i=1}^{n}(y_i - \hat{y}i)^2}$$

is the square root of MSE and is expressed in the same units as the target variable (minutes), providing a more interpretable measure of average prediction error magnitude that can be directly understood by stakeholders. 

##### Mean Absolute Error (MAE)

Mean Absolute Error (MAE), defined as

$$
MAE = \frac{1}{n}\sum{i=1}^{n}|y_i - \hat{y}i|
$$

calculates the average absolute difference between predicted and actual delays without squaring, making it less sensitive to outliers than MSE or RMSE and providing a robust measure of typical prediction error. 



#### Classification Metrics (for Bucketed Predictions)


As mentioned, we will bucket regression predictions into time intervals. To evaluate the practical utility of our predictions for operational decision-making, we will convert continuous regression predictions into discrete time buckets.

##### Precision
Precision, computed as $$Precision = \frac{TP}{TP + FP}$$, measures the proportion of predicted delayed flights (within a specific bucket) that are actually delayed, indicating the accuracy of positive predictions and helping stakeholders understand how often they can trust the model's delay warnings.

##### Recall
Recall (also known as Sensitivity), calculated as $$Recall = \frac{TP}{TP + FN}$$, measures the proportion of actually delayed flights that are correctly predicted by the model, indicating the model's ability to identify delayed flights and helping ensure that significant delays are not missed.

##### F1-Score
The F1-Score, defined as $$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$, is the harmonic mean of precision and recall, providing a balanced metric when both precision and recall are important. The F1-score is particularly valuable for imbalanced datasets where delay frequencies may vary significantly across different routes, airlines, or time periods, as it penalizes models that achieve high precision at the expense of recall or vice versa.

##### Accuracy
Accuracy, expressed as $$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$, measures the overall proportion of correct predictions across all buckets, though this metric may be misleading for imbalanced classes where most flights are on-time, as a model that always predicts "on-time" would achieve high accuracy despite poor performance in identifying delays.


#### Domain-Specific Metrics
Beyond standard statistical metrics, we plan to employ domain-specific metrics that directly measure the practical value of our predictions for airline operations and passenger experience, making direct sense to the business aspect of it. 

##### On-Time Performance Prediction Accuracy (OTPA)
**On-Time Performance Prediction Accuracy** measures the accuracy of predicting flights within a narrow tolerance window (+/- 5 minutes), which is critical for operational planning because resource allocation decisions, such as gate assignments, crew scheduling, and passenger communication, require precise delay estimates rather than broad categorizations. 

OTPA converts delay predictions into a binary outcome—*on-time* (delay < 15 minutes) or *delayed* (≥ 15 minutes)—and measures accuracy or F1-score. This bridges regression and classification perspectives, aligning with how the aviation industry defines on-time performance. On-Time flights help improve the reputation of the airline and keep the airport traffic smoothly moving. 

We operationalize this metric by converting continuous delay predictions into binary outcomes consistent with U.S. Department of Transportation (DOT) definitions:

- On-Time: delay < 15 minutes
- Delayed: delay ≥ 15 minutes


**On-Time Definition:**

- Predicted On-Time = 1 if `ŷ < 15`, else 0  
- Actual On-Time = 1 if `y < 15`, else 0  

**Formula:**

`On-Time Prediction Accuracy (OTPA) = (TP + TN) / (TP + TN + FP + FN)`

Where:  
- **TP** = correctly predicted on-time flights  
- **TN** = correctly predicted delayed flights  
- **FP** = predicted on-time but actually delayed  
- **FN** = predicted delayed but actually on-time  

We may also compute the **F1-score** for the “on-time” class:

`F1 = 2 * (Precision * Recall) / (Precision + Recall)`

This metric translates regression outputs into actionable insights for **crew scheduling**, **resource allocation**, and **passenger notifications**, emphasizing precision in predicting near-schedule flights.
  


##### Severe Delay Detection Rate (SDDR)
Severe Delay Detection Rate measures the recall specifically for delays >=60 minutes, which is important for identifying flights requiring major operational adjustments such as aircraft substitutions, crew reassignments, or passenger compensation, as these severe delays have the most significant impact on operations and customer satisfaction. 


**Formula:**

`Severe Delay Detection Rate (SDDR) = TP_severe / (TP_severe + FN_severe)`

Where:  
- **TP_severe** = number of flights correctly predicted to have severe delays (≥ 60 min)  
- **FN_severe** = number of severely delayed flights the model failed to identify  

A high SDDR ensures the model can proactively flag flights requiring **contingency actions** such as aircraft swaps, gate rescheduling, and passenger rebooking.  
This metric operationalizes the model’s **real-world value**—not just statistical accuracy, but its usefulness for **delay mitigation and resource optimization**.
 

#### Validation Strategy
To ensure robust model evaluation and prevent overfitting, we will employ a time-based cross-validation strategy that respects the temporal nature of flight delay data. Traditional random cross-validation would violate the temporal structure of our dataset by potentially training on future data to predict past events, introducing data leakage that would lead to overly optimistic performance estimates. Instead, we will use time-series cross-validation, where the training set always precedes the validation set chronologically, ensuring that models are evaluated on their ability to predict future delays based on historical patterns. Throughout the model development process, we will monitor all metrics on the validation set to prevent overfitting, using the validation performance as a guide for hyperparameter tuning and model selection decisions, while reserving the test set for final unbiased performance assessment only after all model development and selection decisions have been finalized.

## Conclusion and Next Steps

In this phase, our focus was on **exploratory data analysis (EDA)** and developing a deep understanding of the structure, quality, and interrelationships among the flight, airport, and weather datasets. We successfully conducted sample-level experiments to design and validate **custom spatial and temporal join strategies** that integrate flight records with corresponding airport and weather information.

Through this process, we:

- Explored data distributions, missing values, and schema inconsistencies.  
- Designed and validated spatial (±0.1°) and temporal (±24 hours) joins on data subsets.  
- Identified issues such as inconsistent date formats and memory constraints when scaling to full datasets.  
- Observed strong correlations between **departure delays** and **adverse weather conditions** (e.g., low visibility, high wind speeds), confirming the importance of weather-driven features for modeling.

At this stage, our efforts established a solid understanding of the data landscape and validated integration logic using manageable samples.  

In the next phase, we plan to:

- **Scale** the validated join strategy to the complete 2015–2021 dataset.  
- Extend this foundation into **feature engineering** and predictive modeling, creating features such as rolling-window weather trends, airport congestion metrics, and network connectivity measures.  
- **Implement** a structured, end-to-end data pipeline with **checkpointing** and **incremental processing** for scalability and fault tolerance.  
- **Generate** enriched, machine learning–ready datasets that combine flight, airport, and weather features for robust model training and evaluation.

This foundational EDA and prototype work lays the groundwork for a **scalable, reproducible data architecture** in the next project phase, supporting advanced analytics and real-time delay forecasting.


### Acknowledgements 

We would like to express our sincere gratitude to Professor Vinicio De Sola for his valuable guidance and insights during live session & office hours. His thoughtful feedback and clarifications greatly helped us refine our methodology and deepen our understanding of the project’s technical aspects.

### References

#### Code Notebooks




Siddharth - Custom Joins: https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/669506783946166?o=4021782157704243 

Paul - EDA: https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/4473559687182426?o=4021782157704243 

Indri - EDA: https://dbc-fae72cab-cf59.cloud.databricks.com/editor/notebooks/2168935714217059?o=4021782157704243
