# Flight Predict - Predicting Departure Delays using ML


%md
## Phase leader plan in Table format
| Phase | Leader           | Duration | Project Plan |
|------:|----------------  |:--------:|--------------------------------------------------------------------------------------|
| 1     |Kowsalya Ganesan  | Oct 27- Nov 2  |Abstract, Data description, Machine algorithms and metrics, Machine Learning Pipelines |
| 2     |   Eric Wu, Steven Au               |   Nov 3 - Nov 19             | TBD |
| 3     |   Yu-Sheng, Naresh               |    Nov 20 - Dec 10            | TBD |

## Credit assignment plan / who does/did what
| Team Member (A–Z by first name) | Responsibilities     |
|--------------------|----------------------|
| Eric Wu          |Machine learning algorithms|
| Kowsalya Ganesan |Exploratory Data Analysis (EDA)|
| Naresh Kumar     |Abstract|
| Steven Au        |Machine learning algorithms|
| Yu-Sheng Lee     |Machine Learning Pipelines|

## Team Members
| Name (A–Z by first name) | Email     | photo      |
|--------------------|----------------------|-------------|
| Eric Wu          |enqiwu@berkeley.edu| <img src="https://raw.githubusercontent.com/ericwu2024/w261/00189ab0be487ac67b350983d8b23f0e1c98f7e2/eric_wu.jpg" width="120" height="120">|
| Kowsalya Ganesan |kowsalya@berkeley.edu|<img src="https://raw.githubusercontent.com/ericwu2024/w261/00189ab0be487ac67b350983d8b23f0e1c98f7e2/kowsalya_ganesan.jpeg" width="120" height="120">|
| Naresh Kumar     |naresh_kumar@berkeley.edu|<img src="https://raw.githubusercontent.com/ericwu2024/w261/00189ab0be487ac67b350983d8b23f0e1c98f7e2/naresh_kumar.png" width="120" height="120">|
| Steven Au        |steau@berkeley.edu|<img src="https://raw.githubusercontent.com/ericwu2024/w261/00189ab0be487ac67b350983d8b23f0e1c98f7e2/steven_au.jpeg" width="120" height="120">|
| Yu-Sheng Lee     |yushenglee@berkeley.edu|<img src="https://raw.githubusercontent.com/ericwu2024/w261/00189ab0be487ac67b350983d8b23f0e1c98f7e2/yusheng_lee.jpg" width="120" height="120">|

## Abstract



The objective of this project is to build a flight delay prediction system that classifies upcoming flight departures into delay categories no delay and delay to support airport management - leading to more efficient logistics operations and providing better passenger experience. Using the Databricks PaaS environment, the project leverages the starter notebook and cluster setup for data exploration, checkpointing, and modeling. The primary data sources are the U.S. Department of Transportation for flight delay information (2015–2021) and the National Oceanic and Atmospheric Administration (NOAA) for weather data covering the same period. Preliminary exploratory analysis shows strong correlations between delay duration and factors such as poor weather visibility, precipitation, congestion at major hubs, and peak-hour departures. We also plan to utilize other factors like holidays and government shutdown to make this prediction more accurate. 

The project frames delay prediction as a classification task, addressing the question “Is there a flight delay?” Predictions will be made up to two hours before scheduled departure, enabling proactive planning and passenger communication. We will train and evaluate models including Logistic Regression, Random Forest Classifier, and XGBoost Classifier using the integrated Spark MLlib and SparkXGBClassifier libraries on Databricks. Evaluation metrics will include Accuracy, Precision, Recall, Log loss and F1-Score. This approach supports operational decision-making, allowing airports to anticipate disruptions, allocate ground staff efficiently, and improve on-time performance tracking. Future extensions may include temporal features, route-based clustering, and real-time streaming predictions for airport operations dashboards.

## Data Description


The dataset consists of a 3-month sample of on-time performance and weather records for U.S. commercial flights. It includes detailed flight-level information such as carrier codes, departure and arrival times, delays, and airport-level weather observations.

**Data Size:**  ~1.4 million rows

**Number of Columns:**  256

### Summary Statistics

A descriptive analysis was performed on key numerical variables to assess data quality and to detect potential outliers.

<img src="https://raw.githubusercontent.com/ericwu2024/w261/dac9afe96b18c37ee6937282bf73028d88ccf585/summary.png">

**Key Insights:**

- The average departure delay is around 10 minutes, there are a few extreme outliers (up to 1,971 minutes) causing a long-right-tailed distribution.

- Binary indicators such as ARR_DEL15 and DEP_DEL15 confirm that about 20% of flights experience delays greater than 15 minutes.

Here is the inital analysis of the missing values.

**Missing Value Analysis**

An initial assessment of missing values was conducted to evaluate data completeness.

<img src="https://raw.githubusercontent.com/ericwu2024/w261/dac9afe96b18c37ee6937282bf73028d88ccf585/missing_values_assessment.png">

**Key Insights:**

91 columns have 100% missing values.

85 columns have less than 1% missing values, which can be easily imputed.

**Action Plan:**

- greater than 70% missing: Drop the columns

- 25–70% missing: Evaluate and impute based on data type and context

- less than 1% missing: Impute using mean or mode as applicable

- 0% missing: No action required

- We will include 3 checkpoints to save intermediate data for model training (See ML Pipeline section below)

**Few other analysis on Data set:**

Number of unique origin airports: 313
Number of unique airlines: 14
Number of unique destination airports: 315


### Data Dictionary:

| **Column Name**                                           | **Description**                                                           
| --------------------------------------------------------- | ------------------------------------------------------------------------- | 
| `FL_DATE`                                               | Flight date (local time at origin airport).                               
| `OP_UNIQUE_CARRIER`                                       | Unique carrier code (e.g., AA, DL).              
| `OP_CARRIER`                                              | Marketing carrier code.                                                   
| `OP_CARRIER_AIRLINE_ID`                                   | Numeric airline identifier.                                               
| `OP_CARRIER_FL_NUM`                                       | Flight number.                                                           
| `ORIGIN`, `DEST`                                          | Airport codes for origin and destination.                            
| `ORIGIN_AIRPORT_ID`, `DEST_AIRPORT_ID`                    | Numeric airport identifier assigned by DOT.                               
| `ORIGIN_CITY_NAME`, `DEST_CITY_NAME`                      | City names associated with airports.                                      
| `ORIGIN_STATE_ABR`, `DEST_STATE_ABR`                      | Two-letter state abbreviation.                                            
| `ORIGIN_STATE_NM`, `DEST_STATE_NM`                        | Full state name.                                                          
| `ORIGIN_STATE_FIPS`, `DEST_STATE_FIPS`                    | FIPS state code.                                                          
| `DISTANCE`                                                | Non-stop distance between origin and destination (miles).                 
| `DISTANCE_GROUP`                                          | Distance grouped into 250-mile bins (1-11).                               
| `CRS_DEP_TIME`, `CRS_ARR_TIME`                            | Scheduled departure and arrival times (local, HHMM).                      
| `DEP_TIME`, `ARR_TIME`                                    | Actual departure and arrival times (local, HHMM).                        
| `DEP_DELAY`, `ARR_DELAY`                                  | Difference between actual and scheduled departure/arrival time (minutes). 
| `DEP_DELAY_NEW`, `ARR_DELAY_NEW`                          | Delay variables with negative delays set to zero.                         
| `DEP_DELAY_GROUP`, `ARR_DELAY_GROUP`                      | Delay groupings in 15-minute intervals.                                   
| `DEP_DEL15`, `ARR_DEL15`                                  | Indicator: 1 if delay >= 15 minutes, else 0.                              
| `TAXI_OUT`, `TAXI_IN`                                     | Taxi-out and taxi-in time (minutes).                                      
| `WHEELS_OFF`, `WHEELS_ON`                                 | Actual takeoff and landing times.                                        
| `AIR_TIME`                                                | Minutes in flight.                                       
| `ACTUAL_ELAPSED_TIME`, `CRS_ELAPSED_TIME`                 | Actual and scheduled total flight time (minutes).                        
| `CANCELLED`, `DIVERTED`                                   | Indicators for cancellations and diversions (1 = Yes, 0 = No).           
| `FLIGHTS`                                                 | Number of flights represented by the record (typically 1).                
| `DAY_OF_MONTH`, `DAY_OF_WEEK`, `MONTH`, `QUARTER`, `YEAR` | Temporal attributes of the flight.                                        
| `DEP_TIME_BLK`, `ARR_TIME_BLK`                            | Scheduled time block of departure/arrival.  



### Correlation  Matrix:
<img src="https://raw.githubusercontent.com/ericwu2024/w261/dac9afe96b18c37ee6937282bf73028d88ccf585/Correlation.png">

- DEP_DELAY is highly correlated with ARR_DELAY, likely because late departure is likely to affect the downstream arrival time. 
- DELAY_GROUP breaks the delay time into 15 minute intervals, hence high correlation values. 
- LATE_AIRCRAFT_DELAY and CARRIER_DELAY have higher correlation than WEATHER_DELAY, SECURITY_DELAY, and NAS_DELAY.


## Machine Algorithms and Metrics
### Problem Choice and Rationale
***Classification*** answers *“Is the flight actually delayed?”* This is useful for setting expectations (passengers, crews) when an alert of a delay is issued or can be considered for the airport management. 

As such, we will predict whether a flight will depart on time or have a delay greater than or equal to 30 minutes using information available **two hours before the scheduled departure**. We will use a binary decision rule threshold of 30 minutes to classify flights. That is, given the parameters for a certain flight, the flight can be predicted as either delayed or not delayed given the output binary classification based on the departure delay minutes threshold. This way, airport management can conduct further analysis on resources that they will need to be prepared for the departure delay for their next steps. This framing directly serves operational needs (early alerts for airlines/airports) and supports capacity-constrained decision making (limited alert budget).

The following classification based machine learning models will be used to predict the departure delay (**`DEP_DELAY`**) to provide granular ETA (expected time available) updates where needed, deciding if the departure is to be delayed or not. We plan to use a baseline of the `majority class` for the models to compare against in addition to comparing each model to one another.

### Algorithms (Names, Implementations, Loss/Criteria)

#### 1) Random Forest 
- **Description:** Random Forest is an ensemble of decision trees trained on bootstrap samples; at each split a random subset of features is considered. Trees vote to produce class probabilities.
- **Implementation:** Apache Spark MLlib `RandomForestClassifier` (Databricks).
- **Loss/Criteria:**
  - *Entropy (per tree):*
    $$
    H(S) = -\sum_{k=1}^{K} p_k \log p_k
    $$
  The tree grows by maximizing impurity reduction (information gain).

- **Justification:** Strong baseline on tabular, mixed-type, nonlinear data; robust to noise; handles interactions between schedule, route, and weather.

#### 2) XGBoost
- **Description:** XGBoost is a supervised machine learning algorithm known as Extreme Gradient Boosting that uses gradient boosted decision trees with ensemble learning.
- **Implementation:** From xgboost's spark library, `SparkXGBClassifier` (databricks)
- **Loss/Criteria:**
  - *Log Loss:*
    $$
      \mathrm{LogLoss} = -\frac{1}{N}\sum_{i=1}^{N}\Big[y_i\log\hat p_i+(1-y_i)\log(1-\hat p_i)\Big]
    $$ 
- **Justification:** Works mixed and nonlinear data, and we can select either ridge or lasso regularizations to prevent overfitting; handles interactions between schedule, route, and weather.

#### 3) Logistic Regression
- **Description:** The supervised machine learning model that is used for classifying a binary outcome given the input variables.
- **Implementation:** Apache Spark MLlib's binomial `LogisticRegression`.
- **Loss/Criteria:**  
  - *Log Loss:*
    $$
      \mathrm{LogLoss} = -\frac{1}{N}\sum_{i=1}^{N}\Big[y_i\log\hat p_i+(1-y_i)\log(1-\hat p_i)\Big]
    $$ 
- **Justification:** Fast and interpretable for delay or not; coefficients indicate directional effects of weather/schedule features.

### Classification Metrics (With Equations) and Analysis

*Per the classification threshold of minutes greater than or equal to 30 minutes.* The record with departure delay of 30 minutes or more will be classified as delay, or 1, in a new column.

The following will be the metrics that we will use to evaluate the machine learning classification models.
Let \(TP, FP, TN, FN\) be confusion-matrix counts; (hat P_i) the predicted probability of delay; $$\(y_i\in\{0,1\}\)$$
where TP, FP, TN, and FN are True Positive, False Positive, True Negative, and False Negative respectively.
- **(Balanced) Accuracy**
  - **Description:** The proportion of all classifcations being correct.
  $$
  \mathrm{BA} = \tfrac{1}{2}\left(\frac{TP}{TP+FN}+\frac{TN}{TN+FP}\right)
  $$
- **Precision**
  - **Description:** The proportion of all classifications being actually positive
  $$
  \mathrm{Precision} = \frac{TP}{TP+FP}
  $$
- **Recall (Sensitivity/TPR)** The proportion of all actual positives that were classified correctly as positive.
  - **Description:**
  $$
  \mathrm{Recall} = \frac{TP}{TP+FN}
  $$
- **Log Loss (Cross-Entropy)**
  - **Description:** Log loss (logarithmic loss) shows how close the predicted probability is to the actual true value of the outcome.
  $$
  \mathrm{LogLoss} = -\frac{1}{N}\sum_{i=1}^{N}\Big[y_i\log\hat p_i+(1-y_i)\log(1-\hat p_i)\Big]
  $$ 
- **F1-Score**
  - **Description:** F1 Score calculates the harmonic mean between the Precision and Recall values.
  $$
  \mathrm{F1} = 2*\frac{Precision * Recall}{Precision + Recall}
  $$ 


## Machine Learning Pipelines



<img src="https://raw.githubusercontent.com/ericwu2024/w261/dac9afe96b18c37ee6937282bf73028d88ccf585/ML_Pipeline.png">

EDA (Exploratory Data Analysis) is performed on OTPW data and secondary data that include US holiday information to generate summary statistics and initial data visualizations. Next, we will develop data pipelines to address outliers, missing and duplicated values, and encode categorical features. We will allocate 80% of the data for training and 20% for testing. Because different models may require different feature engineering strategies, we will create a dedicated pipeline for each model - for example, standardizing numerical features for logistic regression. 

To ensure reproducibility and efficiency, three checkpoints will be established so intermediate data outputs can be accessed without rerunning the entire process during model training. We will train multiple ML models and perform model selection using cross-validation, reserving 20% of the training data for validation. We will evaluate model performance and fine-tune hyperparameters using the predefined performance metrics. The entire process will iterate until satisfactory model performance is achieved. Finally, we will evaluate the generalizability of the final ML models on the testing set.




## Overall proposal appearance / Block Diagram
<br>
<img src="https://raw.githubusercontent.com/ericwu2024/w261/ac37abea4cb0dcf3f6e29b7472fbb0af36ee37ea/block_diagram.png">
<br>
<img src="https://raw.githubusercontent.com/ericwu2024/w261/ac37abea4cb0dcf3f6e29b7472fbb0af36ee37ea/duration_vs_task.png">

# Appendix

### Extended Summary Statistics

Additional summary statistics for the remaining numeric columns are provided in the appendix.
<br>
<img src="https://raw.githubusercontent.com/ericwu2024/w261/3fb910fe34be878207fa02d4e0ba9435939b6858/additional.png">



### Data Dictionary

**Weather Data Dictionary**


| **Column Name**                                                 | **Description**                                             | 
| --------------------------------------------------------------- | ----------------------------------------------------------- | 
| `STATION`, `NAME`                                               |  Weather station ID and name.                           |
| `DATE`                                                          | Date and time of weather observation (UTC).                 | 
| `LATITUDE`, `LONGITUDE`, `ELEVATION`                            | Geographic location and altitude of the weather station.    | 
| `SOURCE`, `REPORT_TYPE`                                         | Type of observation report (e.g., METAR, ASOS).             
| `HourlyDryBulbTemperature`                                      | Air temperature at observation time (°F).                   | 
| `HourlyDewPointTemperature`                                     | Dew point temperature (°F).                                 | 
| `HourlyRelativeHumidity`                                        | Relative humidity (%).                                      | 
| `HourlyVisibility`                                              | Horizontal visibility (miles).                              | 
| `HourlyAltimeterSetting`                                        | Barometric pressure at sea level (inHg).                    | 
| `HourlySeaLevelPressure`                                        | Atmospheric pressure at sea level (hPa).                    | 
| `HourlyStationPressure`                                         | Pressure at station elevation (hPa).                        | 
| `HourlyPrecipitation`                                           | Hourly precipitation total (inches).                        | 
| `HourlyWindSpeed`, `HourlyWindDirection`, `HourlyWindGustSpeed` | Wind speed (knots), direction (degrees), and gusts (knots). | 
| `HourlySkyConditions`                                           | Coded sky cover and cloud height.                           | 
| `HourlyWetBulbTemperature`                                      | Wet-bulb temperature (°F).                                  | 
| `WindEquipmentChangeDate`                                       | Date when wind measurement equipment was last updated.      | 
| `REM`                                                           | Remarks and supplemental weather notes.                     | 



**Airport Data Dictionary**

| **Column Name**                                                                    | **Description**                                              
| ---------------------------------------------------------------------------------- | ------------------------------------------------------------ | 
| `origin_airport_name`, `dest_airport_name`                                         | Full airport names.                                          | 
| `origin_airport_lat`, `origin_airport_lon`                                         | Latitude and longitude of origin airport.                    | 
| `dest_airport_lat`, `dest_airport_lon`                                             | Latitude and longitude of destination airport.               | 
| `origin_station_lat`, `origin_station_lon`, `dest_station_lat`, `dest_station_lon` | Coordinates of associated weather stations.                  
| `origin_region`, `dest_region`                                                     | Geographic region (e.g., East, West, Central).               | Derived    |
| `origin_type`, `dest_type`                                                         | Station type (Airport, ASOS, AWOS).                          
| `origin_iata_code`, `dest_iata_code`, `origin_icao`, `dest_icao`                   | Standard codes.                                
| `origin_station_dis`, `dest_station_dis`                                           | Distance between airport and linked weather station (miles). | Derived    |
