# UMBC DATA606 Capstone – Flight Delay Prediction Proposal

**Project Title:** Predicting U.S. Flight Delays and Delay Duration  
**Prepared for:** UMBC Data Science Master Degree Capstone by Dr. Chaojie (Jay) Wang  
**Author:** Drashi Dave  
**GitHub Repository:** [https://github.com/DrashiDave/UMBC-DATA606-Capstone/Data/Airline_Delay_Cause.csv](https://github.com/DrashiDave/UMBC-DATA606-Capstone/Data/Airline_Delay_Cause.csv)  
**LinkedIn Profile:** [linkedin.com/in/drashi-d](https://www.linkedin.com/in/drashi-d)  
<!--**PowerPoint Presentation:** *TBD* -->
<!-- **YouTube Video:** *TBD* -->

## Background

Flight delays are a common issue affecting passengers, airlines, and the broader air transportation system. They increase operational costs, reduce passenger satisfaction, and cause cascading disruptions across connecting flights. The U.S. Bureau of Transportation Statistics (BTS) collects detailed information on flight operations, including arrival delays and their causes.

**Project Objective:**  
This project aims to predict whether a flight will be delayed (classification) and, if delayed, estimate the length of the delay in minutes (regression).

**Why it Matters:**  
Accurate predictions of flight delays help:  
- Passengers plan their travel more efficiently  
- Airlines optimize schedules, staffing, and resource allocation  
- Airports and air traffic controllers manage congestion  

**Research Questions:**  
1. Can we predict whether a flight will be delayed by more than 15 minutes?  
2. If a flight is delayed, can we predict the expected delay duration in minutes?  
3. Which factors (airline, airport, time of year, weather, etc.) contribute most to delays?  


In [52]:
import pandas as pd

**Data Source:**  
- U.S. Bureau of Transportation Statistics - On-Time Performance Dataset (Open-source, publicly available)  
- Official government dataset: [https://www.transtats.bts.gov/](https://www.transtats.bts.gov/)
- Local Copy (for convenience): Available in this GitHub repository  
  [https://github.com/DrashiDave/UMBC-DATA606-Capstone/tree/main/data/Airline_Delay_Cause.csv](https://github.com/DrashiDave/UMBC-DATA606-Capstone/tree/main/data/Airline_Delay_Cause.csv)


In [10]:
# Load dataset 
file_path = "Airline_Delay_Cause.csv"   
df = pd.read_csv(file_path)

In [12]:
# Display first few rows
print("Preview of dataset:")
display(df.head())

Preview of dataset:


Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2025,5,9E,Endeavor Air Inc.,ABE,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",92.0,17.0,4.8,1.24,...,0.0,6.67,4.0,2.0,1834.0,517.0,555.0,283.0,0.0,479.0
1,2025,5,9E,Endeavor Air Inc.,AEX,"Alexandria, LA: Alexandria International",92.0,24.0,10.74,3.65,...,0.0,5.03,2.0,1.0,2080.0,615.0,917.0,186.0,0.0,362.0
2,2025,5,9E,Endeavor Air Inc.,AGS,"Augusta, GA: Augusta Regional at Bush Field",188.0,52.0,17.87,1.49,...,0.0,19.67,4.0,1.0,4132.0,956.0,538.0,898.0,0.0,1740.0
3,2025,5,9E,Endeavor Air Inc.,ALB,"Albany, NY: Albany International",83.0,26.0,6.01,0.0,...,0.0,17.11,4.0,0.0,1975.0,857.0,0.0,83.0,0.0,1035.0
4,2025,5,9E,Endeavor Air Inc.,ATL,"Atlanta, GA: Hartsfield-Jackson Atlanta Intern...",3118.0,785.0,146.76,26.61,...,0.0,411.36,43.0,8.0,67705.0,19313.0,3384.0,10047.0,0.0,34961.0


**Each row represents aggregated statistics for one airline at one airport during one month, including flight counts, delays, and causes of delays.**

In [14]:
# Shape of the dataset
print("\nDataset shape (rows, columns):", df.shape)


Dataset shape (rows, columns): (100447, 21)


**Data Size:**  
- One year of data (example: 2025) ~ several hundred MB  
- ~100447 rows and 21 columns   

In [16]:
# Column information
print("\nColumn Data Types:")
print(df.dtypes)


Column Data Types:
year                     int64
month                    int64
carrier                 object
carrier_name            object
airport                 object
airport_name            object
arr_flights            float64
arr_del15              float64
carrier_ct             float64
weather_ct             float64
nas_ct                 float64
security_ct            float64
late_aircraft_ct       float64
arr_cancelled          float64
arr_diverted           float64
arr_delay              float64
carrier_delay          float64
weather_delay          float64
nas_delay              float64
security_delay         float64
late_aircraft_delay    float64
dtype: object


### Data Information

| Column Name           | Data Type | Definition                                                                                     | Example Values |
|-----------------------|-----------|-------------------------------------------------------------------------------------------------|----------------|
| year                  | int       | Year in YYYY format                                                                             | 2025           |
| month                 | int       | Month in MM format (1-12)                                                                      | 5              |
| carrier               | object    | Code assigned by US DOT to identify a unique airline carrier                                     | 9E             |
| carrier_name          | object    | Unique airline holding and reporting under the same DOT certificate                              | Endeavor Air Inc. |
| airport               | object    | Three-character airport code issued by US DOT                                                   | ABE            |
| airport_name          | object    | Full name of airport including location                                                         | Allentown/Bethlehem/Easton, PA: Lehigh Valley Intl |
| arr_flights           | float     | Total number of arriving flights                                                               | 92.0           |
| arr_del15             | float     | Arrival Delay Indicator ≥ 15 min. Difference between actual and scheduled arrival time         | 17.0           |
| carrier_ct            | float     | Number of delays caused by the airline                                                         | 4.8            |
| weather_ct            | float     | Number of delays caused by weather                                                             | 1.24           |
| nas_ct                | float     | Number of delays caused by National Air System                                                 | 4.29           |
| security_ct           | float     | Number of delays caused by security                                                            | 0.0            |
| late_aircraft_ct      | float     | Number of delays caused by late aircraft                                                       | 6.67           |
| arr_cancelled         | float     | Number of cancelled flights                                                                    | 4.0            |
| arr_diverted          | float     | Number of diverted flights                                                                     | 2.0            |
| arr_delay             | float     | Difference in minutes between scheduled and actual arrival time. Early arrivals show negative. | 1834.0         |
| carrier_delay         | float     | Average delay time caused by carrier (minutes)                                                 | 517.0          |
| weather_delay         | float     | Average delay time caused by weather (minutes)                                                 | 555.0          |
| nas_delay             | float     | Average delay time caused by NAS (minutes)                                                     | 283.0         |
| security_delay        | float     | Average delay time caused by security (minutes)                                                | 0.0            |
| late_aircraft_delay   | float     | Average delay time caused by late aircraft (minutes)                                           | 479.0          |

**Each row represents aggregated statistics for one airline at one airport during one month, including flight counts, delays, and causes of delays.**

In [69]:
# Let's find out the start and end dates from dataset
start_year = df['year'].min()
start_month = df[df['year'] == start_year]['month'].min()

end_year = df['year'].max()
end_month = df[df['year'] == end_year]['month'].max()

print(f"Start Date: {start_year}-{start_month:02d}")
print(f"End Date: {end_year}-{end_month:02d}")

Start Date: 2021-01
End Date: 2025-05


**The Dataset is ranged from the year 2021 to 2025**  
- Start Date: January 2021  
- End Date: Latest available month in 2025  

In [28]:
categorical_cols = ['carrier_name', 'airport']
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")

carrier_name: 23 unique values
airport: 385 unique values


## Here are the Target Variables and Feature Candidates

**Classification:** `Delayed` – 1 if `arr_delay > 15`, else 0.  
> **Note:** The U.S. Bureau of Transportation Statistics defines a flight as “delayed” if it arrives 15 minutes or more after the scheduled time. This threshold is the standard in aviation and ensures your model aligns with real-world definitions.

**Regression:** `arr_delay` – continuous delay time in minutes.

**Feature Candidates:** These features capture the airline, airport, time, flight volume, and causes of delays, which will help the model predict whether a flight will be delayed and estimate the delay duration.

- Airline (`carrier_name`)
- Airport (`airport`)
- Time (`year`, `month`)
- Flight counts (`arr_flights`)
- Cause counts (`carrier_ct`, `weather_ct`, `nas_ct`, `security_ct`, `late_aircraft_ct`)
- Cancellation/diversion counts (`arr_cancelled`, `arr_diverted`)

> **Explanation:** These features will allow the model to predict both whether a flight is delayed and the expected delay duration.

