Flight delay prediction using Random Forest on synthetic and real data, exploring accuracy, key features, and class imbalance.
When my family was stranded in London during a transit system meltdown on a trip in 2024, it wasn’t just the chaos that struck me—it was how disconnected and brittle the decision-making seemed. That experience pushed me to explore predictive modeling through a Kaggle project on flight delays.
Flight delays affect millions of travelers every year and pose significant challenges to airline operations. Predicting delays can help optimize scheduling, improve customer satisfaction, and reduce operational costs. This project explores two models — one built on synthetic data and the other on real-world data — to predict whether a flight will be delayed.
- Synthetic Dataset: 50,000 generated samples based on airline attributes (e.g., origin, destination, departure time, gate wait time, and previous flight delays)
- Real-World Dataset: 1.6M+ flight records sourced from Kaggle.com
Install Jupyter Notebook 2.16, Python 3.13, scikit-learn 1.61.1 and other dependencies.
Jupyter Notebook
- Run: jupyter notebook
- Once Jupyter Server is running, go to the web http://localhost:8888
- Upload the notebook .ipynb and the associated data .csv files Synthetic Data Model: notebook airline-synthetic.ipynb and Dataset 1 (expanded_flight_delays.csv will be created or overwritten, so you don't need upload this data file) Real-World Data Model: notebook airline-real.ipynb and Dataset 2 (flights_sample_3m.csv) which is need to be uplaoded.
- Run each cell one by one or Run all
- Re-run (after Clear Cell Output or Clear Outputs of all Cells)
The first part of the project involved creating and modeling synthetic data to simulate realistic flight scenarios:
Steps:
- Generated 50,000 samples using Python (
pandas,numpy, etc.), simulating features like departure time, previous late flights, and gate wait time. - Converted object-type columns to integers; applied one-hot encoding to categorical features.
- Split the data into 80% training and 20% testing sets; ensured aligned feature dimensions.
- Handled class imbalance using
SMOTETomekto balance delayed vs. non-delayed flights. - Trained a
RandomForestClassifierachieving 99.98% accuracy with only 2 misclassified samples. - Saved the model using
joblib.
Top Features:
Number of previous flights late— 73%Average gate wait time— 20%
⚠️ Note: The high accuracy is likely due to the simplified nature of the synthetic data, which may not reflect real-world complexity.
The second part of the project used a real dataset with over 1.6 million flight records.
Steps:
- Imported and cleaned the dataset. Filled missing values (mode for categorical, median for numerical).
- Encoded categorical variables; created a binary target label (
Delayed: 1 or 0). - Split the data into 70% training and 30% testing sets.
- Addressed class imbalance by resampling minority (delayed) cases.
- Trained a
RandomForestClassifierachieving 92% accuracy.
Performance:
F1 score (Not Delayed)— 0.9602F1 score (Delayed)— 0.0295
Top Features:
Scheduled Arrival Time— 36%Scheduled Departure Time— 34%Actual Departure Time— 7%
⚠️ The low F1 score for delayed flights shows the challenge of handling real-world imbalance, even with good overall accuracy.