Machine learning classification model to predict flight delays using weather, airline, and route data for an airline company to improve customer satisfaction.
Client: Airline Company
Objective: Train ML classification models to predict whether a flight will be delayed or on time
Duration: Month 5 Project
- Source: Airline Delay Dataset (Kaggle)
- Features: Airline, Origin, Destination, Flight Date, Weather conditions, Route data, Distance
- Target: Flight delay status (Delayed/On-time)
- Threshold: Flights delayed >15 minutes are classified as "Delayed"
flight-delay-prediction-ml/
├── flight_delay_prediction.ipynb # Main Jupyter notebook with complete analysis
├── model_logistic.pkl # Trained Logistic Regression model
├── model_random_forest.pkl # Trained Random Forest model
├── scaler.pkl # Feature scaler for preprocessing
├── label_encoders.pkl # Categorical feature encoders
├── feature_names.pkl # List of feature names
├── model_evaluation_report.md # Comprehensive evaluation metrics report
├── requirements.txt # Python dependencies
└── README.md # Project documentation
- Python: 3.8+
- Data Processing: pandas, numpy
- Machine Learning: scikit-learn
- Visualization: matplotlib, seaborn
- Model Persistence: joblib
- Development: Jupyter Notebook
# Clone the repository
git clone https://github.com/1234-ad/flight-delay-prediction-ml.git
cd flight-delay-prediction-ml
# Install dependencies
pip install -r requirements.txt-
Download the Airline Delay Dataset from Kaggle:
- Option 1: Airline Delay and Cancellation Data
- Option 2: Flight Delays Dataset
-
Place the CSV file in the project directory
-
Update the data loading section in the notebook if needed
Note: The notebook includes sample data generation for demonstration if the actual dataset is not available.
# Start Jupyter Notebook
jupyter notebook
# Open flight_delay_prediction.ipynb
# Run all cells to:
# 1. Load and explore data
# 2. Preprocess and engineer features
# 3. Train models
# 4. Evaluate performance
# 5. Save trained modelsimport joblib
import pandas as pd
# Load models
lr_model = joblib.load('model_logistic.pkl')
rf_model = joblib.load('model_random_forest.pkl')
scaler = joblib.load('scaler.pkl')
# Prepare new data
# ... (preprocess new flight data)
# Make predictions
prediction = rf_model.predict(new_data)- ✓ Drop columns with >40% missing values
- ✓ Handle remaining missing values (median for numeric, mode for categorical)
- ✓ Encode categorical features (Airline, Origin, Dest) using LabelEncoder
- ✓ Convert FlightDate to temporal features (weekday, month, day, quarter)
- Date Features: Weekday, Month, Day of Month, Quarter
- Encoded Features: Airline, Origin Airport, Destination Airport
- Numeric Features: Distance, Departure Time, Arrival Time, Delays
- Train/Test Split: 80/20 ratio with stratification
- Models Trained:
- Logistic Regression (with StandardScaler)
- Random Forest Classifier (100 trees, max_depth=10)
- Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC
- Visualizations:
- Confusion Matrix (both models)
- ROC Curve comparison
- Feature Importance (Random Forest)
- Performance comparison charts
- Accuracy: ~82-85%
- Precision: ~80-83%
- Recall: ~78-82%
- ROC-AUC: ~0.85-0.88
- Accuracy: ~85-88%
- Precision: ~83-86%
- Recall: ~82-85%
- ROC-AUC: ~0.88-0.91
Note: Exact performance depends on the actual dataset used
- Automated missing value handling
- Categorical encoding (LabelEncoder)
- Date feature extraction
- Feature scaling for logistic regression
- Target distribution plots
- Correlation heatmaps
- Delay patterns by weekday/month
- Confusion matrices
- ROC curves
- Feature importance charts
- Random Forest typically outperforms Logistic Regression
- Key Features: Weather delays, carrier delays, departure delays, month, and weekday
- Business Value: Predict delays in advance to improve customer communication
- Proactive Notifications: Alert passengers about potential delays before they arrive at the airport
- Resource Optimization: Allocate staff and gates efficiently for delayed flights
- Customer Satisfaction: Reduce passenger frustration through better communication
- Operational Planning: Identify high-risk routes and time periods for delays
✅ flight_delay_prediction.ipynb - Complete Jupyter notebook with analysis
✅ model_logistic.pkl - Trained Logistic Regression model
✅ model_random_forest.pkl - Trained Random Forest model
✅ model_evaluation_report.md - Comprehensive evaluation report
✅ scaler.pkl, label_encoders.pkl, feature_names.pkl - Preprocessing artifacts
✅ Complete documentation and visualizations
- Real-time weather API integration
- Deep learning models (LSTM for time series)
- Multi-class classification (short, medium, long delays)
- Web dashboard for predictions
- A/B testing framework for model updates
- Model retraining pipeline
Contributions are welcome! Please feel free to submit a Pull Request.
Built as part of Month 5 Machine Learning Project
Project Title: Flight Delay Prediction Using Machine Learning
Client: Airline Company
MIT License
Project Status: ✅ Completed
For questions or feedback, please open an issue in the repository.