Skip to content

Machine Learning project to predict flight delays using weather, airline, and route data. Built with Logistic Regression and Random Forest classifiers.

Notifications You must be signed in to change notification settings

1234-ad/flight-delay-prediction-ml

Repository files navigation

Flight Delay Prediction Using Machine Learning

Project Overview

Machine learning classification model to predict flight delays using weather, airline, and route data for an airline company to improve customer satisfaction.

Client: Airline Company
Objective: Train ML classification models to predict whether a flight will be delayed or on time
Duration: Month 5 Project

Dataset

  • Source: Airline Delay Dataset (Kaggle)
  • Features: Airline, Origin, Destination, Flight Date, Weather conditions, Route data, Distance
  • Target: Flight delay status (Delayed/On-time)
  • Threshold: Flights delayed >15 minutes are classified as "Delayed"

Project Structure

flight-delay-prediction-ml/
├── flight_delay_prediction.ipynb    # Main Jupyter notebook with complete analysis
├── model_logistic.pkl                # Trained Logistic Regression model
├── model_random_forest.pkl           # Trained Random Forest model
├── scaler.pkl                        # Feature scaler for preprocessing
├── label_encoders.pkl                # Categorical feature encoders
├── feature_names.pkl                 # List of feature names
├── model_evaluation_report.md        # Comprehensive evaluation metrics report
├── requirements.txt                  # Python dependencies
└── README.md                         # Project documentation

Technologies Used

  • Python: 3.8+
  • Data Processing: pandas, numpy
  • Machine Learning: scikit-learn
  • Visualization: matplotlib, seaborn
  • Model Persistence: joblib
  • Development: Jupyter Notebook

Installation

# Clone the repository
git clone https://github.com/1234-ad/flight-delay-prediction-ml.git
cd flight-delay-prediction-ml

# Install dependencies
pip install -r requirements.txt

Dataset Setup

  1. Download the Airline Delay Dataset from Kaggle:

  2. Place the CSV file in the project directory

  3. Update the data loading section in the notebook if needed

Note: The notebook includes sample data generation for demonstration if the actual dataset is not available.

Usage

Running the Complete Analysis

# Start Jupyter Notebook
jupyter notebook

# Open flight_delay_prediction.ipynb
# Run all cells to:
# 1. Load and explore data
# 2. Preprocess and engineer features
# 3. Train models
# 4. Evaluate performance
# 5. Save trained models

Using Saved Models

import joblib
import pandas as pd

# Load models
lr_model = joblib.load('model_logistic.pkl')
rf_model = joblib.load('model_random_forest.pkl')
scaler = joblib.load('scaler.pkl')

# Prepare new data
# ... (preprocess new flight data)

# Make predictions
prediction = rf_model.predict(new_data)

Project Workflow

1. Data Preparation

  • ✓ Drop columns with >40% missing values
  • ✓ Handle remaining missing values (median for numeric, mode for categorical)
  • ✓ Encode categorical features (Airline, Origin, Dest) using LabelEncoder
  • ✓ Convert FlightDate to temporal features (weekday, month, day, quarter)

2. Feature Engineering

  • Date Features: Weekday, Month, Day of Month, Quarter
  • Encoded Features: Airline, Origin Airport, Destination Airport
  • Numeric Features: Distance, Departure Time, Arrival Time, Delays

3. Model Building

  • Train/Test Split: 80/20 ratio with stratification
  • Models Trained:
    • Logistic Regression (with StandardScaler)
    • Random Forest Classifier (100 trees, max_depth=10)

4. Model Evaluation

  • Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC
  • Visualizations:
    • Confusion Matrix (both models)
    • ROC Curve comparison
    • Feature Importance (Random Forest)
    • Performance comparison charts

Model Performance

Logistic Regression

  • Accuracy: ~82-85%
  • Precision: ~80-83%
  • Recall: ~78-82%
  • ROC-AUC: ~0.85-0.88

Random Forest

  • Accuracy: ~85-88%
  • Precision: ~83-86%
  • Recall: ~82-85%
  • ROC-AUC: ~0.88-0.91

Note: Exact performance depends on the actual dataset used

Key Features

Data Preprocessing

  • Automated missing value handling
  • Categorical encoding (LabelEncoder)
  • Date feature extraction
  • Feature scaling for logistic regression

Visualizations

  • Target distribution plots
  • Correlation heatmaps
  • Delay patterns by weekday/month
  • Confusion matrices
  • ROC curves
  • Feature importance charts

Model Insights

  • Random Forest typically outperforms Logistic Regression
  • Key Features: Weather delays, carrier delays, departure delays, month, and weekday
  • Business Value: Predict delays in advance to improve customer communication

Business Applications

  1. Proactive Notifications: Alert passengers about potential delays before they arrive at the airport
  2. Resource Optimization: Allocate staff and gates efficiently for delayed flights
  3. Customer Satisfaction: Reduce passenger frustration through better communication
  4. Operational Planning: Identify high-risk routes and time periods for delays

Deliverables

flight_delay_prediction.ipynb - Complete Jupyter notebook with analysis
model_logistic.pkl - Trained Logistic Regression model
model_random_forest.pkl - Trained Random Forest model
model_evaluation_report.md - Comprehensive evaluation report
scaler.pkl, label_encoders.pkl, feature_names.pkl - Preprocessing artifacts
✅ Complete documentation and visualizations

Future Enhancements

  • Real-time weather API integration
  • Deep learning models (LSTM for time series)
  • Multi-class classification (short, medium, long delays)
  • Web dashboard for predictions
  • A/B testing framework for model updates
  • Model retraining pipeline

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Author

Built as part of Month 5 Machine Learning Project
Project Title: Flight Delay Prediction Using Machine Learning
Client: Airline Company

License

MIT License


Project Status: ✅ Completed

For questions or feedback, please open an issue in the repository.

About

Machine Learning project to predict flight delays using weather, airline, and route data. Built with Logistic Regression and Random Forest classifiers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published