# **Airline-On-Time-Performance-and-Delay-Prediction**

## Problem Statement
---

### Project Objective:
Develop a machine learning model to predict airline delays and analyze factors affecting on-time performance using historical data. By exploring patterns and correlations in flight records, this model will forecast both the likelihood and duration of delays. The ultimate goal is to provide airlines with actionable insights to optimize schedules, enhance operational efficiency, and improve customer satisfaction.

### Keywords:
Airline delays, Flight performance prediction, Machine learning, Delay analysis, Predictive Modeling, Classification, Regression, Data preprocessing, Feature engineering, Time series analysis

### Research Questions to Address:
1. What factors influence flight delays (e.g., weather, time of day, distance)?
2. How can we predict if a flight will be delayed or on time?
3. What is the expected delay duration for delayed flights?
4. How accurate are machine learning models in predicting flight delays?
5. What trends can we identify in historical delay data?
6. How can flight schedules be optimized to minimize delays?

### Dataset Information:
- **Source**: U.S. Bureau of Transportation Statistics ([BTS](https://www.transtats.bts.gov)) Airline On-Time Performance Data. Click [here](https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp) to download the dataset
- **Features**: The dataset includes flight details such as year, month, carrier, airport, and airport_name. It also contains arrival-related data and causes of delay. Additional columns include arr_cancelled, arr_diverted, arr_delay, and various delay times (carrier_delay, weather_delay, etc.).
- **Size**: Hundreds of thousands of records spanning from `June 2003` to `December 2024`.

### Evaluation Metrics:
- **For Classification (Delay Prediction)**: 
  - Accuracy, Precision, Recall, F1-Score, AUC-ROC.
- **For Regression (Delay Duration)**: 
  - MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), R² (R-Squared).

### Success Criteria:
- High prediction accuracy (e.g., 85%+ for classification).
- Interpretability of the model (clear insights on delay factors).
- Real-world applicability (model generalizes well on new data).
- Actionable insights for airlines (e.g., delay-prone routes, airports).


## Import Libraries

In [None]:
# Importing basic libraries for data manipulation
import pandas as pd
import numpy as np

# Importing libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Importing libraries for machine learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score

# For handling warnings (if needed)
import warnings
warnings.filterwarnings('ignore')

# Optional: Importing libraries for handling missing data (if needed)
from sklearn.impute import SimpleImputer

# For time-based features (if necessary)
import datetime as dt


## About Dataset

The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Monthly summaries of flight performance—such as the number of on-time, delayed, canceled, and diverted flights—are provided in DOT's **Air Travel Consumer Report**, published approximately 30 days after each month ends. Additionally, BTS releases summary tables on their website. Since June 2003, BTS has also been collecting data on the causes of flight delays. Both summary statistics and raw data are made publicly available when the Air Travel Consumer Report is released.

### Dataset Columns:

- **year**: Year in YYYY format.
- **month**: Month in MM format (1-12).
- **carrier**: Code assigned by the U.S. DOT to uniquely identify an airline carrier.
- **carrier_name**: The full name of the airline carrier, defined as one holding and reporting under the same DOT certificate, regardless of its code, name, or parent company.
- **airport**: A three-character alphanumeric code issued by the U.S. DOT, designating the airport.
- **airport_name**: The official name of an airport, where aircraft operate, typically with paved runways, maintenance facilities, and terminals.
- **arr_flights**: Total number of arrival flights.
- **arr_del15**: Arrival delay indicator (15 minutes or more). A flight is considered delayed if its actual arrival time exceeds the scheduled time by 15 minutes or more.
- **carrier_ct**: Number of delays caused by the airline (carrier).
- **weather_ct**: Number of delays caused by weather-related factors.
- **nas_ct**: Number of delays caused by the National Air System (NAS).
- **security_ct**: Number of delays caused by security-related issues.
- **late_aircraft_ct**: Number of delays caused by late-arriving aircraft.
- **arr_cancelled**: Indicates whether the flight was canceled.
- **arr_diverted**: Indicates whether the flight was diverted.
- **arr_delay**: The difference (in minutes) between the scheduled and actual arrival time. Negative values indicate early arrivals.
- **carrier_delay**: The delay, in minutes, caused by the airline carrier.
- **weather_delay**: The delay, in minutes, caused by weather conditions.
- **nas_delay**: The delay, in minutes, caused by the National Air System.
- **security_delay**: The delay, in minutes, caused by security-related issues.
- **late_aircraft_delay**: The delay, in minutes, caused by late-arriving aircraft.

In [None]:
url = 'https://raw.githubusercontent.com/GopinathAchuthan/Airline-On-Time-Performance-and-Delay-Prediction/refs/heads/main/Dataset/Airline_Delay_Cause.csv'