# ‚úàÔ∏è Flight Delay Prediction ‚Äì Problem Statement
Flight delays cause inconvenience to passengers and financial losses to airlines and airports. Delays can occur due to multiple factors such as weather conditions, flight distance, previous delays, passenger volume, and operational constraints.

The objective of this project is to build a machine learning model that predicts whether a flight will be delayed or arrive on time based on historical flight, weather, and operational data.

The model takes inputs such as:

- Flight distance and timing
- Weather conditions (temperature, wind speed, humidity)
- Airline and airport information
- Day type (weekday/weekend)
- Previous delay history and passenger count
  
Using these features, the system classifies flights into:

- Delayed (1)
- On Time (0)
  
The trained model is deployed using Streamlit, allowing users to input flight details and instantly receive delay predictions through an interactive web interface. This solution can help airlines improve operational planning and assist passengers in making informed travel decisions.

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [70]:
df = pd.read_csv("flight_delay_dataset.csv")

df.head()

Unnamed: 0,Flight_Distance_km,Departure_Hour,Arrival_Hour,Temperature_C,Wind_Speed_kmph,Humidity_Percent,Previous_Delays,Passenger_Count,Airline,Weather_Condition,Day_Type,Airport,Flight_Type,Delayed
0,1060,12,13,30,22,19,0,276,Vistara,Rain,Weekend,BOM,Domestic,0
1,1494,19,3,32,11,83,4,141,AirIndia,Clear,Weekday,BLR,International,1
2,1330,16,0,10,39,27,4,87,AirIndia,Clear,Weekday,DEL,Domestic,1
3,1295,8,23,15,2,33,1,176,Vistara,Rain,Weekend,DEL,Domestic,0
4,1838,3,3,22,1,10,3,107,Vistara,Rain,Weekday,BOM,Domestic,1


In [121]:
df.shape

(5000, 14)

In [72]:
df.dtypes

Flight_Distance_km     int64
Departure_Hour         int64
Arrival_Hour           int64
Temperature_C          int64
Wind_Speed_kmph        int64
Humidity_Percent       int64
Previous_Delays        int64
Passenger_Count        int64
Airline               object
Weather_Condition     object
Day_Type              object
Airport               object
Flight_Type           object
Delayed                int64
dtype: object

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Flight_Distance_km  5000 non-null   int64 
 1   Departure_Hour      5000 non-null   int64 
 2   Arrival_Hour        5000 non-null   int64 
 3   Temperature_C       5000 non-null   int64 
 4   Wind_Speed_kmph     5000 non-null   int64 
 5   Humidity_Percent    5000 non-null   int64 
 6   Previous_Delays     5000 non-null   int64 
 7   Passenger_Count     5000 non-null   int64 
 8   Airline             5000 non-null   object
 9   Weather_Condition   5000 non-null   object
 10  Day_Type            5000 non-null   object
 11  Airport             5000 non-null   object
 12  Flight_Type         5000 non-null   object
 13  Delayed             5000 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 547.0+ KB


In [76]:
df.describe()

Unnamed: 0,Flight_Distance_km,Departure_Hour,Arrival_Hour,Temperature_C,Wind_Speed_kmph,Humidity_Percent,Previous_Delays,Passenger_Count,Delayed
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1607.191,11.6132,11.685,19.2836,19.4424,54.0362,2.0412,174.173,0.863
std,804.072402,6.888445,6.90163,14.49317,11.557413,26.076926,1.433288,72.038158,0.343882
min,200.0,0.0,0.0,-5.0,0.0,10.0,0.0,50.0,0.0
25%,912.5,6.0,6.0,7.0,9.0,31.0,1.0,113.0,1.0
50%,1608.0,12.0,12.0,19.0,19.0,54.0,2.0,173.0,1.0
75%,2294.0,17.0,18.0,32.0,30.0,77.0,3.0,236.0,1.0
max,2999.0,23.0,23.0,44.0,39.0,99.0,4.0,299.0,1.0


In [78]:
df.columns

Index(['Flight_Distance_km', 'Departure_Hour', 'Arrival_Hour', 'Temperature_C',
       'Wind_Speed_kmph', 'Humidity_Percent', 'Previous_Delays',
       'Passenger_Count', 'Airline', 'Weather_Condition', 'Day_Type',
       'Airport', 'Flight_Type', 'Delayed'],
      dtype='object')

In [80]:
df.isnull().sum()

Flight_Distance_km    0
Departure_Hour        0
Arrival_Hour          0
Temperature_C         0
Wind_Speed_kmph       0
Humidity_Percent      0
Previous_Delays       0
Passenger_Count       0
Airline               0
Weather_Condition     0
Day_Type              0
Airport               0
Flight_Type           0
Delayed               0
dtype: int64

In [82]:
df.duplicated().sum()

0

In [84]:
X = df.drop("Delayed", axis=1)
y = df["Delayed"]


In [109]:
X

Unnamed: 0,Flight_Distance_km,Departure_Hour,Arrival_Hour,Temperature_C,Wind_Speed_kmph,Humidity_Percent,Previous_Delays,Passenger_Count,Airline,Weather_Condition,Day_Type,Airport,Flight_Type
0,1060,12,13,30,22,19,0,276,Vistara,Rain,Weekend,BOM,Domestic
1,1494,19,3,32,11,83,4,141,AirIndia,Clear,Weekday,BLR,International
2,1330,16,0,10,39,27,4,87,AirIndia,Clear,Weekday,DEL,Domestic
3,1295,8,23,15,2,33,1,176,Vistara,Rain,Weekend,DEL,Domestic
4,1838,3,3,22,1,10,3,107,Vistara,Rain,Weekday,BOM,Domestic
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,1251,22,0,38,34,10,4,195,AirIndia,Rain,Weekend,DEL,Domestic
4996,1773,12,13,10,9,64,0,123,Indigo,Storm,Weekend,BOM,International
4997,492,14,5,21,11,24,4,233,Indigo,Rain,Weekday,BOM,Domestic
4998,2424,10,9,5,17,94,4,192,Indigo,Storm,Weekend,DEL,International


In [111]:
y

0       0
1       1
2       1
3       0
4       1
       ..
4995    1
4996    1
4997    1
4998    1
4999    1
Name: Delayed, Length: 5000, dtype: int64

In [86]:

num_cols = [ "Flight_Distance_km",  "Departure_Hour", "Arrival_Hour", "Temperature_C", "Wind_Speed_kmph", "Humidity_Percent","Previous_Delays", "Passenger_Count"]

cat_cols = ["Airline","Weather_Condition", "Day_Type","Airport","Flight_Type"]


In [88]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y,test_size=0.2, random_state=42,stratify=y)


In [113]:
X_train.shape

(4000, 13)

In [115]:
X_test.shape

(1000, 13)

In [117]:
y_train.shape

(4000,)

In [119]:
y_test.shape

(1000,)

In [123]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

preprocessor = ColumnTransformer(transformers=[("num", StandardScaler(), num_cols),("cat", OrdinalEncoder(), cat_cols)])


In [125]:
preprocessor

In [92]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier( n_estimators=200, learning_rate=0.05,max_depth=5,  random_state=42)


In [93]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline( steps=[("preprocess", preprocessor),("model", model)])


In [96]:

pipeline.fit(X_train, y_train)




In [97]:

y_pred = pipeline.predict(X_test)



In [107]:

from sklearn.metrics import accuracy_score,classification_report

print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


Accuracy: 1.0

Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       137
           1       1.00      1.00      1.00       863

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000



# ü§ñ Machine Learning model

This project uses supervised machine learning to solve a binary classification problem, where the goal is to predict whether a flight will be delayed (1) or on time (0).

The dataset consists of numerical features such as flight distance, departure and arrival time, weather parameters, passenger count, and previous delays, along with categorical features like airline, airport, weather condition, day type, and flight type.

üîπ Data Preprocessing

- Numerical features are standardized using StandardScaler
- Categorical features are encoded using OrdinalEncoder
- Feature transformation is handled using a ColumnTransformer
- Data is split into training and testing sets using stratified sampling
  
üîπ Model Selection

- A Gradient Boosting Classifier is used due to its ability to:
- Handle non-linear relationships
- Capture feature interactions
- Provide high accuracy on structured tabular data
  
üîπ Model Training & Evaluation

- The model is trained on preprocessed data using a pipeline
- Hyperparameters such as number of estimators, learning rate, and tree depth are tuned
- The model achieves high predictive performance on the dataset
  
üîπ Deployment

The trained model is deployed using Streamlit, enabling users to input flight details through a web interface and receive real-time predictions.