# Flight Delay Prediction

This notebook processes flight data from 2024 to predict flight delays.

**Data:** The dataset contains information about various flights, including:
*   Flight dates (year, month, day of week)
*   Carrier information (unique carrier, flight number)
*   Origin and destination airports
*   Departure and arrival delays
*   Flight distance
*   Cancellation and diversion status

**Prediction Goal:** The primary objective is to predict whether a flight will be 'Delayed'. A flight is classified as 'Delayed' if its arrival delay (`arr_delay`) is greater than 15 minutes. This is a binary classification problem.

### Data Loading and Target Creation

This section handles the initial loading of the flight data from a CSV file. It performs the following steps:

1.  **Load Dataset**: Reads the `flight_data_2024.csv` file into a pandas DataFrame.
2.  **Create Target Variable**: A new column named `Delayed` is created. A flight is considered 'Delayed' if its arrival delay (`arr_delay`) is greater than 15 minutes. This is a binary classification target (1 for delayed, 0 for not delayed).
3.  **Select Relevant Columns**: Keeps only the columns necessary for the analysis and model training, discarding irrelevant ones.
4.  **Save to Database**: The processed DataFrame is then saved into a SQLite database named `flights2024.db` as a table called `flights2024`. This step ensures that the data is persistently stored and can be easily queried later.

Loading the CSV file and creating a target

In [3]:
import os
from pathlib import Path
import pandas as pd
from sqlalchemy import create_engine

# Robust CSV loading: prefer full dataset, fall back to sample, else instruct user
csv_path = Path("../data/flight_data_2024.csv")
if not csv_path.exists():
    sample_path = csv_path.with_name("flight_data_2024_sample.csv")
    if sample_path.exists():
        print("Full dataset not found — using sample dataset at: {}".format(sample_path))
        csv_path = sample_path
    else:
        raise FileNotFoundError("Dataset not found. Please run 'dvc pull' or place 'flight_data_2024.csv' in the project 'data/' folder.")

# Load dataset (either full or sample)
df = pd.read_csv(csv_path)

# ✅ Use lowercase column name 'arr_delay'
df["Delayed"] = (df["arr_delay"] > 15).astype(int)

# Keep relevant columns
df = df[[
    "year", "month", "day_of_week", "fl_date",
    "op_unique_carrier", "op_carrier_fl_num",
    "origin", "dest", "dep_delay", "arr_delay",
    "distance", "cancelled", "diverted", "Delayed"
]]


  df = pd.read_csv(csv_path)


In [None]:

# Save to DB
engine = create_engine("sqlite:///flights2024.db")
df.to_sql("flights2024", con=engine, if_exists="replace", index=False)

print("✅ flights2024 table created successfully")

### Data Querying and Preprocessing

This section prepares the data for model training by querying the necessary features from the database and applying preprocessing steps:

1.  **Query Data**: Selects relevant features from the `flights2024` table, specifically filtering out cancelled and diverted flights, as these might be handled separately or are not suitable for predicting typical delays.
2.  **Encode Categorical Features**: Categorical columns like `op_unique_carrier`, `origin`, and `dest` are converted into numerical representations using `LabelEncoder`. This is necessary because machine learning models typically require numerical input.
3.  **Feature and Target Split**: The dataset is divided into features (`X`) and the target variable (`y`, which is 'Delayed').
4.  **Train/Test Split**: The data is further split into training and testing sets (`X_train`, `X_test`, `y_train`, `y_test`) using `train_test_split`. This allows the model to be trained on one portion of the data and evaluated on unseen data to assess its generalization performance. `stratify=y` ensures that the proportion of delayed vs. non-delayed flights is maintained in both sets.
5.  **Scale Numeric Features**: Numerical features in `X_train` and `X_test` are scaled using `StandardScaler`. This standardizes the features by removing the mean and scaling to unit variance, which can help improve the performance and convergence of some machine learning algorithms.

Quering and preprocessing

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

query = """
SELECT year, month, day_of_week, dep_delay, distance,
       op_unique_carrier, origin, dest, cancelled, diverted, Delayed
FROM flights2024
WHERE cancelled = 0 AND diverted = 0
"""
df = pd.read_sql(query, engine)

# Encode categorical columns
for col in ["op_unique_carrier", "origin", "dest"]:
    df[col] = LabelEncoder().fit_transform(df[col])

X = df.drop("Delayed", axis=1)
y = df["Delayed"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Scale numeric features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


### Model Training and Evaluation

This section involves training a machine learning model and assessing its performance:

1.  **Initialize XGBoost Classifier**: An `XGBClassifier` is initialized with specific hyperparameters (`n_estimators`, `max_depth`, `learning_rate`, `random_state`, `n_jobs`). XGBoost is a powerful gradient boosting framework known for its efficiency and performance.
2.  **Train Model**: The model is trained using the preprocessed training data (`X_train`, `y_train`).
3.  **Make Predictions**: Once trained, the model makes predictions on the unseen test data (`X_test`).
4.  **Evaluate Performance**: The model's performance is evaluated using common classification metrics:
    *   **Accuracy Score**: The proportion of correctly classified instances.
    *   **Classification Report**: Provides a detailed breakdown of precision, recall, and f1-score for each class (delayed vs. not delayed), along with support (the number of actual occurrences of each class in the specified dataset).

Training the model

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


✅ Accuracy: 0.9326932050013854
              precision    recall  f1-score   support

           0       0.94      0.98      0.96   1112376
           1       0.92      0.73      0.81    280678

    accuracy                           0.93   1393054
   macro avg       0.93      0.86      0.89   1393054
weighted avg       0.93      0.93      0.93   1393054



### Save Predictions

This final section uses the trained model to make predictions on the entire dataset and saves the results:

1.  **Predict on Full Dataset**: The trained model makes predictions (`Predicted`) on the complete, scaled dataset (`X_full`).
2.  **Store Predictions**: The `Predicted` column is added back to the original DataFrame `df` (which now contains the preprocessed features).
3.  **Save to Database**: The DataFrame, now including the predictions, is saved to the SQLite database `flights2024.db` as a new table named `flight_preds_2024`. This allows for easy access and analysis of the predictions alongside the original data.

Save Predictions

In [None]:
# Predict on full dataset
X_full = scaler.transform(X)
df["Predicted"] = model.predict(X_full)

df.to_sql("flight_preds_2024", con=engine, if_exists="replace", index=False)

print("✅ Predictions saved to flight_preds_2024 table")

✅ Predictions saved to flight_preds_2024 table
