# Manage Flight Data: Data Preparation, Modeling, and Export

This notebook demonstrates how to load, clean, model, and export data from a flight delay dataset. We will:
- Import required libraries
- Load and inspect the dataset
- Clean and preprocess the data
- Train a machine learning model to predict flight delays
- Save the trained model
- Extract and export unique airport information

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import joblib

In [5]:
# Load Flight Delay Dataset
df = pd.read_csv('data/flights.csv')
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,Carrier,OriginAirportID,OriginAirportName,OriginCity,OriginState,DestAirportID,DestAirportName,DestCity,DestState,CRSDepTime,DepDelay,DepDel15,CRSArrTime,ArrDelay,ArrDel15,Cancelled
0,2013,9,16,1,DL,15304,Tampa International,Tampa,FL,12478,John F. Kennedy International,New York,NY,1539,4,0.0,1824,13,0,0
1,2013,9,23,1,WN,14122,Pittsburgh International,Pittsburgh,PA,13232,Chicago Midway International,Chicago,IL,710,3,0.0,740,22,1,0
2,2013,9,7,6,AS,14747,Seattle/Tacoma International,Seattle,WA,11278,Ronald Reagan Washington National,Washington,DC,810,-3,0.0,1614,-7,0,0
3,2013,7,22,1,OO,13930,Chicago O'Hare International,Chicago,IL,11042,Cleveland-Hopkins International,Cleveland,OH,804,35,1.0,1027,33,1,0
4,2013,5,16,4,DL,13931,Norfolk International,Norfolk,VA,10397,Hartsfield-Jackson Atlanta International,Atlanta,GA,545,-1,0.0,728,-9,0,0


In [6]:
# Identify and Replace Null Values
print('Columns with null values:')
print(df.isnull().sum())

df_filled = df.fillna(0)
print('Null values after filling:')
print(df_filled.isnull().sum())

Columns with null values:
Year                    0
Month                   0
DayofMonth              0
DayOfWeek               0
Carrier                 0
OriginAirportID         0
OriginAirportName       0
OriginCity              0
OriginState             0
DestAirportID           0
DestAirportName         0
DestCity                0
DestState               0
CRSDepTime              0
DepDelay                0
DepDel15             2761
CRSArrTime              0
ArrDelay                0
ArrDel15                0
Cancelled               0
dtype: int64
Null values after filling:
Year                 0
Month                0
DayofMonth           0
DayOfWeek            0
Carrier              0
OriginAirportID      0
OriginAirportName    0
OriginCity           0
OriginState          0
DestAirportID        0
DestAirportName      0
DestCity             0
DestState            0
CRSDepTime           0
DepDelay             0
DepDel15             0
CRSArrTime           0
ArrDelay             0


In [8]:
# Feature Selection and Preprocessing
# Use columns: 'DayOfWeek', 'OriginAirportID', 'ArrDelay', 'OriginAirportName'

# Create target: 1 if ArrDelay > 15, else 0
df_filled['DELAYED'] = (df_filled['ArrDelay'] > 15).astype(int)

# Select features and encode categorical variables
features = ['DayOfWeek', 'OriginAirportID']
X = df_filled[features]
y = df_filled['DELAYED']

In [9]:
# Train Machine Learning Model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate model
preds = model.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89     43760
           1       0.00      0.00      0.00     10628

    accuracy                           0.80     54388
   macro avg       0.40      0.50      0.45     54388
weighted avg       0.65      0.80      0.72     54388



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [11]:
# Save Trained Model to File
joblib.dump(model, 'server/model.pkl')
print('Model saved to server/model.pkl')

Model saved to server/model.pkl


In [12]:
# Extract and Export Unique Airport Names and IDs
# Use columns: 'OriginAirportID', 'OriginAirportName'
airports = df_filled[['OriginAirportID', 'OriginAirportName']].drop_duplicates()
airports.to_csv('server/airports.csv', index=False)
print('Airport data saved to server/airports.csv')

Airport data saved to server/airports.csv
