# Title

## Summary

## Introduction

## Methods and Results

To start, we will import the required libraries for our analysis, set the random state to generate reproducible results, and read in the data.

In [1]:
# import required libraries for analysis
import altair as alt
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import (FunctionTransformer, Normalizer, OneHotEncoder, StandardScaler, normalize, scale)
from sklearn.compose import make_column_transformer
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import  confusion_matrix, ConfusionMatrixDisplay, classification_report

# set random state to have reproducible results
random_state=12

# read in data
raw_full_flight_data = pd.read_csv("../data/raw/full_data_flightdelay.csv")

Let's see how big the data set is.

In [2]:
raw_full_flight_data.shape

(6489062, 26)

There are 6,489,062 observations in the raw data set. Since such a large data set will take a lot of computing power and time, we will take sample of the data (20,000 observations) and use it in our analysis.

In [3]:
# sample 20,000 observations from the raw data set
raw_sample_flight_data = raw_full_flight_data.sample(n=20000, random_state=12)

# save the sample data set
raw_sample_flight_data.to_csv("../data/processed/raw_sample_flight_data.csv")

# check shape of sample to confirm the sampling worked
raw_sample_flight_data.shape

(20000, 26)

Let's clean our sample data by only keeping the features of interest and the target column in our data.  

Features:
- Month (`MONTH`)
- Day of Week (`DAY_OF_WEEK`)
- Number of concurrent flights leaving from the airport in the same departure block (`CONCURRENT_FLIGHTS`)
- Carrier (`CARRIER_NAME`)
- Number of flight attendants per passenger for airline (`FLT_ATTENDANTS_PER_PASS`)
- Number of ground service employees (service desk) per passenger for airline (`GROUND_SERV_PER_PASS`)
- Age of departing aircraft (`PLANE_AGE`)
- Inches of snowfall for on departure day (`SNOW`)
- Max wind speed for on departure day (`AWND`)

Target:
- If the departing flight is delayed over 15 minutes or not (`DEP_DEL15`, `0` = no and `1` = yes)

Then, we'll split the data into training and testing sets.

In [4]:
# list of features and target (DEP_DEL15) columns
list_of_features_and_target = ['MONTH', 'DAY_OF_WEEK', 'DEP_DEL15', 'CONCURRENT_FLIGHTS', 'CARRIER_NAME',
 'FLT_ATTENDANTS_PER_PASS', 'GROUND_SERV_PER_PASS', 'PLANE_AGE', 'SNOW', 'AWND']

# only keep the features of interest and the target column in the data set
filtered_sample_flight_data = raw_sample_flight_data[list_of_features_and_target]

# save the filtered sample data set
filtered_sample_flight_data.to_csv("../data/processed/filtered_sample_flight_data.csv")

# split filtered sample data into training and testing splits.
flight_train, flight_test = train_test_split(filtered_sample_flight_data, test_size=0.2, random_state=12, stratify=filtered_sample_flight_data["DEP_DEL15"])

### Exploratory Data Analysis

Let's preview the testing set and have a look at some information about the data.

In [5]:
flight_train.head()

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,CONCURRENT_FLIGHTS,CARRIER_NAME,FLT_ATTENDANTS_PER_PASS,GROUND_SERV_PER_PASS,PLANE_AGE,SNOW,AWND
3808173,8,4,0,31,United Air Lines Inc.,0.000254,0.000229,18,0.0,8.28
2861285,6,7,0,20,United Air Lines Inc.,0.000254,0.000229,17,0.0,12.53
1880876,4,5,0,64,American Eagle Airlines Inc.,0.000348,0.000107,15,0.0,14.32
2861238,6,7,0,27,American Airlines Inc.,9.8e-05,0.000177,2,0.0,12.53
5617638,11,6,0,57,United Air Lines Inc.,0.000254,0.000229,21,0.0,6.26


*Table 1. Preview of the training flight data.*

In [6]:
flight_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16000 entries, 3808173 to 2731743
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   MONTH                    16000 non-null  int64  
 1   DAY_OF_WEEK              16000 non-null  int64  
 2   DEP_DEL15                16000 non-null  int64  
 3   CONCURRENT_FLIGHTS       16000 non-null  int64  
 4   CARRIER_NAME             16000 non-null  object 
 5   FLT_ATTENDANTS_PER_PASS  16000 non-null  float64
 6   GROUND_SERV_PER_PASS     16000 non-null  float64
 7   PLANE_AGE                16000 non-null  int64  
 8   SNOW                     16000 non-null  float64
 9   AWND                     16000 non-null  float64
dtypes: float64(4), int64(5), object(1)
memory usage: 1.3+ MB


We can see that there are no null values in any of the columns. To gain more insight into the training data, the following data visualizations were made to see the distribution of the different feature variables and target variable.

In [7]:
## EDA vizs

### Preprocessing the Data

Since the KNN algorithm uses Euclidian distance to determine how similar data points are to each cluster center, we will center and scale each numeric feature in our preprocessing so they have the same effect on deciding cluster assignment. We will consider the `month` and `day of the week` as numeric features, as they are represented in the dataset to preserve the ordinal element -- however we will concede that this would make the model consider Saturday (`7`) and Sunday (`1`) as far apart. This won't be a problem for the `months` since the data only has observations from 2019.  

Saturdays and Sundays should be considered closer together since they are on weekends, we will change Sundays' value to `8`. The model will consider Sundays and Mondays to be further apart as a result, but we feel that this is a better trade-off.

To allow our categorical features to be used as predictors, we will preprocess them using one-hot encoding.

No imputation is needed since there are no missing values in the dataset.

In [8]:
# replace Sunday's value from 1 to 8, to be closer to Saturday's value (7)
flight_train.loc[flight_train['DAY_OF_WEEK']==1, 'DAY_OF_WEEK']=8
flight_test.loc[flight_test['DAY_OF_WEEK']==1, 'DAY_OF_WEEK']=8

# save taining and test splits.
flight_train.to_csv("../data/processed/training_flight_data.csv")
flight_test.to_csv("../data/processed/testing_flight_data.csv")

# check if the replacement worked
print(flight_train['DAY_OF_WEEK'].describe(), flight_test['DAY_OF_WEEK'].describe())

count    16000.000000
mean         4.979125
std          2.009681
min          2.000000
25%          3.000000
50%          5.000000
75%          7.000000
max          8.000000
Name: DAY_OF_WEEK, dtype: float64 count    4000.000000
mean        4.988750
std         2.012371
min         2.000000
25%         3.000000
50%         5.000000
75%         7.000000
max         8.000000
Name: DAY_OF_WEEK, dtype: float64


In [9]:
# separate feature vectors from target
X_train = flight_train.drop(columns = ["DEP_DEL15"])
y_train = flight_train["DEP_DEL15"]
X_test = flight_test.drop(columns = ["DEP_DEL15"])
y_test = flight_test["DEP_DEL15"]

# preprocess features
numeric_features = ['MONTH', 'DAY_OF_WEEK', 'CONCURRENT_FLIGHTS', 'FLT_ATTENDANTS_PER_PASS',
                    'GROUND_SERV_PER_PASS', 'PLANE_AGE', 'SNOW', 'AWND']
numeric_transformer = make_pipeline(StandardScaler())

categorical_features = ['CARRIER_NAME']
categorical_transformer = make_pipeline(OneHotEncoder(sparse_output=False, dtype='int'))

preprocessor = make_column_transformer((numeric_transformer, numeric_features),
                                       (categorical_transformer, categorical_features))

X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

### Baseline Model

To create a baseline model to compare the final KNN model to, we will make a DummyClassifier that will randomly predict if the flight departure will be delayed or not, at a frequency respective to its distribution in the training data.

In [10]:
# create baseline model to compare final model to
scoring = {'accuracy': 'accuracy',
           'precision': make_scorer(precision_score, pos_label=1),
           'recall': make_scorer(recall_score, pos_label=1),
           'f1': make_scorer(f1_score, pos_label=1) }

dummy_classifier = DummyClassifier(strategy = "stratified", random_state = 12)

dummy_scores = pd.DataFrame(
    cross_validate(
        dummy_classifier, X_train, y_train, cv = 5, return_train_score = True, scoring = scoring
    )
)

dummy_mean = dummy_scores.mean()
dummy_mean

fit_time           0.008740
score_time         0.024389
test_accuracy      0.695625
train_accuracy     0.692484
test_precision     0.188156
train_precision    0.189657
test_recall        0.183681
train_recall       0.191113
test_f1            0.185891
train_f1           0.190382
dtype: float64

As shown above, the DummyClassifier has a validation score of 69.6%

### KNN Classifier

#### Parameter Tuning

To find the optimal value of k that maximizes the accuracy of the model, we will use 5-fold cross-validation for values of k from 10 to 40, in increments of 2.

In [11]:
# find the k value that yields the best accuracy estimate
results_dict = {
    "n_neighbors": [],
    "mean_train_score": [],
    "mean_cv_score": []}

for n in range(10,41, 2):
    knn_model = KNeighborsClassifier(n_neighbors=n)
    cv_scores = cross_validate(knn_model, X_train, y_train, cv=5, return_train_score=True)
    results_dict["n_neighbors"].append(n)
    results_dict["mean_train_score"].append(cv_scores["train_score"].mean())
    results_dict["mean_cv_score"].append(cv_scores["test_score"].mean())

results_df = pd.DataFrame(results_dict)

results_df.sort_values(by=["mean_cv_score"], ascending=False).head(1)

Unnamed: 0,n_neighbors,mean_train_score,mean_cv_score
12,34,0.810953,0.810875


In [12]:
best_k = int(results_df.loc[results_df['mean_cv_score'].idxmax()]['n_neighbors'])
best_k

34

As shown above, 34 is the best k value from 10 to 40 in increments of 2 and yields a validation score of 81.1% which is higher compared to the validation score of the DummyClassifier (69.6%).

#### Training and Testing

In [13]:
# make new model with best k
best_model = KNeighborsClassifier(n_neighbors=best_k)

# retrain classifier
best_model.fit(X_train, y_train)

# get predictions on test data
best_model.predict(X_test)

# get estimate of accuracy of classifier on test data
test_score = best_model.score(X_test, y_test)
test_score

0.81075

The accuracy score of the knn classifier with k=34 on the test set is 81.1%.

#### Exploring the Model

Let’s explore the model with more visualizations.

In [14]:
predictions = best_model.predict(X_test)

# make the predictions into a dataframe and rename the prediction column to "prediction"
prediction_df = pd.DataFrame(predictions)
prediction_df = prediction_df.rename(columns={prediction_df.columns[0]:'prediction'})

# reset index of testing dataframe
flight_test = flight_test.reset_index()

# concatenate the prediction dataframe to the testing dataframe
flight_test_predict = pd.concat([flight_test, prediction_df], axis=1)

# preview the testing dataframe with the model predictions
flight_test_predict.head()

Unnamed: 0,index,MONTH,DAY_OF_WEEK,DEP_DEL15,CONCURRENT_FLIGHTS,CARRIER_NAME,FLT_ATTENDANTS_PER_PASS,GROUND_SERV_PER_PASS,PLANE_AGE,SNOW,AWND,prediction
0,1742730,4,7,1,79,American Airlines Inc.,9.8e-05,0.000177,9,0.0,4.47,0
1,4205374,8,5,1,55,Southwest Airlines Co.,6.2e-05,9.9e-05,12,0.0,8.5,0
2,2994400,6,4,0,6,United Air Lines Inc.,0.000254,0.000229,18,0.0,7.38,0
3,2706197,6,6,0,25,American Eagle Airlines Inc.,0.000348,0.000107,14,0.0,6.04,0
4,1193332,3,3,0,28,Endeavor Air Inc.,0.0,9.4e-05,10,0.0,16.33,0


*Table 2. Preview of the testing data with the model predictions*

## Discussion

## References