# Title

## Summary

## Introduction

## Methods and Results

To start, we will import the required libraries for our analysis, set the random state to generate reproducible results, and read in the data.

In [1]:
# import required libraries for analysis
import altair as alt
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import (FunctionTransformer, Normalizer, OneHotEncoder, StandardScaler, normalize, scale)
from sklearn.compose import make_column_transformer
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import  confusion_matrix, ConfusionMatrixDisplay, classification_report

# set random state to have reproducible results
random_state=12

# read in data
raw_full_flight_data = pd.read_csv("../data/raw/full_data_flightdelay.csv")

Let's see how big the data set is.

In [2]:
raw_full_flight_data.shape

(6489062, 26)

There are 6,489,062 observations in the raw data set. Since such a large data set will take a lot of computing power and time, we will take sample of the data (20,000 observations) and use it in our analysis.

In [3]:
# sample 20,000 observations from the raw data set
raw_sample_flight_data = raw_full_flight_data.sample(n=20000, random_state=12)

# save the sample data set
raw_sample_flight_data.to_csv("../data/processed/raw_sample_flight_data.csv")

# check shape of sample to confirm the sampling worked
raw_sample_flight_data.shape

(20000, 26)

Let's clean our sample data by only keeping the features of interest and the target column in our data.  

Features:
- Month (`MONTH`)
- Day of Week (`DAY_OF_WEEK`)
- Number of concurrent flights leaving from the airport in the same departure block (`CONCURRENT_FLIGHTS`)
- Carrier (`CARRIER_NAME`)
- Number of flight attendants per passenger for airline (`FLT_ATTENDANTS_PER_PASS`)
- Number of ground service employees (service desk) per passenger for airline (`GROUND_SERV_PER_PASS`)
- Age of departing aircraft (`PLANE_AGE`)
- Departing airport (`DEPARTING_AIRPORT`)
- Previous airport that the aircraft departed from (`PREVIOUS_AIRPORT`)
- Inches of snowfall for on departure day (`SNOW`)
- Max wind speed for on departure day (`AWND`)

Target:
- If the departing flight is delayed over 15 minutes or not (`DEP_DEL15`)

Then, we'll split the data into training and testing sets.

In [4]:
# list of features and target (DEP_DEL15) columns
list_of_features_and_target = ['MONTH', 'DAY_OF_WEEK', 'DEP_DEL15', 'CONCURRENT_FLIGHTS', 'CARRIER_NAME',
 'FLT_ATTENDANTS_PER_PASS', 'GROUND_SERV_PER_PASS', 'PLANE_AGE',
 'DEPARTING_AIRPORT', 'PREVIOUS_AIRPORT', 'SNOW', 'AWND']

# only keep the features of interest and the target column in the data set
filtered_sample_flight_data = raw_sample_flight_data[list_of_features_and_target]

# save the filtered sample data set
filtered_sample_flight_data.to_csv("../data/processed/filtered_sample_flight_data.csv")

# split filtered sample data into training and testing splits.
flight_train, flight_test = train_test_split(filtered_sample_flight_data, test_size=0.2, random_state=12, stratify=filtered_sample_flight_data["DEP_DEL15"])

# save taining and test splits.
flight_train.to_csv("../data/processed/training_flight_data.csv")
flight_test.to_csv("../data/processed/testing_flight_data.csv")

## Exploratory Data Analysis

Let's preview the testing set and have a look at some information about the data.

In [5]:
flight_train.head()

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,CONCURRENT_FLIGHTS,CARRIER_NAME,FLT_ATTENDANTS_PER_PASS,GROUND_SERV_PER_PASS,PLANE_AGE,DEPARTING_AIRPORT,PREVIOUS_AIRPORT,SNOW,AWND
3808173,8,4,0,31,United Air Lines Inc.,0.000254,0.000229,18,Los Angeles International,NONE,0.0,8.28
2861285,6,7,0,20,United Air Lines Inc.,0.000254,0.000229,17,San Francisco International,NONE,0.0,12.53
1880876,4,5,0,64,American Eagle Airlines Inc.,0.000348,0.000107,15,Chicago O'Hare International,Truax Field,0.0,14.32
2861238,6,7,0,27,American Airlines Inc.,9.8e-05,0.000177,2,San Francisco International,NONE,0.0,12.53
5617638,11,6,0,57,United Air Lines Inc.,0.000254,0.000229,21,Chicago O'Hare International,Pittsburgh International,0.0,6.26


*Table 1. Preview of the training flight data.*

In [6]:
filtered_sample_flight_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, 4784886 to 2124261
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   MONTH                    20000 non-null  int64  
 1   DAY_OF_WEEK              20000 non-null  int64  
 2   DEP_DEL15                20000 non-null  int64  
 3   CONCURRENT_FLIGHTS       20000 non-null  int64  
 4   CARRIER_NAME             20000 non-null  object 
 5   FLT_ATTENDANTS_PER_PASS  20000 non-null  float64
 6   GROUND_SERV_PER_PASS     20000 non-null  float64
 7   PLANE_AGE                20000 non-null  int64  
 8   DEPARTING_AIRPORT        20000 non-null  object 
 9   PREVIOUS_AIRPORT         20000 non-null  object 
 10  SNOW                     20000 non-null  float64
 11  AWND                     20000 non-null  float64
dtypes: float64(4), int64(5), object(3)
memory usage: 2.0+ MB


We can see that there are no null values in any of the columns. To gain more insight into the training data, the following data visualizations were made to see the distribution of the different feature variables and target variable.

## Discussion

## References