# W261 Final Project - Flight Delay Prediction

Team 11 - Amber (Shu Ying) Chen, Jeffrey Day, Sanjay Elangovan, and Menglu He

Summer 2021

## Table of Contents

Stage 1 - Question Formulation and Evaluation Metrics Choice

Stage 2 - Exploratory Data Analysis

Stage 3 - Feature Engineering

Stage 4 - Algorithm Theory and Implementation

Stage 5 - Conclusions

Course Concepts

Citations

## Stage 1 - Question Formulation and Evaluation Metrics Choice

#### Data Context and Question Formulation

In 2019, flight delays cost airlines approximately $8 billion and passengers approximately $18 billion. Per the FAA, costs of delay have continued to increase over the past four years\\(^{[3]}\\). To that end, we propose the following research question: **Given 2 hours of lead time based on provided data sets, will a given flight be delayed by more than 15 mins?**

We will focus the problem on travelers. The airline likely can not make changes with only 2 hours of lead time, but customers can alter their plans to get to the airport. Ultimately, the traveler is the customer, and it is beneficial to the airline to keep travelers well informed. 

As we evaluate our model, we will prioritize minimizing false positives. In this situation, informing customers that their flight will be late when it is actually on time would be very damaging since customers may miss their flight. By contrast, a false negative would result in the customer staying in the airport longer. 

#### Baselines and State-of-the-Art Literature

Several companies have attempted to solve this problem using statistical inference and ML \\(^{[4]}\\). For example, FlightCaster utilizes Hadoop MapReduce to make predictions regarding flight delays\\(^{[5]}\\). Below are some metrics on how existing prediction products perform:
- KnowDelay: 90% accurate
- FlightCaster: 85% accurate
- DelayCast: 80-90% accurate

We would hope for our model to be on par with these existing options. We should note, however, that their definitions are slightly different from ours. For example, KnowDelay predicts a 30+ minute delay, which may be easier to predict than a 15-minute delay because a 30-minute delay may require more intense weather conditions. Another consideration when evaluating the model is that a model which returns only negative results would still be 80% accurate. This value comes from the baseline flight departure delay distribution: far more on-time departures than late departures.

#### Evaluation Metrics

There are several factors to consider when evaluating our model:

- _Precision_: The metric focuses on the rate of false positives. It measures how often a flight is delayed when a classifier model predicts the flight will delay.
$$ 
Precision = \frac{True\ Positives}{True\ Positives +False\ Positives}
$$
- _Recall_: The metric measures false negatives against true positives. When a flight is delayed, it measures that how often a classifier model predicts that correctly.
$$ 
Recall = \frac{True\ Positives}{True\ Positives +False\ Negatives}
$$
- _Area under the ROC Curve (AUC ROC)_: The AUC ROC curve represents the tradeoff between the true positive rate (TPR) and the false positive rate (FPR) 

<img src="https://blog.floydhub.com/content/images/2019/10/ROC_curve.svg" width=30%>

We choose to use the AUC ROC to gauge the performance of our models during the hyperparameter tuning stage because it very accurately quantifies accuracy in separability, which is what we want our model to do well.

Since false positives have higher consequences than false negatives in our business case, we also use the precision and recall metrics to check whether false positives or false negatives are a bigger part of our error set. Therefore, if two models have similar AUC ROC scores, we would pick the one with a higher precision score.

Using a confusion matrix, as shown below, we can quickly identify how many false positives and false negatives are in the model, helping to inform on teh accuracy of our model. 

Confusion Matrix:

<img src="https://lh3.googleusercontent.com/LUXlprZjYIZUcP7f2g4hZkx_Skuhy8NZRIWyt7KbfwhYeSSBjtDy6DhUzKZ8ckg2hA5gek0nC7Vp8zPPEayO3PRdfM_LLlvRaHUrMma3fa2GV2yS_OhsvMNZ_F15h1RmEiyT1-BrktM" width=20%>

## Stage 2 - Exploratory Data Analysis

For detailed code notebooks, see below:

- EDA: (https://github.com/ShuYingAmberChen/Machine-learning-projects/blob/main/Flight_delay_prediction/EDA/EDA-Full%20Dataset.ipynb)
- Graph Visualizations: https://github.com/ShuYingAmberChen/Machine-learning-projects/tree/main/Flight_delay_prediction/Feature%20Engineering

At a very high level, our EDA's major tasks consist of understanding the information included in each dataset, identifying missing data in both the airlines and weather datasets and identifying correlated features for selection and engineering.

### Airline Dataset

**Airline dataset overview:**

- Airline table has 31,178,801 rows and 107 columns.
- The number of distinct airports is: 369
- The number of distinct airlines is: 19
- The first date in the airline table is: 2015-01-01
- The last date in the airline table is: 2019-12-31
- 18.1% of the flights are delayed by > 15 mins

**Explore Temporal Impacts on Departure Delay**

This section explores the relationship between time-related factors on flight departure delays. More specifically, we looked at:
- Time of Day
- Day of Week
- Day of Month
- Month of Year
- Quarter
- Year

From the distribution graphs below, we can observe that:

- Time of Day: The number of delays increases in the afternoon and peaks during the 5-7 pm period.
- Day Of Week: Monday, Thursday, and Friday have more departure delays than other days of the week.
- Day of Month: No obvious pattern.
- Month: June, July, August, and December have more departure delays than other months of the year, most likely related to summer and holiday travel.

The diagram suggests that including temporal features in our machine learning model can contribute valuable information to the prediction.

<img src="https://lh5.googleusercontent.com/NuR5DIFqbUHzKoGZJsINxLCHVl8ds0ogVjoIzhtYsqh8D1vdcWQ5iIRz2S1Z80fY_JwKUNor52Tb-mDLsmNP2BnN296blCEghLp-oHqqeU7ptXXMFOUvFuYW6R6BYSkgl_Hp4PYVWhI" width=70%>

We also visualized the data geographically. In these visualizations, line width reflects the number of flights that travel from origin to destination. 
- The most popular trips from both airports are to large cities, like New York, Dallas, and L.A.



**Exploring Airline and Airport Factors on Departure Delay**

As we can see from the distribution graph, there are clear relationships between the number of departure delays, airlines, and airports:

- Airlines: The airlines with the highest number of delays are: WN: Southwest Airlines, AA: American Airlines, DL: Delta Airlines
- Airports: The top 3 airports with the highest number of delays are: ATL, ORD, and DFW. This makes sense since ATL and ORD are the busiest airports/connection hubs in the U.S.

<img src="https://lh5.googleusercontent.com/qb-dJ7Fa2qzRKS5aIULD0YZUbFn6DXtoUMIiatP7vyunLV_MF6wLtQ0iPdmE66niQGO1AEKrZxPfX6cHo-_0hT75JD-7BoBLySPNwDf6XuaOasa9ykuN64iPZHPOw_Vs6cOp8c6lpQU" width=70%>


**Weather Dataset**

We created a correlation matrix to see the relationship between weather variables and departure delay on the fully joined dataset with airlines and weathers. As we can see from the matrix below, there is no clear correlation between the weather variables and our departure delay outcome variable. This is consistent with our external research that extreme weather only account for around 5% of all airline delays.\\(^{[6]}\\)

<img src="https://lh4.googleusercontent.com/-TSaWXIMkkzX-MtidVygfH7qqPa9T2C6xc2-kOb78NpB2-4l7jzPmhXiuGSTAWh7gcXtB1HzRjsWsT7vJ_ltkyUbcXJiVz57B6R8IG2oxP-4uNi4LT898ihKF74O3Gt42fJ2qqYpuZ8" width=50%>

We also explored the relationship between our key weather variables and the delay minutes. See the distribution diagram below:

<img src="https://lh5.googleusercontent.com/DJFD2PyXZRtu9lnooSjgasqDk_EYk3MNgKOW_uf83L8_kUOO0JJ9LLbxqp4Oux2H9HN8G19FoPqBDrwVnsLr5JRy2WLEPpbpnikYeKXqKY2FJStqiJafJpqgNQJRL_Csoo9LwVcYrqY" width=70%>

## Stage 3 - Feature Engineering

### Data Ingestion
Please see Data Ingestion - Complete Dataset notebook: https://github.com/ShuYingAmberChen/Machine-learning-projects/tree/main/Flight_delay_prediction/Data_Ingestion


When connecting weather records to flights, we identified the following two gaps:

1) The Airline dataset has only IATA airport codes, but the Weather dataset uses ICAO airport codes; and

2) While the departure and arrival times are in the local timezone in the airline dataset, the weather data has timestamps in UTC format.

In order to reconcile this difference, we introduced an external dataset from OpenFlight\\(^{[2]}\\) website. This dataset provides a reference of IATA and ICAO airport codes, location, timezone information for all airports across the globe. Hence we join this external dataset to airline data using IATA code and the timezone fields for flight time conversion. Then we found the weather station closest tO all airports. Lastly, we join the weather and airline datasets on ICAO codes for both ORIGIN and DESTINATION airports and attached the latest weather record from the station closest to the departure airport 2 hours before the scheduled time.

The final dataset had over 31M data points with no flight data duplicated/lost in combining.

<img src="https://lh3.googleusercontent.com/O0652QCcbYJn8u04Ukjy7ITD22Y5RJ61AZLIpc9tFTGDNKw9da7jAKHwLfjrV6R6GKBLXvQBYDH6ONgMgiSHDtSJPfNgPtDiLd4hR11kuINjs0FAlKxFPg4SS1YUGyn5BeQsXPKuH7c" width=60%>

### Feature Engineering

####Notebooks
- Flight Delay Features: https://github.com/ShuYingAmberChen/Machine-learning-projects/blob/main/Flight_delay_prediction/Feature%20Engineering/snowball_temporal_delay_feature_engineering.ipynb
- Graph Features: https://github.com/ShuYingAmberChen/Machine-learning-projects/blob/main/Flight_delay_prediction/Feature%20Engineering/graph_features.py
- Normalized Variables: https://github.com/ShuYingAmberChen/Machine-learning-projects/blob/main/Flight_delay_prediction/Feature%20Engineering/normalizing_variables.py

#### Graph Features
Using the flight data and NetworkX, we generated a graph and calculated some metrics about the graph. Since most of the nodes in the graph are not connected, metrics like betweenness or distance are not particularly helpful. We focused on two graph metrics: PageRank and Degree Centrality. NetworkX has functions for both metrics, which makes calculating these variables relatively straightforward. However, it is interesting to note that these metrics do not seem to correlate highly to the response variable. It is plausible that they do not have much effect on the model. 

<img src="https://lh3.googleusercontent.com/oT1FWl2L_yz6pGUuyFtJJmtTjqbaJzVgGU7qsUSrh22D8gQo-HtQEoqtjN89FaMkiygKSVLUa2mYKMmN6GdS7h-tnKTthJe9p3QX-MZ92IHDbT-dVxnrhW5z41SP1Nb3rTDMzDrAwSY"width=50%><img src="https://lh4.googleusercontent.com/D2Irtm1Xjh9Oz9HfSwqVqfm9K--aivFxm8ZZgDJl5UchnwsAxdifPDSkZX1N3GYuZYJ6Wqqye2j4yLsP8VEo6NuFXdgXaioEwF6tylM3IDY6oeGPAPDbu9ULio0fUvqNh4U4fcZk-lQ" width=50%>

<img src="https://lh3.googleusercontent.com/keep-bbsk/AGk0z-OFqCj_Owm98P_B5siOphqPM8CdWG02ytrrax0Rswj8qmoWv9emW8Wrd7sZE5nuTa5SqgDf29mCMS27yeiy3pSs4AYI8okXewKxAcM" width=50%>

#### Normalized Variables
One key problem with this dataset is that many of the numerical variables have values that are in differing ranges. For example, Air Temperature is often 2 orders of magnitude smaller than Ceiling Height. We chose to normalize the variables to counter these imbalances. For each variable, we subtracted the mean of the column from each record and divided by the standard deviation. This gave us a Z-Score that we could easily compare across variables. 

#### New Features

In addition to the features providd, we also added two new features to encapsulate potential for previous delays to cause future delays:
- DEP_DEL15_PREV_FIN - Was the plane with tail number for current flight’s most recent flight delayed by at least 15 minutes?
- NUM_PREV_DELAYS_ORIGIN - How many previous flights at the origin airport for the flight have been delayed at least 15 minutes?
Both of these features can only consider flights more than 135 minutes in the past, since we are trying to predict for current flight two hours ahead.


#### Other Feature Engineering
Below are some of the other tasks we completed during feature engineering:
- **Split up Weather Columns:** Many of the columns in the weather dataset had multiple comma-separated values in each row. One of our first tasks was to split these variables into their own columns. 
- **Quality Codes:** The weather columns also come with quality codes and nullable data. For example, records of 999 in wind direction are really nulls, and quaity scores of 2, 3, 7, or 9 are all suspect data. In order to have an accurate dataset, we needed to remove these values.
- **UTC Time Conversion:** In the airlines dataset also converted the time zones into UTC so that we could merge with the weather dataset. 


#### Final Feature Selection
Our final set of features are:
- Response variable: DEP_DEL15 (binary)
- Explanatory variables: 
  - Airline:
    - Carrier, departure airport, arrival airport, distance, scheduled departure time, scheduled arrival time, time of day, day of week
  - Weather (for origin and destination, normalized):
    - Elevation, wind direction, wind speed, ceiling height, visibility distance, air temperature, dew point temperature, sea level pressure, liquid precipitation
  - New Features:
    - ‘DEP_DEL15_PREV_FIN’, ‘NUM_PREV_DELAYS_ORIGIN’ 
    -  PageRank, Degree Centrality (Origin and Destination)

## Stage 4 - Algorithm Theory and Implementation

### Algorithm Theory


**Handling Imbalanced Data**

Through our EDA of the airline dataset, we noticed an imbalance where only 18% of flights are delayed. We also observed results from our initial baseline model on the three-month dataset that produced a low recall and f-1 score, which is caused by many false negatives. When the model predicts the flight is not delayed, but it is actually delayed. To account for the imbalance, we tried two options for the baseline model, including:

- class weight option where a weight column that utilizes a balancing ratio
- adjusting the threshold for the model (increase sample size for minority class and decrease sample size to majority class) to resemble the actual mix of delays to no delays in real data
- SMOTE model to correct for imbalance

While we explored the above three approaches, we ultimately decided to oversample the minority class and undersample the majority class. The parameters for the oversampling and undersampling were optimized by trial and error. We also tried class weights but not ultimately used because it underperformed the over & under-sampling approach. We also tested SMOTE on our three-month dataset, but it significantly increased the time to run the model. Due to time and resource constraints, we aborted SMOTE adjustment in our final complete dataset.

**Cross Validation**

With time-series data, performing cross-validation is more complicated because the model cannot use data from the future to predict data in the past. In order to properly create cross-validation splits for time series data, the splits must continually grow in a temporal fashion where the test set contains data in the future of the training and validation sets. For k-fold cross-validation, each fold builds upon the previous fold. The test set of the last fold gets appended to the training and validation set of the previous fold, and a similar-sized chunk of data in the future constitutes the test set for the next fold. See the diagram below for a useful illustration:

<img src="https://lh4.googleusercontent.com/QtTJtB61Mt2mzlGJUZ0khwaDjooik4Sz9LzepupBFmmxxxSVdnm8UMQX-z2WkDQ0BvuV-weRdn6LaRzUqrFXFIvW5vzvYvQpRBKfvInbQqspXkAC1JyHouDqHISQaofFjuyLwVkZUKM" width=30%>

We utilize the above cross-validation method for our time series, employing the training set to train our model, the validation set to optimize hyperparameter choices, and the test set to quantify model error.

**Logistic Regression**

For our baseline model, we decided to use logistic regression, which predicts the binary variable of a delay greater than 15 minutes (DEP_DEL15). This baseline model used nine weather features (elevation, wind direction, wind speed, ceiling height, visibility distance, air temperature, dew point temperature, sea level pressure, liquid precipitation), eight features from the flight dataset (carrier, departure airport, arrival airport, distance, scheduled departure time, scheduled arrival time, time of day, day of week) and two temporal delay features (the number of delays in the origin airport, whether the previous airplane is delayed or not).

In general, logistic regression is used to calculate the log-odds that a binary variable equals 1:
$$
l(y=1) = log\frac{p}{1-p} =\beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_5
$$
where \\(p = P(y=1)\\) Based on this, we can use the coefficients \\(beta_i\\) to understand the change in log-odds that a flight will be delayed with each unit increase in one of our predictor variables.


**Random Forest**

Next, we tried the Random Forest model, which consists of many individual decision trees that operate as an ensemble.\\(^{[7]}\\) It employs bagging and randomizing feature selection to build a forest of trees. Each tree in the random forest spits out a class prediction, and the class with the most votes becomes our model’s prediction. Typically Random Forest outperforms a straightforward Decision Tree approach by training each tree on different random samples. However, each tree might have a high variance with respect to a particular set of training data. Overall, the entire forest will have lower variance but not at the cost of increasing the bias. Because of this, Random Forest is very unlikely to suffer from an overfitting problem than a single Decision Tree does. The main limitation of Random Forest, which we also encountered in our case, is that many trees can make the Random Forest algorithm too slow to run. A more accurate prediction requires more trees, which results in a slower model. In addition, we did not see a significant improvement in model performance from our Random Forest model compare to our baseline logistic model. This might be because our data were highly consistent and correlated, so the randomization and majority vote did not improve the information signals to the overall classification. The weather dataset has low signal power to the model, and the airline features were highly correlated.


**Gradient Boosted Trees **

Similar to the Random Forest algorithm, the Gradient Boosted Trees (GBT) algorithm \\(^{[1]}\\) is another ensembling method that combines outputs from individual trees. What made it different from the Random Forest is that this model uses the boosting method that combines weak learner trees sequentially so that each new tree corrects the errors of the previous one. We evaluate how well a tree performs utilizing a loss function. There are many  loss functions we could choose, and one example is Cross-Entropy \\(Loss(p,q)=-\sum{p(x)logq(x)}\\), where \\(p\\) is the true label and \\(q\\) is the prediction. The loss is high when the label and prediction do not agree, and the loss is 0 when they match. We try to move in the direction of lowering loss. The model can be mathematically represented as follows:
$$
Boosted\ Ensemble = First\ Tree + \alpha * Second\ Tree,\ where\ \alpha\ is\ the\ learning\ rate
$$
$$
Loss(Boosted Ensemble) < Loss(First Tree)
$$

**Toy Example**

<img src="https://lh6.googleusercontent.com/cUJIwXgOT286w2Stc8WtQ62-c0_5xjXJKrqZYZ_K0oXwHxf1szI7_D4lWvjxfmhXiU_k5wyy8mkS1CrRfpj6CT1UKbI8QP4RaJe_IE-yICsFCvr2WOWwWyDqkqMuZJxtgwiJUbsEqPs" width=70%>

### Implementation

Please go to the following notebooks in the team repo.

#### Cross Validation

- CV Implementation Notebook: https://github.com/ShuYingAmberChen/Machine-learning-projects/tree/main/Flight_delay_prediction/Model%20Code

#### A flow diagram of our model pipeline

<img src = "https://lh3.googleusercontent.com/fFa9Gzy8V0Xq6Qse688KDgf65j6v9EmJDkAtMyTaCwDposSLwtagVklYW8N9LfXEPEYYH4OLJUY3LJEfkB5PXwbnV8AQKaCDPHUYAZBmiKOiMK48BZSGb72IUBUUj-E0bSoApvCjVGA" width =80%>

#### Logistic Regression

- Baseline Logistic Regression - Final notebook:https://github.com/ShuYingAmberChen/Machine-learning-projects/tree/main/Flight_delay_prediction/Model%20Code

Evaluation results:

- The accuracy is: 0.8364
- The recall is: 0.3994
- The precision is: 0.6109
- The f1_score is: 0.483

<img src = "https://lh3.googleusercontent.com/keep-bbsk/AGk0z-POLq6DRcuk0QqnuC68EwI5AnhySOJrgJEk-GAgsEkCHZ5pWIWbZplfKZVW6GOV6xZMYQB0-zHUvWYI5RiK2mVvE7wLPxq7usBcyFg" width =70%>


**Interpretation:**

Accuracy: The logistic model correctly 83% of the flights in 2019

Precision: When the classfier predicts a flight will delay, there is 61% of the chance that the flight actually delays.

Recall: When a flight actually delays, there is only 40% chance that the classifier predicts that correctly.

#### Random Forest

- Random Forest - Final notebook: https://github.com/ShuYingAmberChen/Machine-learning-projects/tree/main/Flight_delay_prediction/Model%20Code


Evaluation results:

- The accuracy is: 0.8136
- The recall is: 0.4898
- The precision is: 0.5136
- The f1_score is: 0.5014

<img src = "https://lh4.googleusercontent.com/2lwX4Qqohym9qXGUpQ7chn9hYv4826O9yGskbpojIKKm1dadOqIW-OYsfnwk6_6yN2HkXg0G97JjYQ95sftJ7EzgBf7ywArrvLpugfFNqxVVLdCytYsS74i89mXh1HyurZXlLeHFZN0" width =70%>

**Interpretation:**

Accuracy: The RF model correctly 81% of the flights in 2019

Precision: When the classfier predicts a flight will delay, there is 51% of the chance that the flight actually delays.

Recall: When a flight actually delays, there is only 49% chance that the classifier predicts that correctly.
 


#### Gradient Boosted Tree

GBT Model - Final notebook: https://github.com/ShuYingAmberChen/Machine-learning-projects/tree/main/Flight_delay_prediction/Model%20Code

Evaluation results:

<img src = "https://lh4.googleusercontent.com/yPBeCMN7ASdqhvN6p5oCM0b5ulCWpQyoCK1xLVSBkfKEnqwfiDnGkOAd8cpUPuFmVf8nexKGUVwFxmuJm1yr6T7xjbpj0Icgq1Yqqo_F5ZLEytWIIaRx9SrnftJEieidkBMzon696-M" width=50%>

<img src = "https://lh5.googleusercontent.com/1moeRcOOli_QqbMdvhk2AeDGjfbek_jXmPXHcZ0MfNzwW31wdS4BZCI6FJAgeF0VTp4GxSXPYmfTEu4sLAcykTvy9Rz3sRj9vLNtzWJ331ohDTdSJHZNVMLuqrQI0iIwmFhXvNexbG4" width=40%>
<img src = "https://lh3.googleusercontent.com/1gcdyttpNoTjrUiKp7aJXR93EIXmnVK8nextp0Odke_d8oVvqhvPWhRZ8iiRB3X6QwH1mhkg8ceXfZNPxdogYQ_V4hfh5lIYfkznFYllz31r1cvOAD9hwhUBpIZKEVuCIzR3jEBS4KU" width=40%>

**Interpretation**

Accuracy: The GBT classfier correctly 85% of the flights in 2019

Precision: When the classfier predicts a flight will delay, there is 71% of the chance that the flight actually delays.

Recall: When a flight actually delays, there is only 42% chance that the classifier predicts that correctly.

The confusion matrix indicates that our model focuses on reducing false positives (3.33% =47606/1428214) vs. false negatives (11.76% = 167907/1428214). As mentioned in the GBT notebook, the evaluation scores of the training and test datasets indicates that the GBT algorithm is not suffered from overfitting. 

Hence, the GBT model not only achieved our primary objective of 80% accuracy, but also kept a low false positive rate. 

Feature Importance:

<img src = "https://lh3.googleusercontent.com/lxcwdlWHmuK-N-dnHSnWxiDQy7sE2xAFdzKryo6bod-KUiSL0gLjq3gUm-xb-F-tJM21XO-izQUUwbnla7jV2sLGIlSnqgsEL9Q2H_XvlYhu_wk83w9R9yxpWJL9jneeap6X3-9qcP0" width=70%>

The top 20 important features from the GBT model are mostly origin and destination airports and temporal factors.

## Stage 5 - Conclusion

Based on our choice of evaluation metrics, we selected Gradient Boosted Trees as our final model. It has the highest AUC ROC score amongst the three models, as shown below. For our business problem, the second metric we would like to focus on is precision. Comparing the precision scores, we see that the gradient boosted tree is still the winner. We care about false positives than false negatives since our target audience is travelers. In the 2019 test dataset, the GBT model predicted only 3.33% false positives but 11.76% false negatives. Hence the low false-positive rate indicated that our model makes fairly good predictions at a low cost of false positives.

<img src="https://lh6.googleusercontent.com/3KWLWSLGqpD3mRMQO0sbcNAr4CrQZ4IIPzO9oMnZEWmn_Hebaw8_P0Fu0hpBLy-8vQzWW3rySd-xyeyl16dw4SQEz9Kia3_YbbHTCSjwx8_Ty1GXOMaPbASCqoCilRt030H0Vd_Rfgk" width=80%>

Furthermore, as mentioned in the GBT Model - Final notebook, we observed no overfitting in our models because the ensembling and time-series cross-validation methods allow the models to avoid bias in error estimates.

**Future Improvements**

We identified three areas of improvement: hyperparameter tuning, algorithm, and feature engineering. Given the time and resource constraints, we had to limit hyperparameter tuning within a few variations. However, we believe the Random Forest and the Gradient Boosted Trees can be further improved by running with more significant tree depth, iterations, and step size. Secondly, we would like to explore other algorithms such as neural networks. Finally, we would like to improve feature engineering and derive features like extreme weather indicators and holidays that may be more important in model prediction.

## Course Concepts

**1. RDD vs Spark Dataframe vs Pandas Dataframe**

Unlike Spark Dataframes, RDD does not automatically define a dataset's schema and is slower to perform simple operations like grouping the data. Therefore, we mostly used Dataframes to boost performance efficiency and eliminate manual schema setups.
Compared to Pandas Dataframe, Spark Dataframe has advantages from lazy evaluation property. Lazy evaluation reduces the number of passes on data by grouping operations, saves computation, and increases speed. We used Spark Dataframe as much as possible since our dataset is so large. We converted Spark dataframes to Pandas dataframes only when we needed to do complex calculations or visualization.

**2. Modeling Pipeline**

Setting up an MLFlow Pipeline allows us to apply many feature transformations on different datasets (e.g., train, validation, test) with minimal repeated code. Some of the transformations we used include: one-hot encoding categorical data, normalized numerical data using a standard scaler, and vectorized our datasets so that our models could use them.

**3. Model Scalability**

We found that it took much less time to run Logistic Regression during our modeling stage compared to the Random Forest model. Random Forest models are much more computationally expensive, especially when adding more features and using hyperparameter search and cross-validation stage. Our Random Forest models frequently crashed due to memory and network shuffle constraints. We ultimately need to reduce the number of trees and max bins for the Random Forest Model, which brings down the accuracy score. In contrast, additional features increased the training times of Logistic Regression models, but memory and network shuffle issues were not encountered.

**4. PageRank**

During the Feature Engineering portion of the project, we calculated PageRank to identify how important each node in the graph is. We then used these calculations in our ML models. The nature of the flight data lent itself to graph analysis features quite well, with airports being nodes and flights between them being edges.

## Citations
[1] A Visual Guide to Gradient Boosted Trees (XGBoost) https://towardsdatascience.com/a-visual-guide-to-gradient-boosted-trees-8d9ed578b33

[2] OpenFlights https://openflights.org/data.html

[3] Cost of Delay Estimates: https://www.faa.gov/data_research/aviation_data_statistics/media/cost_delay_estimates.pdf

[4] Predictive Modelling: Flight Delays and Associated Factors, Hartsfield–Jackson Atlanta International Airport: https://www.sciencedirect.com/science/article/pii/S1877050918317319

[5] How FlightCaster Squeezes Predictions from Flight Data: http://www.datawrangling.org/how-flightcaster-squeezes-predictions-from-flight-data/

[6] 10 Common Reasons for Flight Delays: https://skyrefund.com/en/blog/ten-reasons-for-flight-delays

[7] Random Forest: https://www.ibm.com/cloud/learn/random-forest