# Modeling the Effect of Weather on Flight Delays at SeaTac
### Authors: Beichen Liang, Max Zhou, Vanely Ruiz, and Will Bowers

## Problem Overview

For our final project, we are trying to answer what effect, if any, the weather has on outgoing flights. More specifically we are interested in the weather in Seattle and its effect on flights leaving SeaTac. We believe this to be an important topic as thousands of people fly through Seattle everyday, and as all of us as residents know Seattle's weather can be both variable and rainy. We hope to inform both the populace of Seattle that travels and the organization of SeaTac with this information, with the idea that travelers can their flights better and SeaTac can analyze and improve their systems to respond better to weather delays.

Flight delays can have costly consequences. It is estimated that in the U.S. alone, flight delays have a 40.7 billion dollar impact. Additionally, all of that time planes spend on the tarmac results in excess fuel being used and more emissions being released (Fleurquin). Flight delays also result in significant disruptions to aviation safety and the decreased traffic results in losses for the airlines (Gao, 68). Flight delays also cause passengers to prefer other airlines if they experience delays with a certain carrier (Tae-Hwee Lee, 277). Causes of flight delays can range from the unpreventable, severe weather (Gao, 68), to the preventable, crew mishaps and flight order rotation (Fleurquin).

The specific question we hope to answer with this data is, __given a particular day at SeaTac International Airport, can we predict the average departure delay for all flights based upon the weather?__ For the purposes of statistical modeling, which we'll cover later, our null hypothesis would be: _there is no relationship between the average daily departure delay at SeaTac and the weather at SeaTac_. Conversely, our alternative hypothesis would be: _there is a relationship between the average daily departure delay at SeaTac and the weather at SeaTac._ Through exploring our data with statistical models, we will be able to either accept or reject our null hypothesis. 

![title](img/plane_departing_seatac.jpg)
<i><center>Plane Departing SeaTac</center></i>
<i><center>Image courtesy of https://news.theregistryps.com/with-growing-region-seatac-prepares-for-expansion/</center></i>

## Data Preparation

For this project we used two data sets: __weather data__ gather by the National Center for Environmental Information (NOAA) and __flight data__ gathered by the Bureau of Transportation Statistics. 


### Weather Data

We used Local Climatological Data gathered by the National Center for Environmental Information (NOAA) as the data source for our project. This dataset includes hourly observations made at SeaTac during Dec. 1, 2017 to Nov. 30, 2018 to match the duration of our flight data. [The source for our data can be found here](https://www.ncdc.noaa.gov/cdo-web/datatools/lcd)

The raw weather data presented several challenges to our analysis. 

Many columns were dominated by null values, though such columns tend to be trivial measurements that are rarely taken by the weather station. We ended up selecting 15 different weather metrics from the dataset which we think would provide a full picture of the weather conditions that might affect flight delays.

Among the observations we selected, some columns are in formats other than floats that our machine learning process can work with. 

Specifically, some columns with numeric values sometimes contain special characters to denote special conditions. For example, the hourly precipitation columns occasionally use the letter "T" to indicate a trace amount of rain that can't be accurately measured. Because the appearance of such values are rare and they can be practically regarded as 0 rainfall, we decided to convert any non-numeric values in those columns to null to be later imputed using forward-filling.

Some columns like "hourly pressure change" use null values to indicate the absence of any changes. This is deduced from the fact that those values are either non-zero numbers or NaN. We filled those NaN values with 0.

There are columns that need to be transformed to new features. The "hourly sky conditions" column contains strings such as "FEW:02 38 BKN:07 190" to describe certain sky conditions. According to the dataset documentation, the two-digit number like "02" and "07" can indicate the thickness of the clouds. We simplified this feature into dummie columns "cloud_0" for no or light cloud, and "cloud_1" for heavily clouded conditions. 

The "hourly wind direction" expresses wind direction in 360 degrees. We think that the wind direction could be an important factor for flights, but we only need a categorical variable that indicates an approximate direction. So we converted the degrees to north, east, south or west, and created dummie column for each.

As mentioned previously, we used forward-filling to handle any null values we had remaining in the dataset. Forward-filling is a process in which the previous entry for that particular feature is used to fill the null value. Forward-filling is useful in cases where the data is on a time scale, as the assumption being made is the weather one hour has a high likelihood as being the same as the previous hour.

### Flight data



## Exploratory Data Analysis



## Statistical Modeling

## Interpretation

## Works Cited

“A Simple Formula for Estimating Evaporation Rates in Various Climates, <br> 
&emsp;Using Temperature Data Alone.” NeuroImage, Academic Press, 14 May 2003, <br>
&emsp;www.sciencedirect.com/science/article/pii/0002157177900073?via=ihub.

“Dengue Fever.” Mayo Clinic, Mayo Foundation for Medical Education and <br>
&emsp;Research, 16 Feb. 2018, www.mayoclinic.org/diseases-conditions/dengue-<br>
&emsp;fever/symptoms-causes/syc-20353078.
    
“Entomology & Ecology | Dengue | CDC.” Centers for Disease Control and <br>
&emsp;Prevention, Centers for Disease Control and Prevention, 15 Jan. 2019, <br>
&emsp;www.cdc.gov/dengue/entomologyecology/index.html.

“Frequently Asked Questions | Dengue | CDC.” Centers for Disease Control and <br>
&emsp;Prevention, Centers for Disease Control and Prevention, 15 Jan. 2019, <br>
&emsp;www.cdc.gov/dengue/faqfacts/index.html.
    
“Single Imputation Methods.” Iris Eekhout | Missing Data, <br>
&emsp;www.iriseekhout.com/missing-data/missing-data-methods/imputation <br>
&emsp;methods/.
    
“Zika Virus.” Centers for Disease Control and Prevention, Centers for <br>
&emsp;Disease Control and Prevention, 23 Feb. 2018, <br>
&emsp;www.cdc.gov/zika/vector/range.html.