# Motivation and Data Aquisition Process

This notebook describes the motivation for our project, related work, and finally the data aquisition process. We decided to put this in a separate notebook, as we have downloaded and scraped data from different sources. 

* [1 Motivation](#1-Motivation)
    * [1.1 Why is My Flight Delayed?](#1.1-Why-is-My-Flight-Delayed?)
* [2 Data Aquisition Process](#2-Data-Aquisition-Process)
    * [2.1 Flight Delay Central Database](#2.1-Flight-Delay-Central-Database)
    * [2.2 Aircraft Information Data](#2.2-Aircraft-Information-Data)
    * [2.3 Airport Information](#2.3-Airport-Information)
    * [2.4 Weather Data](#2.4-Weather-Data)

## 1 Motivation

### 1.1 Why is My Flight Delayed?

Flight delays are a major problem. Almost everybody has experienced a delayed or cancelled flight before and knows how annoying it is to wait at airports and maybe even miss important meetings at the destination. Delays are not only a problem for individual customers, but also for the airlines and the US economy in general: In 2010, researchers at the University of California (Berkeley) found that flight delays lead to total costs of more than $32.9 billion!

However, usually customers do not know the reason for their delayed arrival. Therefore, we want to understand what causes the delays and if we are able to estimate the expected delay for a given flight in the future.

A few of the main questions we are planning to investigate in our project are:

* Is there a difference between airports? What are the airports that are most heavily affected by delays?
* We would like to ask a similar question for the airlines? Which airlines usually arrive on time? Which airline is the worst?
* Can we detect any seasonality in the data? Are there more delays in the winter (e.g. because of bad weather) or in the summer (e.g. because of summer holidays)? During the days, at which time should I book my flight to avoid delays?
* Sometimes flights can't leave the airport because of last-minute repairs or other problems caused by the carrier. Has the age of the aircraft any influence on these carrier delays?
* Is it possible to build a model that estimates the delay for a given flight?

## 2 Data Aquisition Process

The following sections will outline our data aquisition process. Furthermore, we will describe the most important characteristics and features of the datasets.
<img src="images/DataSources.png" align="left" width="500" height="500">
    
   <dt>We used four different data sources</dt> 
* The official flight database for every domestic flight in the US 
* Historical weather data
* Airport information with geodata and names (e.g. for visualization and interpretation of results) 
* Information about aircraft models

### 2.1 Flight Delay Central Database

Our main source of data was the Bureau of Transportation Statistics (BTS), which is as statistical agency of the US DEpartment of Transportation (<http://www.transtats.bts.gov/>). Luckily, the BTS publishes detailed data for every domestic flight in the US (<http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time>). However, it is not possible to download the data over a specified period of time, i.e. one can only download data for a month in a given year. To get the data automatically we developed a scraper tool in Python, which automatically performs the requests and downloads the data. Files adressing this issue can be found in the `src` folder. Scraping the data for ~25 years needs around 2-3 hours as requests are processed slowly on the server side. Furthermore, the BTS provides LookUp Tables for airline codes which have been downloaded manually.

Since the uncompressed data for each month is around 250-300MB (comma separated), we needed to filter this dataset. A first step to do so is restricting the features. A description of all available columns is available at <http://www.transtats.bts.gov/TableInfo.asp?Table_ID=236&DB_Short_Name=On-Time&Info_Only=0>. In our analysis we will use a subset of 30 features that have been identified as relevant for the purpose of our analysis.

### 2.2 Aircraft Information Data

Many of us have experienced it before: a flight is delayed because there are some lastminute repairs or other problems with the aircraft. We are curious if the manufacturer or the age of the aircraft influences the probability of delays. Therefore we need more detailed data about the flight. From the delay dataset mentioned in 2.1 we get the tail number of the aircraft for every single flight. This tail number is comparable to car license plates and helps us to identify the manufacturer and age of the airplane. 

We can get this information from the Federal Aviation Administration. This institution has a database of tail numbers (so called N-Numbers) for each aircraft in the US and also publishes other datasets with information about the respective aircraft (e.g. manufacturer of turbines, owner, etc.). We downloaded the database from the following website:
http://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/releasable_aircraft_download/.

### 2.3 Airport Information

In addition, we need the exact geolocations for the airports in the dataset, for example to get good visualizations using Tableau Public or other maps. Furthermore, airport names would be helpful to interpret the data (the original dataset just contains the short IATA abbreviations). This data can be easily found online as csvs (see http://openflights.org/data.html). It can be found in the folder `data`.

### 2.4 Weather Data

A natural cause for many delays seems to be the weather. We decided to include weather data additionally in our analysis. Therefore, we wanted to get historical weather information of the airports. Unfortunately, when it comes to weather data, we couldn't find any public available sources that provide a suitable (free) dataset. However, Wunderground provides a webinterface allowing to query specific IATA / IAOC codes of airports (see i.e. <http://www.wunderground.com/history/airport/EDDF/2005/10/3/DailyHistory.html?req_city=Frankfurt+%2F+Main&req_state=&req_statename=Germany&reqdb.zip=00000&reqdb.magic=5&reqdb.wmo=10637>). Writing a script allowed us to get historic data of individual airports. 

<table>
<tr><td>**events**</td><td>a list containing strings of weather events, i.e. "Rain", "Fog", "Snow"</td></tr>
<tr><td>**humidity**</td><td>humidity measured in percent</td></tr>
<tr><td>**precipitation**</td><td>precipitation measured in inches </td></tr>
<tr><td>**sealevelpressure**</td><td>pressure at sea level in inches</td></tr>
<tr><td>**snowdepth**</td><td>snow depth in inches</td></tr>
<tr><td>**snowfall**</td><td>snow fall in inches</td></tr>
<tr><td>**temperature**</td><td>temperature in degree Fahrenheit</td></tr>
<tr><td>**visibility**</td><td>visibility in miles</td></tr>
<tr><td>**windspeed**</td><td>wind speed in miles per hour</td></tr>
</table>
 
One drawback of this method is similiar to getting the data from the BTS: the slow processing of requests from the server and the amount of requests necessary to get data matching the huge dataset of the BTS (ca. 15 min for a single year and just one airport). Thus, we decided to focus on the weather at the John F. Kennedy International Airport (New York City) and at the Boston Logan International Airport (Boston) only. Although we just used these two airports we could get some very valuable insights about weather's effects on delays (see process notebook for exploratory analysis).