## Airline Delay Dataset

This is a standard benchmark dataset in **Gaussian Processes**

It consists of data from all commercial flights in the US in 2008. It was first used in Hensman, 2013 who only looked the first 4 months of 2008 and subsampled to 800k points.

The whole 2008 dataset **should** contain just shy of 6M data-points


The variables in the data are 7:
 - Distance (miles)
 - Air time (minutes)
 - Departure time (
 
  - the age of the aircraft (number of years since deployment), 
 - distance that needs to be covered, 
 - airtime, 
 - departure time, 
 - arrival time, 
 - day of the week, 
 - day of the month 
 - month.
 
with the goal of predicting the amount of delay in minutes.
 
But all GP papers also add another variable:
 - aircraft_age
which we join to the main data in this notebook


#### Download Procedure
1. Download flight data from https://www.transtats.bts.gov/Fields.asp?Table_ID=236
    from january 2008 to december 2008
2. Download plane data from http://stat-computing.org/dataexpo/2009/supplemental-data.html (plane-data.csv)

#### Preprocessing
1. Merge all flight data
2. Join flight and plane data
3. Drop NaNs
4. Calculate aircraft age
5. Save useful columns to hdf5

#### References
http://stat-computing.org/dataexpo/2009/the-data.html

In [118]:
import pandas as pd

In [119]:
file_name = "/data/DATASETS/FLIGHTS/airline.csv"
df = pd.read_csv(file_name, index_col=0)

  mask |= (ar1 == a)


In [120]:
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,ArrDelay,AirTime,Distance,plane_age
0,2008,1,3,4,2003.0,2211.0,-14.0,116.0,810,10.0
1,2008,1,3,4,734.0,958.0,-22.0,314.0,2283,10.0
2,2008,1,3,4,1052.0,1603.0,-17.0,175.0,1521,10.0
3,2008,1,3,4,1653.0,1932.0,2.0,79.0,577,10.0
4,2008,1,4,5,1338.0,1440.0,10.0,48.0,239,10.0


In [122]:
X_data = df[[
    "Month", "DayofMonth", "DayOfWeek", "plane_age", 
    "Distance", "AirTime", "DepTime", "ArrTime"]].to_numpy()
Y_data = df["ArrDelay"].to_numpy().reshape(-1, 1)
print(X_data.shape, X_data.dtype)
print(Y_data.shape, Y_data.dtype)

(5929413, 8) float64
(5929413, 1) float64


In [123]:
with h5py.File("/data/DATASETS/FLIGHTS/flights.hdf5", "w") as hdf5_file:
    hdf5_file.create_dataset("X", data=X_data, compression="gzip", compression_opts=5)
    hdf5_file.create_dataset("Y", data=Y_data, compression="gzip", compression_opts=5)