### Linear Regression Prediction

In [98]:
%matplotlib inline
# import required modules for prediction tasks
import numpy as np
import pandas as pd
import math
import random
import requests
import zipfile
import StringIO
import re
import json
import os

In [99]:
# load data
df = pd.read_csv('cache/linear_model_data.csv')

For our model, we need only some columns

In [100]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,index,FL_DATE,UNIQUE_CARRIER,ORIGIN,DEST,CRS_DEP_TIME,CRS_ARR_TIME,ARR_DELAY,DISTANCE,ORIGIN_CITY_NAME,DEST_CITY_NAME,AIRCRAFT_YEAR,AIRCRAFT_MFR,LAT,LONG
0,0,0,2014-01-01,AA,JFK,LAX,900,1225,13,2475,"New York, NY","Los Angeles, CA",1987,BOEING,40.633333,-73.783333
1,1,1,2014-01-02,AA,JFK,LAX,900,1225,1,2475,"New York, NY","Los Angeles, CA",1987,BOEING,40.633333,-73.783333


In our model, we want to use the age of aircrafts as variable. Unfortunately, for some aircrafts data is not available. A quick chck reveals that around 2.5% of all rows are affected(4M in total). We remove thet data.

In [101]:
df[df['AIRCRAFT_YEAR']=='    '].count()[0] * 100. / df.count()[0]

2.358077559760376

In [102]:
df = df[df['AIRCRAFT_YEAR'] != '    ']

Using the aircraft year, we compute the age of the planes (note that all data used is from 2014!)

In [103]:
# add manually a new column aircraft age (data is from 2014, so use this year!)
df['AIRCRAFT_AGE'] = 2014 - df['AIRCRAFT_YEAR'].astype(int)
df = df[['FL_DATE', 'UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'ARR_DELAY', 'DISTANCE', 'AIRCRAFT_MFR', 'AIRCRAFT_AGE']];

In the next step, we transform the flight data into binary features. Typically, the day of week and the current season has effect on the delay time. Hence, we add 4 variables indicating the season and 7 indicating the weekday.

In [104]:
# add columns filled with zero to df
dateFeaturesColumns = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Spring', 'Summer', 'Autumn', 'Winter']

for col in dateFeaturesColumns:
    df[col] = 0

To run this code, sit back and relax. It might take some while.

In [138]:
%time 

import datetime

# set binary features
for index, row in df.iterrows():
    
    datestr = row['FL_DATE']
    dtobj = datetime.datetime.strptime(datestr, '%Y-%m-%d')
    
    # set 1's where necessary
    df.set_value(index, dateFeaturesColumns[dtobj.weekday()], 1)
    df.set_value(index, dateFeaturesColumns[7 + (int(dtobj.month) - 1) / 3], 1)

CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 5.01 µs


In [139]:
df.head()

Unnamed: 0,FL_DATE,UNIQUE_CARRIER,ORIGIN,DEST,ARR_DELAY,DISTANCE,AIRCRAFT_MFR,AIRCRAFT_AGE,Mon,Tue,Wed,Thu,Fri,Sat,Sun,Spring,Summer,Autumn,Winter
0,2014-01-01,AA,JFK,LAX,13,2475,BOEING,27,0,0,1,0,0,0,0,1,1,0,0
1,2014-01-02,AA,JFK,LAX,1,2475,BOEING,27,0,0,0,1,0,0,0,1,1,0,0
2,2014-01-04,AA,JFK,LAX,59,2475,BOEING,28,0,0,0,0,0,1,0,1,1,0,0
3,2014-01-06,AA,JFK,LAX,-8,2475,BOEING,29,1,0,0,0,0,0,0,1,1,0,0
4,2014-01-09,AA,JFK,LAX,-21,2475,BOEING,26,0,0,0,1,0,0,0,1,1,0,0
