### Linear Regression Prediction

In [212]:
%matplotlib inline
# import required modules for prediction tasks
import numpy as np
import pandas as pd
import math
import random
import requests
import zipfile
import StringIO
import re
import json
import os

In [213]:
# load data
df = pd.read_csv('cache/linear_model_data.csv')

For our model, we need only some columns

In [214]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,index,FL_DATE,UNIQUE_CARRIER,ORIGIN,DEST,CRS_DEP_TIME,CRS_ARR_TIME,ARR_DELAY,DISTANCE,ORIGIN_CITY_NAME,DEST_CITY_NAME,AIRCRAFT_YEAR,AIRCRAFT_MFR,LAT,LONG
0,0,0,2014-01-01,AA,JFK,LAX,900,1225,13,2475,"New York, NY","Los Angeles, CA",1987,BOEING,40.633333,-73.783333
1,1,1,2014-01-02,AA,JFK,LAX,900,1225,1,2475,"New York, NY","Los Angeles, CA",1987,BOEING,40.633333,-73.783333


In our model, we want to use the age of aircrafts as variable. Unfortunately, for some aircrafts data is not available. A quick chck reveals that around 2.5% of all rows are affected(4M in total). We remove thet data.

In [215]:
df[df['AIRCRAFT_YEAR']=='    '].count()[0] * 100. / df.count()[0]

2.358077559760376

In [216]:
df = df[df['AIRCRAFT_YEAR'] != '    ']

Using the aircraft year, we compute the age of the planes (note that all data used is from 2014!)

In [217]:
# add manually a new column aircraft age (data is from 2014, so use this year!)
df['AIRCRAFT_AGE'] = 2014 - df['AIRCRAFT_YEAR'].astype(int)
df = df[['FL_DATE', 'UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'ARR_DELAY', 'DISTANCE', 'AIRCRAFT_MFR', 'AIRCRAFT_AGE']];

In the next step, we transform the flight data into binary features. Typically, the day of week and the current season has effect on the delay time. Hence, we add 4 variables indicating the season and 7 indicating the weekday.

In [218]:
# add columns filled with zero to df
dateFeaturesColumns = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Spring', 'Summer', 'Autumn', 'Winter']

for col in dateFeaturesColumns:
    df[col] = 0

To run this code, sit back and relax. It might take some while.

In [219]:
%%time

import datetime

# set binary features
for index, row in df.iterrows():
    
    datestr = row['FL_DATE']
    dtobj = datetime.datetime.strptime(datestr, '%Y-%m-%d')
    
    # set 1's where necessary
    df.set_value(index, dateFeaturesColumns[dtobj.weekday()], 1)
    df.set_value(index, dateFeaturesColumns[7 + (int(dtobj.month) - 1) / 3], 1)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.96 µs


In [220]:
df.head()

Unnamed: 0,FL_DATE,UNIQUE_CARRIER,ORIGIN,DEST,ARR_DELAY,DISTANCE,AIRCRAFT_MFR,AIRCRAFT_AGE,Mon,Tue,Wed,Thu,Fri,Sat,Sun,Spring,Summer,Autumn,Winter
0,2014-01-01,AA,JFK,LAX,13,2475,BOEING,27,0,0,1,0,0,0,0,1,0,0,0
1,2014-01-02,AA,JFK,LAX,1,2475,BOEING,27,0,0,0,1,0,0,0,1,0,0,0
2,2014-01-04,AA,JFK,LAX,59,2475,BOEING,28,0,0,0,0,0,1,0,1,0,0,0
3,2014-01-06,AA,JFK,LAX,-8,2475,BOEING,29,1,0,0,0,0,0,0,1,0,0,0
4,2014-01-09,AA,JFK,LAX,-21,2475,BOEING,26,0,0,0,1,0,0,0,1,0,0,0


### Dealing with the aircraft manufacturer

Next, we generate features for the aircraft manufacturer. A first step is to clean the manufacturer strings up (i.e. remove the whitespace) and get a first statistic.

In [221]:
df['AIRCRAFT_MFR'] = df['AIRCRAFT_MFR'].map(lambda x: x.strip())

In [222]:
mfr_stats = df['AIRCRAFT_MFR'].value_counts()
mfr_stats

BOEING                           1485383
BOMBARDIER INC                    689292
EMBRAER                           567350
AIRBUS INDUSTRIE                  500511
AIRBUS                            401410
MCDONNELL DOUGLAS AIRCRAFT CO     164145
MCDONNELL DOUGLAS                 118159
CANADAIR                           23805
MCDONNELL DOUGLAS CORPORATION      21170
EMBRAER S A                         7264
CESSNA                              7014
PIPER                               3270
BEECH                               2984
CIRRUS DESIGN CORP                  2178
CANADAIR LTD                        1791
GULFSTREAM AEROSPACE                1576
BELL                                1527
MARZ BARRY                          1508
KILDALL GARY                        1433
LEBLANC GLENN T                     1349
ROBINSON HELICOPTER CO              1122
FRIEDEMANN JON                       723
GROSS ROBERT                         466
SOCATA                               415
RAYTHEON AIRCRAF

A quick investigation of the data yields that there are many manufacturers whose airplanes serve a neglectable amount of flights. Also, we merge some companies that are basically the same. To simplify the model, flights served on airplanes with less than 1% market are grouped together.

In [223]:
market_share = mfr_stats.values * 100. / np.sum(mfr_stats.values)
idxs = np.where(market_share < 1.)
names = np.array([el for el in list(mfr_stats.keys())])

# get labels for small manufacturers
smallMFR = names[idxs]
smallMFR

array(['CANADAIR', 'MCDONNELL DOUGLAS CORPORATION', 'EMBRAER S A',
       'CESSNA', 'PIPER', 'BEECH', 'CIRRUS DESIGN CORP', 'CANADAIR LTD',
       'GULFSTREAM AEROSPACE', 'BELL', 'MARZ BARRY', 'KILDALL GARY',
       'LEBLANC GLENN T', 'ROBINSON HELICOPTER CO', 'FRIEDEMANN JON',
       'GROSS ROBERT', 'SOCATA', 'RAYTHEON AIRCRAFT COMPANY', 'AGUSTA SPA',
       'DOUGLAS', 'SIKORSKY', 'AVIAT AIRCRAFT INC', 'BENHAM JOHN'], 
      dtype='|S29')

In [224]:
mfr_stats.keys()

Index([u'BOEING', u'BOMBARDIER INC', u'EMBRAER', u'AIRBUS INDUSTRIE',
       u'AIRBUS', u'MCDONNELL DOUGLAS AIRCRAFT CO', u'MCDONNELL DOUGLAS',
       u'CANADAIR', u'MCDONNELL DOUGLAS CORPORATION', u'EMBRAER S A',
       u'CESSNA', u'PIPER', u'BEECH', u'CIRRUS DESIGN CORP', u'CANADAIR LTD',
       u'GULFSTREAM AEROSPACE', u'BELL', u'MARZ BARRY', u'KILDALL GARY',
       u'LEBLANC GLENN T', u'ROBINSON HELICOPTER CO', u'FRIEDEMANN JON',
       u'GROSS ROBERT', u'SOCATA', u'RAYTHEON AIRCRAFT COMPANY', u'AGUSTA SPA',
       u'DOUGLAS', u'SIKORSKY', u'AVIAT AIRCRAFT INC', u'BENHAM JOHN'],
      dtype='object')

In [225]:
# save the progress
df.to_csv('cache/cached_features.csv')

In [234]:
# perform merging for the big companies
# Douglas airplanes
df.loc[df['AIRCRAFT_MFR'] == 'MCDONNELL DOUGLAS AIRCRAFT CO', 'AIRCRAFT_MFR'] = 'MCDONNELL DOUGLAS'
df.loc[df['AIRCRAFT_MFR'] == 'MCDONNELL DOUGLAS CORPORATION', 'AIRCRAFT_MFR'] = 'MCDONNELL DOUGLAS'
df.loc[df['AIRCRAFT_MFR'] == 'MCDONNELL DOUGLAS CORPORATION', 'AIRCRAFT_MFR'] = 'DOUGLAS'

# Embraer
df.loc[df['AIRCRAFT_MFR'] == 'EMBRAER S A', 'AIRCRAFT_MFR'] = 'EMBRAER'

# Airbus
df.loc[df['AIRCRAFT_MFR'] == 'AIRBUS INDUSTRIE', 'AIRCRAFT_MFR'] = 'AIRBUS'

# the small manufacturers
for name in smallMFR:
    df.loc[df['AIRCRAFT_MFR'] == name, 'AIRCRAFT_MFR'] = 'SMALL'

In [235]:
df['AIRCRAFT_MFR'].value_counts()

BOEING               1485383
AIRBUS                901921
BOMBARDIER INC        689292
EMBRAER               574614
MCDONNELL DOUGLAS     303474
SMALL                  52147
dtype: int64

### Creating binary features for the flight

Now we create binary features along the airports (origin, dest), the manufacturer and for the airline.

In [236]:
# get airlines
airline_labels = df['UNIQUE_CARRIER'].unique()
mfr_labels = df['AIRCRAFT_MFR'].unique()
airport_labels_o = df['ORIGIN'].unique()
airport_labels_d = df['DEST'].unique()

A quick check shows, that there are some airports who are served by some flights only as destination or origin. Here, we include those cases. However, as we model a factor for each airport involved it might be a consideration worth whether other modelling choices here are more adequate.

In [247]:
set(airport_labels_o) ^ set(airport_labels_d)

{'BKG',
 'BQN',
 'CSG',
 'ECP',
 'GUM',
 'MCN',
 'MVY',
 'PPG',
 'PSE',
 'SJU',
 'STT',
 'STX',
 'TXK'}

In [251]:
airport_labels = list(set(airport_labels_o) | set(airport_labels_d))
len(airport_labels)

324

In [252]:
%%time
# similiar to the date data, add first columns!
for col in airline_labels:
    df[col] = 0
for col in mfr_labels:
    df[col] = 0
for col in airport_labels:
    df[col] = 0

As before, running this function might take some time!

In [267]:
%%time

# now add the features
for index, row in df.iterrows():
    # set 1's where necessary (do lookup based on labels)
    df.set_value(index, row['UNIQUE_CARRIER'], 1)
    df.set_value(index, row['ORIGIN'], 1)
    df.set_value(index, row['DEST'], 1)
    df.set_value(index, row['AIRCRAFT_MFR'], 1)

CPU times: user 1 µs, sys: 0 ns, total: 1 µs
Wall time: 16.2 µs


In [268]:
print 'done'

done


In [269]:
%%time
# save to csv file
df.to_csv('cache/linear_model_features.csv')

In [270]:
print 'done'

done
