Thinkful Bootcamp Course

Author: Ian Heaton

Email: iheaton@gmail.com

Mentor: Nemanja Radojkovic

Date: 2017/04/27


In [1]:
from IPython.display import display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.linear_model import LogisticRegression

%matplotlib inline

sb.set_style('darkgrid')
my_dpi = 76

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Classification of Airline Delays with SVM


## Question:


### Data:

Data comes from the Kaggle web site, ‘Airlines Delay’  [1].   


### Context:

The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. 

### Content:
      
+ Year: 1987-2008
+ Month: 	            1-12
+ DayofMonth: 	        1-31
+ DayOfWeek: 	        1 (Monday) - 7 (Sunday)
+ DepTime: 	        actual departure time (local, hhmm)
+ CRSDepTime: 	        scheduled departure time (local, hhmm)
+ ArrTime: 	        actual arrival time (local, hhmm)
+ CRSArrTime: 	        scheduled arrival time (local, hhmm)
+ UniqueCarrier: 	    unique carrier code
+ FlightNum: 	        flight number
+ TailNum: 	        plane tail number
+ ActualElapsedTime: 	in minutes
+ CRSElapsedTime: 	in minutes
+ AirTime: 	in minutes
+ ArrDelay: 	arrival delay, in minutes
+ DepDelay: 	departure delay, in minutes
+ Origin: 	origin IATA airport code
+ Dest: 	destination IATA airport code
+ Distance: 	in miles
+ TaxiIn: 	taxi in time, in minutes
+ TaxiOut: 	taxi out time in minutes
+ Cancelled: 	was the flight cancelled?
+ CancellationCode: 	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
+ Diverted: 	1 = yes, 0 = no
+ CarrierDelay: 	in minutes
+ WeatherDelay: 	in minutes
+ NASDelay: 	in minutes
+ SecurityDelay: 	in minutes
+ LateAircraftDelay: 	in minutes


In [2]:
# Exclude some columns that will not add information to the analysis
ex_columns = ['FlightNum', 'TailNum', 'UniqueCarrier']
# Read CSV containing text data
data_file = '/media/ianh/space/ThinkfulData/AirlinesDelay/DelayedFlights.csv'
flights = pd.read_csv(data_file)
print("\nObservations : %d\n" % (flights.shape[0]))


Observations : 1936758



## Preprocessing and exploratory data analysis

In [16]:
# Check for missing data
print("%s\n" % (flights.isnull().sum()))

# Lets ensure that all columns are of the expected type object and int (or float)
print("\n%s\n" % (flights.dtypes))

# Since we are interested if a flight was delayed and not the 
# reason why  we will create a new column called delayed and 
# an observation will only be considered delayed if the delay in minutes surpasses 10.
flights['Delayed'] = np.where(((flights.CarrierDelay > 10) | (flights.WeatherDelay > 10) | (flights.NASDelay > 10)\
                               | (flights.SecurityDelay > 10) | (flights.LateAircraftDelay > 10) ), 1, 0)

Unnamed: 0                0
Year                      0
Month                     0
DayofMonth                0
DayOfWeek                 0
DepTime                   0
CRSDepTime                0
ArrTime                7110
CRSArrTime                0
UniqueCarrier             0
FlightNum                 0
TailNum                   5
ActualElapsedTime      8387
CRSElapsedTime          198
AirTime                8387
ArrDelay               8387
DepDelay                  0
Origin                    0
Dest                      0
Distance                  0
TaxiIn                 7110
TaxiOut                 455
Cancelled                 0
CancellationCode          0
Diverted                  0
CarrierDelay         689270
WeatherDelay         689270
NASDelay             689270
SecurityDelay        689270
LateAircraftDelay    689270
Delayed                   0
dtype: int64


Unnamed: 0             int64
Year                   int64
Month                  int64
DayofMonth             int64
D

## Conclusions

## References

1. https://www.kaggle.com/giovamata/airlinedelaycauses
2. 