In [1]:
#Importing the libraries
import pandas as pd
import numpy as np
# the url
file = "https://raw.githubusercontent.com/jtopor/DAV-5400/master/Project1/hflights.csv"
# the data frame
df = pd.read_csv(file)

# Model Goal, Dependant and independant variables
Before we start with the feature engineering process we have to first establish our goals
So we need to first identify what the dependent features are and the independent features to create a clear picture of what type of feature engineering is needed and direct our efforts in the right direction.

- So the goal is to determenine whether or not a fligh has been cancelled given a group of features.

so up till now in this context the Dependant and independant variables are as follows:
- dependent feature(s) : Cancelled
- Independant features: all except the Cancelled coloumn

In [2]:
df.shape

(20000, 21)

In [3]:
df[df['Cancelled']==1].head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
194,2011,1,24,1,,,AA,1700,N3BGAA,,...,,,IAH,MIA,964,,,1,A,0
210,2011,1,9,7,,,AA,1820,N4XCAA,,...,,,IAH,DFW,224,,,1,B,0
323,2011,1,11,2,,,B6,624,N537JB,,...,,,HOU,JFK,1428,,,1,B,0
335,2011,1,19,3,,,B6,624,N504JB,,...,,,HOU,JFK,1428,,,1,A,0
347,2011,1,27,4,,,B6,624,N569JB,,...,,,HOU,JFK,1428,,,1,B,0


3904

Let's examine the dependant feature it seems like the values present in the Cancelled column/feature are 1 & 0 from which you can only deduce that it is a binary classification machine learning problem 

In [5]:
df['Cancelled'].unique()

array([0, 1], dtype=int64)

So now that we have established our goals we turn and assigned the Dependant and independant variable leaving us with 1 dependent feature vector and 20 independent feature vectors, we turn our attention to feature engineering and the first feature engineering method we are going to use is 


## 1] Removal of unused attributes/features, Can we ?
Yes, we can here is how and why

  we will follow a black listing approach where we list all the features deemed to be unhelpful towards our statistical model.
I believe the following features are not essential to our goal.


- CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security  
- TailNum: airplane tail number 
- FlightNum: : flight number 
- TaxiIn: taxi in time in minutes  
- TaxiOut: taxi out time in minutes  
- Diverted: diverted indicator: 1 = Yes, 0 = N

##### The reason i excecluded Cancellation code is because it not logically sound to try and predict flight cancellation while having Cancellation Code as this a piece of information that we are not suppose to have before hand.

###### TailNum and FlightNum are two forms of ID that are so unique to the extent that it would not help generalize our statistical model, so keeping could actually harm the model and lead to overfitting, However i kept carrier unique id and the reason for that is because the specific airline could give our model an idea if it is going to be cancelled or not and needs to be empircally tested in the model development section.

##### Based on my domain Knowledge , how much time you spend within a taxt doesnot affect whether or not a flight is cancelled, it is more common sense that domain knowledge really.

###### Diverted, if the plane diverted it means that it wasn't cancelled. this is also not logically sound to keep

Note: the reason i chose to do this first is because i would like to exert effort only on the ones that i will be using for predictions as it is always better to define your scope

In [6]:
df.head(1)

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
0,2011,1,1,6,1400.0,1500.0,AA,428,N576AA,60.0,...,-10.0,0.0,IAH,DFW,224,7.0,13.0,0,,0


In [7]:
data=df.iloc[:,:16]
X=data.drop(['FlightNum'], axis=1)
X.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,UniqueCarrier,TailNum,ActualElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance
0,2011,1,1,6,1400.0,1500.0,AA,N576AA,60.0,40.0,-10.0,0.0,IAH,DFW,224
1,2011,1,2,7,1401.0,1501.0,AA,N557AA,60.0,45.0,-9.0,1.0,IAH,DFW,224
2,2011,1,3,1,1352.0,1502.0,AA,N541AA,70.0,48.0,-8.0,-8.0,IAH,DFW,224
3,2011,1,4,2,1403.0,1513.0,AA,N403AA,70.0,39.0,3.0,3.0,IAH,DFW,224
4,2011,1,5,3,1405.0,1507.0,AA,N492AA,62.0,44.0,-3.0,5.0,IAH,DFW,224


## 2. Combining Sparse Classifications, Can we after the removal?
we turn our attention to the categorical features left
 
 Unique Carrier
 
 Dest

 Origin 
<br>
we fist check if they have missing values, They appear to not have any missing values, however there are missing values in the data that need to be dealt with later on outside the scope of this step.

In [8]:
X.isnull().sum(axis = 0)

Year                   0
Month                  0
DayofMonth             0
DayOfWeek              0
DepTime              215
ArrTime              234
UniqueCarrier          0
TailNum              108
ActualElapsedTime    262
AirTime              262
ArrDelay             262
DepDelay             215
Origin                 0
Dest                   0
Distance               0
dtype: int64

In [9]:
print(X['UniqueCarrier'].unique())
print('\n')
print('\n')
print(X['Origin'].unique())
print('\n')
print('\n')
print(X['Dest'].unique())

['AA' 'AS' 'B6' 'CO' 'DL' 'OO' 'UA' 'US' 'WN' 'EV' 'F9' 'FL' 'MQ' 'XE']




['IAH' 'HOU']




['DFW' 'MIA' 'SEA' 'JFK' 'HNL' 'MSY' 'SAT' 'AUS' 'LAX' 'DEN' 'EWR' 'ORD'
 'ONT' 'DCA' 'SFO' 'LAS' 'TPA' 'PDX' 'SJU' 'PHX' 'BWI' 'LGA' 'CLE' 'RDU'
 'BOS' 'IND' 'SAN' 'IAD' 'SJC' 'TUL' 'MFE' 'PIT' 'SLC' 'SNA' 'MCO' 'MSP'
 'EGE' 'ATL' 'PHL' 'DTW' 'FLL' 'CLT' 'RSW' 'MCI' 'SMF' 'ELP' 'OKC' 'ABQ'
 'TUS' 'PBI' 'OMA' 'HDN' 'GUC' 'MTJ' 'BHM' 'CMH' 'COS' 'GRR' 'ASE' 'MKE'
 'BNA' 'CRP' 'DAL' 'ECP' 'HRL' 'JAN' 'JAX' 'LIT' 'MAF' 'MDW' 'OAK' 'STL'
 'MEM' 'CVG' 'SDF' 'BTR' 'RIC' 'LRD' 'ORF' 'PNS' 'CRW' 'XNA' 'AEX' 'BRO'
 'DAY' 'GPT' 'GSO' 'SAV' 'ICT' 'HSV' 'AVL' 'LEX' 'LFT' 'LCH' 'LBB' 'DSM'
 'CHS' 'CAE' 'GSP' 'SHV' 'TYS' 'MOB' 'VPS' 'AMA' 'GRK' 'RNO']


After investigating the unique categorical values for Unique and origin coloumn i have decided to keep them as they are as they cannot be grouped, however it is would be good idea to put a pin on destination if we ever encounter a problem in the future with our model and investigate whether or not we could perform a grouping although i highly doubt that that would be the a problem

Conclusion.
No need for the combining sparse elements however we do need to impute and add the missing values by other numerical values (Mean,median...etc), So let us go ahead and perform the third method of feature engineering which is creating dummy variables

## 3] Creating Dummy Variables, Should We?
Yes, we absoultely should and here is how.
- all the categorical variables that are present need to be changed into numerical values that donot allow any form precedence for the machine learning model, that is because we want our model to be able to use it and to be able to use fairly

In [10]:
encoded_columns = pd.get_dummies(X['UniqueCarrier'])
X = X.join(encoded_columns).drop('UniqueCarrier', axis=1)

encoded_columns = pd.get_dummies(X['Origin'])
X = X.join(encoded_columns).drop('Origin', axis=1)

encoded_columns = pd.get_dummies(X['Dest'])
X = X.join(encoded_columns).drop('Dest', axis=1)

In [11]:
X.iloc[:,12:].head()

Unnamed: 0,AA,AS,B6,CO,DL,EV,F9,FL,MQ,OO,...,SLC,SMF,SNA,STL,TPA,TUL,TUS,TYS,VPS,XNA
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
X

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,TailNum,ActualElapsedTime,AirTime,ArrDelay,...,SLC,SMF,SNA,STL,TPA,TUL,TUS,TYS,VPS,XNA
0,2011,1,1,6,1400.0,1500.0,N576AA,60.0,40.0,-10.0,...,0,0,0,0,0,0,0,0,0,0
1,2011,1,2,7,1401.0,1501.0,N557AA,60.0,45.0,-9.0,...,0,0,0,0,0,0,0,0,0,0
2,2011,1,3,1,1352.0,1502.0,N541AA,70.0,48.0,-8.0,...,0,0,0,0,0,0,0,0,0,0
3,2011,1,4,2,1403.0,1513.0,N403AA,70.0,39.0,3.0,...,0,0,0,0,0,0,0,0,0,0
4,2011,1,5,3,1405.0,1507.0,N492AA,62.0,44.0,-3.0,...,0,0,0,0,0,0,0,0,0,0
5,2011,1,6,4,1359.0,1503.0,N262AA,64.0,45.0,-7.0,...,0,0,0,0,0,0,0,0,0,0
6,2011,1,7,5,1359.0,1509.0,N493AA,70.0,43.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
7,2011,1,8,6,1355.0,1454.0,N477AA,59.0,40.0,-16.0,...,0,0,0,0,0,0,0,0,0,0
8,2011,1,9,7,1443.0,1554.0,N476AA,71.0,41.0,44.0,...,0,0,0,0,0,0,0,0,0,0
9,2011,1,10,1,1443.0,1553.0,N504AA,70.0,45.0,43.0,...,0,0,0,0,0,0,0,0,0,0


# 4 . Creating interaction features
I believe there should be binning when it comes to distance as there are a plethora of bins and it would be a lot more useful to group them between 127 to 3904 , we should perhaps use pd.cut() to establish the binning as we learned in the grouping lectures i believe that would be the best approach to go about it.




for example

In [23]:
from scipy.stats.mstats import mquantiles
mquantiles(X.Distance)

array([ 351.,  787., 1034.])

# 5.Creating Indicators
I think we should create a new column with the season which we will derive from the time-related columns which will tell us in which of the four-season where we, I believe that will have a huge impact on the model as there is the season where the weather is just to be bad.our new created indicator should be a season and it would be classified as 1,2,3,4 for each of the seasons

In [None]:
from datetime import date, datetime

Y = 2000 # dummy leap year to allow input X-02-29 (leap day)
seasons = [('winter', (date(Y,  1,  1),  date(Y,  3, 20))),
           ('spring', (date(Y,  3, 21),  date(Y,  6, 20))),
           ('summer', (date(Y,  6, 21),  date(Y,  9, 22))),
           ('autumn', (date(Y,  9, 23),  date(Y, 12, 20))),
           ('winter', (date(Y, 12, 21),  date(Y, 12, 31)))]

def get_season(now):
    if isinstance(now, datetime):
        now = now.date()
    now = now.replace(year=Y)
    return next(season for season, (start, end) in seasons
                if start <= now <= end)

print(get_season(date.today()))

refrences:
https://stackoverflow.com/questions/16139306/determine-season-given-timestamp-in-python-using-datetime