# Foundations of Data Science - Group 12



Flight Delays

Let us begin by importing the required libraries into our code. We will use the `pandas` library for dataframe operations.

In [None]:
import pandas as pd
import numpy as np
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics, preprocessing
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import CondensedNearestNeighbour

We read the dataset from the CSV file into our dataframe object coverdata, and try to get an initial feel by printing the first 5 rows.

In [None]:
flights_data = pd.read_csv('Flight_delay.csv')
flights_data.head()

Unnamed: 0,DayOfWeek,Date,DepTime,ArrTime,CRSArrTime,UniqueCarrier,Airline,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Org_Airport,Dest,Dest_Airport,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,4,03-01-2019,1829,1959,1925,WN,Southwest Airlines Co.,3920,N464WN,90,90,77,34,34,IND,Indianapolis International Airport,BWI,Baltimore-Washington International Airport,515,3,10,0,N,0,2,0,0,0,32
1,4,03-01-2019,1937,2037,1940,WN,Southwest Airlines Co.,509,N763SW,240,250,230,57,67,IND,Indianapolis International Airport,LAS,McCarran International Airport,1591,3,7,0,N,0,10,0,0,0,47
2,4,03-01-2019,1644,1845,1725,WN,Southwest Airlines Co.,1333,N334SW,121,135,107,80,94,IND,Indianapolis International Airport,MCO,Orlando International Airport,828,6,8,0,N,0,8,0,0,0,72
3,4,03-01-2019,1452,1640,1625,WN,Southwest Airlines Co.,675,N286WN,228,240,213,15,27,IND,Indianapolis International Airport,PHX,Phoenix Sky Harbor International Airport,1489,7,8,0,N,0,3,0,0,0,12
4,4,03-01-2019,1323,1526,1510,WN,Southwest Airlines Co.,4,N674AA,123,135,110,16,28,IND,Indianapolis International Airport,TPA,Tampa International Airport,838,4,9,0,N,0,0,0,0,0,16


We see that the data is in a raw format, with no normalization or scaling done. Let's try to find the data types of the attributes

In [None]:
flights_data.dtypes

We see that most attributes have the right types. But we can do better with some dateTime attributes. Let's try to convert a few into the desired type.

In [None]:
for column in ['DayOfWeek', 'FlightNum', 'Cancelled', 'Diverted']:
  flights_data[column] = flights_data[column].apply(str)

In [None]:
for column in ['DepTime', 'ArrTime', 'CRSArrTime']:
  #TO CHANGE THE MISSING DIGIT ------ 958 to 0958
  flights_data[column] = flights_data[column].map("{:04}".format)
  # ADDING COLON AFTER TWO CHARACHTER ------ 09:58
  flights_data[column] =flights_data[column].astype(str).replace(r"(\d{2})(\d+)", r"\1:\2", regex=True)
  #changing 24:00 to 00:00 because while changing to Standard Timestamp, we will get error if the column have 24:00 value)
  flights_data[column] = flights_data[column].replace(to_replace ='24:', value = '00:', regex = True)
  #Combining 'Date' column and this column
  flights_data[column] = flights_data.Date.map(str) + " " + flights_data[column]
  #Applying time stamp to dataframe combined column
  flights_data[column] = pd.to_datetime(flights_data[column])
flights_data.head()

Unnamed: 0,DayOfWeek,Date,DepTime,ArrTime,CRSArrTime,UniqueCarrier,Airline,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Org_Airport,Dest,Dest_Airport,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,4,03-01-2019,2019-03-01 18:29:00,2019-03-01 19:59:00,2019-03-01 19:25:00,WN,Southwest Airlines Co.,3920,N464WN,90,90,77,34,34,IND,Indianapolis International Airport,BWI,Baltimore-Washington International Airport,515,3,10,0,N,0,2,0,0,0,32
1,4,03-01-2019,2019-03-01 19:37:00,2019-03-01 20:37:00,2019-03-01 19:40:00,WN,Southwest Airlines Co.,509,N763SW,240,250,230,57,67,IND,Indianapolis International Airport,LAS,McCarran International Airport,1591,3,7,0,N,0,10,0,0,0,47
2,4,03-01-2019,2019-03-01 16:44:00,2019-03-01 18:45:00,2019-03-01 17:25:00,WN,Southwest Airlines Co.,1333,N334SW,121,135,107,80,94,IND,Indianapolis International Airport,MCO,Orlando International Airport,828,6,8,0,N,0,8,0,0,0,72
3,4,03-01-2019,2019-03-01 14:52:00,2019-03-01 16:40:00,2019-03-01 16:25:00,WN,Southwest Airlines Co.,675,N286WN,228,240,213,15,27,IND,Indianapolis International Airport,PHX,Phoenix Sky Harbor International Airport,1489,7,8,0,N,0,3,0,0,0,12
4,4,03-01-2019,2019-03-01 13:23:00,2019-03-01 15:26:00,2019-03-01 15:10:00,WN,Southwest Airlines Co.,4,N674AA,123,135,110,16,28,IND,Indianapolis International Airport,TPA,Tampa International Airport,838,4,9,0,N,0,0,0,0,0,16


We will now find the range of each column to see if normalization to a similar scale is necessary. As expected, we will check only continuous variable as categorical variables do not require normalization.

In [None]:
#Name of the columns having numeric values
numeric=flights_data.select_dtypes(include=np.number).columns.tolist()
print(numeric)

['ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Distance', 'TaxiIn', 'TaxiOut', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']


In [None]:
for column in numeric:
    print("{:40} Min:{:6} \tMax:{:6}".format(column, flights_data[column].min(), flights_data[column].max()))

ActualElapsedTime                        Min:    15 	Max:   727
CRSElapsedTime                           Min:   -21 	Max:   602
AirTime                                  Min:     0 	Max:   609
ArrDelay                                 Min:    15 	Max:  1707
DepDelay                                 Min:     6 	Max:  1710
Distance                                 Min:    31 	Max:  4502
TaxiIn                                   Min:     0 	Max:   207
TaxiOut                                  Min:     0 	Max:   383
CarrierDelay                             Min:     0 	Max:  1707
WeatherDelay                             Min:     0 	Max:  1148
NASDelay                                 Min:     0 	Max:  1357
SecurityDelay                            Min:     0 	Max:   392
LateAircraftDelay                        Min:     0 	Max:  1254


There isn't a very good reason we need to normalize, but we will go ahead with normalization for now, just for comparison. We will make use of the `apply` function to normalize these columns between 0 and 1.

In [None]:
cols_to_norm = numeric
normalized_flights_data[cols_to_norm] = flights_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
normalized_flights_data.head()

In [None]:
flights_data.to_csv('Normalized Flight data.csv')

As we see here, all the continuous variables have been normalized to a value between 0 and 1. Now let's try to perform some rudimentary analysis to see how much correlation exists in between the attributes.

In [None]:
corr_matrix = flights_data.corr(method = 'pearson')
high_corr = []
for row in numeric:
    for column in numeric:
        if row != column:
            if abs(corr_matrix[row][column]) >= 0.75:
                if [column,row] not in high_corr:
                    high_corr.append([row,column])
                    print("{} x {} : {}".format(row,column,corr_matrix[row][column]))

### TODO:
We see that are six pairs of attributes that have high correlation between them. I have chosen 75% as an arbitrary value for feature selection.

 Now, let's see if there are any columns that are constant, that is they only have one value for all the observations.

In [None]:
single_valued_columns = flights_data.columns[flights_data.nunique() <= 1]
single_valued_columns

Index(['Cancelled', 'CancellationCode', 'Diverted'], dtype='object')

There are three columns with a constant value and thus do not have any information to contribute to our models.

In [None]:
flights_data = flights_data.drop(single_valued_columns, axis = 1)

 Now let's learn something about the target column.

In [None]:
delay_columns = ["CarrierDelay", "WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay"]
flights_data["Delayed"] = flights_data[delay_columns].sum(axis=1)
flights_data["Delayed"] = [1 if (i > 0) else 0 for i in flights_data["Delayed"]]
flights_data["Delayed"] = flights_data["Delayed"].astype('category')
len(flights_data["Delayed"].unique())

1

There are 2 unique target classes, as expected. Let's now have a look at how the observations are distributed in between classes.

In [None]:
flights_data['Delayed'].value_counts()

1    484551
Name: Delayed, dtype: int64

There seems to be a lot of class imbalance present, as the most frequent target class has more than 100000 times the observations of the least frequent target class. It is very difficult for us to proceed with this distribution, we can try some sampling methods, but there is not enough data to guarantee good results.

In [None]:
flights_data = flights_data.drop(['Delayed'], axis = 1)

Initially we'll just try to do multiclass classification, before we step into prediction. So, I am replacing the values of the five delay columns with one column that shows which type of delay happened.

In [None]:
target_column = []
for index, data in flights_data.iterrows():
  max_value = -1
  target_class = None
  for column in delay_columns:
    if data[column] >= max_value:
      max_value = data[column]
      target_class = column
  target_column.append(target_class)
flights_data['target'] = target_column

In [None]:
multi_class_flight_data = flights_data.drop(delay_columns, axis = 1)
multi_class_flight_data['target'].value_counts()

LateAircraftDelay    230607
CarrierDelay         144832
NASDelay              90316
WeatherDelay          17771
SecurityDelay          1025
Name: target, dtype: int64

In [None]:
multi_class_flight_data.to_csv('MultiClassFlight.csv')

In [None]:
multi_class_flight_data.head()

Unnamed: 0,DayOfWeek,Date,DepTime,ArrTime,CRSArrTime,UniqueCarrier,Airline,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Org_Airport,Dest,Dest_Airport,Distance,TaxiIn,TaxiOut,target
0,4,03-01-2019,2019-03-01 18:29:00,2019-03-01 19:59:00,2019-03-01 19:25:00,WN,Southwest Airlines Co.,3920,N464WN,90,90,77,34,34,IND,Indianapolis International Airport,BWI,Baltimore-Washington International Airport,515,3,10,LateAircraftDelay
1,4,03-01-2019,2019-03-01 19:37:00,2019-03-01 20:37:00,2019-03-01 19:40:00,WN,Southwest Airlines Co.,509,N763SW,240,250,230,57,67,IND,Indianapolis International Airport,LAS,McCarran International Airport,1591,3,7,LateAircraftDelay
2,4,03-01-2019,2019-03-01 16:44:00,2019-03-01 18:45:00,2019-03-01 17:25:00,WN,Southwest Airlines Co.,1333,N334SW,121,135,107,80,94,IND,Indianapolis International Airport,MCO,Orlando International Airport,828,6,8,LateAircraftDelay
3,4,03-01-2019,2019-03-01 14:52:00,2019-03-01 16:40:00,2019-03-01 16:25:00,WN,Southwest Airlines Co.,675,N286WN,228,240,213,15,27,IND,Indianapolis International Airport,PHX,Phoenix Sky Harbor International Airport,1489,7,8,LateAircraftDelay
4,4,03-01-2019,2019-03-01 13:23:00,2019-03-01 15:26:00,2019-03-01 15:10:00,WN,Southwest Airlines Co.,4,N674AA,123,135,110,16,28,IND,Indianapolis International Airport,TPA,Tampa International Airport,838,4,9,LateAircraftDelay


## TODO: Class Balancing

## TODO: Encode categorical variables

In [None]:
target = multi_class_flight_data['target']
cols_to_drop = ['Date','DepTime','ArrTime','CRSArrTime', 'UniqueCarrier', 'Airline', 'FlightNum',	'TailNum', 'Origin', 'Org_Airport', 'Dest', 'Dest_Airport', 'target']
multi_class_flight_data = multi_class_flight_data.drop(cols_to_drop, axis = 1)

flights_data_train, flights_data_test, target_train, target_test = train_test_split(multi_class_flight_data, target, test_size = 0.3, random_state = 11, shuffle = 1, stratify = target)
flights_data_train, flights_data_validation, target_train, target_validation = train_test_split(flights_data_train, target_train, test_size = 0.28, shuffle = 1, stratify = target_train)

In [None]:
multi_class_flight_data.head()

We now have the balanced data loaded and divided into training (50%) , validation (20%) and testing data (30%). Before we begin building our models let's define a function `PerformanceMetrics` to evaluate our models.


In [None]:
def PerformanceMetrics(actual_values, predicted_values):
    print("Confusion Matrix:")
    print(metrics.confusion_matrix(actual_values, predicted_values))
    print("\nAccuracy:", metrics.accuracy_score(actual_values, predicted_values))
    print("\nClassification Metrics:")
    print(metrics.classification_report(actual_values, predicted_values))

### Multinomial Logistic Regression

We begin with Multinomial Logistic Regression. Logistic Regression for Python is mainly defined for binary classes, but we can adapt it to a multiclass implementation with single parameter.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Import Logistic Regression model
from sklearn.linear_model import LogisticRegression
#Create a LR Classifier
LRmodel = LogisticRegression(multi_class = 'auto', verbose = 1)
# Train the model using the training sets
LRmodel.fit(flights_data_train, target_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   19.1s finished


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=1,
                   warm_start=False)

The model did not take long to build, we will now look at how it performs with the test data.

In [None]:
#Evaluate model training performance
target_pred_val = LRmodel.predict(flights_data_validation)
PerformanceMetrics(target_validation, target_pred_val)

Confusion Matrix:
[[  451 26016  1914     0     6]
 [  449 41902  2848     0     0]
 [   41  7994  9666     0     1]
 [    4   177    20     0     0]
 [    8  2915   558     0     2]]

Accuracy: 0.5477509160594701

Classification Metrics:


  _warn_prf(average, modifier, msg_start, len(result))


                   precision    recall  f1-score   support

     CarrierDelay       0.47      0.02      0.03     28387
LateAircraftDelay       0.53      0.93      0.67     45199
         NASDelay       0.64      0.55      0.59     17702
    SecurityDelay       0.00      0.00      0.00       201
     WeatherDelay       0.22      0.00      0.00      3483

         accuracy                           0.55     94972
        macro avg       0.37      0.30      0.26     94972
     weighted avg       0.52      0.55      0.44     94972



In [None]:
#Evalute model performance on Test set
target_pred = LRmodel.predict(flights_data_test)
PerformanceMetrics(target_test, target_pred)

Confusion Matrix:
[[  701 39760  2979     0    10]
 [  704 63988  4489     0     1]
 [   70 12336 14686     0     3]
 [    4   273    31     0     0]
 [   15  4469   840     0     7]]

Accuracy: 0.5460836784392499

Classification Metrics:


  _warn_prf(average, modifier, msg_start, len(result))


                   precision    recall  f1-score   support

     CarrierDelay       0.47      0.02      0.03     43450
LateAircraftDelay       0.53      0.92      0.67     69182
         NASDelay       0.64      0.54      0.59     27095
    SecurityDelay       0.00      0.00      0.00       308
     WeatherDelay       0.33      0.00      0.00      5331

         accuracy                           0.55    145366
        macro avg       0.39      0.30      0.26    145366
     weighted avg       0.52      0.55      0.44    145366



LR does not seem to be a very good learner, with metrics for all classes except class 7 being very poor in most cases. This could be a symptom of the class balancing that was done, where we could have lost valuable information about the first 6 classes.

### Is class balancing the culprit?
Let us take a diversion here, and see how the LR model performs on the unbalanced dataset. This dataset is still normalized and the unnecessary features have been removed.

In [None]:
coverdata_train, coverdata_test, target_train, target_test = train_test_split(coverdata, target, test_size = 0.3, random_state = 11, shuffle = 1, stratify = target)
coverdata_train, coverdata_validation, target_train, target_validation = train_test_split(coverdata_train, target_train, test_size = 0.28, shuffle = 1, stratify = target_train)

#Import Logistic Regression model
from sklearn.linear_model import LogisticRegression
#Create a LR Classifier
LRmodel = LogisticRegression(multi_class = 'auto', verbose = 1)
# Train the model using the training sets
LRmodel.fit(coverdata_train, target_train)

We are defining a new model so that the previous knowledge from the balanced dataset should not affect the decision making capabilities of this model.

In [None]:
#Evaluate model training performance
target_pred_val = LRmodel.predict(coverdata_validation)
PerformanceMetrics(target_validation, target_pred_val)

In [None]:
#Evalute model performance on Test set
target_pred = LRmodel.predict(coverdata_test)
PerformanceMetrics(target_test, target_pred)

Performance takes a whole new curve. It does seem that the class balancing has affected a lesser number of classes than expected. Class 1,2 and 3 performed better on with the unbalanced set (not surprising as they had the most number of samples), whereas the performance with class 7 is not as bad as expected. But the significant gains in the balanced set are with class 4,5 and 6, which were well above expectations.

The overall accuracy has reduced with balancing the dataset, which might suggest the model was overfitting the unbalanced dataset. This is also reinforced by the fact that performance between the different classes have a smaller gap in between them in the balanced implementation.

It seems to me that we might have balanced a little more than necessary, and we could have still retained some more samples for the underperforming classes.

### Naive-Bayes

Let's try implementing Naive-Bayes. The python package for Multinomial NB does not work with negative values. So, we will find which columns have negative values in them.

In [None]:
(flights_data_train.iloc[:,1:8] < 0 ).any()

In [None]:
(flights_data_validation.iloc[:,1:8] < 0 ).any()

In [None]:
(flights_data_test.iloc[:,1:8] < 0 ).any()

Initially, I had run Naive-Bayes without normalizing the data, that is why these checks were done, but now there are no negative values due to normalization, hence we can proceed without making any changes.

In [None]:
#Import Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
#Create a Multimnomial Classifier
NBmodel = MultinomialNB()
# Train the model using the training sets
NBmodel.fit(flights_data_train, target_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Now with the model built, let's look at performance.

In [None]:
#Evaluate model training performance
target_pred_val = NBmodel.predict(flights_data_validation)
PerformanceMetrics(target_validation, target_pred_val)

Confusion Matrix:
[[ 2525  2881  4100  9649  9232]
 [ 4812  5920  5645 12245 16577]
 [  552   651  5680  6098  4721]
 [   17    17    41    92    34]
 [  236   291   463   687  1806]]

Accuracy: 0.16871288379732974

Classification Metrics:
                   precision    recall  f1-score   support

     CarrierDelay       0.31      0.09      0.14     28387
LateAircraftDelay       0.61      0.13      0.22     45199
         NASDelay       0.36      0.32      0.34     17702
    SecurityDelay       0.00      0.46      0.01       201
     WeatherDelay       0.06      0.52      0.10      3483

         accuracy                           0.17     94972
        macro avg       0.27      0.30      0.16     94972
     weighted avg       0.45      0.17      0.21     94972



In [None]:
#Evalute model performance on Test set
target_pred = NBmodel.predict(flights_data_test)
PerformanceMetrics(target_test, target_pred)

Confusion Matrix:
[[ 4038  4296  6250 14900 13966]
 [ 7350  9195  8430 18608 25599]
 [  817  1032  8666  9444  7136]
 [   20    26    63   146    53]
 [  336   438   750  1007  2800]]

Accuracy: 0.1709134185435384

Classification Metrics:
                   precision    recall  f1-score   support

     CarrierDelay       0.32      0.09      0.14     43450
LateAircraftDelay       0.61      0.13      0.22     69182
         NASDelay       0.36      0.32      0.34     27095
    SecurityDelay       0.00      0.47      0.01       308
     WeatherDelay       0.06      0.53      0.10      5331

         accuracy                           0.17    145366
        macro avg       0.27      0.31      0.16    145366
     weighted avg       0.46      0.17      0.21    145366



This is a similar performance, if not much worse, to Logistic Regression, with the first three classes doing worse on the balanced set, and the last class (which had the lowest share of observations in the unbalanced set) having the best performance, which is more than double the lowest performance.

I am not exactly sure what to conclude here, but it looks like we have overshot the perfect equilibrium between classes to achieve good performance on all classes in Logistic Regression and Naive Bayes. In order to verify this conclusion, we will have to train and test both the classification algorithms on various degrees of class balanced data, which I am ommiting for this Homework.

It is however important to note that the overall performance of Naive-Bayes has nearly doubled from an abysmal mid-30% to 59%. This was achieved by normalizing the data to values between 0 and 1.

### K-Nearest Neighbours

In [None]:
#Import KNN model
from sklearn.neighbors import KNeighborsClassifier
#Create a KNN Classifier
KNNmodel = KNeighborsClassifier(n_neighbors = 1)
# Train the model using the training sets
KNNmodel.fit(flights_data_train, target_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

Manually running K-NN over different `n_neighbors` from 1 to 15, showed little change in accuracy, with the best performance given below at `n_neighbors = 1`

In [None]:
#Evaluate model training performance
target_pred_val = KNNmodel.predict(flights_data_validation)
PerformanceMetrics(target_validation, target_pred_val)

Confusion Matrix:
[[10885 13630  2739    58  1075]
 [13432 26126  4014   103  1524]
 [ 2998  4164 10013    26   501]
 [   79    86    23     3    10]
 [ 1142  1578   514     6   243]]

Accuracy: 0.4977256454533968

Classification Metrics:
                   precision    recall  f1-score   support

     CarrierDelay       0.38      0.38      0.38     28387
LateAircraftDelay       0.57      0.58      0.58     45199
         NASDelay       0.58      0.57      0.57     17702
    SecurityDelay       0.02      0.01      0.02       201
     WeatherDelay       0.07      0.07      0.07      3483

         accuracy                           0.50     94972
        macro avg       0.32      0.32      0.32     94972
     weighted avg       0.50      0.50      0.50     94972



In [None]:
#Evalute model performance on Test set
target_pred = KNNmodel.predict(flights_data_test)
PerformanceMetrics(target_test, target_pred)

Confusion Matrix:
[[16402 20890  4368   106  1684]
 [20365 40111  6300   136  2270]
 [ 4563  6509 15267    32   724]
 [  121   142    36     4     5]
 [ 1737  2367   813    11   403]]

Accuracy: 0.4965879228980642

Classification Metrics:
                   precision    recall  f1-score   support

     CarrierDelay       0.38      0.38      0.38     43450
LateAircraftDelay       0.57      0.58      0.58     69182
         NASDelay       0.57      0.56      0.57     27095
    SecurityDelay       0.01      0.01      0.01       308
     WeatherDelay       0.08      0.08      0.08      5331

         accuracy                           0.50    145366
        macro avg       0.32      0.32      0.32    145366
     weighted avg       0.50      0.50      0.50    145366



This is our best performance yet, but the trend between the different classes is stil the same. Comparatively lower values for classes 1,2 and 3; and relatively better performance for the other 4 classes.

Another odd observation is that k-NN performs best when the `n_neighbors = 1`. It is suggestive of high correlation between pairs of sample data, but this is not a good generalization, particularly when considering our limited data.

### Multinomial SVM

In [None]:
#import Multinomial SVM
from sklearn import svm
#Create an SVM Classifier
SVMmodel = svm.SVC(kernel = 'rbf', verbose = 1, decision_function_shape = 'ovr')
# Train the model using the training sets
SVMmodel.fit(flights_data_train, target_train)

This has been one of our slowest models yet, taking the longest time to build. This build time was much worse before we balanced the data, often taking not less than 6 minutes each time. This was expected though, since SVMs are one of most complex models to train.

In [None]:
#Evaluate model training performance
target_pred_val = SVMmodel.predict(flights_data_validation)
PerformanceMetrics(target_validation, target_pred_val)

In [None]:
#Evalute model performance on Test set
target_pred = SVMmodel.predict(flights_data_test)
PerformanceMetrics(target_test, target_pred)

The trend between the classes continues, and performance is slightly better than Logistic Regression and Naive-Bayes. Before balancing the data, the overall accuracy was slightly better at 70%, as was the case for many previous models.

### Decision Tree

In [None]:
#Import tree
from sklearn import tree
#Create a Decision Tree Classifier
DTmodel = tree.DecisionTreeClassifier(random_state = 0)
# Train the model using the training sets
DTmodel.fit(flights_data_train, target_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

A very quick build, and in order to ensure that randomness is not a factor in building the trees for each iteration, we have set seed at `0`.

In [None]:
#Evaluate model training performance
target_pred_val = DTmodel.predict(flights_data_validation)
PerformanceMetrics(target_validation, target_pred_val)

Confusion Matrix:
[[10600 13464  2923    99  1301]
 [13858 25130  4264   116  1831]
 [ 2730  4022 10333    37   580]
 [   84    84    23     0    10]
 [ 1119  1561   528     8   267]]

Accuracy: 0.48782799140799393

Classification Metrics:
                   precision    recall  f1-score   support

     CarrierDelay       0.37      0.37      0.37     28387
LateAircraftDelay       0.57      0.56      0.56     45199
         NASDelay       0.57      0.58      0.58     17702
    SecurityDelay       0.00      0.00      0.00       201
     WeatherDelay       0.07      0.08      0.07      3483

         accuracy                           0.49     94972
        macro avg       0.32      0.32      0.32     94972
     weighted avg       0.49      0.49      0.49     94972



In [None]:
#Evalute model performance on Test set
target_pred = DTmodel.predict(flights_data_test)
PerformanceMetrics(target_test, target_pred)

Confusion Matrix:
[[15990 20963  4424   136  1937]
 [21333 38337  6576   183  2753]
 [ 4201  6141 15868    46   839]
 [  102   156    41     2     7]
 [ 1764  2418   785     8   356]]

Accuracy: 0.4853473301872515

Classification Metrics:
                   precision    recall  f1-score   support

     CarrierDelay       0.37      0.37      0.37     43450
LateAircraftDelay       0.56      0.55      0.56     69182
         NASDelay       0.57      0.59      0.58     27095
    SecurityDelay       0.01      0.01      0.01       308
     WeatherDelay       0.06      0.07      0.06      5331

         accuracy                           0.49    145366
        macro avg       0.31      0.32      0.32    145366
     weighted avg       0.49      0.49      0.49    145366



Really good performance, only second to k-NN. The trend between the different classes is still apparent.

### Random Forest

In [None]:
#Import RF model
from sklearn.ensemble import RandomForestClassifier
#Create a RandomForest Classifier
RFmodel = RandomForestClassifier(verbose = 1)
# Train the model using the training sets
RFmodel.fit(flights_data_train, target_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  1.2min finished


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=1, warm_start=False)

Let us look at the `varImpplot` equivalent in python for the variables. For this, we use a module called `feature_importances_` from `RandomForestClassifier`, and plot a bar graph.

In [None]:
# get importance
importance = RFmodel.feature_importances_
# plot feature importance
plt.figure(figsize = (15,2))
plt.bar([ x for x in range(len(importance))], importance, tick_label = coverdata_train.columns)
plt.xticks(rotation = 90)
plt.show()

There isn't much difference in terms of feature importance, hence we will proceed without removing existing features. Now let's evaluate this model.

In [None]:
#Evaluate model training performance
target_pred_val = RFmodel.predict(flights_data_validation)
PerformanceMetrics(target_validation, target_pred_val)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    4.3s finished


Confusion Matrix:
[[ 8817 18341  1198     1    30]
 [ 7795 35728  1639     1    36]
 [ 1553  5153 10978     0    18]
 [   66   119    16     0     0]
 [  880  2243   337     0    23]]

Accuracy: 0.5848671187297309

Classification Metrics:
                   precision    recall  f1-score   support

     CarrierDelay       0.46      0.31      0.37     28387
LateAircraftDelay       0.58      0.79      0.67     45199
         NASDelay       0.77      0.62      0.69     17702
    SecurityDelay       0.00      0.00      0.00       201
     WeatherDelay       0.21      0.01      0.01      3483

         accuracy                           0.58     94972
        macro avg       0.41      0.35      0.35     94972
     weighted avg       0.57      0.58      0.56     94972



In [None]:
#Evalute model performance on Test set
target_pred = RFmodel.predict(flights_data_test)
PerformanceMetrics(target_test, target_pred)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    6.6s finished


Confusion Matrix:
[[13371 28204  1828     0    47]
 [12018 54587  2530     1    46]
 [ 2495  7843 16730     0    27]
 [   92   193    21     1     1]
 [ 1310  3444   529     0    48]]

Accuracy: 0.5829217286022866

Classification Metrics:
                   precision    recall  f1-score   support

     CarrierDelay       0.46      0.31      0.37     43450
LateAircraftDelay       0.58      0.79      0.67     69182
         NASDelay       0.77      0.62      0.69     27095
    SecurityDelay       0.50      0.00      0.01       308
     WeatherDelay       0.28      0.01      0.02      5331

         accuracy                           0.58    145366
        macro avg       0.52      0.35      0.35    145366
     weighted avg       0.57      0.58      0.56    145366



The random forest algorithm is an ensemble extension of the `tree` model, and we can see that it is better than, not only the Decision Tree, but also all the other models we have implemented. This is the advantage of combining classifiers which we will explore deeper in the next Homework.

Before we conclude, we will estimate variance for all the models we have used until now, with the function `VarianceEstimator` defined below. This function will display the aggregate performance of different instances of the same classification algorithm across various training sizes. For each training set size, we will again run the model `number_iter` times for a different subset, for example, for a training set with size 50% of the whole data, we will run a model 20 times, each time with a random subset of size 50%. Similarly for 10%, 90% and so on..

In [None]:
def VarianceEstimator():

    number_iter = 20

    coverdata = pd.read_csv('BalancedCover.csv', index_col = 0)
    target = pd.read_csv('BalancedTarget.csv', index_col = 0)

    target = target['target']

    coverdata_train, coverdata_test, target_train, target_test = train_test_split(coverdata, target, test_size = 0.1, random_state = 11, shuffle = 1, stratify = target)
    models = ['LR','NB','KNN','SVM','DT']

    for model_name in models:
        modelVE(model_name, number_iter, coverdata_train, coverdata_test, target_train, target_test)



def modelVE(model_name, number_iter, X_train, X_test, y_train, y_test):

    if model_name == 'LR':
        from sklearn.linear_model import LogisticRegression
        model = LogisticRegression(max_iter = 1000, multi_class = 'auto')
    elif model_name == 'NB':
        from sklearn.naive_bayes import MultinomialNB
        model = MultinomialNB()
    elif model_name == 'KNN':
        from sklearn.neighbors import KNeighborsClassifier
        model = KNeighborsClassifier(n_neighbors = 1)
    elif model_name == 'SVM':
        from sklearn.svm import SVC
        model = SVC(kernel = 'rbf', probability = True, decision_function_shape = 'ovr')
    elif model_name == 'DT':
        from sklearn import tree
        model = tree.DecisionTreeClassifier(random_state = 0)

    print("\nModel Name: ", model_name)

    test_size_list = [0.1, 0.5, 0.9]

    #Estimate Variance for a number of training sizes
    for test_size in test_size_list:
        print("\nTRAINING SIZE = ", 1 - test_size)
        print("\nModel loading.", end = "")

        accuracy_list = []
        auc_list = []
        accuracy_list_val = []

        #Run for number of iterations
        for i in range(number_iter):
            if ((i+1)%3):
                print(".", end = "")
            else:
                print("\b\b", end = "")
            #Splitting data randomly into 50% for each iteration
            X_train_i, X_val_i, y_train_i, y_val_i = train_test_split(X_train, y_train, test_size = test_size, random_state = i, shuffle = 1, stratify = y_train)
            #Training model with this instance of data
            model.fit(X_train_i, y_train_i)
            #print Training accuracy
            target_pred_val = model.predict(X_val_i)
            accuracy_list_val.append(metrics.accuracy_score(y_val_i, target_pred_val))
            target_pred = model.predict(X_test)
            accuracy_list.append(metrics.accuracy_score(y_test, target_pred))
            auc_list.append(metrics.roc_auc_score(y_test, model.predict_proba(X_test), multi_class = 'ovo'))
        print("\rAverage Training Accuracy = {}\nVariance Estimate:\n\tAverage Testing Accuracy = {}\n\tAverage AUC Score = {}".format(sum(accuracy_list_val)/len(accuracy_list_val), sum(accuracy_list)/len(accuracy_list), sum(auc_list)/len(auc_list)))

In [None]:
VarianceEstimator()

As expected, in all the models, performance maintains a decreasing trend as the training data decreases. And even though each instance of the estimated 300 models running is fed with a random training data, the consistency in its performance shows that we indeed have a strong learner which is not overfitting.

## Conclusion

Uptil now we have implemented many data processing methods on our data, and we have seen its effects on a multitude of classification algorithms. Even though there is significant difference between the performance of the different models, we find that the relative performance among the different target classes maintains a similar trend across all the models. We've seen that normalization immensely improved training times in some models and sometimes even the performance (in the case of Naive-Bayes). Class balancing has given us mixed results, it has reduced the gap between the relative performance among the classes, but at the same time it has led to a reduction in overall accuracy (which might not be a totally bad thing as it could be that it has prevented overfitting of the models). It is worth noting here that removing the correlated features also contributed significantly to model performance in many cases, though I have not shown it in the code above.

So far our best classifiers have been k-NN, Decision Tree, SVM and Logistic Regression, in that order. We will leave out Random Forest because it is an ensemble method, which we will use in the next Homework. It is obvious that for our data, non-parametric classification algorithms work better than parametric algorithms. One possible reason could be that the underlying target function is highly complex, which the parametric models are unable to emulate with their limited capacity. And the large amount of data and attributes is not an advantage to parametric models either. This is also well reinforced by the fact that the non-parametric not only have better values, but also some of highest (in most cases greater than 80%). Thus it is safe to say that the data being too spread out and complex has lead to the performance levels that we have witnessed.