# Data Modelling - Flights Delay in USA


# Table of contents<a class="anchor" id="table"></a>

* [1 Data Loading, Cleaning, Labelling, and Exploration](#1)
* [1.1 Data Loading](#1.1)
* [1.1.1 Creating SparkSession & SparkContext](#OneOneOne)
* [1.1.2 Load datasets and display number of records](#OneOneTwo)
* [1.1.3 Obtain number of columns](#1.1.3)
* [1.2 Data Cleaning](#1.2)
* [1.2.1 Check for missing values](#1.2.1)
* [1.2.2 Remove some columns and rows using threshold values](#1.2.2)
* [1.2.2.a) Using Python function ](#1.2.2.a)
* [1.2.2.b Removing Null or NaN columns](#1.2.2.b)
* [1.2.2.c. Drop rows with Null and Nan values](#1.2.3)
* [1.3 Data labelling ](#1.3)
* [1.3.1 Generating labels ](#1.3.1)
* [1.3.1.a Binary labels](#1.3.1.a)
* [1.3.1.b Multiclass labeling](#1.3.1.b)
* [1.3.2 Auto labelling flightsDf using function](#1.3.2)
* [1.4 Data Exploration / Exploratory Analysis](#1.4)
* [2 Feature extraction and ML Training ](#2)
* [2.1. Discuss the feature selection and prepare the feature columns](#2.1)
* [2.1.1 Define dataframes and loading scheme](#2.1.1)
* [2.4.1 Binary classification](#2.4.1)
* [2.4.1.a ML Pipelines to train the models](#2.4.1.a)
* [2.4.1.b Display the count of each combination of late/not late label and prediction label](#2.4.1.b)
* [2.4.1.c Compute the AUC, accuracy, recall, and precision ](#2.4.1.c)
* [2.4.1.d Which is the better model, and persist the better model](#2.4.1.d)
* [2.4.1.e. Ways the performance can be improved for classifiers](#2.4.1.e)
* [2.4.1.f. Top-3 feature with each corresponding feature importance](#2.4.1.f)
* [2.4.1.g. Ways the performance can be improved for both classifiers](#2.4.1.g)
* [2.4.2 Multiclass classification](#2.4.2)
* [2.4.2.a ML Pipelines to train the models](#2.4.2.a)
* [2.4.2.b Display the count of each combination of early/on-time/late label and prediction label](#2.4.2.b)
* [2.4.2.c Compute the AUC, accuracy, recall, and precision ](#2.4.2.c)
* [2.4.2.d Which is the better model, and persist the better model](#2.4.2.d)
* [2.4.2.e. Ways the performance can be improved for classifiers](#2.4.2.e)


# 1 Data Loading, Cleaning, Labelling, and Exploration <a class="anchor" id="1"></a>
## 1.1 Data Loading<a class="anchor" id="1.1"></a>
### 1.1.1 Create SparkSession and SparkContext<a class="anchor" id="OneOneOne"></a>
[Back to top](#table)

In [1]:
# Import SparkConf class
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[*]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Data Modelling"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

### 1.1.2 Load datasets and display number of records <a class="anchor" id="OneOneTwo"></a>
[Back to top](#table)

In [2]:
# loading units data into one data frame
flightsRawDf = spark.read.csv("Datasets/flight*.csv",header=True, inferSchema=True)

# loading crash data into one data frame
airportsDf = spark.read.csv("Datasets/airports.csv",header=True, inferSchema=True)

In [3]:
flightsRawDf.printSchema()

root
 |-- YEAR: integer (nullable = true)
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- TAIL_NUMBER: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- DEPARTURE_TIME: integer (nullable = true)
 |-- DEPARTURE_DELAY: integer (nullable = true)
 |-- TAXI_OUT: integer (nullable = true)
 |-- WHEELS_OFF: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- ELAPSED_TIME: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- WHEELS_ON: integer (nullable = true)
 |-- TAXI_IN: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- ARRIVAL_TIME: integer (nullable = true)
 |-- ARRIVAL_DELAY: integer (null

In [4]:
airportsDf.printSchema()

root
 |-- IATA_CODE: string (nullable = true)
 |-- AIRPORT: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- STATE: string (nullable = true)
 |-- COUNTRY: string (nullable = true)
 |-- LATITUDE: double (nullable = true)
 |-- LONGITUDE: double (nullable = true)



In [5]:
print(f"Number of Records in flights delay dataset: {flightsRawDf.count()}")

print(f"Number of Records in Airports dataset: {airportsDf.count()}")

Number of Records in flights delay dataset: 582184
Number of Records in Airports dataset: 322


### 1.1.3 Obtain number of columns <a class="anchor" id="1.1.3"></a>
[Back to top](#table)

In [6]:
allColumnFlights = len(flightsRawDf.columns)
print(f"Number of columns in flights delay dataset: {allColumnFlights}")

Number of columns in flights delay dataset: 31


## 1.2 Data Cleaning <a class="anchor" id="1.2"></a>
### 1.2.1 Check for missing values <a class="anchor" id="1.2.1"></a>
[Back to top](#table)

In [7]:
from pyspark.sql.functions import isnan, when, count, col

# presenting nan and null values and showing their count
flightsRawDf.select([count(when(isnan(each) | col(each).isNull(), each)).alias(each) for each in flightsRawDf.columns]).show()


+----+-----+---+-----------+-------+-------------+-----------+--------------+-------------------+-------------------+--------------+---------------+--------+----------+--------------+------------+--------+--------+---------+-------+-----------------+------------+-------------+--------+---------+-------------------+----------------+--------------+-------------+-------------------+-------------+
|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|TAIL_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|SCHEDULED_DEPARTURE|DEPARTURE_TIME|DEPARTURE_DELAY|TAXI_OUT|WHEELS_OFF|SCHEDULED_TIME|ELAPSED_TIME|AIR_TIME|DISTANCE|WHEELS_ON|TAXI_IN|SCHEDULED_ARRIVAL|ARRIVAL_TIME|ARRIVAL_DELAY|DIVERTED|CANCELLED|CANCELLATION_REASON|AIR_SYSTEM_DELAY|SECURITY_DELAY|AIRLINE_DELAY|LATE_AIRCRAFT_DELAY|WEATHER_DELAY|
+----+-----+---+-----------+-------+-------------+-----------+--------------+-------------------+-------------------+--------------+---------------+--------+----------+--------------+------------+--------+-

In [8]:
null_dict = {col:flightsRawDf.filter(flightsRawDf[col].isNull()).count() for col in flightsRawDf.columns}
null_dict

{'YEAR': 0,
 'MONTH': 0,
 'DAY': 0,
 'DAY_OF_WEEK': 0,
 'AIRLINE': 0,
 'FLIGHT_NUMBER': 0,
 'TAIL_NUMBER': 1462,
 'ORIGIN_AIRPORT': 0,
 'DESTINATION_AIRPORT': 0,
 'SCHEDULED_DEPARTURE': 0,
 'DEPARTURE_TIME': 8633,
 'DEPARTURE_DELAY': 8633,
 'TAXI_OUT': 8891,
 'WHEELS_OFF': 8891,
 'SCHEDULED_TIME': 1,
 'ELAPSED_TIME': 10455,
 'AIR_TIME': 10455,
 'DISTANCE': 0,
 'WHEELS_ON': 9257,
 'TAXI_IN': 9257,
 'SCHEDULED_ARRIVAL': 0,
 'ARRIVAL_TIME': 9257,
 'ARRIVAL_DELAY': 10455,
 'DIVERTED': 0,
 'CANCELLED': 0,
 'CANCELLATION_REASON': 573213,
 'AIR_SYSTEM_DELAY': 475831,
 'SECURITY_DELAY': 475831,
 'AIRLINE_DELAY': 475831,
 'LATE_AIRCRAFT_DELAY': 475831,
 'WEATHER_DELAY': 475831}

### 1.2.2 Remove some columns and rows using threshold values <a class="anchor" id="1.2.2"></a>
### 1.2.2.a) Using Python function <a class="anchor" id="1.2.2.a"></a>
[Back to top](#table)

In [9]:
x = 10

def find_removed_columns(x, flightsRawDf):
    removed_columns = []

    number_of_records = flightsRawDf.count()
    threshold_value = x*number_of_records/100
    print("The threshold value:", threshold_value)
    
    for key,value in null_dict.items():
        if (value > (x*number_of_records/100)):
            removed_columns.append(key)
        
    return removed_columns
    
removed_columns = find_removed_columns(x, flightsRawDf)
print("The following column names are unworthy due to the abundance of missing values: ")
removed_columns

The threshold value: 58218.4
The following column names are unworthy due to the abundance of missing values: 


['CANCELLATION_REASON',
 'AIR_SYSTEM_DELAY',
 'SECURITY_DELAY',
 'AIRLINE_DELAY',
 'LATE_AIRCRAFT_DELAY',
 'WEATHER_DELAY']

### 1.2.2.b Removing Null or NaN columns <a class="anchor" id="1.2.2.b"></a>
[Back to top](#table)

In [10]:
def eliminate_columns(removed_columns, flightsRawDf):
    # dropping the list of columns from dataframe
    flightsRawDf = flightsRawDf.drop(*removed_columns)
    
    return flightsRawDf
    
    
flightsRawDf = eliminate_columns(removed_columns, flightsRawDf)

print(f"Number of rows in flights delay dataset: {flightsRawDf.count()}")
print(f"Number of columns in flights delay dataset: {len(flightsRawDf.columns)}")

Number of rows in flights delay dataset: 582184
Number of columns in flights delay dataset: 25


### 1.2.2.c. Drop rows with Null and Nan values <a class="anchor" id="1.2.2.c"></a>
[Back to top](#table)

In [11]:
flightsDf = flightsRawDf.na.drop(how="any")
print(f"Number of rows in cleaned flights delay dataset: {flightsDf.count()}")
print(f"Number of columns in cleaned flights delay dataset: {len(flightsDf.columns)}")

Number of rows in cleaned flights delay dataset: 571729
Number of columns in cleaned flights delay dataset: 25


### Observation:

The minimum number of rows are removed must be equal to the number of least available (i.e. 10455 for 'ARRIVAL_DELAY')

The number of rows is equal to 10455 which verifies that all Null or NaN rows have been sucessfully removed from the dataset.

## 1.3 Data labelling <a class="anchor" id="1.3"></a>
### 1.3.1 Generating labels <a class="anchor" id="1.3.1"></a>
### 1.3.1.a Binary labels <a class="anchor" id="1.3.1.a"></a>
[Back to top](#table)

In [12]:
import pyspark.sql.functions as F

flightsDf = flightsDf.withColumn('binaryArrDelay', F.when(F.col("ARRIVAL_DELAY") > 0, 1)
    .otherwise(0))

# flightsDf.groupBy('binaryArrDelay').count().orderBy('autolabelMultiClassArrDelay').show()

In [13]:
flightsDf = flightsDf.withColumn('binaryDeptDelay', F.when(F.col("DEPARTURE_DELAY") > 0, 1)
    .otherwise(0))

# flightsDf.groupBy('binaryDeptDelay').count().orderBy('autolabelMultiClassArrDelay')show()

### 1.3.1.b Multiclass labeling <a class="anchor" id="1.3.1.b"></a>
[Back to top](#table)

In [14]:
# The new column names are multiClassArrDelay and multiClassDeptDelay,
flightsDf = flightsDf.withColumn('multiClassArrDelay',
    F.when(F.col('ARRIVAL_DELAY') < 5, 0) \
    .when(F.col('ARRIVAL_DELAY').between(5,20), 1) \
    .when(F.col('ARRIVAL_DELAY') > 20, 2)
)

# flightsDf.groupBy('multiClassArrDelay').count().orderBy('autolabelMultiClassArrDelay').show()

In [15]:
flightsDf = flightsDf.withColumn('multiClassDeptDelay',
    F.when(F.col('DEPARTURE_DELAY') < 5, 0) \
    .when(F.col('DEPARTURE_DELAY').between(5,20), 1) \
    .when(F.col('DEPARTURE_DELAY') > 20, 2)
)

# flightsDf.groupBy('multiClassDeptDelay').count().orderBy('autolabelMultiClassArrDelay').show()

### 1.3.2 Auto labelling flightsDf using function <a class="anchor" id="1.3.2"></a>

[Back to top](#table)

In [16]:
flightsDf.describe('ARRIVAL_DELAY').show()

+-------+-----------------+
|summary|    ARRIVAL_DELAY|
+-------+-----------------+
|  count|           571729|
|   mean|4.467084930098001|
| stddev| 39.7870861908589|
|    min|              -82|
|    max|             1665|
+-------+-----------------+



In [17]:
# flightsDf.toPandas().hist(column = 'ARRIVAL_DELAY')

In [18]:
from pyspark.sql.window import Window

def create_range(dataframe, column_name, output_column):
    category_range = []
    
    windowSpec = Window().partitionBy().orderBy(dataframe[column_name])
    
    # Creating a list to add the range of 3 categories: early, on-time and late 
    abc = dataframe.withColumn("rank",F.ntile(3).over(windowSpec))

    
    for i in range(1,4):
        b = abc.filter(col('rank') == i)
        max_range_value = b.agg({column_name: "max"}).collect()[0][0]
        
        category_range.append(max_range_value)
        
    print("The range for ", column_name, " is ", category_range)
        
    dataframe = dataframe.withColumn(output_column,
            F.when(F.col(column_name) < category_range[0], 0) \
                .when(F.col(column_name).between(category_range[0], category_range[1]), 1) \
                .when(F.col(column_name) > category_range[1], 2))
    
    return dataframe
    

In [19]:
%time
flightsDf = create_range(flightsDf, 'ARRIVAL_DELAY', 'autolabelMultiClassArrDelay')

flightsDf.groupBy('autolabelMultiClassArrDelay').count().orderBy('autolabelMultiClassArrDelay').show()

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.77 µs
The range for  ARRIVAL_DELAY  is  [-10, 2, 1665]
+---------------------------+------+
|autolabelMultiClassArrDelay| count|
+---------------------------+------+
|                          0|188468|
|                          1|196227|
|                          2|187034|
+---------------------------+------+



In [20]:
%time
flightsDf = create_range(flightsDf, 'DEPARTURE_DELAY', 'autolabelMultiClassDeptDelay')

flightsDf.groupBy('autolabelMultiClassDeptDelay').count().orderBy('autolabelMultiClassDeptDelay').show()


CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.25 µs
The range for  DEPARTURE_DELAY  is  [-4, 2, 1670]
+----------------------------+------+
|autolabelMultiClassDeptDelay| count|
+----------------------------+------+
|                           0|155337|
|                           1|233135|
|                           2|183257|
+----------------------------+------+



# 1.4 Data Exploration / Exploratory Analysis <a class="anchor" id="1.4"></a>
[Back to top](#table)

In [21]:
import pandas as pd
pd.set_option("display.max_columns", None)

In [22]:
flightsDf.printSchema()

root
 |-- YEAR: integer (nullable = true)
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- TAIL_NUMBER: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- DEPARTURE_TIME: integer (nullable = true)
 |-- DEPARTURE_DELAY: integer (nullable = true)
 |-- TAXI_OUT: integer (nullable = true)
 |-- WHEELS_OFF: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- ELAPSED_TIME: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- WHEELS_ON: integer (nullable = true)
 |-- TAXI_IN: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- ARRIVAL_TIME: integer (nullable = true)
 |-- ARRIVAL_DELAY: integer (null

In [23]:
# identify total unique categories, and category name, and frequency of the highest occurring category in the data.
categorical_features = []
numerical_column = []
for i in flightsDf.dtypes:
    if(i[1] == 'string'):
        categorical_features.append(i[0])
    elif(i[1] == 'int'):
        numerical_column.append(i[0])
    
        
print("Categorical columns are:",categorical_features)

print("Numerical columns are:",numerical_column)


Categorical columns are: ['AIRLINE', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']
Numerical columns are: ['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'FLIGHT_NUMBER', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'binaryArrDelay', 'binaryDeptDelay', 'multiClassArrDelay', 'multiClassDeptDelay', 'autolabelMultiClassArrDelay', 'autolabelMultiClassDeptDelay']


In [24]:

for column in categorical_features:
    total_unique_categories = flightsDf.select(column).distinct().count()
    print("The total unique categories in", column, "is: ", total_unique_categories)
    
    subCatDf = flightsDf.groupBy(column).count().orderBy(col('count').desc())
    
    mostFrequentCat = subCatDf.take(1)
    print("The highest occurring category in", column, "is",mostFrequentCat[0][0], "with the frequency of", mostFrequentCat[0][1],"\n")
    
    subCatDf.show()
    
    print("--------------------------------------------------------------------------------------------")
    

The total unique categories in AIRLINE is:  14
The highest occurring category in AIRLINE is WN with the frequency of 123912 

+-------+------+
|AIRLINE| count|
+-------+------+
|     WN|123912|
|     DL| 87307|
|     AA| 71242|
|     OO| 57665|
|     EV| 55349|
|     UA| 50952|
|     MQ| 27899|
|     B6| 26286|
|     US| 19519|
|     AS| 17286|
|     NK| 11522|
|     F9|  9020|
|     HA|  7670|
|     VX|  6100|
+-------+------+

--------------------------------------------------------------------------------------------
The total unique categories in TAIL_NUMBER is:  4802
The highest occurring category in TAIL_NUMBER is N480HA with the frequency of 398 

+-----------+-----+
|TAIL_NUMBER|count|
+-----------+-----+
|     N480HA|  398|
|     N483HA|  397|
|     N488HA|  381|
|     N489HA|  369|
|     N484HA|  367|
|     N491HA|  365|
|     N478HA|  362|
|     N493HA|  347|
|     N479HA|  339|
|     N477HA|  338|
|     N486HA|  337|
|     N487HA|  331|
|     N481HA|  328|
|     N485HA|  31

In [25]:
month_list = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
from pyspark.sql.functions import lit
from pyspark.sql.functions import udf

    

def lateFlightsPerMonth():
    monthDf = flightsDf.groupBy('MONTH').agg({"MONTH":"count"})\
          .withColumnRenamed("COUNT(MONTH)", "TOTAL_FLIGHTS") 
    
    lateArrivalDf = flightsDf.filter(col('binaryArrDelay') == 1).groupBy('MONTH').agg({"MONTH":"count"})\
          .withColumnRenamed("COUNT(MONTH)", "LATE_FLIGHTS")
    
    joinedDf =  lateArrivalDf.join(monthDf, monthDf.MONTH==lateArrivalDf.MONTH,how='inner').drop(monthDf.MONTH)

    joinedDf = joinedDf.withColumn('PERCENT', 
                                  F.concat((F.col('LATE_FLIGHTS') * 100)/F.col('TOTAL_FLIGHTS'))).orderBy('MONTH')
    
#     joinedDf = joinedDf.withColumn('MONTH_NAME',month_list).orderBy('MONTH')

    return joinedDf
    
lateArrMonthDf = lateFlightsPerMonth()
# Displaying the result
lateArrMonthDf.show()

+------------+-----+-------------+------------------+
|LATE_FLIGHTS|MONTH|TOTAL_FLIGHTS|           PERCENT|
+------------+-----+-------------+------------------+
|       18401|    1|        45900|40.089324618736384|
|       17405|    2|        40684| 42.78094582636909|
|       19223|    3|        49580| 38.77168212989108|
|       17326|    4|        48221| 35.93040376599407|
|       17616|    5|        48977|35.967903301549704|
|       20769|    6|        49158| 42.24948126449408|
|       20073|    7|        51415|39.041135855295146|
|       17902|    8|        49866| 35.90021256968676|
|       13349|    9|        46459|28.732861232484556|
|       14238|   10|        48357|29.443513865624418|
|       14891|   11|        46203| 32.22950890634807|
|       17862|   12|        46909|38.077980771280565|
+------------+-----+-------------+------------------+



# 2 Feature extraction and ML Training <a class="anchor" id="2"></a>
## 2.1. Discuss the feature selection and prepare the feature columns <a class="anchor" id="2.1"></a>
### 2.1.1 Define dataframes and loading scheme<a class="anchor" id="2.1.1"></a>
[Back to top](#table)

In [26]:
flightsDf.toPandas().describe()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,binaryArrDelay,binaryDeptDelay,multiClassArrDelay,multiClassDeptDelay,autolabelMultiClassArrDelay,autolabelMultiClassDeptDelay
count,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0,571729.0
mean,2015.0,6.54226,15.711267,3.932858,2160.62128,1328.704886,1334.71451,9.323501,16.076809,1357.015107,141.885302,137.028886,113.517327,824.298112,1472.260561,7.43475,1494.01055,1477.347904,4.467085,0.0,0.0,0.365654,0.369455,0.443938,0.434655,0.997492,1.048834
std,0.0,3.398908,8.768838,1.984087,1752.191061,483.322122,496.152643,37.430095,8.921954,497.697168,75.204014,74.127482,72.122705,607.467311,521.202226,5.606216,506.257498,525.302164,39.787086,0.0,0.0,0.481613,0.482658,0.739339,0.736535,0.810418,0.768013
min,2015.0,1.0,1.0,1.0,1.0,1.0,1.0,-48.0,1.0,1.0,18.0,14.0,7.0,31.0,1.0,1.0,1.0,1.0,-82.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2015.0,4.0,8.0,2.0,727.0,916.0,921.0,-5.0,11.0,935.0,86.0,82.0,60.0,373.0,1055.0,4.0,1110.0,1059.0,-13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2015.0,7.0,16.0,4.0,1678.0,1325.0,1330.0,-2.0,14.0,1343.0,123.0,118.0,94.0,650.0,1508.0,6.0,1520.0,1512.0,-5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
75%,2015.0,9.0,23.0,6.0,3202.0,1730.0,1740.0,7.0,19.0,1754.0,174.0,169.0,144.0,1065.0,1912.0,9.0,1918.0,1917.0,8.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0
max,2015.0,12.0,31.0,7.0,7438.0,2359.0,2400.0,1670.0,200.0,2400.0,683.0,726.0,684.0,4983.0,2400.0,202.0,2400.0,2400.0,1665.0,0.0,0.0,1.0,1.0,2.0,2.0,2.0,2.0


In [27]:
# Checking the correlation of variables with every parameter
flightsDf.toPandas().corr()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,binaryArrDelay,binaryDeptDelay,multiClassArrDelay,multiClassDeptDelay,autolabelMultiClassArrDelay,autolabelMultiClassDeptDelay
YEAR,,,,,,,,,,,,,,,,,,,,,,,,,,,
MONTH,,1.0,0.004354,-0.011835,-0.020922,-0.000739,-0.005156,-0.021136,-0.012162,-0.005567,0.009833,0.001209,0.002682,0.010167,-0.010184,0.000831,-0.011723,-0.010672,-0.036217,,,-0.053875,-0.037683,-0.050762,-0.039383,-0.05822,-0.028773
DAY,,0.004354,1.0,0.001057,0.000132,-0.001213,-0.002348,-0.001327,-0.002073,-0.003012,0.003074,0.00171,0.002066,0.003293,-0.00278,-0.000672,-0.001633,-0.00212,-0.003873,,,-0.006956,-0.003501,-0.008453,-0.005845,-0.006518,-0.002735
DAY_OF_WEEK,,-0.011835,0.001057,1.0,0.01427,0.006439,0.00454,-0.009496,-0.022014,0.002679,0.015879,0.012699,0.015607,0.017207,0.004201,0.002166,0.0054,0.003631,-0.015288,,,-0.011677,-0.001282,-0.013372,-0.007741,-0.016847,-0.002143
FLIGHT_NUMBER,,-0.020922,0.000132,0.01427,1.0,-0.004104,0.001425,-0.008599,0.049131,0.008011,-0.315371,-0.305986,-0.318995,-0.329182,-0.004609,-0.020246,-0.013307,-0.001136,0.017929,,,0.013503,-0.052005,0.010942,-0.019303,0.019645,-0.071714
SCHEDULED_DEPARTURE,,-0.000739,-0.001213,0.006439,-0.004104,1.0,0.963318,0.108216,0.006438,0.938467,-0.019202,-0.021204,-0.018974,-0.012057,0.661917,-0.046524,0.709157,0.634908,0.098594,,,0.125526,0.186498,0.144811,0.188996,0.118987,0.168828
DEPARTURE_TIME,,-0.005156,-0.002348,0.00454,0.001425,0.963318,1.0,0.166282,0.013999,0.972847,-0.023988,-0.025175,-0.024353,-0.019487,0.682338,-0.041851,0.714185,0.654593,0.154869,,,0.159455,0.224339,0.191623,0.240559,0.151739,0.206981
DEPARTURE_DELAY,,-0.021136,-0.001327,-0.009496,-0.008599,0.108216,0.166282,1.0,0.058651,0.158186,0.027801,0.03098,0.023384,0.024531,0.058625,0.015464,0.096108,0.049747,0.945931,,,0.426323,0.477504,0.599489,0.660057,0.409149,0.47041
TAXI_OUT,,-0.012162,-0.002073,-0.022014,0.049131,0.006438,0.013999,0.058651,1.0,0.039321,0.114383,0.208094,0.089963,0.074333,0.034209,0.002702,0.025135,0.031775,0.226676,,,0.277381,0.052396,0.283746,0.061457,0.282204,0.053052
WHEELS_OFF,,-0.005567,-0.003012,0.002679,0.008011,0.938467,0.972847,0.158186,0.039321,1.0,-0.031549,-0.03039,-0.032987,-0.030807,0.701473,-0.040035,0.725603,0.673536,0.151829,,,0.162454,0.220059,0.193162,0.235712,0.155297,0.202996


## Observation:

1. Features like **'YEAR', 'DIVERTED', 'CANCELLED'** have only 1 value, thus are insignificant in predicting the label. This is proved by the correlation table as well.


2. For the "autolabelMultiClassArrDelay": features like **'FLIGHT_NUMBER', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'ELAPSED_TIME', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME'** shows positive correlation, means that they are positively linearly dependent on the label of arrival delay.


3. However, features like **'MONTH', 'DAY','DAY_OF_WEEK', 'SCHEDULED_TIME', 'AIR_TIME' and 'DISTANCE'** presents negative correlation.


4. Thus for the model to predict the label for ARRIVAL_DELAY as early,on-time or late, we can take the significant features and discard all the irrelevant features from our new dataset.


5. For the "autolabelMultiClassDeptDelay": features like **'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME','AIR_TIME', 'DISTANCE',  'WHEELS_ON', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME','ARRIVAL_DELAY'** shows positive correlation, means that they are positively linearly dependent on the label of arrival delay.


6. However, features like **'MONTH', 'DAY','DAY_OF_WEEK', 'FLIGHT_NUMBER', 'TAXI_IN'** presents negative correlation.


7. Thus for the model to predict the label for DEPT_DELAY as early,on-time or late, we can take the significant features and discard all the irrelevant features from our new dataset.


8. Moreover, we have observed that there are different number of unique instances of **categorical variables** like 'AIRLINE', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', and 'DESTINATION_AIRPORT' and their frequency of instances also differs. Thus, for our further analysis we cannot remove these categorical variables. 


9. It is important to note that exists few qualitative nominal features like **'MONTH','DAY', 'DAY_OF_WEEK'**. This nominal classification is done to facilitate data collection and data management. these variables are transformed before any statistical analysis. 


10. Moreover, there exists certain features which are highly correlated with each other. For instance, **'SCHEDULED_DEPARTURE' and 'DEPARTURE_TIME'** are 0.96, **'SCHEDULED_TIME' and 'ELAPSED_TIME'** are 0.98,  positively correlated respectively. This high correlation increases **complexity and overfitting on dataset.**


### 2.1.2 Creating the analytical dataset consisting of relevant columns<a class="anchor" id="2.1.2"></a>
[Back to top](#table)

In [28]:
# Checking the total_unique_instances in numerical columns also
for column in numerical_column:
    total_unique_instances = flightsDf.select(column).distinct().count()
    print("The total unique instances in", column, "is: ", total_unique_instances)

The total unique instances in YEAR is:  1
The total unique instances in MONTH is:  12
The total unique instances in DAY is:  31
The total unique instances in DAY_OF_WEEK is:  7
The total unique instances in FLIGHT_NUMBER is:  6691
The total unique instances in SCHEDULED_DEPARTURE is:  1281
The total unique instances in DEPARTURE_TIME is:  1416
The total unique instances in DEPARTURE_DELAY is:  723
The total unique instances in TAXI_OUT is:  162
The total unique instances in WHEELS_OFF is:  1417
The total unique instances in SCHEDULED_TIME is:  529
The total unique instances in ELAPSED_TIME is:  633
The total unique instances in AIR_TIME is:  614
The total unique instances in DISTANCE is:  1328
The total unique instances in WHEELS_ON is:  1440
The total unique instances in TAXI_IN is:  136
The total unique instances in SCHEDULED_ARRIVAL is:  1393
The total unique instances in ARRIVAL_TIME is:  1440
The total unique instances in ARRIVAL_DELAY is:  752
The total unique instances in DIVERT

In [29]:
# Dropping irrelevant columns 'YEAR', 'DIVERTED', 'CANCELLED'
irrelevant_columns = ['YEAR', 'DIVERTED', 'CANCELLED']
relevantFlightsDf = eliminate_columns(irrelevant_columns, flightsDf)


# As per above observation, adding 'MONTH','DAY', 'DAY_OF_WEEK' as categorical qualitative nominal variables
nominal_variables = ['MONTH','DAY', 'DAY_OF_WEEK']
for each in nominal_variables:
    if (each not in categorical_features):
        categorical_features.append(each)

print("Categorical columns are:",categorical_features)

# Removing nominal_variables from numerical column list
numerical_column = []
for each in relevantFlightsDf.columns:
    if ((each not in categorical_features) & (each not in numerical_column)):
        numerical_column.append(each)
        
print("Numerical columns are:",numerical_column)

label_list = ['binaryArrDelay', 'binaryDeptDelay', 'multiClassArrDelay', 'multiClassDeptDelay', 'autolabelMultiClassArrDelay', 'autolabelMultiClassDeptDelay']
numerical_features = []
for each in numerical_column:
    if (each not in label_list):
        numerical_features.append(each)
        
print("Numerical features are:",numerical_features)


Categorical columns are: ['AIRLINE', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'MONTH', 'DAY', 'DAY_OF_WEEK']
Numerical columns are: ['FLIGHT_NUMBER', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'binaryArrDelay', 'binaryDeptDelay', 'multiClassArrDelay', 'multiClassDeptDelay', 'autolabelMultiClassArrDelay', 'autolabelMultiClassDeptDelay']
Numerical features are: ['FLIGHT_NUMBER', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY']


In [30]:
binaryClassRemoval = ['multiClassArrDelay', 'multiClassDeptDelay', 'autolabelMultiClassArrDelay', 'autolabelMultiClassDeptDelay']
binaryClassDf = eliminate_columns(binaryClassRemoval, relevantFlightsDf)
# binaryClassDf.printSchema()

In [31]:
binaryArrRemoval = ['ARRIVAL_DELAY','binaryDeptDelay']
binaryArrDf = eliminate_columns(binaryArrRemoval, binaryClassDf)
print("binaryArrDf: ")
binaryArrDf.printSchema()

binaryDepRemoval = ['DEPARTURE_DELAY', 'binaryArrDelay']
binaryDepDf = eliminate_columns(binaryDepRemoval, binaryClassDf)
print("binaryDepDf: ")
binaryDepDf.printSchema()

binaryArrDf: 
root
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- TAIL_NUMBER: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- DEPARTURE_TIME: integer (nullable = true)
 |-- DEPARTURE_DELAY: integer (nullable = true)
 |-- TAXI_OUT: integer (nullable = true)
 |-- WHEELS_OFF: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- ELAPSED_TIME: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- WHEELS_ON: integer (nullable = true)
 |-- TAXI_IN: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- ARRIVAL_TIME: integer (nullable = true)
 |-- binaryArrDelay: integer (nullable = false)

binaryD

In [32]:
multiArrRemoval = ['ARRIVAL_DELAY', 'multiClassDeptDelay','binaryArrDelay', 'binaryDeptDelay', 'autolabelMultiClassArrDelay', 'autolabelMultiClassDeptDelay']
multiArrDf = eliminate_columns(multiArrRemoval, relevantFlightsDf)
print("multiArrDf: ")
multiArrDf.printSchema()

multiDepRemoval = ['DEPARTURE_DELAY','multiClassArrDelay', 'binaryArrDelay', 'binaryDeptDelay', 'autolabelMultiClassArrDelay', 'autolabelMultiClassDeptDelay']
multiDepDf = eliminate_columns(multiDepRemoval, relevantFlightsDf)
print("multiDepDf: ")
multiDepDf.printSchema()


multiArrDf: 
root
 |-- MONTH: integer (nullable = true)
 |-- DAY: integer (nullable = true)
 |-- DAY_OF_WEEK: integer (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- FLIGHT_NUMBER: integer (nullable = true)
 |-- TAIL_NUMBER: string (nullable = true)
 |-- ORIGIN_AIRPORT: string (nullable = true)
 |-- DESTINATION_AIRPORT: string (nullable = true)
 |-- SCHEDULED_DEPARTURE: integer (nullable = true)
 |-- DEPARTURE_TIME: integer (nullable = true)
 |-- DEPARTURE_DELAY: integer (nullable = true)
 |-- TAXI_OUT: integer (nullable = true)
 |-- WHEELS_OFF: integer (nullable = true)
 |-- SCHEDULED_TIME: integer (nullable = true)
 |-- ELAPSED_TIME: integer (nullable = true)
 |-- AIR_TIME: integer (nullable = true)
 |-- DISTANCE: integer (nullable = true)
 |-- WHEELS_ON: integer (nullable = true)
 |-- TAXI_IN: integer (nullable = true)
 |-- SCHEDULED_ARRIVAL: integer (nullable = true)
 |-- ARRIVAL_TIME: integer (nullable = true)
 |-- multiClassArrDelay: integer (nullable = true)

multi

## 2.2. Preparing any Spark ML Transformers/ Estimators for features and models <a class="anchor" id="2.2"></a>
### 2.2.1 Creating Transformers/Estimators for transforming/assembling the columns <a class="anchor" id="2.2.1"></a>
[Back to top](#table)

In [33]:
%time

# Using the String Indexer
from pyspark.ml.feature import StringIndexer
# The encode of indexed vlaues multiple columns
from pyspark.ml.feature import OneHotEncoder

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# This method is used to create estimators and assember for all the dataframes
def create_estimators_assember(df, delay_type):
    
    inputCols = categorical_features
    #Define the output columns 
    outputCols=[f'{x}_index' for x in categorical_features]
    # TODO: Initialize StringIndexer (use inputCols and outputCols)
    indexer = StringIndexer(inputCols=inputCols, outputCols=outputCols)

    #TODO call the fit and transform() method to get the encoded results 
    df_indexed = indexer.fit(df).transform(df)
    
    #the outputcols of previous step act as input cols for this step
    inputCols_OHE = outputCols #all output columns from StringIndexer exept the Income
    outputCols_OHE = [f'{x}_vec' for x in inputCols]

    #Define OneHotEncoder with the appropriate columns
    encoder = OneHotEncoder(inputCols=inputCols_OHE,
                        outputCols=outputCols_OHE)
    # Call fit and transform to get the encoded results
    df_encoded = encoder.fit(df_indexed).transform(df_indexed)
    

    # Combining numerical and non numerical columns
    inputCols = fetch_numerical_features(delay_type) + outputCols_OHE

    #Define the assembler with appropriate input and output columns
    # Vectorizing encoded values
    assembler = VectorAssembler(inputCols=inputCols,outputCol="features")

    #use the asseembler transform() to get encoded results
    df_final = assembler.transform(df_encoded)
    
    # Creating stages that will be used later in pipelining
    stages = []
    stages += [indexer, encoder]
    stages += [assembler]
    
    return df_final, stages
    
    


    
def fetch_numerical_features(type):
    available_list = []
    if(type == 'arrival'):

        for x in numerical_features:
            if (x != 'ARRIVAL_DELAY'):
                available_list.append(x)
        
    elif (type == 'departure'):

        for x in numerical_features:
            if (x != 'DEPARTURE_DELAY'):
                available_list.append(x)
        
    return available_list
    

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.05 µs


In [34]:
%time 

binaryArrivalDf, binaryArrivalStages = create_estimators_assember(binaryArrDf, 'arrival')
binaryArrivalDf

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 3.81 µs


DataFrame[MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, TAIL_NUMBER: string, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, DEPARTURE_DELAY: int, TAXI_OUT: int, WHEELS_OFF: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, AIR_TIME: int, DISTANCE: int, WHEELS_ON: int, TAXI_IN: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, binaryArrDelay: int, MONTH_index: double, AIRLINE_index: double, DESTINATION_AIRPORT_index: double, DAY_OF_WEEK_index: double, ORIGIN_AIRPORT_index: double, DAY_index: double, TAIL_NUMBER_index: double, TAIL_NUMBER_vec: vector, ORIGIN_AIRPORT_vec: vector, DESTINATION_AIRPORT_vec: vector, DAY_vec: vector, DAY_OF_WEEK_vec: vector, AIRLINE_vec: vector, MONTH_vec: vector, features: vector]

In [35]:
%time 

binaryDepartureDf, binaryDepartureStages = create_estimators_assember(binaryDepDf, 'departure')
binaryDepartureDf

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs


DataFrame[MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, TAIL_NUMBER: string, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, TAXI_OUT: int, WHEELS_OFF: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, AIR_TIME: int, DISTANCE: int, WHEELS_ON: int, TAXI_IN: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, ARRIVAL_DELAY: int, binaryDeptDelay: int, MONTH_index: double, AIRLINE_index: double, DESTINATION_AIRPORT_index: double, DAY_OF_WEEK_index: double, ORIGIN_AIRPORT_index: double, DAY_index: double, TAIL_NUMBER_index: double, TAIL_NUMBER_vec: vector, ORIGIN_AIRPORT_vec: vector, DESTINATION_AIRPORT_vec: vector, DAY_vec: vector, DAY_OF_WEEK_vec: vector, AIRLINE_vec: vector, MONTH_vec: vector, features: vector]

In [36]:
%time 

multiArrivalDf, multiArrivalStages = create_estimators_assember(multiArrDf, 'arrival')
multiArrivalDf

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


DataFrame[MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, TAIL_NUMBER: string, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, DEPARTURE_DELAY: int, TAXI_OUT: int, WHEELS_OFF: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, AIR_TIME: int, DISTANCE: int, WHEELS_ON: int, TAXI_IN: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, multiClassArrDelay: int, MONTH_index: double, AIRLINE_index: double, DESTINATION_AIRPORT_index: double, DAY_OF_WEEK_index: double, ORIGIN_AIRPORT_index: double, DAY_index: double, TAIL_NUMBER_index: double, TAIL_NUMBER_vec: vector, ORIGIN_AIRPORT_vec: vector, DESTINATION_AIRPORT_vec: vector, DAY_vec: vector, DAY_OF_WEEK_vec: vector, AIRLINE_vec: vector, MONTH_vec: vector, features: vector]

In [37]:
%time 

multiDepartureDf, multiDepartureStages = create_estimators_assember(multiDepDf, 'departure')
multiDepartureDf

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 8.34 µs


DataFrame[MONTH: int, DAY: int, DAY_OF_WEEK: int, AIRLINE: string, FLIGHT_NUMBER: int, TAIL_NUMBER: string, ORIGIN_AIRPORT: string, DESTINATION_AIRPORT: string, SCHEDULED_DEPARTURE: int, DEPARTURE_TIME: int, TAXI_OUT: int, WHEELS_OFF: int, SCHEDULED_TIME: int, ELAPSED_TIME: int, AIR_TIME: int, DISTANCE: int, WHEELS_ON: int, TAXI_IN: int, SCHEDULED_ARRIVAL: int, ARRIVAL_TIME: int, ARRIVAL_DELAY: int, multiClassDeptDelay: int, MONTH_index: double, AIRLINE_index: double, DESTINATION_AIRPORT_index: double, DAY_OF_WEEK_index: double, ORIGIN_AIRPORT_index: double, DAY_index: double, TAIL_NUMBER_index: double, TAIL_NUMBER_vec: vector, ORIGIN_AIRPORT_vec: vector, DESTINATION_AIRPORT_vec: vector, DAY_vec: vector, DAY_OF_WEEK_vec: vector, AIRLINE_vec: vector, MONTH_vec: vector, features: vector]

### 2.2.2 BONUS TASK: Custom Transformer that allows you to map Months to Season <a class="anchor" id="2.2.2"></a>
[Back to top](#table)

### 2.2.3 Create ML model Estimators for Decision Tree and Gradient Boosted Tree model <a class="anchor" id="2.2.3"></a>
[Back to top](#table)

### Binary classification for arrival delay for Decision Tree Model

In [38]:
from pyspark.ml.classification import DecisionTreeClassifier


# This method is used to create Decision Tree Model 
def createDTModel(trainDf, testDf):
    dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
    # Fitting the model with Train dataset
    model = dt.fit(trainDf)

    # Testing the model on test dataset
    predictions = model.transform(testDf)
    
    return predictions


### Binary classification for Departure delay for Gradient Boosting Model

In [39]:
from pyspark.ml.classification import GBTClassifier

# This method is used to create Gradient Boosting Model 
def createGradientBoostingModel(trainDf, testDf):
    gbt = GBTClassifier(labelCol="label",featuresCol="features", maxIter=10)
    # Fitting the model with Train dataset
    gbtModel = gbt.fit(trainDf)
    
    # Testing the model on test dataset
    predictions = gbtModel.transform(testDf)
    
    return predictions


### 2.2.4 ML model Estimators for Naive Bayes model for multiclass classification <a class="anchor" id="2.2.4"></a>
[Back to top](#table)

In [40]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import NaiveBayes

def createNaiveBayesModel(trainDf, testDf):
    
    nb = NaiveBayes(labelCol="label",featuresCol="features", smoothing=1, modelType="multinomial")
    # Fitting the model with Train dataset
    model = nb.fit(trainDf)
    # Testing the model on test dataset
    predictions = model.transform(testDf)
    
    return predictions

### 2.2.5 Transformers/Estimators into pipelines <a class="anchor" id="2.2.5"></a>
[Back to top](#table)

In [41]:
from pyspark.ml import Pipeline

def creating_pipeline(df, stages):
    pipeline = Pipeline(stages=stages)
    model=pipeline.fit(df)
    transformer = model.transform(df)
    
    return transformer


In [42]:
def renamedDf(df, labeled_column):
    df = df.withColumnRenamed(labeled_column, 'label')
    
    return df
    

In [43]:
%time

binaryArrTransformer = creating_pipeline(renamedDf(binaryArrDf, 'binaryArrDelay'), binaryArrivalStages)

binaryDepTransformer = creating_pipeline(renamedDf(binaryDepDf,'binaryDeptDelay'), binaryDepartureStages)

multiArrTransformer = creating_pipeline(renamedDf(multiArrDf,'multiClassArrDelay'), multiArrivalStages)

multiDepTransformer = creating_pipeline(renamedDf(multiDepDf,'multiClassDeptDelay'), multiDepartureStages)

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.77 µs


## 2.3. Preparing the training and testing data <a class="anchor" id="2.3"></a>
[Back to top](#table)

In [44]:
#Splitting the data into testing and training set 80% into training and 20% for testing
def splitDf(df):
    train, test = df.randomSplit([0.8, 0.2], seed=111)
    
    return train, test

In [45]:
binaryArrTrain, binaryArrTest = splitDf(binaryArrTransformer)
binaryDepTrain, binaryDepTest = splitDf(binaryDepTransformer)
multiArrTrain, multiArrTest = splitDf(multiArrTransformer)
multiDepTrain, multiDepTest = splitDf(multiDepTransformer)

## 2.4 Training and evaluating models <a class="anchor" id="2.4"></a>
### 2.4.1.a ML Pipelines to train the models <a class="anchor" id="2.4.1.a"></a>

[Back to top](#table)

### For arrival delay for Decision Tree Model

In [46]:
%time 

predictionsArrDT = createDTModel(binaryArrTrain, binaryArrTest)
predictionsArrDT.select('label','rawPrediction', 'prediction', 'probability').show(10)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.01 µs
+-----+------------------+----------+--------------------+
|label|     rawPrediction|prediction|         probability|
+-----+------------------+----------+--------------------+
|    0|[255169.0,38531.0]|       0.0|[0.86880830779707...|
|    1|    [23.0,41434.0]|       1.0|[5.54791711894251...|
|    0|[255169.0,38531.0]|       0.0|[0.86880830779707...|
|    0|[255169.0,38531.0]|       0.0|[0.86880830779707...|
|    1|  [10776.0,6444.0]|       0.0|[0.62578397212543...|
|    0|[255169.0,38531.0]|       0.0|[0.86880830779707...|
|    1|   [1502.0,5667.0]|       1.0|[0.20951318175477...|
|    0|[255169.0,38531.0]|       0.0|[0.86880830779707...|
|    0|[255169.0,38531.0]|       0.0|[0.86880830779707...|
|    1|[255169.0,38531.0]|       0.0|[0.86880830779707...|
+-----+------------------+----------+--------------------+
only showing top 10 rows



### For Departure delay for Decision Tree Model

In [47]:
%time 

predictionsDepDT = createDTModel(binaryDepTrain, binaryDepTest)
predictionsDepDT.select('label','rawPrediction', 'prediction', 'probability').show(10)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs
+-----+-----------------+----------+--------------------+
|label|    rawPrediction|prediction|         probability|
+-----+-----------------+----------+--------------------+
|    0| [29174.0,9428.0]|       0.0|[0.75576395005440...|
|    1| [1981.0,57190.0]|       1.0|[0.03347923814030...|
|    0| [30562.0,8774.0]|       0.0|[0.77694732560504...|
|    0|[114398.0,9671.0]|       0.0|[0.92205143911855...|
|    0| [1981.0,57190.0]|       1.0|[0.03347923814030...|
|    0| [30562.0,8774.0]|       0.0|[0.77694732560504...|
|    1| [8887.0,32349.0]|       1.0|[0.21551556892036...|
|    0|[114398.0,9671.0]|       0.0|[0.92205143911855...|
|    0|[114398.0,9671.0]|       0.0|[0.92205143911855...|
|    0| [30562.0,8774.0]|       0.0|[0.77694732560504...|
+-----+-----------------+----------+--------------------+
only showing top 10 rows



### For arrival delay for Gradient Boosting Model

In [None]:
%time

predictionsArrGB = createGradientBoostingModel(binaryArrTrain, binaryArrTest)
predictionsArrGB.select('label','rawPrediction', 'prediction', 'probability').show(10)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


### For Departure delay for Gradient Boosting Model

In [None]:
%time

predictionsDepGB = createGradientBoostingModel(binaryDepTrain, binaryDepTest)
predictionsDepGB.select('label','rawPrediction', 'prediction', 'probability').show(10)

### 2.4.1.b Display the count of each combination of late/not late label and prediction label <a class="anchor" id="2.4.1.b"></a>
[Back to top](#table)

In [None]:
print('Binary Arrival Delay Classification Decision Tree')
predictionsArrDT.groupBy('label', 'prediction').count().show()

print('Binary Departure Delay Classification Decision Tree')
predictionsDepDT.groupBy('label', 'prediction').count().show()

print('Binary Arrival Delay Classification Gradient Boosting')
predictionsArrGB.groupBy('label', 'prediction').count().show()

print('Binary Departure Delay Classification Gradient Boosting')
predictionsDepGB.groupBy('label', 'prediction').count().show()

### 2.4.1.c Compute the AUC, accuracy, recall, and precision <a class="anchor" id="2.4.1.c"></a>
[Back to top](#table)

In [None]:
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# from pyspark.mllib.evaluation import BinaryClassificationMetrics


def evaluateAUC(predictions):
    evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
    # Calculating the area 
    auc = evaluator.evaluate(predictions)
    
    print("The Evaluation metric is:", evaluator.getMetricName(), "with", auc)
    
def calculate_metrics(predictions):
    TN = predictions.filter('prediction = 0 AND label = 0').count()
    TP = predictions.filter('prediction = 1 AND label = 1').count()

    FN = predictions.filter('prediction = 0 AND label = 1').count()
    FP = predictions.filter('prediction = 1 AND label = 0').count()

    accuracy = (TP + TN)/ (TP + TN + FN + FP)
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1_score = 2 / ((1/recall) + (1/precision))
    
    print("Accuracy =",accuracy)
    print("Precision =",precision);
    print("Recall =",recall);
    print("F Measure =",f1_score);
    
    evaluateAUC(predictions)
    
    return accuracy,precision,recall,f1_score 


In [None]:
print('Calculating metrics for Binary Arrival Delay Classification Decision Tree')
calculate_metrics(predictionsArrDT)
print('----------------------------------------------------------------------------')

print('Calculating metrics for Binary Departure Delay Classification Decision Tree')
calculate_metrics(predictionsDepDT)
print('----------------------------------------------------------------------------')


print('Calculating metrics for Binary Arrival Delay Classification Gradient Boosting')
calculate_metrics(predictionsArrGB)
print('----------------------------------------------------------------------------')


print('Calculating metrics for Binary Departure Delay Classification Gradient Boosting')
calculate_metrics(predictionsDepGB)
print('----------------------------------------------------------------------------')


### 2.4.1.d Metric is more proper for measuring the model performance <a class="anchor" id="2.4.1.d"></a>
[Back to top](#table)

**Accuracy** provides the information of how many observations are correctly labelled. However, in case of imbalanced data, there are higher chances of getting higher accuracy even when the model doesn't predicts well. Moreover, in case every class is equally important, accuracy is not able to predict properly.   

**F1 score** gives better performance if the dataset is heavily imbalanced or its mostly concerned about the positive classes. 

**Area under ROC** gives better prediction if the problem is related to ranking predictions where there is no need for calibrated probabilities.


Thus for our case, **Area under the ROC** curve is better in measuring the model performance.

### 2.4.1.e Which is the better model, and persist the better model <a class="anchor" id="2.4.1.e"></a>
[Back to top](#table)

In [None]:
%time

gbt = GBTClassifier(labelCol="label",featuresCol="features", maxIter=10)
# Fitting the model with Train dataset
bestArrModel = gbt.fit(binaryArrTrain)

bestDepModel = gbt.fit(binaryDepTrain)


Binary classification for **Gradient Boosting** performs better in our dataset, as for the **arrival_delay: areaUnderROC is 0.9266792553453875** and for **departure_delay, areaUnderROC is 0.8945453682503985**

### 2.4.1.f. Top-3 feature with each corresponding feature importance <a class="anchor" id="2.4.1.f"></a>
[Back to top](#table)


### 2.4.1.g. Ways the performance can be improved for both classifiers<a class="anchor" id="2.4.1.g"></a>
[Back to top](#table)


**F1 score** gives better performance if the dataset is heavily imbalanced or its mostly concerned about the positive classes. 

**Area under ROC** gives better prediction if the problem is related to ranking predictions where there is no need for calibrated probabilities.

The difference between the **F1 score and ROC AUC** is that the F1 score takes predicted classes and whereas AUC takes predicted scores as input. However,we can optimise the threshold with F1 score that predicts the observations to those classes. 

Therefore, we can optimise the model performance can be improved by adjusting the threshold (by default it is 0.5)

In case of Decision Tree, using Random Forest Classification model can also yield with better performance. 

Moreover, hyper tuning the parameters optimises the model creation and makes better prediction. Thus improves the performance of the classifier. 

## 2.4.2 Multiclass classification <a class="anchor" id="2.4.2"></a>

### 2.4.2.a ML Pipelines to train the models <a class="anchor" id="2.4.2.a"></a>

[Back to top](#table)

In [None]:
%time

predictionsMultiArr = createNaiveBayesModel(multiArrTrain, multiArrTest)

predictionsMultiArr.select('label','rawPrediction', 'prediction', 'probability').show(10)


### 2.4.2.b Display the count of each combination of early/on-time/late label and prediction label<a class="anchor" id="2.4.2.b"></a>
[Back to top](#table)

In [None]:
predictionsMultiArr.groupBy('label', 'prediction').count().show()

### 2.4.2.c Compute the AUC, accuracy, recall, and precision <a class="anchor" id="2.4.2.c"></a>
[Back to top](#table)

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",metricName="accuracy")

# Calculating the accuracy 
accuracy = evaluator.evaluate(predictions)
    
print("The Evaluation metric is:", evaluator.getMetricName(), "with", accuracy)


In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Create both evaluators
evaluatorMulti = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
evaluator = BinaryClassificationEvaluator(labelCol="target", rawPredictionCol="prediction", metricName='areaUnderROC')

# Make predicitons
predictionAndTarget = model.transform(df).select("target", "prediction")

# Get metrics
accuracy = evaluatorMulti.evaluate(predictions, {evaluatorMulti.metricName: "accuracy"})
f1_score = evaluatorMulti.evaluate(predictions, {evaluatorMulti.metricName: "f1"})
precision = evaluatorMulti.evaluate(predictions, {evaluatorMulti.metricName: "weightedPrecision"})
recall = evaluatorMulti.evaluate(predictions, {evaluatorMulti.metricName: "weightedRecall"})
auc = evaluator.evaluate(predictions)


print("Accuracy =",accuracy)
print("Precision =",precision);
print("Recall =",recall);
print("F Measure =",f1_score);
print("The Evaluation metric is:", evaluator.getMetricName(), "with", auc)

### 2.4.2.d Which is the better model, and persist the better model <a class="anchor" id="2.4.2.d"></a>
[Back to top](#table)

**F1 score** gives better performance if the dataset is heavily imbalanced or its mostly concerned about the positive classes. 
Thus, we choose **F1** as the performance measure.

### 2.4.2.e. Ways the performance can be improved for classifiers<a class="anchor" id="2.4.2.e"></a>
[Back to top](#table)

**F1 score** gives better performance if the dataset is heavily imbalanced or its mostly concerned about the positive classes. 

**Area under ROC** gives better prediction if the problem is related to ranking predictions where there is no need for calibrated probabilities.

The difference between the **F1 score and ROC AUC** is that the F1 score takes predicted classes and whereas AUC takes predicted scores as input. However,we can optimise the threshold with F1 score that predicts the observations to those classes. 

Therefore, we can optimise the model performance can be improved by adjusting the threshold (by default it is 0.5)

Moreover, hyper tuning the parameters optimises the model creation and makes better prediction. Thus improves the performance of the classifier. 

## Reference:

[https://www.datasciencemadesimple.com/quantile-rank-decile-rank-n-tile-rank-in-pyspark-rank-by-group/]

[https://towardsdatascience.com/the-most-complete-guide-to-pyspark-dataframes-2702c343b2e8]

