<a href="https://colab.research.google.com/github/Bishop1303/ML_FlightDelayClassifier/blob/master/ML_FlightDelayClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Getting the softwares:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark


# To use spark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

# SparkSession
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark


In [3]:
#Carica il drive con i dati:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
flights = spark.read.csv("/content/drive/My Drive/flights.csv", inferSchema=True, header=True, mode='FAILFAST')
flights.show()
flights.printSchema()

Tune the raw dataset:

1.   Removing an uninformative column, (**flight**).

2.   removing rows which do not have information about whether or not a flight was delayed, (condition on **delay** column).



In [None]:
# Remove the 'flight' column
flights_drop_column = flights.drop('flight')

# Remove records with missing 'delay' values or NA values
flights_none_missing = flights_drop_column.filter((flights_drop_column.delay.isNotNull()) & \
                                                  (flights_drop_column.delay != 'NA'))

#Change delay column type form str to int
flights_none_missing = flights_none_missing.withColumn('delay', flights_none_missing['delay'].cast('int'))

# Check on dataframe
print('The Schema is: ')
flights_none_missing.printSchema()
print('=====================================================')
print('Informative rows after dropping malformed: ',flights_none_missing.count())

# Tweaking the data:


1.   Converting the units of distance, replacing the *mile* column with a *km* column
2.   Creating a Boolean column indicating whether or not a flight was delayed, (>15 mins = delayed)



In [None]:
# Import the required function
from pyspark.sql.functions import round

# Conversion: 'mile' to 'km' and drop 'mile' column
flights_km = flights_none_missing.withColumn('km', round(flights.mile * 1.60934, 0)).drop('mile')

# Creating 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))

# Check records
flights_km.show(5)
flights_km.printSchema()

# Indexing: from string to unique inxed

Transforming categorical columns (**carrier**, **org**) in numerical values.

In [None]:
from pyspark.ml.feature import StringIndexer

# Create an indexer
indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights_km)

# Indexer creates a new column with numeric index values
flights_indexed = indexer_model.transform(flights_km)

# Repeat the process for the other categorical feature
flights_indexed = StringIndexer(inputCol='org', outputCol='org_idx').fit(flights_indexed).transform(flights_indexed)

# Check result
flights_indexed.show()

# Vector Assembler

Consolidate predictor columns into a single column called **features** needed for the *Decision Tree*.

In [None]:
# Import the necessary class
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=['mon','dom', 'dow', 'carrier_idx',
                                       'org_idx', 'km', 'depart', 'duration'],
                             outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights_indexed)

# Check the resulting column
flights_assembled.select('features', 'delay').show(5, truncate=False)

# **Decision Tree**

**Root node**: contains all the data

**Child node**: separate in 2 the main node based on classification criteria.

Recursive approch to every child node to keep splitting...

In this case the decision tree must use **features** to predict the **delay**.


# **Split train/test**

Random split the main dataset in 2 sets: training and test. Usually the training set is 80% of the total data, 4 times more then the test set.

Splitting data is important: DO NOT TEST MODELS ON TRAINED DATA of course they will perfom well...

1.   Training data (used to train the model) about 80% of the total data
2.   Testing data (used to test the model) remaning 20% of the data.




In [None]:
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights_assembled.randomSplit([0.8, 0.2], seed=7)

# Check that training set has around 80% of records
training_ratio = flights_train.count() / flights_test.count()
print('Training/Test data ratio is: ',training_ratio)

# Build a Decision Tree
Using the data: *flights_train* and *flights_test* to fit a **Decision Tree model**.



In [None]:
# Import the Decision Tree Classifier class
from pyspark.ml.classification import DecisionTreeClassifier

# Classifier object
tree = DecisionTreeClassifier()

# Fiting the training data
tree_model = tree.fit(flights_train)

# Create predictions for the testing data
prediction = tree_model.transform(flights_test)

# Check the predictions
prediction.select('label', 'prediction', 'probability').show(5, False)

# **Confusion Matrix**

A confusion matrix gives a useful breakdown of predictions versus known values. It has four cells which represent the counts of:

* **True Negatives** (**TN**) — model predicts negative outcome & known outcome is negative  

*   **True Positives** (**TP**) — model predicts positive outcome & known outcome is positive

* **False Negatives** (**FN**) — model predicts negative outcome but known outcome is positive

* **False Positives** (**FP**) — model predicts positive outcome but known outcome is negative.

$$\text{accuracy} :=\frac{(TN+TP)}{(TN+TP+FN+FP)}$$


In [None]:
# Confusion matrix
prediction.groupBy('label', 'prediction').count().show()

# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label = 1').count()
FP = prediction.filter('prediction = 1 AND label = 0').count()

# Accuracy measures the proportion of correct predictions
accuracy = (TN + TP) / (TN + TP + FN + FP)

# Check accuracy
print('The model accuracy is: ', '{:.3f}'.format(accuracy))

# Logistic Curve

In [None]:
# Import the logistic regression class
from pyspark.ml.classification import LogisticRegression

# Create a classifier object and train on training data
logistic = LogisticRegression().fit(flights_train)

# Create predictions for the testing data and show confusion matrix
prediction = logistic.transform(flights_test)

# Confusion matrix
prediction.groupBy('label', 'prediction').count().show()

# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label = 1').count()
FP = prediction.filter('prediction = 1 AND label = 0').count()

# **Evaluate the Logistic Regression model**

Accuracy is generally not a very reliable metric because it can be biased by the most common target class.



*   **precision**: is the proportion of positive predictions which are correct. For all flights which are predicted to be delayed, (i.e. what proportion is actually delayed?)
$$\text{precision} := \frac{TP}{(TP+FP)}$$


*   **recall**: is the proportion of positives outcomes which are correctly predicted. For all delayed flights, (i.e. what proportion is correctly predicted by the model?)
$$\text{recall} := \frac{TP}{(TP+FN)}$$


*The precision and recall are generally formulated in terms of the positive target class. But it's also possible to calculate weighted versions of these metrics which look at both target classes.*


In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

# Calculate precision and recall
precision = TP / (TP + FP)
recall = TP / (TP + FN)

print('precision = {:.2f}\n recall = {:.2f}'.format(precision, recall))

# Find weighted precision
multi_evaluator = MulticlassClassificationEvaluator()
weighted_precision = multi_evaluator.evaluate(prediction, {multi_evaluator.metricName: 'weightedPrecision'})

# Find AUC
binary_evaluator = BinaryClassificationEvaluator()
auc = binary_evaluator.evaluate(prediction, {binary_evaluator.metricName: 'areaUnderROC'})

# AOC, should be near 1
print('The area under the curve is: ','{:.2f}'.format(auc))

# Encoding flight origin (One-Hot Encoding)

The **org** column in the flights data is a categorical variable giving the airport from which a flight departs.

some values are:

* ORD — O'Hare International Airport (Chicago).
* SFO — San Francisco International Airport.
* JFK — John F Kennedy International Airport (New York).
* LGA — La Guardia Airport (New York).
* SMF — Sacramento.
* SJC — San Jose.
* TUS — Tucson International Airport.
* OGG — Kahului (Hawaii).
* ...

In [None]:
# Import the one hot encoder class
from pyspark.ml.feature import OneHotEncoder

# Create an instance of the one hot encoder
onehot = OneHotEncoder(inputCols=['org_idx'], outputCols=['org_dummy'])

# Apply the one hot encoder to the flights_km data
onehot = onehot.fit(flights_indexed)
flights_onehot = onehot.transform(flights_indexed)

# Check the results
flights_onehot.select('org', 'org_idx', 'org_dummy').distinct().sort('org_idx').show()

# Flight duration model: Just distance

The objective is to predict flight duration (the duration column) by using only the distance of the flight (the km column) as a predictor.


In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=['km'], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights_onehot)

# Check the resulting column
flights_assembled = flights_assembled.select('mon','dom','dow','carrier','org','depart','duration','delay','km',
                                             'org_idx','org_dummy','features')

# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights_assembled.randomSplit([0.8, 0.2], seed=7)

# Create a regression object and train on training data
regression = LinearRegression(labelCol='duration').fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
predictions = regression.transform(flights_test)
predictions.select('duration', 'prediction').show(5, False)

# Calculate the RMSE
print('The RMSE of the model is:')
RegressionEvaluator(labelCol='duration').evaluate(predictions)


# Interpreting the coefficients
The linear regression model for flight duration as a function of distance takes the form:

$$\text{duration} = \alpha + \beta \times \text{distance}$$

Where:

 * $\alpha$ is the *intercept*, i.e. component of duration which does not depend on distance.

 * $\beta$ is the coefficient (or the *slope*), i.e. the rate at which duration increases as a function of distance.

By looking at the coefficients of your model you will be able to infer:

* How much of the average flight duration is actually spent on the ground.

* What the average speed is during a flight.

In [None]:
# Intercept (average minutes on ground)
inter = regression.intercept
print('The intercep of your model is: ',inter)

# Coefficients
#coefs = regression.coefficients
#print('The regression coefficient  ',coefs)

# Average minutes per km
minutes_per_km = regression.coefficients[0]
print('The regression coefficient rapresent the minutes per km: ', '{:,.3f}'.format(minutes_per_km))

# Average speed in km per hour
avg_speed = 60 / minutes_per_km
print('The average speed is: ', '{:,.2f}'.format(avg_speed), 'km/h')

#Flight duration model: Adding origin airport

It stands to reason that the duration of a flight might depend not only on the distance being covered but also the airport from which the flight departs.

Include the departure airport as a predictor to see if the statement is true.

In [None]:
# Create an assembler object
assembler = VectorAssembler(inputCols=['km','org_dummy'], outputCol='features2')
# The vector has km at position 0, so all the org_index are shifted:
# i.e. ORD was in position 0 now is at index 1 in the feature2 vector.


# Consolidate predictor columns
flights_assembled2 = assembler.transform(flights_onehot)

# Check the resulting column
flights_assembled2 = flights_assembled2.select('mon','dom','dow','carrier','org','depart','duration','delay','km',
                                             'org_idx','org_dummy','features2')

# Split into training and testing sets in a 80:20 ratio
flights_train2, flights_test2 = flights_assembled2.randomSplit([0.8, 0.2], seed=7)

# Create a regression object and train on training data
regression = LinearRegression(featuresCol='features2',labelCol='duration').fit(flights_train2)


# Create predictions for the testing data
predictions = regression.transform(flights_test2)
predictions.select('duration', 'prediction').show(5, False)

# Calculate the RMSE
print('The RMSE of the model is:')
RegressionEvaluator(labelCol='duration').evaluate(predictions)

# Interpreting coefficients

Origin airport, **org**, has eight possible values (ORD, SFO, JFK, LGA, SMF, SJC, TUS and OGG) which have been one-hot encoded to seven dummy variables in **org_dummy**.  

The values for **km** and **org_dummy** have been assembled into **features**, which has eight columns with *sparse representation*. Column indices in features are as follows:

* 0 — km,
* 1 — ORD,
* 2 — SFO,
* 3 — JFK,
* 4 — LGA,
* 5 — SMF,
* 6 — SJC,
* 7 — TUS.

Note that **OGG** does not appear in this list because it is the *reference level* for the origin airport category. So every average time on ground is equal to = (intercept + coefficent relative to that airport) as the list above shows.

In [None]:
# Coefficients
coefs = regression.coefficients
print('The regression coefficients are:\n',coefs)

# Average speed in km per hour
avg_speed_hour = 60 / regression.coefficients[0]
print('The average speed is: ','{:,.2f}'.format(avg_speed_hour), 'km/h')

# Average minutes on ground at OGG
inter = regression.intercept
print('The average time on ground at OGG is: ','{:,.1f}'.format(inter), 'minutes')

# Average minutes on ground at JFK
avg_ground_jfk = inter + regression.coefficients[3]
print('The average time on ground at JFK is: ','{:,.1f}'.format(avg_ground_jfk), 'minutes')

# Average minutes on ground at LGA
avg_ground_lga = inter + regression.coefficients[4]
print('The average time on ground at LGA is: ','{:,.1f}'.format(avg_ground_lga), 'minutes')