# Introduction

We are interested in applying Decision Trees and Random Forests to binary classification problems in Spark. Here we are using a dataset with flight data and predicting if a flight is cancelled or not. Of course, this data set is not balanced, as most flights are not cancelled. For the same of the exercise we reduce the dataset so that we have approximately 50% of the flight cancelled.

We combine our data processing into a pipeline, before training a Decision Tree model and then a Random Forest. Since we are just interested in exploring these models here we do not split out data into train/test sets or evaluate the models. 

# Exploring and tidying the dataset

We explore the data and note that we have many more uncancelled flights than cancelled. We resample the data to get a balanced data set.

In [0]:
# Show available datasets
display(dbutils.fs.ls('/databricks-datasets'))

path,name,size,modificationTime
dbfs:/databricks-datasets/COVID/,COVID/,0,0
dbfs:/databricks-datasets/README.md,README.md,976,1532468253000
dbfs:/databricks-datasets/Rdatasets/,Rdatasets/,0,0
dbfs:/databricks-datasets/SPARK_README.md,SPARK_README.md,3359,1455043490000
dbfs:/databricks-datasets/adult/,adult/,0,0
dbfs:/databricks-datasets/airlines/,airlines/,0,0
dbfs:/databricks-datasets/amazon/,amazon/,0,0
dbfs:/databricks-datasets/asa/,asa/,0,0
dbfs:/databricks-datasets/atlas_higgs/,atlas_higgs/,0,0
dbfs:/databricks-datasets/bikeSharing/,bikeSharing/,0,0


In [0]:
display(dbutils.fs.ls('/databricks-datasets/airlines/part-00000'))

path,name,size,modificationTime
dbfs:/databricks-datasets/airlines/part-00000,part-00000,67108879,1436493184000


In [0]:
path = '/databricks-datasets/airlines/part-00000'
df = spark.read.option("header", "true").csv(path)
df.show(5)

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|1987|   10|    

In [0]:
# We can see that very few flights are cancelled. This will lead to difficulty with predictions, as we can trivially predict all flights not to be cancelled and be almost always correct.
# We will change the data to something more suitable for the experiments below, but of course we shouldn't do this with real data.
cancelled_ratio = df.filter(df['Cancelled'] == 1).count() / df.filter(df['Cancelled'] == 0).count()
cancelled_ratio

Out[26]: 0.008062401678028316

In [0]:
cancelled = df.filter(df['Cancelled'] == 1)

In [0]:
# We take a sample of the non-cancelled data so that we have approximately the same number of cancelled and non-cancelled flights
non_cancelled_sample = df.filter(df['Cancelled'] == 0).sample(False, cancelled_ratio, seed=0)

In [0]:
# Now we can see we have approximately the same cancelled and non-cancelled
non_cancelled_sample.count() / cancelled.count()

Out[29]: 0.9796747967479674

In [0]:
# We now rejoin the data
balanced_data = non_cancelled_sample.union(cancelled)
balanced_data.show()

+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|IsArrDelayed|IsDepDelayed|
+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+------------+------------+
|1987|   10|    

In [0]:
# Now that we have balanced data we select only the columns we want.
reduced = balanced_data[['Year', 'Month', 'UniqueCarrier', 'Cancelled']]

In [0]:
reduced.show(5)

+----+-----+-------------+---------+
|Year|Month|UniqueCarrier|Cancelled|
+----+-----+-------------+---------+
|1987|   10|           PS|        0|
|1987|   10|           PS|        0|
|1987|   10|           PS|        0|
|1987|   10|           PS|        0|
|1987|   10|           PS|        0|
+----+-----+-------------+---------+
only showing top 5 rows



In [0]:
# We will have problems with the Decision Tree if we have NA
reduced = reduced.dropna()

# Decision Tree Classifier

We process our data by recasting our "Cancelled" column as int, using the String Indexer to replace categorical data with indices, and pass to the Vector Assembler to combine all features in one vector. We combine these into a pipeline and fit it to the data.

In [0]:
from pyspark.ml.classification import DecisionTreeClassifier
dec_tree = DecisionTreeClassifier(labelCol="Cancelled")

In [0]:
# Check data types
reduced.printSchema()

root
 |-- Year: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- Cancelled: string (nullable = true)



In [0]:
# Cancelled needs to be numerical for Decision Tree
from pyspark.sql.functions import col
reduced = reduced.select(['Year', 'Month', 'UniqueCarrier', col('Cancelled').cast('int')])

In [0]:
# We need to use the String Indexer for our categorical data

from pyspark.ml.feature import StringIndexer

catCols = [field for (field, dataType) in reduced.dtypes if dataType == "string"]
numericCols = []

# Name for our columns after they will be indexed
indexOutputCols = [x + "Index" for x in catCols]

str_indexer = StringIndexer(inputCols=catCols, outputCols=indexOutputCols, handleInvalid='skip')

In [0]:
# Finally we pass all inputs to the Vector Assembler

from pyspark.ml.feature import VectorAssembler

assemblerInputs = indexOutputCols + numericCols
vecAssembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")

In [0]:
# We can turn this to a pipeline

from pyspark.ml import Pipeline

stages = [str_indexer, vecAssembler, dec_tree]
pipeline = Pipeline(stages=stages)

In [0]:
# Fit the pipeline to the reduced data
pipelineModel = pipeline.fit(reduced)

In [0]:
# We can see the Tree's branches and predictions
dtModel = pipelineModel.stages[-1]
print(dtModel.toDebugString)


DecisionTreeClassificationModel: uid=DecisionTreeClassifier_55fd76276d3b, depth=5, numNodes=15, numClasses=2, numFeatures=3
  If (feature 2 in {0.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0})
   If (feature 2 in {5.0,6.0,9.0,10.0,11.0,12.0,13.0})
    If (feature 2 in {13.0})
     Predict: 1.0
    Else (feature 2 not in {13.0})
     If (feature 1 in {1.0})
      If (feature 2 in {12.0})
       Predict: 1.0
      Else (feature 2 not in {12.0})
       Predict: 0.0
     Else (feature 1 not in {1.0})
      Predict: 0.0
   Else (feature 2 not in {5.0,6.0,9.0,10.0,11.0,12.0,13.0})
    If (feature 1 in {1.0})
     Predict: 1.0
    Else (feature 1 not in {1.0})
     If (feature 2 in {7.0,8.0})
      Predict: 1.0
     Else (feature 2 not in {7.0,8.0})
      Predict: 0.0
  Else (feature 2 not in {0.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0})
   Predict: 1.0



# Random Forest Classification

Now we try a Random Forest classifier. Luckily we can use the same pipeline defined above.

In [0]:
# Let's try the same but with a Random Forest
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(labelCol="Cancelled", maxBins=40, seed=42)

In [0]:
# The pipeline is the same here
stages = [str_indexer, vecAssembler, rf]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(reduced)

In [0]:
# We can check the output in the same way
rfModel = pipelineModel.stages[-1]
print(rfModel.toDebugString)

RandomForestClassificationModel: uid=RandomForestClassifier_61d0bc3e5c78, numTrees=20, numClasses=2, numFeatures=3
  Tree 0 (weight 1.0):
    If (feature 1 in {1.0})
     Predict: 1.0
    Else (feature 1 not in {1.0})
     If (feature 2 in {0.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0})
      Predict: 0.0
     Else (feature 2 not in {0.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0})
      Predict: 1.0
  Tree 1 (weight 1.0):
    If (feature 1 in {1.0})
     Predict: 1.0
    Else (feature 1 not in {1.0})
     If (feature 2 in {0.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0})
      Predict: 0.0
     Else (feature 2 not in {0.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0})
      Predict: 1.0
  Tree 2 (weight 1.0):
    If (feature 2 in {0.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0})
     If (feature 1 in {1.0})
      If (feature 2 in {10.0})
       Predict: 0.0
      Else (feature 2 not in {10.0})
       Predict: 1.0
     Else (feature 1 not in