# Building a Decision Tree with PySpark

Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details.

### Index 
1. Little remainder of the characteristics of Spark
2. Loading the data
3. Data Preparation
4. Train/test split
5. Build a Decision Tree
6. Evaluate the Decision Tree

In [1]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession            # Import the PySpark module
from pyspark.sql.functions import round         # Import the required function
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler  # Import the necessary class
from pyspark.ml.classification import DecisionTreeClassifier # Decision Tree Classifier 

## 1. Characteristics of Spark
Spark is currently the most popular technology for processing large quantities of data. Not only is it able to handle enormous data volumes, but it does so very efficiently too!

### Components in a Spark Cluster
Spark is a distributed computing platform. It achieves efficiency by distributing data and computation across a cluster of computers. A Spark cluster consists of a number of hardware and software components which work together.

In [2]:
# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
print(spark.version)

2.4.7


## 2. Loading flights data

In [3]:
path = "/home/danae/Documents/pySparkTraining/files/"
# Read data from CSV file
flights = spark.read.csv(path + 'flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
print(flights.dtypes)

The data contain 50000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

[('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


## 3. Data Preparation

You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.

In this exercise you need to trim those data down by:

removing an uninformative column and
removing rows which do not have information about whether or not a flight was delayed.


In [4]:
# Remove the 'flight' column
flights_drop_column = flights.drop('flight')

# Number of records with missing 'delay' values
flights_drop_column.filter('delay IS NULL').count()

# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.filter('delay IS NOT NULL')

# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.dropna()
print(flights_none_missing.count())

47022


### 3.1 Column manipulation

The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.

The next step of preparing the flight data has two parts:

1. convert the units of distance, replacing the mile column with a `kmcolumn`; and
2. create a Boolean column indicating whether or not a flight was delayed.

In [5]:
# Convert 'mile' to 'km' and drop 'mile' column
flights_km = flights.withColumn('km', round(flights.mile * 1.60934 , 0)) \
                    .drop('mile')

# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))

# Check first five records
flights_km.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+-----+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|label|
+---+---+---+-------+------+---+------+--------+-----+------+-----+
| 11| 20|  6|     US|    19|JFK|  9.48|     351| null|3465.0| null|
|  0| 22|  2|     UA|  1107|ORD| 16.33|      82|   30| 509.0|    1|
|  2| 20|  4|     UA|   226|SFO|  6.17|      82|   -8| 542.0|    0|
|  9| 13|  1|     AA|   419|ORD| 10.33|     195|   -5|1989.0|    0|
|  4|  2|  5|     AA|   325|ORD|  8.92|      65| null| 415.0| null|
+---+---+---+-------+------+---+------+--------+-----+------+-----+
only showing top 5 rows



### 3.2 Categorical columns
In the flights data there are two columns, carrier and org, which hold categorical data. You need to transform those columns into indexed numerical values.

In [6]:
# Create an indexer
indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights)

# Indexer creates a new column with numeric index values
flights_indexed = indexer_model.transform(flights)

# Repeat the process for the other categorical feature
flights_indexed = StringIndexer(inputCol='org', outputCol='org_idx')\
                                .fit(flights_indexed)\
                                .transform(flights_indexed)

The first step to encoding categorical features is to create a `StringIndexer`. Members of this class are **Estimators** that take a DataFrame with a column of strings and map each unique string to a number. 

Then, the *Estimator* returns a **Transformer** that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.

In [7]:
flights_indexed.select('carrier', 'carrier_idx').distinct().orderBy('carrier_idx').show(5)

+-------+-----------+
|carrier|carrier_idx|
+-------+-----------+
|     UA|        0.0|
|     AA|        1.0|
|     OO|        2.0|
|     WN|        3.0|
|     B6|        4.0|
+-------+-----------+
only showing top 5 rows



In [8]:
flights_indexed.select('org', 'org_idx').distinct().orderBy('org_idx').show(5)

+---+-------+
|org|org_idx|
+---+-------+
|ORD|    0.0|
|SFO|    1.0|
|JFK|    2.0|
|LGA|    3.0|
|SJC|    4.0|
+---+-------+
only showing top 5 rows



### 3.3 Assembling columns
The final stage of data preparation is to consolidate all of the predictor columns into a single column.

This has to be done before modeling because every Spark modeling routine expects the data to be in this form.

In [9]:
flights = flights_indexed.join(flights_km, ['flight', 'mon', 'dom', 'dow', 'carrier'
                                            , 'org', 'depart', 'duration', 'delay'])
flights.show(5)

+------+---+---+---+-------+---+------+--------+-----+----+-----------+-------+------+-----+
|flight|mon|dom|dow|carrier|org|depart|duration|delay|mile|carrier_idx|org_idx|    km|label|
+------+---+---+---+-------+---+------+--------+-----+----+-----------+-------+------+-----+
|  1107|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 316|        0.0|    0.0| 509.0|    1|
|   226|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 337|        0.0|    1.0| 542.0|    0|
|   419|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1236|        1.0|    0.0|1989.0|    0|
|   704|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 550|        0.0|    1.0| 885.0|    0|
|   380|  7|  2|  6|     AA|ORD| 10.83|     135|   54| 733|        1.0|    0.0|1180.0|    1|
+------+---+---+---+-------+---+------+--------+-----+----+-----------+-------+------+-----+
only showing top 5 rows



In [10]:
flights = flights.select('mon', 'dom', 'dow', 'carrier_idx', 
                         'org_idx', 'km', 'depart', 'duration', 'delay', 'label')

# Create an assembler object
assembler = VectorAssembler(inputCols=[
    'mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'
], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights)

# Check the resulting column
flights_assembled.select('features','label').show(5, truncate=False)

+-----------------------------------------+-----+
|features                                 |label|
+-----------------------------------------+-----+
|[0.0,22.0,2.0,0.0,0.0,509.0,16.33,82.0]  |1    |
|[2.0,20.0,4.0,0.0,1.0,542.0,6.17,82.0]   |0    |
|[9.0,13.0,1.0,1.0,0.0,1989.0,10.33,195.0]|0    |
|[5.0,2.0,1.0,0.0,1.0,885.0,7.98,102.0]   |0    |
|[7.0,2.0,6.0,1.0,0.0,1180.0,10.83,135.0] |1    |
+-----------------------------------------+-----+
only showing top 5 rows



# Decision Tree

## 4. Train/test split

To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!

You will split the data into two components:

- training data (used to train the model) and
- testing data (used to test the model).


In [11]:
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights_assembled.randomSplit([0.8, 0.2], seed=17)

# Check that training set has around 80% of records
training_ratio = flights_train.count() / flights.count()
print(training_ratio)

0.7980732423121092


In [12]:
flights_train.show(5)

+---+---+---+-----------+-------+-----+------+--------+-----+-----+--------------------+
|mon|dom|dow|carrier_idx|org_idx|   km|depart|duration|delay|label|            features|
+---+---+---+-----------+-------+-----+------+--------+-----+-----+--------------------+
|  0|  1|  2|        0.0|    0.0|378.0| 21.33|      69|   70|    1|[0.0,1.0,2.0,0.0,...|
|  0|  1|  2|        0.0|    0.0|386.0| 13.17|      68|   68|    1|[0.0,1.0,2.0,0.0,...|
|  0|  1|  2|        0.0|    0.0|386.0| 21.25|      68|   85|    1|[0.0,1.0,2.0,0.0,...|
|  0|  1|  2|        0.0|    0.0|476.0| 13.75|      75|   68|    1|[0.0,1.0,2.0,0.0,...|
|  0|  1|  2|        0.0|    0.0|538.0| 22.45|      79|   86|    1|[0.0,1.0,2.0,0.0,...|
+---+---+---+-----------+-------+-----+------+--------+-----+-----+--------------------+
only showing top 5 rows



## 5. Build a Decision Tree
Now that you've split the flights data into training and testing sets, you can use the training set to fit a Decision Tree model.

In [13]:
# Create a classifier object and fit to the training data
tree = DecisionTreeClassifier()
tree_model = tree.fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
prediction = tree_model.transform(flights_test)
prediction.select('label', 'prediction', 'probability').show(5, False)

+-----+----------+---------------------------------------+
|label|prediction|probability                            |
+-----+----------+---------------------------------------+
|1    |1.0       |[0.389808712425655,0.6101912875743449] |
|1    |1.0       |[0.389808712425655,0.6101912875743449] |
|0    |1.0       |[0.3186365751512331,0.6813634248487669]|
|1    |1.0       |[0.389808712425655,0.6101912875743449] |
|1    |1.0       |[0.389808712425655,0.6101912875743449] |
+-----+----------+---------------------------------------+
only showing top 5 rows



## 6. Evaluate the Decision Tree
You can assess the quality of your model by evaluating how well it performs on the testing data. Because the model was not trained on these data, this represents an objective assessment of the model.

A confusion matrix gives a useful breakdown of predictions versus known values. It has four cells which represent the counts of:

- True Negatives (TN) — model predicts negative outcome & known outcome is negative
- True Positives (TP) — model predicts positive outcome & known outcome is positive
- False Negatives (FN) — model predicts negative outcome but known outcome is positive
- False Positives (FP) — model predicts positive outcome but known outcome is negative.

In [14]:
# Create a confusion matrix
prediction.groupBy('label', 'prediction').count().show()

# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label != prediction').count()
FP = prediction.filter('prediction = 1 AND label != prediction').count()

# Accuracy measures the proportion of correct predictions
accuracy = (TN+TP)/(TN+TP+FN+FP)
print(accuracy)

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0| 1243|
|    0|       0.0| 2400|
|    1|       1.0| 3628|
|    0|       1.0| 2224|
+-----+----------+-----+

0.6348604528699315


In [15]:
# Terminate the cluster
spark.stop()