# Decision Trees in Spark
## 1.) Basic Overview
* Decision trees can be used for regression and classification tasks
* They sequentially segment your data into smaller and smaller subsets based on specific feature values/thresholds in order to produce the highest class purity (for classification) or closest mean value (for regression) outputs
* The aim is to reduce error or misclassification at each step of the process in a greedy manner (i.e. focus on current step specifically rather than looking forwards to future steps)
* Pruning is used to reduce tree depth (and hence number of splits) once the full tree has been built in order to reduce overfitting, which occurs strongly in decision tree learning otherwise
* Decision trees are easy to interpret and visualize, following human logic in terms of simple splits based on clear criteria
* However, they often don't perform as well as other methods (e.g. linear regression) for comparable problems and as such, enhancements on simple trees such as random forests, bagging and boosting are required
    * Bagging involves bootstrapping (splitting input data into multiple subsets to allow ~cross-validation) and then aggregation (running 100s or 1000s of independent trees on the data subsets before averaging results/residiuals/learnings across all trees to create a balanced model)
    * Random Forests are similar to bagging except that the features allowed in each tree are restricted at random to ensure that each predictor is focused on in different trees to prevent only focusing on the most important features in regards to the outcomes
    * Boosting involves sequentially enhancing the model using shallow trees (often stumps with a depth of 1) and applying a shrinkage parameter to create a slow, gradual learning method to prevent overfitting
* Across all methods and variants, the aim is to reduce the error of the model whilst preventing overfitting
    * For regression, residual sum of squares (RSS) is used to determine error between actual values at terminal nodes in comparison with the mean value of results that end up at each terminal node
    * For classification, classification error, Gini score or cross-entropy scores are used to assess class purity at each node as well as comparing predicted classes to actual classes with the aim of achieving the best class purity possible at terminal nodes
* NOTE: you can also get decision tree based models via the pyspark.ml.regression library

In [29]:
# load libs
import findspark

# store location of spark files
findspark.init('/home/matt/spark-3.0.2-bin-hadoop3.2')

# load libs
import pyspark
from pyspark.sql import SparkSession

# start new session
spark = SparkSession.builder.appName('tree').getOrCreate()

# load other libs
# note that these are classifier models, regression models exist too
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier, DecisionTreeClassifier

# read in data
df = spark.read.format('libsvm').load('Data/sample_libsvm_data.txt')

# show schema
df.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



In [30]:
# peek at data
# already formatted for Spark
df.show(3)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
+-----+--------------------+
only showing top 3 rows



In [31]:
# split train/test data
train, test = df.randomSplit([0.7, 0.3])

# build classifier models
# can change maxDepth etc. here
# testing tree amounts and accuracy is useful for seeing
# where you stop gaining benefits by increasing tree num
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier(numTrees=100)
gbtc = GBTClassifier()

# fit models to training data
dtc_model = dtc.fit(train)
rfc_model = rfc.fit(train)
gbtc_model = gbtc.fit(train)

# make predictions
dtc_preds = dtc_model.transform(test)
rfc_preds = rfc_model.transform(test)
gbtc_preds = gbtc_model.transform(test)

# check outputs
dtc_preds.show(3)

+-----+--------------------+-------------+-----------+----------+
|label|            features|rawPrediction|probability|prediction|
+-----+--------------------+-------------+-----------+----------+
|  0.0|(692,[95,96,97,12...|   [33.0,0.0]|  [1.0,0.0]|       0.0|
|  0.0|(692,[122,123,148...|   [33.0,0.0]|  [1.0,0.0]|       0.0|
|  0.0|(692,[124,125,126...|   [33.0,0.0]|  [1.0,0.0]|       0.0|
+-----+--------------------+-------------+-----------+----------+
only showing top 3 rows



In [32]:
# import evaluators
# binary evaluators only give you AUC, ROC etc.
# multiclass still works on binary data but also lets you pull more metrics (see below)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# create evaluator
acc_eval = MulticlassClassificationEvaluator(metricName='accuracy')

# evaluate decision tree
# this accuracy is perfect which is worrying
# but our data is model and so highly separable
# also, basic decision trees are prone to overfitting
# this analysis is a simple example and not a real world scenario
print('DTC Accuracy: ')
acc_eval.evaluate(dtc_preds)

DTC Accuracy: 


1.0

In [33]:
# get feature importances from model
# trees determine which features are most significant
# in relation to the outcome labels
# shows importance by feature
rfc_model.featureImportances

SparseVector(692, {149: 0.001, 154: 0.0003, 155: 0.0016, 177: 0.0007, 178: 0.001, 179: 0.0023, 183: 0.0005, 210: 0.0025, 215: 0.0009, 234: 0.0028, 236: 0.0059, 237: 0.0005, 241: 0.0015, 244: 0.0057, 260: 0.0006, 262: 0.0076, 263: 0.0071, 266: 0.0004, 270: 0.0004, 271: 0.0026, 272: 0.0073, 273: 0.0075, 288: 0.0003, 295: 0.001, 297: 0.0007, 300: 0.0074, 302: 0.0063, 317: 0.0153, 319: 0.0004, 323: 0.0006, 325: 0.0009, 326: 0.0004, 330: 0.0129, 332: 0.001, 344: 0.006, 345: 0.0075, 346: 0.0046, 347: 0.001, 350: 0.0183, 351: 0.0254, 352: 0.0011, 355: 0.001, 356: 0.0143, 357: 0.0121, 358: 0.0005, 360: 0.0012, 369: 0.0005, 372: 0.011, 373: 0.0099, 374: 0.0167, 375: 0.0061, 377: 0.0001, 378: 0.0285, 379: 0.0161, 381: 0.0005, 383: 0.0005, 385: 0.0214, 386: 0.0086, 388: 0.0022, 398: 0.0006, 401: 0.0005, 403: 0.0022, 404: 0.0004, 405: 0.01, 406: 0.0226, 407: 0.0393, 409: 0.001, 410: 0.0017, 411: 0.0033, 413: 0.0229, 415: 0.0003, 425: 0.0027, 426: 0.0054, 427: 0.0056, 429: 0.0224, 433: 0.0209, 434:

## 2. Real World Data
### Building the Model
* Here, we will process more realistic data by converting features into a Spark-friendly format
* We will also use a pipeline to chain together the multiple stages involved

In [34]:
# load data
df = spark.read.csv('Data/College.csv', inferSchema=True, header=True)

# show schema
df.printSchema()

root
 |-- School: string (nullable = true)
 |-- Private: string (nullable = true)
 |-- Apps: integer (nullable = true)
 |-- Accept: integer (nullable = true)
 |-- Enroll: integer (nullable = true)
 |-- Top10perc: integer (nullable = true)
 |-- Top25perc: integer (nullable = true)
 |-- F_Undergrad: integer (nullable = true)
 |-- P_Undergrad: integer (nullable = true)
 |-- Outstate: integer (nullable = true)
 |-- Room_Board: integer (nullable = true)
 |-- Books: integer (nullable = true)
 |-- Personal: integer (nullable = true)
 |-- PhD: integer (nullable = true)
 |-- Terminal: integer (nullable = true)
 |-- S_F_Ratio: double (nullable = true)
 |-- perc_alumni: integer (nullable = true)
 |-- Expend: integer (nullable = true)
 |-- Grad_Rate: integer (nullable = true)



In [35]:
# peek at data
df.head(1)

[Row(School='Abilene Christian University', Private='Yes', Apps=1660, Accept=1232, Enroll=721, Top10perc=23, Top25perc=52, F_Undergrad=2885, P_Undergrad=537, Outstate=7440, Room_Board=3300, Books=450, Personal=2200, PhD=70, Terminal=78, S_F_Ratio=18.1, perc_alumni=12, Expend=7041, Grad_Rate=60)]

In [36]:
# load libs
from pyspark.ml.feature import VectorAssembler, StringIndexer

# create assembler
assembler = VectorAssembler(inputCols=['Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F_Undergrad',
                                       'P_Undergrad', 'Outstate', 'Room_Board', 'Books', 'Personal',
                                       'PhD', 'Terminal', 'S_F_Ratio', 'perc_alumni', 'Expend', 'Grad_Rate'],
                            outputCol='features')

# apply assembler to format features
output = assembler.transform(df)

# encode categorical vars (private = Yes/No)
private_indexer = StringIndexer(inputCol='Private', outputCol='PrivateIndex')
output_enc = private_indexer.fit(output).transform(output)

# extract relevant vars
final_df = output_enc.select('features', 'PrivateIndex')

# train/test split
train, test = final_df.randomSplit([0.7, 0.3])

### Evaluating the Model
* We can see here that our single decision tree is a pretty good predictor of our data, a score of >90% with a single tree indicates that our data is quite easily separable
* Clearly the random forest and gradient boosted trees outperform the decision tree though
* This is what you would expect every time as, by definition, these are both enhanced, ensemble versions of the simple decision tree model

In [37]:
# load libs
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier, DecisionTreeClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# build model instances
dtc = DecisionTreeClassifier(labelCol='PrivateIndex', featuresCol='features')
rfc = RandomForestClassifier(labelCol='PrivateIndex', featuresCol='features', numTrees=150)
gbt = GBTClassifier(labelCol='PrivateIndex', featuresCol='features')

# fit models to training data
dtc_model = dtc.fit(train)
rfc_model = rfc.fit(train)
gbt_model = gbt.fit(train)

# make predictions
dtc_preds = dtc_model.transform(test)
rfc_preds = rfc_model.transform(test)
gbt_preds = gbt_model.transform(test)

# create evaluation metrics
binary_eval = BinaryClassificationEvaluator(labelCol='PrivateIndex')

# show results per model
print('DTC:')
print(binary_eval.evaluate(dtc_preds))

DTC:
0.8907056798623063


In [38]:
# show results per model
print('RFC:')
print(binary_eval.evaluate(rfc_preds))

RFC:
0.9842226047045317


In [39]:
# show results per model
print('GBT:')
print(binary_eval.evaluate(gbt_preds))

GBT:
0.9731306177089308


In [41]:
# the binary evaluator above doesn't give us many metrics
# the multiclass evaluator lets us look at accuracy, precision, recall etc.
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# create evaluator for accuracy
# you can check other metrics by tweaking the metricName
# check documentation for available options
acc_eval = MulticlassClassificationEvaluator(labelCol='PrivateIndex',
                                             metricName='accuracy')

# evaluate rfc model
rfc_acc = acc_eval.evaluate(rfc_preds)
rfc_acc

0.9475982532751092