<h1>Decision trees</h1>

A [decision tree](https://en.wikipedia.org/wiki/Decision_tree_learning) can be thought of as a sequence of **hierarchical if-else statements** that test feature values to predict a class.

Using [MLlib](https://spark.apache.org/docs/2.0.2/mllib-decision-tree.html) to train a decision tree from data, we want to carry out the following steps:

- read dataset
- train a decision tree model
- measure the training error of the model
    
Start off with the usual setting up and imports:

In [1]:
val sparkVersion = "2.0.1"
val scalaVersion = scala.util.Properties.versionNumberString

[36msparkVersion[0m: [32mString[0m = [32m"2.0.1"[0m
[36mscalaVersion[0m: [32mString[0m = [32m"2.11.8"[0m

In [2]:
classpath.add(
    "org.apache.spark" %% "spark-yarn" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
)

147 new artifact(s)


147 new artifacts in macro
147 new artifacts in runtime
147 new artifacts in compile




In [3]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.util.MLUtils

// decision tree imports
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel

// importing CSV data into the expected format
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.sql.Row

[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36morg.apache.spark.mllib.util.MLUtils[0m
[32mimport [36morg.apache.spark.mllib.tree.DecisionTree[0m
[32mimport [36morg.apache.spark.mllib.tree.model.DecisionTreeModel[0m
[32mimport [36morg.apache.spark.mllib.regression.LabeledPoint[0m
[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[32mimport [36morg.apache.spark.sql.Row[0m

The simple decision tree example in the cell below reads a dataset, trains a decision tree model and then measures the training error of the model. We use the [Spambase](http://archive.ics.uci.edu/ml/datasets/Spambase) dataset, replicated for Jupyter at 

    files/spambase.data

In [4]:
// Create Spark session
val sparkSession = SparkSession.builder
    .master("local[1]")
    .appName("Decision Tree example")
    .getOrCreate()

// Load the data
val text = sparkSession.sparkContext.textFile("files/spambase.data")

// Separate into array
val data = text.map(line => line.split(',').map(_.toDouble)).map(t => LabeledPoint(t(57), Vectors.dense(t.take(57))))

// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model.toDebugString)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/03/06 16:32:51 INFO SparkContext: Running Spark version 2.0.1
17/03/06 16:32:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/06 16:32:52 INFO SecurityManager: Changing view acls to: b97eec96efcb40779e247b002e047f82
17/03/06 16:32:52 INFO SecurityManager: Changing modify acls to: b97eec96efcb40779e247b002e047f82
17/03/06 16:32:52 INFO SecurityManager: Changing view acls groups to: 
17/03/06 16:32:52 INFO SecurityManager: Changing modify acls groups to: 
17/03/06 16:32:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(b97eec96efcb40779e247b002e047f82); groups with view permissions: Set(); users  with modify permissions: Set(b97eec96efcb40779e247b002e047f82); groups with modify permissions: Set()
17/03/06 16:32:52 INFO Utils: Successfully started service 

Test Error = 0.08931185944363104


17/03/06 16:33:01 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 15, localhost, partition 0, PROCESS_LOCAL, 5385 bytes)
17/03/06 16:33:01 INFO Executor: Running task 0.0 in stage 15.0 (TID 15)
17/03/06 16:33:01 INFO HadoopRDD: Input split: file:/projects/b97eec96-efcb-4077-9e24-7b002e047f82/Scalable-ML/week4/files/spambase.data:0+698341
17/03/06 16:33:01 INFO Executor: Finished task 0.0 in stage 15.0 (TID 15). 954 bytes result sent to driver
17/03/06 16:33:01 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 15) in 33 ms on localhost (1/1)
17/03/06 16:33:01 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool 
17/03/06 16:33:01 INFO DAGScheduler: ResultStage 15 (count at Main.scala:68) finished in 0.033 s
17/03/06 16:33:01 INFO DAGScheduler: Job 9 finished: count at Main.scala:68, took 0.194443 s


Learned classification tree model:
DecisionTreeModel classifier of depth 5 with 49 nodes
  If (feature 51 <= 0.057)
   If (feature 6 <= 0.04)
    If (feature 23 <= 0.0)
     If (feature 15 <= 0.1)
      If (feature 52 <= 0.182)
       Predict: 0.0
      Else (feature 52 > 0.182)
       Predict: 1.0
     Else (feature 15 > 0.1)
      If (feature 4 <= 1.07)
       Predict: 0.0
      Else (feature 4 > 1.07)
       Predict: 1.0
    Else (feature 23 > 0.0)
     If (feature 24 <= 0.05)
      If (feature 55 <= 9.0)
       Predict: 0.0
      Else (feature 55 > 9.0)
       Predict: 1.0
     Else (feature 24 > 0.05)
      Predict: 0.0
   Else (feature 6 > 0.04)
    If (feature 26 <= 0.0)
     If (feature 24 <= 0.26)
      If (feature 49 <= 0.375)
       Predict: 1.0
      Else (feature 49 > 0.375)
       Predict: 0.0
     Else (feature 24 > 0.26)
      If (feature 25 <= 0.29)
       Predict: 0.0
      Else (feature 25 > 0.29)
       Predict: 1.0
    Else (feature 26 > 0.0)
     Predict: 0.0
  El

[36msparkSession[0m: [32mSparkSession[0m = org.apache.spark.sql.SparkSession@626e290b
[36mtext[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mString[0m] = files/spambase.data MapPartitionsRDD[1] at textFile at Main.scala:34
[36mdata[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mLabeledPoint[0m] = MapPartitionsRDD[3] at map at Main.scala:37
[36msplits[0m: [32mArray[0m[[32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mLabeledPoint[0m]] = [33mArray[0m(
  MapPartitionsRDD[4] at randomSplit at Main.scala:40,
  MapPartitionsRDD[5] at randomSplit at Main.scala:40
)
[36mtrainingData[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mLabeledPoint[0m] = MapPartitionsRDD[4] at randomSplit at Main.scala:40
[36mtestData[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mLabeledPoint[0m] = MapPartitionsRDD[5] at randomSplit at Main.s

A lot of information is contained in the output: in this case, the model is a classifier of depth 1 with 3 nodes. The structure of the decision tree is also printed

    If (feature 434 <= 0.0)
        Predict: 0.0
    Else (feature 434 > 0.0)
        Predict: 1.0
        
The example above contains the variable <tt>impurity</tt>. The node [impurity](https://spark.apache.org/docs/2.0.2/mllib-decision-tree.html#node-impurity-and-information-gain) is the measure of homogeneity of the labels at the node. The current implmentation includes two impurity measures for classification: Gini impurity and entropy, invoked by passing the relevant value (<tt>gini</tt> or <tt>entropy</tt>) to the classifier.

<h1>Exercises</h1>

<h2>Exercise 1</h2>

Make the decision tree code a standalone program to run on HPC. Make <tt>impurity</tt> value an argument to the program (with the values <tt>gini</tt> or <tt>entropy</tt>). Run this on the [default of credit cards](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) datase. Note that this dataset has a different format to the Spambase dataset above - you will need to convert from XLS format to, say, CSV, before using the data. You can use any available tool for this: for example, Excell has an export option, or there is a command line tool <tt>xls2csv</tt> available on Linux.

<h2>Exercise 2</h2>

Modify your program to run the decision tree as part of a pipeline (see Notebook 3 for a refresher on pipelines). The pipeline model can be used to find the best set of parameters using cross validation. An example of a cross-validator can be found [here](http://spark.apache.org/docs/2.1.0/ml-tuning.html#cross-validation). In your case, make <tt>paramGrid</tt> contain different values for <tt>maxDepth</tt>, <tt>maxBins</tt> and <tt>impurity</tt> and find the best parameters, and associated test error, for both datasets.