<h1>Bigger decision trees</h1>

This notebook presents bigger examples of decision trees and introduces regression using a decision tree.

In [1]:
val sparkVersion = "2.0.1"
val scalaVersion = scala.util.Properties.versionNumberString

[36msparkVersion[0m: [32mString[0m = [32m"2.0.1"[0m
[36mscalaVersion[0m: [32mString[0m = [32m"2.11.8"[0m

In [2]:
classpath.add(
    "org.apache.spark" %% "spark-yarn" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
)

146 new artifact(s)


146 new artifacts in macro
146 new artifacts in runtime
146 new artifacts in compile




In [3]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.util.MLUtils

// decision tree imports
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel

// importing CSV data into the expected format
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.sql.Row

[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36morg.apache.spark.mllib.util.MLUtils[0m
[32mimport [36morg.apache.spark.mllib.tree.DecisionTree[0m
[32mimport [36morg.apache.spark.mllib.tree.model.DecisionTreeModel[0m
[32mimport [36morg.apache.spark.mllib.regression.LabeledPoint[0m
[32mimport [36morg.apache.spark.mllib.linalg.Vectors[0m
[32mimport [36morg.apache.spark.sql.Row[0m

<h2>Regression</h2>

Using the <tt>spambase</tt> dataset from the previous notebook, we perform regression using a decision tree with variance as an impurity measure and a maximum tree depth of 5. The Mean Squared Error is computed at the end to evaluate goodness of fit.

In [5]:
// Create Spark session
val sparkSession = SparkSession.builder
    .master("local[1]")
    .appName("Decision Tree example")
    .getOrCreate()

// Load the data
val text = sparkSession.sparkContext.textFile("files/spambase.data")

// Separate into array
val data = text.map(line => line.split(',').map(_.toDouble)).map(t => LabeledPoint(t(57), Vectors.dense(t.take(57))))

// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "variance"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity,
  maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v - p, 2) }.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression tree model:\n" + model.toDebugString)

Test Mean Squared Error = 0.0851854980284374
Learned regression tree model:
DecisionTreeModel regressor of depth 5 with 45 nodes
  If (feature 52 <= 0.055)
   If (feature 6 <= 0.05)
    If (feature 51 <= 0.378)
     If (feature 15 <= 0.19)
      If (feature 22 <= 0.29)
       Predict: 0.061284619917501476
      Else (feature 22 > 0.29)
       Predict: 0.8125
     Else (feature 15 > 0.19)
      If (feature 36 <= 0.19)
       Predict: 0.525
      Else (feature 36 > 0.19)
       Predict: 0.06060606060606061
    Else (feature 51 > 0.378)
     If (feature 55 <= 10.0)
      If (feature 16 <= 0.0)
       Predict: 0.2072072072072072
      Else (feature 16 > 0.0)
       Predict: 1.0
     Else (feature 55 > 10.0)
      If (feature 38 <= 0.27)
       Predict: 0.9074074074074074
      Else (feature 38 > 0.27)
       Predict: 0.0
   Else (feature 6 > 0.05)
    If (feature 26 <= 0.0)
     If (feature 45 <= 0.0)
      If (feature 29 <= 0.18)
       Predict: 0.9674418604651163
      Else (feature 29 >

[36msparkSession[0m: [32mSparkSession[0m = org.apache.spark.sql.SparkSession@71fb6a24
[36mtext[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mString[0m] = files/spambase.data MapPartitionsRDD[34] at textFile at Main.scala:49
[36mdata[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mLabeledPoint[0m] = MapPartitionsRDD[36] at map at Main.scala:52
[36msplits[0m: [32mArray[0m[[32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mLabeledPoint[0m]] = [33mArray[0m(
  MapPartitionsRDD[37] at randomSplit at Main.scala:55,
  MapPartitionsRDD[38] at randomSplit at Main.scala:55
)
[36mtrainingData[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mLabeledPoint[0m] = MapPartitionsRDD[37] at randomSplit at Main.scala:55
[36mtestData[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32mrdd[0m.[32mRDD[0m[[32mLabeledPoint[0m] = MapPartitionsRDD[38] at randomSplit at 

<h1>Exercises</h1>

<h2>Exercise 1</h2>

Run your standalone decision tree program on the [Physical Activity Monitoring](http://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring) and the [Physics](http://archive.ics.uci.edu/ml/datasets/HIGGS) datasets, methodically experimenting with the <tt>maxDepth</tt> and <tt>maxBins</tt> values. Obtain timings for each experiment. Note that the <tt>physical activity monitoring</tt> dataset contains <tt>NaN</tt> (not a number) values when values are missing - you should try dealing with this in two ways

1. Drop lines containing <tt>NaN</tt>
2. Replace <tt>NaN</tt> with the average value from that column

Run experiments with both options.

<h3>Exercise 2</h3>

Determine which features are the most important for classification (start by fixing your <tt>maxDepth</tt> and <tt>maxBins</tt> values). Restrict the decision tree program to only these features and compare preformance against the full feature set.

<h2>Exercise 3</h2>

Carry out both exercises also with the regression using decision tree program above.