# <center>Machine Learning with Spark MLlib</center>
## <center>Decision Trees and Random Forests</center>
### <center>July 20,2016</center>

<img src = "https://ibm.box.com/shared/static/wfbduwkbx22nx3i2psbp9g27s2p9s86v.png", width="500" align = 'center'>

## <b>Welcome to the third lab in the course, Machine Learning with Spark MLlib.</b>
### <b>Spark has many libraries, namely under MLlib (Machine Learning Library)! Spark allows for quick and easy scalability of practical machine learning!</b>

In this lab exercise, you will learn how to create Classification and Regression DecisionTree and RandomForest Models, as well as how to tune the parameters for each to create more optimal trees and ensembles of trees.

### Some Notebook Commands
#### In case you haven't dealt with a Jupyter Notebook before, here are some quick, useful commands that may be handy to get started.
<ul>
    <li>Run a cell: CTRL + ENTER</li>
    <li>Create a cell above a cell: a</li>
    <li>Create a cell below a cell: b</li>
    <li>Change a cell to Markdown: m</li>
    
    <li>Change a cell to code: y</li>
</ul>

<b> If you are interested in more keyboard shortcuts, go to Help -> Keyboard Shortcuts </b>

### How this lab will operate:
In this lab, you will be presented with a walk-through of a Regression DecisionTree model with how to tune some of the parameters. Then, you will create a Classification DecisionTree model yourself. You will also be presented with a walk-through of a Classification RandomForest model with how to tune some of the parameters, then you will create a Regression RandomForest model yourself.

## DecisionTree (Regression)

Import the following libraries:
<ul>
    <li>DecisionTree, DecisionTreeModel from pyspark.mllib.tree</li>
    <li>MLUtils from pyspark.mllib.util</li>
    <li>time</li>
</ul>

In [1]:
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import time

Next, we will load in the <b>poker.txt</b> LibSVM file, which is a dataset based on poker hands. Use <b>MLUtils.loadLibSVMFile</b> and pass in the spark context (<b>sc</b>) and the path to the file <b>'resources/poker.txt'</b>. Store this into a variable called <b>regDT_data</b> 

In [3]:
regDT_data = MLUtils.loadLibSVMFile(sc, '/resources/data/poker.txt')

Next, we need to split the data into a training dataset (called <b>regDT_train</b>) and testing dataset (called <b>regDT_test</b>). This will be done by running the <b>.randomSplit</b> function on <b>regDT_data</b>. The input into .randomSplit will be <b>[0.7, 0.3]</b>. <br> <br>

This will give us a training dataset containing 70% of the data, and a testing dataset containing 30% of the data.

In [4]:
(regDT_train, regDT_test) = regDT_data.randomSplit([0.7, 0.3])

Next, we need to create the Regression Decision Tree called <b>regDT_model</b>. To instantiate the regressor, use <b>DecisionTree.trainRegressor</b>. We will pass in the following parameters:
<ul>
    <li>1st: The input data. In our case, we will use <b>regDT_train</b></li>
    <li>2nd: The categorical features info. For our dataset, have <b>categoricalFeaturesInfo</b> equal <b>{}</b></li>
    <li>3rd: The type of impurity. Since we're dealing with <b>Regression</b>, we will be have <b>impurity</b> set to <b>'variance'</b></li>
    <li>4th: The maximum depth of the tree. For now, set <b>maxDepth</b> to <b>5</b>, which is the default value</li>
    <li>5th: The maximum number of bins. For now, set <b>maxBins</b> to <b>32</b>, which is the default value</li>
    <li>6th: The minimum instances required per node. For now, set <b>minInstancesPerNode</b> to <b>1</b>, which is the default value</li>
    <li>7th: The minimum required information gain per node. For now, set <b>minInfoGain</b> to <b>0.0</b>, which is the default value</li>
</ul> <br> <br>

We will also be timing how long it takes to create the model, so run <b>start = time.time()</b> before creating the model and <b>print(time.time()-start)</b> after the model has been created. <br>
<b>Note</b>: The timings differ on run and by computer, therefore some statements throughout the lab may not directly align with the results you get, which is okay! There are many factors that can affect the time output.

In [5]:
start = time.time()
regDT_model = DecisionTree.trainRegressor(regDT_train, categoricalFeaturesInfo={},
                                    impurity='variance', maxDepth=5, maxBins=32,
                                    minInstancesPerNode=1, minInfoGain=0.0)
print (time.time()-start)

4.51244497299


Next, we want to get the models prediction on the test data, which we will call <b>regDT_pred</b>. We will run <b>.predict</b> on regDT_model, passing in the testing data, <b>regDT_test</b> that is mapped using <b>.map</b> which maps the features by passing in a lambda function (<b>lambda x: x.features</b>).

In [6]:
regDT_pred = regDT_model.predict(regDT_test.map(lambda x: x.features))

Now create a variable called <b>regDT_label_pred</b> which uses a <b>.map</b> on <b>regDT_test</b>. Pass <b>lambda l: l.label</b> into the mapping function. Outside of the mapping function, add a <b>.zip(regDT_pred)</b>. This will merge the label with the prediction</b> 

In [7]:
regDT_label_pred = regDT_test.map(lambda l: l.label).zip(regDT_pred)

Now we will calculate the Mean Squared Error for this prediction, which we will call <b>regDT_MSE</b>. This will equate to <b>regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())</b>, which will take the difference of the actual value and the predicted response, square it, and sum that with the rest of the values. Afterwards, it is divided by the total number of values in the testing data.

In [8]:
regDT_MSE = regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())

Next, print out the MSE prediction value (<b>str(regDT_MSE)</b>), as well as the learned regression tree model (<b>regDT_model.toDebugString()</b>), so you have an idea of what the tree looks like.

In [9]:
print('Test Mean Squared Error = ' + str(regDT_MSE))
print('Learned Regression Tree Model: ' + regDT_model.toDebugString())

Test Mean Squared Error = 0.604814685923
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 61 nodes
  If (feature 7 <= 2.0)
   If (feature 1 <= 2.0)
    If (feature 5 <= 2.0)
     If (feature 9 <= 2.0)
      Predict: 3.0
     Else (feature 9 > 2.0)
      If (feature 3 <= 2.0)
       Predict: 2.25
      Else (feature 3 > 2.0)
       Predict: 1.425
    Else (feature 5 > 2.0)
     If (feature 9 <= 2.0)
      If (feature 3 <= 2.0)
       Predict: 2.25
      Else (feature 3 > 2.0)
       Predict: 1.2941176470588236
     Else (feature 9 > 2.0)
      If (feature 3 <= 5.0)
       Predict: 1.030612244897959
      Else (feature 3 > 5.0)
       Predict: 0.5538461538461539
   Else (feature 1 > 2.0)
    If (feature 5 <= 2.0)
     If (feature 3 <= 2.0)
      If (feature 9 <= 2.0)
       Predict: 2.0
      Else (feature 9 > 2.0)
       Predict: 1.1951219512195121
     Else (feature 3 > 2.0)
      If (feature 9 <= 2.0)
       Predict: 1.2222222222222223
      Else (feature 9 >

Now that we've created the basic Regression Decision Tree, let's start tuning some parameters! To speed up the process and reduce the amount of code that appears in this notebook, I've made a function that encorporates all of the code above. This way, we can tune the parameters in a single line of code. <br> <br>

Read over the code, and it should be apparent what each of the inputs should be. But just to reiterate:
<ul>
    <li>1st: maxDepthValue is the value for maxDepth (Type:Int, Range: 0 to 30)</li>
    <li>2nd: maxBinsValue is the value for maxBins (Type: Int, Range: >= 2)</li>
    <li>3rd: minInstancesValue is the value for minInstancesPerNode (Type: Int, Range: >=1)</li>
    <li>4th: minInfoGainValue is the value for minInfoGain (Type: Float)</li>
    <ul>
        <li><b>NOTE</b>: The input for minInfoGain MUST contain a decimal (ex. -3.0, 0.1, etc.) or else you will get an error</li>
    </ul>
</ul>

In [10]:
def regDT_tuner(maxDepthValue, maxBinsValue, minInstancesValue, minInfoGainValue):
    start = time.time()
    regDT_model = DecisionTree.trainRegressor(regDT_train, categoricalFeaturesInfo={},
                                        impurity='variance', maxDepth=maxDepthValue, maxBins=maxBinsValue,
                                        minInstancesPerNode=minInstancesValue, minInfoGain=minInfoGainValue)
    print (time.time()-start)

    regDT_pred = regDT_model.predict(regDT_test.map(lambda x: x.features))
    regDT_label_pred = regDT_test.map(lambda l: l.label).zip(regDT_pred)
    regDT_MSE = regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())

    print('Test Mean Squared Error = ' + str(regDT_MSE))
    print('Learned Regression Tree Model: ' + regDT_model.toDebugString())

Start off by re-creating the original tree. That requires the inputs: <b>(5, 32, 1, 0.0)</b> into <b>regDT_tuner</b>

In [11]:
regDT_tuner(5, 32, 1, 0.0)

1.61899995804
Test Mean Squared Error = 0.604814685923
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 61 nodes
  If (feature 7 <= 2.0)
   If (feature 1 <= 2.0)
    If (feature 5 <= 2.0)
     If (feature 9 <= 2.0)
      Predict: 3.0
     Else (feature 9 > 2.0)
      If (feature 3 <= 2.0)
       Predict: 2.25
      Else (feature 3 > 2.0)
       Predict: 1.425
    Else (feature 5 > 2.0)
     If (feature 9 <= 2.0)
      If (feature 3 <= 2.0)
       Predict: 2.25
      Else (feature 3 > 2.0)
       Predict: 1.2941176470588236
     Else (feature 9 > 2.0)
      If (feature 3 <= 5.0)
       Predict: 1.030612244897959
      Else (feature 3 > 5.0)
       Predict: 0.5538461538461539
   Else (feature 1 > 2.0)
    If (feature 5 <= 2.0)
     If (feature 3 <= 2.0)
      If (feature 9 <= 2.0)
       Predict: 2.0
      Else (feature 9 > 2.0)
       Predict: 1.1951219512195121
     Else (feature 3 > 2.0)
      If (feature 9 <= 2.0)
       Predict: 1.2222222222222223
      Els

Remember that when we are tuning a specific parameter, that we will keep the other parameters at their original value

### maxDepth Parameter 
Let's start by tuning the <b>maxDepth</b> parameter. Begin by setting it to a lower value, such as <b>1</b>

In [12]:
regDT_tuner(1, 32, 1, 0.0)

1.07931900024
Test Mean Squared Error = 0.650456393425
Learned Regression Tree Model: DecisionTreeModel regressor of depth 1 with 3 nodes
  If (feature 7 <= 2.0)
   Predict: 0.5831238779174147
  Else (feature 7 > 2.0)
   Predict: 0.6239956421081302



By decreasing the maxDepth parameter, you can see that the run-time slightly decreased, presenting a smaller tree as well. You may also see a slight increase in the error, which is to be expected since the tree is too small to make accurate predictions.

Now try increasing to value of <b>maxDepth</b> to a large number, such as <b>30</b>, which is the maximum value.

In [13]:
regDT_tuner(30, 32, 1, 0.0)

5.37232494354
Test Mean Squared Error = 1.15386656055
Learned Regression Tree Model: DecisionTreeModel regressor of depth 30 with 12745 nodes
  If (feature 7 <= 2.0)
   If (feature 1 <= 2.0)
    If (feature 5 <= 2.0)
     If (feature 9 <= 2.0)
      Predict: 3.0
     Else (feature 9 > 2.0)
      If (feature 3 <= 2.0)
       If (feature 4 <= 1.0)
        Predict: 2.0
       Else (feature 4 > 1.0)
        Predict: 3.0
      Else (feature 3 > 2.0)
       If (feature 3 <= 7.0)
        If (feature 0 <= 1.0)
         If (feature 6 <= 2.0)
          Predict: 1.0
         Else (feature 6 > 2.0)
          If (feature 2 <= 1.0)
           Predict: 1.0
          Else (feature 2 > 1.0)
           Predict: 3.0
        Else (feature 0 > 1.0)
         Predict: 1.0
       Else (feature 3 > 7.0)
        If (feature 3 <= 10.0)
         If (feature 8 <= 3.0)
          If (feature 0 <= 2.0)
           If (feature 5 <= 1.0)
            If (feature 6 <= 1.0)
             Predict: 2.0
            Else (featu

With a large value for maxDepth, you can see that the run-time increased greatly, along with the size of the tree. The MSE has increased greatly compared to the original, which is due to overfitting of the training data from having a deep tree.

### maxBins Parameter
Now let's tune the <b>maxBins</b> variable. Start by decreasing the value to 2, to see what the lower end of this value does to the tree.

In [14]:
regDT_tuner(5, 2, 1, 0.0)

1.18987512589
Test Mean Squared Error = 0.608460553012
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 63 nodes
  If (feature 7 <= 6.0)
   If (feature 5 <= 7.0)
    If (feature 9 <= 7.0)
     If (feature 1 <= 6.0)
      If (feature 3 <= 7.0)
       Predict: 1.2367906066536203
      Else (feature 3 > 7.0)
       Predict: 0.7012195121951219
     Else (feature 1 > 6.0)
      If (feature 3 <= 7.0)
       Predict: 0.7216338880484114
      Else (feature 3 > 7.0)
       Predict: 0.5072463768115942
    Else (feature 9 > 7.0)
     If (feature 1 <= 6.0)
      If (feature 3 <= 7.0)
       Predict: 0.6706827309236948
      Else (feature 3 > 7.0)
       Predict: 0.5178571428571429
     Else (feature 1 > 6.0)
      If (feature 2 <= 2.0)
       Predict: 0.5333333333333333
      Else (feature 2 > 2.0)
       Predict: 0.46923076923076923
   Else (feature 5 > 7.0)
    If (feature 9 <= 7.0)
     If (feature 1 <= 6.0)
      If (feature 3 <= 7.0)
       Predict: 0.653688524590163

Comparing this to the original tree, we can see a small decrease in the training time, but not much of a difference in regards to MSE or the size of the tree.

Now let's take a look at the upper end, with a value of 15000

In [15]:
regDT_tuner(5, 15000, 1, 0.0)

1.2347369194
Test Mean Squared Error = 0.604814685923
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 61 nodes
  If (feature 7 <= 2.0)
   If (feature 1 <= 2.0)
    If (feature 5 <= 2.0)
     If (feature 9 <= 2.0)
      Predict: 3.0
     Else (feature 9 > 2.0)
      If (feature 3 <= 2.0)
       Predict: 2.25
      Else (feature 3 > 2.0)
       Predict: 1.425
    Else (feature 5 > 2.0)
     If (feature 9 <= 2.0)
      If (feature 3 <= 2.0)
       Predict: 2.25
      Else (feature 3 > 2.0)
       Predict: 1.2941176470588236
     Else (feature 9 > 2.0)
      If (feature 3 <= 5.0)
       Predict: 1.030612244897959
      Else (feature 3 > 5.0)
       Predict: 0.5538461538461539
   Else (feature 1 > 2.0)
    If (feature 5 <= 2.0)
     If (feature 3 <= 2.0)
      If (feature 9 <= 2.0)
       Predict: 2.0
      Else (feature 9 > 2.0)
       Predict: 1.1951219512195121
     Else (feature 3 > 2.0)
      If (feature 9 <= 2.0)
       Predict: 1.2222222222222223
      Else

With a very large maxBin value, we don't see too much of a change in the overall time or in the MSE. The model still has the same depth and nodes, as expected.

### minInstancesPerNode parameter
Next we will look at tuning the <b>minInstancesPerNode</b> parameter. It starts off at the lowest value of 1, but let's see what happens if we keep increasing the value. Starting off with the value <b>100</b>

In [16]:
regDT_tuner(5, 32, 100, 0.0)

1.22303390503
Test Mean Squared Error = 0.617423137205
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 49 nodes
  If (feature 7 <= 2.0)
   If (feature 1 <= 2.0)
    If (feature 9 <= 4.0)
     Predict: 1.2357723577235773
    Else (feature 9 > 4.0)
     If (feature 3 <= 6.0)
      Predict: 0.9528301886792453
     Else (feature 3 > 6.0)
      Predict: 0.6233766233766234
   Else (feature 1 > 2.0)
    If (feature 5 <= 2.0)
     If (feature 3 <= 4.0)
      Predict: 0.9454545454545454
     Else (feature 3 > 4.0)
      If (feature 9 <= 7.0)
       Predict: 0.8057553956834532
      Else (feature 9 > 7.0)
       Predict: 0.6132075471698113
    Else (feature 5 > 2.0)
     If (feature 3 <= 2.0)
      If (feature 0 <= 2.0)
       Predict: 0.8289473684210527
      Else (feature 0 > 2.0)
       Predict: 0.6305732484076433
     Else (feature 3 > 2.0)
      If (feature 9 <= 2.0)
       Predict: 0.61328125
      Else (feature 9 > 2.0)
       Predict: 0.41295546558704455
  Else

With minInstancesPerNode set to 100, we don't see much of a change in time and MSE, but we can see that there are less nodes in the tree. Try now with a value of <b>1000</b>

In [21]:
regDT_tuner(5, 32, 1000, 0.0)

1.12289094925
Test Mean Squared Error = 0.634591568985
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 15 nodes
  If (feature 7 <= 2.0)
   If (feature 5 <= 5.0)
    Predict: 0.6787072243346007
   Else (feature 5 > 5.0)
    Predict: 0.5251009809578765
  Else (feature 7 > 2.0)
   If (feature 5 <= 2.0)
    If (feature 3 <= 6.0)
     Predict: 0.6546341463414634
    Else (feature 3 > 6.0)
     Predict: 0.5008038585209004
   Else (feature 5 > 2.0)
    If (feature 9 <= 2.0)
     Predict: 0.5269521410579345
    Else (feature 9 > 2.0)
     If (feature 1 <= 2.0)
      Predict: 0.48072562358276644
     Else (feature 1 > 2.0)
      If (feature 3 <= 2.0)
       Predict: 0.45174825174825173
      Else (feature 3 > 2.0)
       Predict: 0.736391268306162



With a value of 1000, we may see more of a decrease in the time, but the MSE has also increased a little bit. As well, the number of nodes in the model has decreased once again. Let's take it one step further and try with a value of <b>8000</b>

In [20]:
regDT_tuner(5, 32, 8000, 0.0)

0.940607070923
Test Mean Squared Error = 0.649847290714
Learned Regression Tree Model: DecisionTreeModel regressor of depth 1 with 3 nodes
  If (feature 7 <= 6.0)
   Predict: 0.6062953399729497
  Else (feature 7 > 6.0)
   Predict: 0.6272221032340972



With a value of 8000, we may see that the run-time to build the model is starting to decrease a lot more, with only a small increase in MSE compared to when the value was set to 1000. The main difference we see is that the tree has become a lot smaller! This is to be expected since we are tuning a stopping parameter, which determines when the model finishes building.

### minInfoGain Parameter
For the last parameter, we will look at the minInfoGain parameter, which was initially set to 0.0. This value works well with negative values, and is very sensitive with values greater than 0.0. Try setting the value to a low number, such as -100.0

In [19]:
regDT_tuner(5, 32, 1, -100.0)

1.16571116447
Test Mean Squared Error = 0.604814685923
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 61 nodes
  If (feature 7 <= 2.0)
   If (feature 1 <= 2.0)
    If (feature 5 <= 2.0)
     If (feature 9 <= 2.0)
      Predict: 3.0
     Else (feature 9 > 2.0)
      If (feature 3 <= 2.0)
       Predict: 2.25
      Else (feature 3 > 2.0)
       Predict: 1.425
    Else (feature 5 > 2.0)
     If (feature 9 <= 2.0)
      If (feature 3 <= 2.0)
       Predict: 2.25
      Else (feature 3 > 2.0)
       Predict: 1.2941176470588236
     Else (feature 9 > 2.0)
      If (feature 3 <= 5.0)
       Predict: 1.030612244897959
      Else (feature 3 > 5.0)
       Predict: 0.5538461538461539
   Else (feature 1 > 2.0)
    If (feature 5 <= 2.0)
     If (feature 3 <= 2.0)
      If (feature 9 <= 2.0)
       Predict: 2.0
      Else (feature 9 > 2.0)
       Predict: 1.1951219512195121
     Else (feature 3 > 2.0)
      If (feature 9 <= 2.0)
       Predict: 1.2222222222222223
      Els

Overall, we don't see much of a change at all to anything. Now try changing the value to 0.0003

In [22]:
regDT_tuner(5, 32, 1, 0.0003)

0.994264125824
Test Mean Squared Error = 0.649837699394
Learned Regression Tree Model: DecisionTreeModel regressor of depth 0 with 1 nodes
  Predict: 0.6174803960849408



We can see that small values greater than zero can cause drastic changes in how the model looks. Here, we see a small decrease in the training time, and small increase in the MSE value. But now the tree only contains one node in it. The affect of this parameter on the tree is similar to minInstancesPerNode, since they are both stopping parameters.

---

## DecisionTree (Classification)

Now it's time for you to try it out for yourself! Build a Classification DecisionTree in a similar way that the Regression DecisionTree was built. Please note that you will be using the same dataset in this section (regDT_train, regDT_test), therefore you do not need to re-initialize that section.<br> <br> 

Try to only reference the above section when you are experiencing a lot of difficulty. This section is mainly for you to apply your learning.

For some help with the variables:
<ul>
    <li><b>numClasses</b>: The number of classes for this dataset is <b>10</b> (parameter doesn't require tuning)</li>
    <li><b>categoricalFeaturesInfo</b>: Has a value of <b>{}</b> (parameter doesn't require tuning)</li>
    <li><b>impurity</b>: There are two types of impurites you can use -- <b>'gini'</b> or <b>'entropy'</b> <i>(Default: 'gini')</i></li>
    <li><b>maxDepth</b>: Values range between <b>0 and 30</b> <i>(Default: 5)</i></li>
    <li><b>maxBins</b>: Value ranges between <b>2 and 2147483647</b> (largest value for 32-bits) <i>(Default: 32)</i></li>
    <li><b>minInstancesPerNode</b> ranges between <b>1 and 2147483647</b> <i>(Default: 1)</i></li>
    <li><b>minInfoGain</b>: Ensure it is a float (has a decimal in the value) <i>(Default: 0.0)</i></li>
</ul>

When displaying the <b>Training Error</b>, use the following formula and print statement instead of MSE: <br>
<b>classDT_error = classDT_label_pred.filter(lambda (v, p): v != p).count() / float(regDT_test.count())</b> <br>
<b>print('Test Error = ' + str(classDT_error))</b>

### The Goal
Try to create a model that is better than the model with default values. Challenge yourself by trying to create the best model you can!


### Note
We want a model that doesn't take too long to train and will cause overfitting. Remember that a very large model with high accuracy but long run time may not be good because the model may have overfit the data.

In [23]:
start = time.time()
classDT_model = DecisionTree.trainClassifier(regDT_train, numClasses = 10, 
                                     categoricalFeaturesInfo = {},
                                     impurity = 'gini', maxDepth = 9,
                                     maxBins = 25, minInstancesPerNode = 4,
                                     minInfoGain = -3.0)
print(time.time() - start)
# Evaluate model on test instances and compute test error
classDT_pred = classDT_model.predict(regDT_test.map(lambda x: x.features))
classDT_label_pred = regDT_test.map(lambda lp: lp.label).zip(classDT_pred)
classDT_error = classDT_label_pred.filter(lambda (v, p): v != p).count() / float(regDT_test.count())
print('Test Error = ' + str(classDT_error))
print('Learned classification tree model:' + classDT_model.toDebugString())

1.55005598068
Test Error = 0.460140602202
Learned classification tree model:DecisionTreeModel classifier of depth 9 with 891 nodes
  If (feature 5 <= 4.0)
   If (feature 9 <= 4.0)
    If (feature 3 <= 4.0)
     If (feature 1 <= 4.0)
      If (feature 7 <= 4.0)
       If (feature 3 <= 2.0)
        If (feature 7 <= 1.0)
         Predict: 1.0
        Else (feature 7 > 1.0)
         If (feature 4 <= 3.0)
          Predict: 2.0
         Else (feature 4 > 3.0)
          Predict: 2.0
       Else (feature 3 > 2.0)
        If (feature 9 <= 3.0)
         If (feature 4 <= 2.0)
          Predict: 1.0
         Else (feature 4 > 2.0)
          Predict: 3.0
        Else (feature 9 > 3.0)
         Predict: 1.0
      Else (feature 7 > 4.0)
       If (feature 3 <= 3.0)
        If (feature 8 <= 2.0)
         If (feature 9 <= 2.0)
          If (feature 5 <= 1.0)
           Predict: 1.0
          Else (feature 5 > 1.0)
           Predict: 1.0
         Else (feature 9 > 2.0)
          If (feature 8 <= 1.0)


---

## RandomForest (Classifier)

Now that we've run through the DecisionTree model, let's work with RandomForests now. The process for this will be similar with the DecisionTree section.

Import the following libraries:
<ul>
    <li>RandomForest, RandomForestModel from pyspark.mllib.tree</li>
    <li>MLUtils from pyspark.mllib.util</li>
    <li>time</li>
</ul>

In [24]:
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
import time

Next, we will load in the <b>pendigits.txt</b> LibSVM file, which is a dataset based on Pen-Based Recognition of Handwritten Digits. Use <b>MLUtils.loadLibSVMFile</b> and pass in the spark context (<b>sc</b>) and the path to the file <b>'resources/pendigits.txt'</b>. Store this into a variable called <b>classRF_data</b> <br> <br>

Note: You can also try out this section with the poker.txt dataset if you want to compare results from both sections!

In [27]:
classRF_data = MLUtils.loadLibSVMFile(sc, '/resources/data/poker.txt')

Next, we need to split the data into a training dataset (called <b>classRF_train</b>) and testing dataset (called <b>classRF_test</b>). This will be done by running the <b>.randomSplit</b> function on <b>classRF_data</b>. The input into .randomSplit will be <b>[0.7, 0.3]</b>. <br> <br>

This will give us a training dataset containing 70% of the data, and a testing dataset containing 30% of the data.

In [28]:
(classRF_train, classRF_test) = classRF_data.randomSplit([0.7, 0.3])

Next, we need to create the Random Forest Classifier called <b>classRF_model</b>. To instantiate the classifier, use <b>RandomForest.trainClassifier</b>. We will pass in the following parameters:
<ul>
    <li>1st: The input data. In our case, we will use <b>classRF_train</b></li>
    <li>2nd: The number of classes. For this dataset, there will be 10 classes, so set <b>numClasses</b> equal to <b>10</b>
    <li>3rd: The categorical features info. For our dataset, have <b>categoricalFeaturesInfo</b> equal <b>{}</b></li>
    <li>4th: The number of trees. We will set <b>numTrees = 3</b>
    <li>5th: The feature Subset Strategy. There are various inputs for this parameter, but for the sake of this section we will set <b>featureSubsetStrategy</b> equal to <b>"auto"</b></li>
    <li>6th: The type of impurity. Since we're dealing with <b>Classification</b>, we will be have <b>impurity</b> set to <b>'gini'</b></li>
    <li>7th: The maximum depth of the tree. For now, set <b>maxDepth</b> to <b>5</b>, which is the default value</li>
    <li>8th: The maximum number of bins. For now, set <b>maxBins</b> to <b>32</b>, which is the default value</li>
    <li>9th: The seed to generate random data. For now, set <b>seed</b> to <b>None</b></li>
</ul> <br> <br>

We will also be timing how long it takes to create the model, so run <b>start = time.time()</b> before creating the model and <b>print(time.time()-start)</b> after the model has been created. <br>
<b>Note</b>: The timings differ on run and by computer, therefore some statements throughout the lab may not directly align with the results you get, which is okay! There are many factors that can affect the time output.

In [29]:
start = time.time()
classRF_model = RandomForest.trainClassifier(classRF_train, numClasses = 10, categoricalFeaturesInfo={},
                                           featureSubsetStrategy="auto", numTrees=3,
                                           impurity='gini', maxDepth=4, maxBins=32, seed=None)
print (time.time()-start)

1.26477193832


Next, we want to get the models prediction on the test data, which we will call <b>classRF_pred</b>. We will run <b>.predict</b> on classRF_model, passing in the testing data, <b>classRF_test</b> that is mapped using <b>.map</b> which maps the features using a lambda function (<b>lambda x: x.features</b>).

In [30]:
classRF_pred = classRF_model.predict(classRF_test.map(lambda x: x.features))

Now create a variable called <b>classRF_label_pred</b> which uses a <b>.map</b> on <b>classRF_test</b>. Pass <b>lambda l: l.label</b> into the mapping function. Outside of the mapping function, add a <b>.zip(classRF_pred)</b>. This will merge the label with the prediction</b> 

In [31]:
classRF_label_pred = classRF_test.map(lambda l: l.label).zip(classRF_pred)

Now we will calculate the Test Error for this prediction, which we will call <b>classRF_error</b>. This will equate to <b>classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())</b>, which will count the number of incorrectly predicted values and divide it by the total number of predictions.

In [32]:
classRF_error = classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())

Next, print out the test error value (<b>str(classRF_error)</b>, as well as the learned regression tree model (<b>classRF_model.toDebugString()</b>), so you have an idea of what the ensemble looks like.

In [33]:
print('Test Error = ' + str(classRF_error))
print('Learned classification tree model:' + classRF_model.toDebugString())

Test Error = 0.495921914694
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 4 <= 2.0)
     If (feature 5 <= 1.0)
      If (feature 7 <= 1.0)
       If (feature 3 <= 1.0)
        Predict: 3.0
       Else (feature 3 > 1.0)
        Predict: 1.0
      Else (feature 7 > 1.0)
       If (feature 3 <= 1.0)
        Predict: 1.0
       Else (feature 3 > 1.0)
        Predict: 0.0
     Else (feature 5 > 1.0)
      If (feature 9 <= 1.0)
       If (feature 3 <= 1.0)
        Predict: 1.0
       Else (feature 3 > 1.0)
        Predict: 0.0
      Else (feature 9 > 1.0)
       If (feature 7 <= 5.0)
        Predict: 0.0
       Else (feature 7 > 5.0)
        Predict: 0.0
    Else (feature 4 > 2.0)
     If (feature 0 <= 3.0)
      If (feature 5 <= 2.0)
       If (feature 1 <= 2.0)
        Predict: 1.0
       Else (feature 1 > 2.0)
        Predict: 0.0
      Else (feature 5 > 2.0)
       If (feature 1 <= 10.0)
        Predict: 0.0
       Else (feature 1 

Now that we've created the basic Classification Random Forest, let's start tuning some parameters! This is similar to the previous section, but since most of the tuning parameters have been covered in the Decision Tree section, there will only be two parameter to tune in this section. <br> <br>

Read over the code and understand how to build the Classification Random Forest as a whole. For the inputs, we have:
<ul>
    <li>1st: numTreesValue is the value for numTrees (Type: Int, Range: > 0, Default: 3)</li>
    <li>2nd: featureSubsetStrategyValue is the value for featureSubsetStrategyValue (Default: "auto")</li>
    <ul>
        <li>Values include: "auto", "all", "sqrt", "log2", "onethird"</li>
    </ul>
</ul>

In [34]:
def classRF_tuner(numTreesValue, featureSubsetStrategyValue):
    start = time.time()
    classRF_model = RandomForest.trainClassifier(classRF_train, numClasses = 10, categoricalFeaturesInfo={},
                                           featureSubsetStrategy=featureSubsetStrategyValue, numTrees=numTreesValue,
                                           impurity='gini', maxDepth=4, maxBins=32, seed=None)
    print (time.time()-start)

    classRF_pred = classRF_model.predict(classRF_test.map(lambda x: x.features))
    classRF_label_pred = classRF_test.map(lambda l: l.label).zip(classRF_pred)
    classRF_error = classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())
    
    print('Test Error = ' + str(classRF_error))
    print('Learned classification tree model:' + classRF_model.toDebugString())

Start off by re-creating the original Random Forest. That requires the input: <b>(3)</b> and <b>"auto"</b> into <b>classRF_tuner</b>

In [35]:
classRF_tuner(3, "auto")

1.17312097549
Test Error = 0.488166867228
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 9 <= 10.0)
     If (feature 3 <= 10.0)
      If (feature 7 <= 10.0)
       If (feature 5 <= 10.0)
        Predict: 1.0
       Else (feature 5 > 10.0)
        Predict: 0.0
      Else (feature 7 > 10.0)
       If (feature 1 <= 10.0)
        Predict: 0.0
       Else (feature 1 > 10.0)
        Predict: 1.0
     Else (feature 3 > 10.0)
      If (feature 7 <= 11.0)
       If (feature 3 <= 11.0)
        Predict: 0.0
       Else (feature 3 > 11.0)
        Predict: 0.0
      Else (feature 7 > 11.0)
       If (feature 8 <= 2.0)
        Predict: 1.0
       Else (feature 8 > 2.0)
        Predict: 1.0
    Else (feature 9 > 10.0)
     If (feature 1 <= 10.0)
      If (feature 2 <= 3.0)
       If (feature 8 <= 3.0)
        Predict: 0.0
       Else (feature 8 > 3.0)
        Predict: 0.0
      Else (feature 2 > 3.0)
       If (feature 3 <= 6.0)
        Predict:

### numTrees Parameter 
Let's start by tuning the <b>numTrees</b> parameter. Begin by setting it to a lower value, such as <b>1</b>

In [36]:
classRF_tuner(1, "auto")

1.13493800163
Test Error = 0.500334269287
Learned classification tree model:TreeEnsembleModel classifier with 1 trees

  Tree 0:
    If (feature 8 <= 2.0)
     If (feature 5 <= 1.0)
      If (feature 7 <= 1.0)
       If (feature 1 <= 9.0)
        Predict: 1.0
       Else (feature 1 > 9.0)
        Predict: 1.0
      Else (feature 7 > 1.0)
       If (feature 9 <= 1.0)
        Predict: 1.0
       Else (feature 9 > 1.0)
        Predict: 0.0
     Else (feature 5 > 1.0)
      If (feature 1 <= 10.0)
       If (feature 5 <= 9.0)
        Predict: 0.0
       Else (feature 5 > 9.0)
        Predict: 0.0
      Else (feature 1 > 10.0)
       If (feature 5 <= 10.0)
        Predict: 0.0
       Else (feature 5 > 10.0)
        Predict: 1.0
    Else (feature 8 > 2.0)
     If (feature 2 <= 2.0)
      If (feature 3 <= 2.0)
       If (feature 9 <= 5.0)
        Predict: 1.0
       Else (feature 9 > 5.0)
        Predict: 0.0
      Else (feature 3 > 2.0)
       If (feature 9 <= 2.0)
        Predict: 0.0
      

By setting numTrees to a value of 1, we see a slightly higher test error. Note that with numTrees equal to 1, the classifier acts as a Decision Tree, since there is only one tree in the ensemble.

Now let's try setting it to a numTrees to a larger value, such as 180. 

In [38]:
classRF_tuner(180, "auto")

4.18126797676
Test Error = 0.501671346437
Learned classification tree model:TreeEnsembleModel classifier with 180 trees

  Tree 0:
    If (feature 0 <= 3.0)
     If (feature 1 <= 9.0)
      If (feature 3 <= 10.0)
       If (feature 9 <= 10.0)
        Predict: 0.0
       Else (feature 9 > 10.0)
        Predict: 0.0
      Else (feature 3 > 10.0)
       If (feature 7 <= 10.0)
        Predict: 0.0
       Else (feature 7 > 10.0)
        Predict: 1.0
     Else (feature 1 > 9.0)
      If (feature 5 <= 9.0)
       If (feature 3 <= 7.0)
        Predict: 0.0
       Else (feature 3 > 7.0)
        Predict: 0.0
      Else (feature 5 > 9.0)
       If (feature 9 <= 7.0)
        Predict: 0.0
       Else (feature 9 > 7.0)
        Predict: 1.0
    Else (feature 0 > 3.0)
     If (feature 4 <= 2.0)
      If (feature 8 <= 3.0)
       If (feature 1 <= 6.0)
        Predict: 0.0
       Else (feature 1 > 6.0)
        Predict: 1.0
      Else (feature 8 > 3.0)
       If (feature 5 <= 7.0)
        Predict: 0.0
  

With a lot more trees in the ensemble, the training error has decreased a lot! But the training time has increased substantially as well. Remember that the training time increases roughly linearly with the number of trees.

### featureSubsetStrategy Parameter

Remember that the featureSubsetStrategy parameter only changes the number of features used as candidates for splitting. The default is set to <b>"auto"</b>, which will select "all", "sqrt", or "onethird" based on the value of numTrees. Since we are basing our analysis off of the default values, we have a numTrees value of 3, which means "sqrt" is selected. So let's start by changing it it <b>"all"</b>, which will use all of the features

In [39]:
classRF_tuner(3, "all")

1.15256810188
Test Error = 0.479342158043
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 6 <= 2.0)
     If (feature 3 <= 2.0)
      If (feature 5 <= 3.0)
       If (feature 9 <= 3.0)
        Predict: 1.0
       Else (feature 9 > 3.0)
        Predict: 1.0
      Else (feature 5 > 3.0)
       If (feature 1 <= 2.0)
        Predict: 1.0
       Else (feature 1 > 2.0)
        Predict: 0.0
     Else (feature 3 > 2.0)
      If (feature 3 <= 12.0)
       If (feature 7 <= 12.0)
        Predict: 0.0
       Else (feature 7 > 12.0)
        Predict: 0.0
      Else (feature 3 > 12.0)
       If (feature 9 <= 12.0)
        Predict: 0.0
       Else (feature 9 > 12.0)
        Predict: 1.0
    Else (feature 6 > 2.0)
     If (feature 7 <= 1.0)
      If (feature 9 <= 1.0)
       If (feature 3 <= 1.0)
        Predict: 3.0
       Else (feature 3 > 1.0)
        Predict: 1.0
      Else (feature 9 > 1.0)
       If (feature 5 <= 1.0)
        Predict: 1.0
    

We can see that there is a small increase in the building time of the model, which is expected since we are considering all of the features. As well, there is a small increase in the test error. A possibility to the increase in test error is that there are some features that aren't "good" in the model, causing an increase in the test error. Next, we will try with <b>"sqrt"</b>

In [40]:
classRF_tuner(3, "sqrt")

1.13319277763
Test Error = 0.475865757454
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 7 <= 5.0)
     If (feature 7 <= 2.0)
      If (feature 3 <= 2.0)
       If (feature 0 <= 3.0)
        Predict: 1.0
       Else (feature 0 > 3.0)
        Predict: 1.0
      Else (feature 3 > 2.0)
       If (feature 5 <= 2.0)
        Predict: 1.0
       Else (feature 5 > 2.0)
        Predict: 0.0
     Else (feature 7 > 2.0)
      If (feature 3 <= 5.0)
       If (feature 7 <= 4.0)
        Predict: 0.0
       Else (feature 7 > 4.0)
        Predict: 0.0
      Else (feature 3 > 5.0)
       If (feature 9 <= 2.0)
        Predict: 0.0
       Else (feature 9 > 2.0)
        Predict: 0.0
    Else (feature 7 > 5.0)
     If (feature 5 <= 5.0)
      If (feature 1 <= 5.0)
       If (feature 3 <= 5.0)
        Predict: 1.0
       Else (feature 3 > 5.0)
        Predict: 0.0
      Else (feature 1 > 5.0)
       If (feature 6 <= 1.0)
        Predict: 0.0
       Els

This has very similar values to the "auto", which is correct since "auto" is using "sqrt" for featureSubsetStrategy, since our numTrees value was set to 3. Let's try using "onethird" now, which uses one third of the features.

In [41]:
classRF_tuner(3, "onethird")

1.1675388813
Test Error = 0.465703971119
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 7 <= 8.0)
     If (feature 1 <= 8.0)
      If (feature 5 <= 8.0)
       If (feature 9 <= 8.0)
        Predict: 1.0
       Else (feature 9 > 8.0)
        Predict: 0.0
      Else (feature 5 > 8.0)
       If (feature 1 <= 1.0)
        Predict: 0.0
       Else (feature 1 > 1.0)
        Predict: 0.0
     Else (feature 1 > 8.0)
      If (feature 5 <= 10.0)
       If (feature 8 <= 2.0)
        Predict: 0.0
       Else (feature 8 > 2.0)
        Predict: 0.0
      Else (feature 5 > 10.0)
       If (feature 9 <= 8.0)
        Predict: 0.0
       Else (feature 9 > 8.0)
        Predict: 1.0
    Else (feature 7 > 8.0)
     If (feature 1 <= 8.0)
      If (feature 1 <= 4.0)
       If (feature 9 <= 12.0)
        Predict: 0.0
       Else (feature 9 > 12.0)
        Predict: 0.0
      Else (feature 1 > 4.0)
       If (feature 9 <= 4.0)
        Predict: 0.0
       

We see that the run-time is similar to the default, but the testing error has decreased a little bit. It's possible that there is about the same number of features when you take one third of them, as if you take the square root of them for this particular dataset. Let's try with the last type, which is <b>"log2"</b>

In [42]:
classRF_tuner(3, "log2")

1.12230682373
Test Error = 0.485893836074
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 5 <= 3.0)
     If (feature 3 <= 3.0)
      If (feature 1 <= 3.0)
       If (feature 7 <= 5.0)
        Predict: 1.0
       Else (feature 7 > 5.0)
        Predict: 1.0
      Else (feature 1 > 3.0)
       If (feature 7 <= 3.0)
        Predict: 1.0
       Else (feature 7 > 3.0)
        Predict: 0.0
     Else (feature 3 > 3.0)
      If (feature 8 <= 3.0)
       If (feature 7 <= 6.0)
        Predict: 1.0
       Else (feature 7 > 6.0)
        Predict: 0.0
      Else (feature 8 > 3.0)
       If (feature 1 <= 1.0)
        Predict: 1.0
       Else (feature 1 > 1.0)
        Predict: 0.0
    Else (feature 5 > 3.0)
     If (feature 3 <= 3.0)
      If (feature 7 <= 3.0)
       If (feature 9 <= 4.0)
        Predict: 1.0
       Else (feature 9 > 4.0)
        Predict: 0.0
      Else (feature 7 > 3.0)
       If (feature 9 <= 1.0)
        Predict: 0.0
       Els

When using <b>"log2"</b>, there is a decrease in run-time, along with testing error!

---

## RandomForest (Regression)

Now it's time for you to try it out for yourself! Build a Regression RandomForest in a similar way that the Classification RandomForest was built. Please note that you will be using the same dataset in this section (classRF_train, classRF_test), therefore you do not need to re-initialize that section.<br> <br> 

Try to only reference the above section when you are experiencing a lot of difficulty. This section is mainly for you to apply your learning.

For some help with the variables:
<ul>
    <li><b>categoricalFeaturesInfo</b>: Has a value of <b>{}</b> (parameter doesn't require tuning)</li>
    <li><b>featureSubsetStrategy</b>: Can change these values between <b>"auto"</b>, <b>"all"</b>, <b>"sqrt"</b>, <b>"log2"</b>, and <b>"onethird"</b></li>
    <li><b>numTrees</b>: Values range from <b>1</b> to infinity<i>(Default: 3)</i></li>
    <ul>
        <li>Note: If the value is too large, the system can run out of memory and not run.</li>
    </ul>
    <li><b>impurity</b>: For Regression, the value must be set to <b>'variance'</b> <i>(Default: 'variance')</i></li>
    <li><b>maxDepth</b>: Values range between <b>0 and 30</b> <i>(Default: 5)</i></li>
    <li><b>maxBins</b>: Value ranges between <b>2 and 2147483647</b> (largest value for 32-bits) <i>(Default: 32)</i></li>
    <li><b>seed</b> Can be set to any value, or to a value based on system time with <i>None</i> <i>(Default: None)</i></li>
</ul>

When displaying the <b>Mean Squared Error</b>, use the following formula and print statement instead of Training Error: <br>
<b>regRF_MSE = regRF_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(classRF_test.count())</b> <br>
<b>print('Test Error = ' + str(regRF_MSE))</b>

### The Goal
Try to create a model that is better than the model with default values.

### Try to beat!
With some parameter tuning, I was able to get a run-time increase of the model by ~0.9 seconds and a Test error decrease of ~2.54. Try to get a value similar to this, or better.

### Note
We want a model that doesn't take too long to train and will cause overfitting. Remember that a very large model with high accuracy but long run time may not be good because the model may have overfit the data.

In [43]:
start = time.time()
regRF_model = RandomForest.trainRegressor(classRF_train, categoricalFeaturesInfo={},
                                    numTrees=14, featureSubsetStrategy="onethird",
                                    impurity='variance', maxDepth=11, maxBins=24, seed=None)
print(time.time() - start)
# Evaluate model on test instances and compute test error
regRF_pred = regRF_model.predict(classRF_train.map(lambda x: x.features))
regRF_label_pred = classRF_train.map(lambda lp: lp.label).zip(regRF_pred)
regRF_MSE = regRF_label_pred.map(lambda (v, p): (v - p) ** 2).sum()/\
                                   float(classRF_train.count())
print('Test Mean Squared Error = ' + str(regRF_MSE))
print('Learned regression forest model: ' + regRF_model.toDebugString())


3.78572702408
Test Mean Squared Error = 0.346675558881
Learned regression forest model: TreeEnsembleModel regressor with 14 trees

  Tree 0:
    If (feature 5 <= 8.0)
     If (feature 3 <= 8.0)
      If (feature 1 <= 8.0)
       If (feature 7 <= 8.0)
        If (feature 9 <= 8.0)
         If (feature 5 <= 6.0)
          If (feature 7 <= 6.0)
           If (feature 9 <= 6.0)
            If (feature 7 <= 2.0)
             If (feature 4 <= 1.0)
              If (feature 9 <= 3.0)
               Predict: 1.0
              Else (feature 9 > 3.0)
               Predict: 0.4166666666666667
             Else (feature 4 > 1.0)
              If (feature 1 <= 3.0)
               Predict: 1.8571428571428572
              Else (feature 1 > 3.0)
               Predict: 0.9083333333333333
            Else (feature 7 > 2.0)
             If (feature 3 <= 6.0)
              If (feature 9 <= 4.0)
               Predict: 1.3857142857142857
              Else (feature 9 > 4.0)
               Predict: 2.089