# <center>Machine Learning with Spark MLlib</center>
## <center>Decision Trees and Random Forests</center>
### <center>July 20,2016</center>

<img src = "https://ibm.box.com/shared/static/wfbduwkbx22nx3i2psbp9g27s2p9s86v.png", width="500" align = 'center'>

## <b>Welcome to the third lab in the course, Machine Learning with Spark MLlib.</b>
### <b>Spark has many libraries, namely under MLlib (Machine Learning Library)! Spark allows for quick and easy scalability of practical machine learning!</b>

In this lab exercise, you will learn how to create Classification and Regression DecisionTree and RandomForest Models, as well as how to tune the parameters for each to create more optimal trees and ensembles of trees.

### Some Notebook Commands
#### In case you haven't dealt with a Jupyter Notebook before, here are some quick, useful commands that may be handy to get started.
<ul>
    <li>Run a cell: CTRL + ENTER</li>
    <li>Create a cell above a cell: a</li>
    <li>Create a cell below a cell: b</li>
    <li>Change a cell to Markdown: m</li>
    
    <li>Change a cell to code: y</li>
</ul>

<b> If you are interested in more keyboard shortcuts, go to Help -> Keyboard Shortcuts </b>

### How this lab will operate:
In this lab, you will be presented with a walk-through of a Regression DecisionTree model with how to tune some of the parameters. Then, you will create a Classification DecisionTree model yourself. You will also be presented with a walk-through of a Classification RandomForest model with how to tune some of the parameters, then you will create a Regression RandomForest model yourself.

## DecisionTree (Regression)

Import the following libraries:
<ul>
    <li>DecisionTree, DecisionTreeModel from pyspark.mllib.tree</li>
    <li>MLUtils from pyspark.mllib.util</li>
    <li>time</li>
</ul>

In [2]:
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import time
from pyspark import SparkContext

Next, we will load in the <b>poker.txt</b> LibSVM file, which is a dataset based on poker hands. Use <b>MLUtils.loadLibSVMFile</b> and pass in the spark context (<b>sc</b>) and the path to the file <b>'resources/poker.txt'</b>. Store this into a variable called <b>regDT_data</b> 

In [3]:
sc = SparkContext()

In [4]:
regDT_data = MLUtils.loadLibSVMFile(sc, "poker.txt")

Next, we need to split the data into a training dataset (called <b>regDT_train</b>) and testing dataset (called <b>regDT_test</b>). This will be done by running the <b>.randomSplit</b> function on <b>regDT_data</b>. The input into .randomSplit will be <b>[0.7, 0.3]</b>. <br> <br>

This will give us a training dataset containing 70% of the data, and a testing dataset containing 30% of the data.

In [5]:
regDT_train, regDT_test = regDT_data.randomSplit([0.7,0.3])

Next, we need to create the Regression Decision Tree called <b>regDT_model</b>. To instantiate the regressor, use <b>DecisionTree.trainRegressor</b>. We will pass in the following parameters:
<ul>
    <li>1st: The input data. In our case, we will use <b>regDT_train</b></li>
    <li>2nd: The categorical features info. For our dataset, have <b>categoricalFeaturesInfo</b> equal <b>{}</b></li>
    <li>3rd: The type of impurity. Since we're dealing with <b>Regression</b>, we will be have <b>impurity</b> set to <b>'variance'</b></li>
    <li>4th: The maximum depth of the tree. For now, set <b>maxDepth</b> to <b>5</b>, which is the default value</li>
    <li>5th: The maximum number of bins. For now, set <b>maxBins</b> to <b>32</b>, which is the default value</li>
    <li>6th: The minimum instances required per node. For now, set <b>minInstancesPerNode</b> to <b>1</b>, which is the default value</li>
    <li>7th: The minimum required information gain per node. For now, set <b>minInfoGain</b> to <b>0.0</b>, which is the default value</li>
</ul> <br> <br>

We will also be timing how long it takes to create the model, so run <b>start = time.time()</b> before creating the model and <b>print(time.time()-start)</b> after the model has been created. <br>
<b>Note</b>: The timings differ on run and by computer, therefore some statements throughout the lab may not directly align with the results you get, which is okay! There are many factors that can affect the time output.

In [6]:
start = time.time()
regDT_model = DecisionTree.trainRegressor(regDT_train,categoricalFeaturesInfo={},
                            impurity='variance',
                            maxDepth=5,
                            maxBins=32,
                            minInstancesPerNode=1,
                            minInfoGain=0.0)
print(time.time() - start)

3.35649108887


Next, we want to get the models prediction on the test data, which we will call <b>regDT_pred</b>. We will run <b>.predict</b> on regDT_model, passing in the testing data, <b>regDT_test</b> that is mapped using <b>.map</b> which maps the features by passing in a lambda function (<b>lambda x: x.features</b>).

In [7]:
regDT_pred = regDT_model.predict(regDT_test.map(lambda x: x.features))

Now create a variable called <b>regDT_label_pred</b> which uses a <b>.map</b> on <b>regDT_test</b>. Pass <b>lambda l: l.label</b> into the mapping function. Outside of the mapping function, add a <b>.zip(regDT_pred)</b>. This will merge the label with the prediction</b> 

In [11]:
regDT_label_pred = regDT_test.map(lambda l: l.label).zip(regDT_pred)

Now we will calculate the Mean Squared Error for this prediction, which we will call <b>regDT_MSE</b>. This will equate to <b>regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())</b>, which will take the difference of the actual value and the predicted response, square it, and sum that with the rest of the values. Afterwards, it is divided by the total number of values in the testing data.

In [19]:
regDT_MSE = regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())

Next, print out the MSE prediction value (<b>str(regDT_MSE)</b>), as well as the learned regression tree model (<b>regDT_model.toDebugString()</b>), so you have an idea of what the tree looks like.

In [20]:
print(str(regDT_MSE))
print(regDT_model.toDebugString())

0.582881410701
DecisionTreeModel regressor of depth 5 with 63 nodes
  If (feature 5 <= 4.5)
   If (feature 3 <= 4.5)
    If (feature 7 <= 4.5)
     If (feature 1 <= 4.5)
      If (feature 9 <= 5.5)
       Predict: 2.4
      Else (feature 9 > 5.5)
       Predict: 1.411764705882353
     Else (feature 1 > 4.5)
      If (feature 9 <= 4.5)
       Predict: 1.179245283018868
      Else (feature 9 > 4.5)
       Predict: 0.7937219730941704
    Else (feature 7 > 4.5)
     If (feature 9 <= 4.5)
      If (feature 1 <= 5.5)
       Predict: 1.1981981981981982
      Else (feature 1 > 5.5)
       Predict: 0.7638190954773869
     Else (feature 9 > 4.5)
      If (feature 1 <= 4.5)
       Predict: 0.78125
      Else (feature 1 > 4.5)
       Predict: 0.5045372050816697
   Else (feature 3 > 4.5)
    If (feature 9 <= 4.5)
     If (feature 7 <= 4.5)
      If (feature 1 <= 4.5)
       Predict: 1.3047619047619048
      Else (feature 1 > 4.5)
       Predict: 0.7537878787878788
     Else (feature 7 > 4.5)
      

In [21]:
def regDT_tuner(maxDepthValue, maxBinsValue, minInstancesValue, minInfoGainValue):
    start = time.time()
    regDT_model = DecisionTree.trainRegressor(regDT_train, categoricalFeaturesInfo={},
                                        impurity='variance', maxDepth=maxDepthValue, maxBins=maxBinsValue,
                                        minInstancesPerNode=minInstancesValue, minInfoGain=minInfoGainValue)
    print (time.time()-start)

    regDT_pred = regDT_model.predict(regDT_test.map(lambda x: x.features))
    regDT_label_pred = regDT_test.map(lambda l: l.label).zip(regDT_pred)
    regDT_MSE = regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())

    print('Test Mean Squared Error = ' + str(regDT_MSE))
    print('Learned Regression Tree Model: ' + regDT_model.toDebugString())

Start off by re-creating the original tree. That requires the inputs: <b>(5, 32, 1, 0.0)</b> into <b>regDT_tuner</b>

In [22]:
regDT_tuner(5,32,1,0.0)

1.31900596619
Test Mean Squared Error = 0.582881410701
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 63 nodes
  If (feature 5 <= 4.5)
   If (feature 3 <= 4.5)
    If (feature 7 <= 4.5)
     If (feature 1 <= 4.5)
      If (feature 9 <= 5.5)
       Predict: 2.4
      Else (feature 9 > 5.5)
       Predict: 1.411764705882353
     Else (feature 1 > 4.5)
      If (feature 9 <= 4.5)
       Predict: 1.179245283018868
      Else (feature 9 > 4.5)
       Predict: 0.7937219730941704
    Else (feature 7 > 4.5)
     If (feature 9 <= 4.5)
      If (feature 1 <= 5.5)
       Predict: 1.1981981981981982
      Else (feature 1 > 5.5)
       Predict: 0.7638190954773869
     Else (feature 9 > 4.5)
      If (feature 1 <= 4.5)
       Predict: 0.78125
      Else (feature 1 > 4.5)
       Predict: 0.5045372050816697
   Else (feature 3 > 4.5)
    If (feature 9 <= 4.5)
     If (feature 7 <= 4.5)
      If (feature 1 <= 4.5)
       Predict: 1.3047619047619048
      Else (feature 1 > 4.5

Remember that when we are tuning a specific parameter, that we will keep the other parameters at their original value

### maxDepth Parameter 
Let's start by tuning the <b>maxDepth</b> parameter. Begin by setting it to a lower value, such as <b>1</b>

In [23]:
regDT_tuner(1,32,1,0.0)

0.846998214722
Test Mean Squared Error = 0.624734400454
Learned Regression Tree Model: DecisionTreeModel regressor of depth 1 with 3 nodes
  If (feature 5 <= 4.5)
   Predict: 0.6421967455621301
  Else (feature 5 > 4.5)
   Predict: 0.6123986569486528



By decreasing the maxDepth parameter, you can see that the run-time slightly decreased, presenting a smaller tree as well. You may also see a slight increase in the error, which is to be expected since the tree is too small to make accurate predictions.

Now try increasing to value of <b>maxDepth</b> to a large number, such as <b>30</b>, which is the maximum value.

In [24]:
regDT_tuner(30,32,1,0.0)

3.43908286095
Test Mean Squared Error = 1.10458665945
Learned Regression Tree Model: DecisionTreeModel regressor of depth 25 with 12919 nodes
  If (feature 5 <= 4.5)
   If (feature 3 <= 4.5)
    If (feature 7 <= 4.5)
     If (feature 1 <= 4.5)
      If (feature 9 <= 5.5)
       If (feature 2 <= 3.5)
        If (feature 0 <= 2.5)
         If (feature 8 <= 2.5)
          If (feature 6 <= 2.5)
           If (feature 0 <= 1.5)
            Predict: 6.0
           Else (feature 0 > 1.5)
            Predict: 8.0
          Else (feature 6 > 2.5)
           If (feature 0 <= 1.5)
            Predict: 4.0
           Else (feature 0 > 1.5)
            Predict: 2.0
         Else (feature 8 > 2.5)
          If (feature 7 <= 1.5)
           Predict: 6.0
          Else (feature 7 > 1.5)
           If (feature 9 <= 1.5)
            Predict: 3.0
           Else (feature 9 > 1.5)
            If (feature 0 <= 1.5)
             Predict: 1.0
            Else (feature 0 > 1.5)
             Predict: 2.0
     

In [29]:
regDT_tuner(3,32,1,0.0)

0.649353027344
Test Mean Squared Error = 0.612700146771
Learned Regression Tree Model: DecisionTreeModel regressor of depth 3 with 15 nodes
  If (feature 5 <= 4.5)
   If (feature 3 <= 4.5)
    If (feature 7 <= 4.5)
     Predict: 1.1269487750556793
    Else (feature 7 > 4.5)
     Predict: 0.680184331797235
   Else (feature 3 > 4.5)
    If (feature 9 <= 4.5)
     Predict: 0.6767422334172963
    Else (feature 9 > 4.5)
     Predict: 0.5303764442787924
  Else (feature 5 > 4.5)
   If (feature 7 <= 4.5)
    If (feature 3 <= 4.5)
     Predict: 0.6619483763530392
    Else (feature 3 > 4.5)
     Predict: 0.49204587495375507
   Else (feature 7 > 4.5)
    If (feature 3 <= 4.5)
     Predict: 0.5130597014925373
    Else (feature 3 > 4.5)
     Predict: 0.7069486404833837



With a large value for maxDepth, you can see that the run-time increased greatly, along with the size of the tree. The MSE has increased greatly compared to the original, which is due to overfitting of the training data from having a deep tree.

### maxBins Parameter
Now let's tune the <b>maxBins</b> variable. Start by decreasing the value to 2, to see what the lower end of this value does to the tree.

In [30]:
regDT_tuner(5,2,1,0.0)

0.738765954971
Test Mean Squared Error = 0.583933151733
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 63 nodes
  If (feature 7 <= 6.5)
   If (feature 5 <= 6.5)
    If (feature 9 <= 6.5)
     If (feature 3 <= 7.5)
      If (feature 1 <= 7.5)
       Predict: 1.2930232558139534
      Else (feature 1 > 7.5)
       Predict: 0.7592592592592593
     Else (feature 3 > 7.5)
      If (feature 1 <= 7.5)
       Predict: 0.7374701670644391
      Else (feature 1 > 7.5)
       Predict: 0.583547557840617
    Else (feature 9 > 6.5)
     If (feature 3 <= 7.5)
      If (feature 1 <= 7.5)
       Predict: 0.7819420783645656
      Else (feature 1 > 7.5)
       Predict: 0.5286885245901639
     Else (feature 3 > 7.5)
      If (feature 8 <= 2.5)
       Predict: 0.5168986083499006
      Else (feature 8 > 2.5)
       Predict: 0.44176706827309237
   Else (feature 5 > 6.5)
    If (feature 0 <= 2.5)
     If (feature 4 <= 2.5)
      If (feature 8 <= 2.5)
       Predict: 0.507722007722007

Comparing this to the original tree, we can see a small decrease in the training time, but not much of a difference in regards to MSE or the size of the tree.

Now let's take a look at the upper end, with a value of 15000

In [31]:
regDT_tuner(5,15000,1,0.0)

0.744964838028
Test Mean Squared Error = 0.582881410701
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 63 nodes
  If (feature 5 <= 4.5)
   If (feature 3 <= 4.5)
    If (feature 7 <= 4.5)
     If (feature 1 <= 4.5)
      If (feature 9 <= 5.5)
       Predict: 2.4
      Else (feature 9 > 5.5)
       Predict: 1.411764705882353
     Else (feature 1 > 4.5)
      If (feature 9 <= 4.5)
       Predict: 1.179245283018868
      Else (feature 9 > 4.5)
       Predict: 0.7937219730941704
    Else (feature 7 > 4.5)
     If (feature 9 <= 4.5)
      If (feature 1 <= 5.5)
       Predict: 1.1981981981981982
      Else (feature 1 > 5.5)
       Predict: 0.7638190954773869
     Else (feature 9 > 4.5)
      If (feature 1 <= 4.5)
       Predict: 0.78125
      Else (feature 1 > 4.5)
       Predict: 0.5045372050816697
   Else (feature 3 > 4.5)
    If (feature 9 <= 4.5)
     If (feature 7 <= 4.5)
      If (feature 1 <= 4.5)
       Predict: 1.3047619047619048
      Else (feature 1 > 4.

With a very large maxBin value, we don't see too much of a change in the overall time or in the MSE. The model still has the same depth and nodes, as expected.

### minInstancesPerNode parameter
Next we will look at tuning the <b>minInstancesPerNode</b> parameter. It starts off at the lowest value of 1, but let's see what happens if we keep increasing the value. Starting off with the value <b>100</b>

In [32]:
regDT_tuner(5,32,100,0.0)

0.748516082764
Test Mean Squared Error = 0.586733905564
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 61 nodes
  If (feature 5 <= 4.5)
   If (feature 3 <= 4.5)
    If (feature 7 <= 4.5)
     If (feature 1 <= 4.5)
      Predict: 1.7
     Else (feature 1 > 4.5)
      If (feature 9 <= 4.5)
       Predict: 1.179245283018868
      Else (feature 9 > 4.5)
       Predict: 0.7937219730941704
    Else (feature 7 > 4.5)
     If (feature 9 <= 4.5)
      If (feature 1 <= 5.5)
       Predict: 1.1981981981981982
      Else (feature 1 > 5.5)
       Predict: 0.7638190954773869
     Else (feature 9 > 4.5)
      If (feature 1 <= 4.5)
       Predict: 0.78125
      Else (feature 1 > 4.5)
       Predict: 0.5045372050816697
   Else (feature 3 > 4.5)
    If (feature 9 <= 4.5)
     If (feature 7 <= 4.5)
      If (feature 1 <= 4.5)
       Predict: 1.3047619047619048
      Else (feature 1 > 4.5)
       Predict: 0.7537878787878788
     Else (feature 7 > 4.5)
      If (feature 1 <= 5.5

With minInstancesPerNode set to 100, we don't see much of a change in time and MSE, but we can see that there are less nodes in the tree. Try now with a value of <b>1000</b>

In [35]:
regDT_tuner(5,32,1000,0.0)

0.711501121521
Test Mean Squared Error = 0.60473386159
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 23 nodes
  If (feature 5 <= 4.5)
   If (feature 3 <= 4.5)
    Predict: 0.8109517601043025
   Else (feature 3 > 4.5)
    If (feature 9 <= 4.5)
     Predict: 0.6767422334172963
    Else (feature 9 > 4.5)
     If (feature 8 <= 2.5)
      Predict: 0.5618904726181545
     Else (feature 8 > 2.5)
      Predict: 0.49925925925925924
  Else (feature 5 > 4.5)
   If (feature 7 <= 4.5)
    If (feature 3 <= 4.5)
     Predict: 0.6619483763530392
    Else (feature 3 > 4.5)
     If (feature 1 <= 8.5)
      Predict: 0.48002421307506055
     Else (feature 1 > 8.5)
      Predict: 0.510941960038059
   Else (feature 7 > 4.5)
    If (feature 3 <= 4.5)
     If (feature 4 <= 2.5)
      Predict: 0.5338918507235338
     Else (feature 4 > 2.5)
      Predict: 0.493050475493782
    Else (feature 3 > 4.5)
     If (feature 1 <= 4.5)
      Predict: 0.5237837837837838
     Else (feature 1 > 

With a value of 1000, we may see more of a decrease in the time, but the MSE has also increased a little bit. As well, the number of nodes in the model has decreased once again. Let's take it one step further and try with a value of <b>8000</b>

In [36]:
regDT_tuner(5,32,8000,0.0)

0.595345973969
Test Mean Squared Error = 0.624312500997
Learned Regression Tree Model: DecisionTreeModel regressor of depth 1 with 3 nodes
  If (feature 7 <= 6.5)
   Predict: 0.6125743415463042
  Else (feature 7 > 6.5)
   Predict: 0.6294243070362473



With a value of 8000, we may see that the run-time to build the model is starting to decrease a lot more, with only a small increase in MSE compared to when the value was set to 1000. The main difference we see is that the tree has become a lot smaller! This is to be expected since we are tuning a stopping parameter, which determines when the model finishes building.

### minInfoGain Parameter
For the last parameter, we will look at the minInfoGain parameter, which was initially set to 0.0. This value works well with negative values, and is very sensitive with values greater than 0.0. Try setting the value to a low number, such as -100.0

In [37]:
regDT_tuner(5,32,1,-100.0)

0.704613924026
Test Mean Squared Error = 0.582881410701
Learned Regression Tree Model: DecisionTreeModel regressor of depth 5 with 63 nodes
  If (feature 5 <= 4.5)
   If (feature 3 <= 4.5)
    If (feature 7 <= 4.5)
     If (feature 1 <= 4.5)
      If (feature 9 <= 5.5)
       Predict: 2.4
      Else (feature 9 > 5.5)
       Predict: 1.411764705882353
     Else (feature 1 > 4.5)
      If (feature 9 <= 4.5)
       Predict: 1.179245283018868
      Else (feature 9 > 4.5)
       Predict: 0.7937219730941704
    Else (feature 7 > 4.5)
     If (feature 9 <= 4.5)
      If (feature 1 <= 5.5)
       Predict: 1.1981981981981982
      Else (feature 1 > 5.5)
       Predict: 0.7638190954773869
     Else (feature 9 > 4.5)
      If (feature 1 <= 4.5)
       Predict: 0.78125
      Else (feature 1 > 4.5)
       Predict: 0.5045372050816697
   Else (feature 3 > 4.5)
    If (feature 9 <= 4.5)
     If (feature 7 <= 4.5)
      If (feature 1 <= 4.5)
       Predict: 1.3047619047619048
      Else (feature 1 > 4.

Overall, we don't see much of a change at all to anything. Now try changing the value to 0.0003

In [38]:
regDT_tuner(5,32,1,0.003)

0.537302970886
Test Mean Squared Error = 0.62440172626
Learned Regression Tree Model: DecisionTreeModel regressor of depth 0 with 1 nodes
  Predict: 0.6215449230943867



We can see that small values greater than zero can cause drastic changes in how the model looks. Here, we see a small decrease in the training time, and small increase in the MSE value. But now the tree only contains one node in it. The affect of this parameter on the tree is similar to minInstancesPerNode, since they are both stopping parameters.

---

## DecisionTree (Classification)

Now it's time for you to try it out for yourself! Build a Classification DecisionTree in a similar way that the Regression DecisionTree was built. Please note that you will be using the same dataset in this section (regDT_train, regDT_test), therefore you do not need to re-initialize that section.<br> <br> 

Try to only reference the above section when you are experiencing a lot of difficulty. This section is mainly for you to apply your learning.

For some help with the variables:
<ul>
    <li><b>numClasses</b>: The number of classes for this dataset is <b>10</b> (parameter doesn't require tuning)</li>
    <li><b>categoricalFeaturesInfo</b>: Has a value of <b>{}</b> (parameter doesn't require tuning)</li>
    <li><b>impurity</b>: There are two types of impurites you can use -- <b>'gini'</b> or <b>'entropy'</b> <i>(Default: 'gini')</i></li>
    <li><b>maxDepth</b>: Values range between <b>0 and 30</b> <i>(Default: 5)</i></li>
    <li><b>maxBins</b>: Value ranges between <b>2 and 2147483647</b> (largest value for 32-bits) <i>(Default: 32)</i></li>
    <li><b>minInstancesPerNode</b> ranges between <b>1 and 2147483647</b> <i>(Default: 1)</i></li>
    <li><b>minInfoGain</b>: Ensure it is a float (has a decimal in the value) <i>(Default: 0.0)</i></li>
</ul>

When displaying the <b>Training Error</b>, use the following formula and print statement instead of MSE: <br>
<b>classDT_error = classDT_label_pred.filter(lambda (v, p): v != p).count() / float(regDT_test.count())</b> <br>
<b>print('Test Error = ' + str(classDT_error))</b>

### The Goal
Try to create a model that is better than the model with default values. Challenge yourself by trying to create the best model you can!


### Note
We want a model that doesn't take too long to train and will cause overfitting. Remember that a very large model with high accuracy but long run time may not be good because the model may have overfit the data.

In [113]:
def classDT_tuner(impurityValue, maxDepthValue, maxBinsValue, minInstancesValue, minInfoGainValue):
    #regDT_train, regDT_test are params used as the data
    start = time.time()
    classDT_model = DecisionTree.trainClassifier(regDT_train, 
                                                                            numClasses=10, 
                                                                            categoricalFeaturesInfo={},
                                                                            impurity=impurityValue, 
                                                                            maxDepth=maxDepthValue, 
                                                                            maxBins=maxBinsValue,
                                                                            minInstancesPerNode=minInstancesValue, 
                                                                            minInfoGain=minInfoGainValue)
    print (time.time()-start)
    
    classDT_pred = classDT_model.predict(regDT_test.map(lambda x: x.features))
    classDT_label_pred = regDT_test.map(lambda l: l.label).zip(classDT_pred)
    # classDT_MSE = classDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())
    classDT_error = classDT_label_pred.filter(lambda (v, p): v != p).count() / float(regDT_test.count()) 
    
    print('Test Error = ' + str(classDT_error))
    # print('Test Mean Squared Error = ' + str(classDT_MSE))
    print('Learned Regression Tree Model: ' + classDT_model.toDebugString())

In [139]:
classDT_tuner('gini',maxBinsValue=3,maxDepthValue=15,minInfoGainValue=0.0,minInstancesValue=80)

0.831645011902
Test Error = 0.4606954404
Learned Regression Tree Model: DecisionTreeModel classifier of depth 11 with 297 nodes
  If (feature 6 <= 1.5)
   If (feature 4 <= 1.5)
    If (feature 0 <= 1.5)
     If (feature 1 <= 4.5)
      Predict: 0.0
     Else (feature 1 > 4.5)
      Predict: 0.0
    Else (feature 0 > 1.5)
     If (feature 8 <= 1.5)
      Predict: 0.0
     Else (feature 8 > 1.5)
      If (feature 3 <= 4.5)
       Predict: 1.0
      Else (feature 3 > 4.5)
       If (feature 1 <= 4.5)
        Predict: 0.0
       Else (feature 1 > 4.5)
        If (feature 9 <= 4.5)
         Predict: 0.0
        Else (feature 9 > 4.5)
         If (feature 3 <= 9.5)
          Predict: 1.0
         Else (feature 3 > 9.5)
          Predict: 0.0
   Else (feature 4 > 1.5)
    If (feature 2 <= 1.5)
     If (feature 9 <= 9.5)
      If (feature 3 <= 9.5)
       If (feature 5 <= 9.5)
        If (feature 4 <= 3.5)
         If (feature 3 <= 4.5)
          Predict: 1.0
         Else (feature 3 > 4.5)
  

---

## RandomForest (Classifier)

Now that we've run through the DecisionTree model, let's work with RandomForests now. The process for this will be similar with the DecisionTree section.

Import the following libraries:
<ul>
    <li>RandomForest, RandomForestModel from pyspark.mllib.tree</li>
    <li>MLUtils from pyspark.mllib.util</li>
    <li>time</li>
</ul>

In [141]:
import time
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

Next, we will load in the <b>pendigits.txt</b> LibSVM file, which is a dataset based on Pen-Based Recognition of Handwritten Digits. Use <b>MLUtils.loadLibSVMFile</b> and pass in the spark context (<b>sc</b>) and the path to the file <b>'resources/pendigits.txt'</b>. Store this into a variable called <b>classRF_data</b> <br> <br>

Note: You can also try out this section with the poker.txt dataset if you want to compare results from both sections!

In [142]:
classRF_data = MLUtils.loadLibSVMFile(sc, "pendigits.txt")

Next, we need to split the data into a training dataset (called <b>classRF_train</b>) and testing dataset (called <b>classRF_test</b>). This will be done by running the <b>.randomSplit</b> function on <b>classRF_data</b>. The input into .randomSplit will be <b>[0.7, 0.3]</b>. <br> <br>

This will give us a training dataset containing 70% of the data, and a testing dataset containing 30% of the data.

In [144]:
classRF_train, classRF_test = classRF_data.randomSplit([0.7,0.3])

Next, we need to create the Random Forest Classifier called <b>classRF_model</b>. To instantiate the classifier, use <b>RandomForest.trainClassifier</b>. We will pass in the following parameters:
<ul>
    <li>1st: The input data. In our case, we will use <b>classRF_train</b></li>
    <li>2nd: The number of classes. For this dataset, there will be 10 classes, so set <b>numClasses</b> equal to <b>10</b>
    <li>3rd: The categorical features info. For our dataset, have <b>categoricalFeaturesInfo</b> equal <b>{}</b></li>
    <li>4th: The number of trees. We will set <b>numTrees = 3</b>
    <li>5th: The feature Subset Strategy. There are various inputs for this parameter, but for the sake of this section we will set <b>featureSubsetStrategy</b> equal to <b>"auto"</b></li>
    <li>6th: The type of impurity. Since we're dealing with <b>Classification</b>, we will be have <b>impurity</b> set to <b>'gini'</b></li>
    <li>7th: The maximum depth of the tree. For now, set <b>maxDepth</b> to <b>5</b>, which is the default value</li>
    <li>8th: The maximum number of bins. For now, set <b>maxBins</b> to <b>32</b>, which is the default value</li>
    <li>9th: The seed to generate random data. For now, set <b>seed</b> to <b>None</b></li>
</ul> <br> <br>

We will also be timing how long it takes to create the model, so run <b>start = time.time()</b> before creating the model and <b>print(time.time()-start)</b> after the model has been created. <br>
<b>Note</b>: The timings differ on run and by computer, therefore some statements throughout the lab may not directly align with the results you get, which is okay! There are many factors that can affect the time output.

In [147]:
start = time.time()
classRF_model = RandomForest.trainClassifier(classRF_train,
                                                                             numClasses=10,
                                                                             categoricalFeaturesInfo={},
                                                                             numTrees=3,
                                                                             featureSubsetStrategy="auto",
                                                                             impurity='gini',
                                                                             maxDepth=5,
                                                                             maxBins=32,
                                                                             seed=None)
print(time.time() - start)                                        

0.524893045425


Next, we want to get the models prediction on the test data, which we will call <b>classRF_pred</b>. We will run <b>.predict</b> on classRF_model, passing in the testing data, <b>classRF_test</b> that is mapped using <b>.map</b> which maps the features using a lambda function (<b>lambda x: x.features</b>).

In [148]:
classRF_pred = classRF_model.predict(classRF_test.map(lambda x: x.features))

Now create a variable called <b>classRF_label_pred</b> which uses a <b>.map</b> on <b>classRF_test</b>. Pass <b>lambda l: l.label</b> into the mapping function. Outside of the mapping function, add a <b>.zip(classRF_pred)</b>. This will merge the label with the prediction</b> 

In [149]:
classRF_label_pred = classRF_test.map(lambda l: l.label).zip(classRF_pred)

Now we will calculate the Test Error for this prediction, which we will call <b>classRF_error</b>. This will equate to <b>classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())</b>, which will count the number of incorrectly predicted values and divide it by the total number of predictions.

In [150]:
classRF_error = classRF_label_pred.filter(lambda (v,p): v != p).count() / float(classRF_test.count())

Next, print out the test error value (<b>str(classRF_error)</b>, as well as the learned regression tree model (<b>classRF_model.toDebugString()</b>), so you have an idea of what the ensemble looks like.

In [152]:
print(str(classRF_error))
print(classRF_model.toDebugString())

0.137511271416
TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 0 <= 2.5)
     If (feature 3 <= 89.5)
      If (feature 9 <= 50.5)
       If (feature 8 <= 25.5)
        If (feature 12 <= 59.5)
         Predict: 5.0
        Else (feature 12 > 59.5)
         Predict: 8.0
       Else (feature 8 > 25.5)
        If (feature 6 <= 45.5)
         Predict: 8.0
        Else (feature 6 > 45.5)
         Predict: 1.0
      Else (feature 9 > 50.5)
       If (feature 13 <= 36.5)
        If (feature 6 <= 48.5)
         Predict: 9.0
        Else (feature 6 > 48.5)
         Predict: 1.0
       Else (feature 13 > 36.5)
        If (feature 2 <= 19.5)
         Predict: 0.0
        Else (feature 2 > 19.5)
         Predict: 9.0
     Else (feature 3 > 89.5)
      If (feature 10 <= 99.5)
       If (feature 2 <= 34.5)
        If (feature 11 <= 22.5)
         Predict: 2.0
        Else (feature 11 > 22.5)
         Predict: 7.0
       Else (feature 2 > 34.5)
        If (feature 9 <= 33.5)
     

Now that we've created the basic Classification Random Forest, let's start tuning some parameters! This is similar to the previous section, but since most of the tuning parameters have been covered in the Decision Tree section, there will only be two parameter to tune in this section. <br> <br>

Read over the code and understand how to build the Classification Random Forest as a whole. For the inputs, we have:
<ul>
    <li>1st: numTreesValue is the value for numTrees (Type: Int, Range: > 0, Default: 3)</li>
    <li>2nd: featureSubsetStrategyValue is the value for featureSubsetStrategyValue (Default: "auto")</li>
    <ul>
        <li>Values include: "auto", "all", "sqrt", "log2", "onethird"</li>
    </ul>
</ul>

In [153]:
def classRF_tuner(numTreesValue, featureSubsetStrategyValue):
    start = time.time()
    classRF_model = RandomForest.trainClassifier(classRF_train, numClasses = 10, categoricalFeaturesInfo={},
                                           featureSubsetStrategy=featureSubsetStrategyValue, numTrees=numTreesValue,
                                           impurity='gini', maxDepth=4, maxBins=32, seed=None)
    print (time.time()-start)

    classRF_pred = classRF_model.predict(classRF_test.map(lambda x: x.features))
    classRF_label_pred = classRF_test.map(lambda l: l.label).zip(classRF_pred)
    classRF_error = classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())
    
    print('Test Error = ' + str(classRF_error))
    print('Learned classification tree model:' + classRF_model.toDebugString())

Start off by re-creating the original Random Forest. That requires the input: <b>(3)</b> and <b>"auto"</b> into <b>classRF_tuner</b>

In [154]:
classRF_tuner(3, "auto")

0.614810943604
Test Error = 0.268259693417
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 13 <= 60.5)
     If (feature 5 <= 63.5)
      If (feature 14 <= 27.5)
       If (feature 8 <= 78.5)
        Predict: 6.0
       Else (feature 8 > 78.5)
        Predict: 4.0
      Else (feature 14 > 27.5)
       If (feature 15 <= 0.5)
        Predict: 4.0
       Else (feature 15 > 0.5)
        Predict: 7.0
     Else (feature 5 > 63.5)
      If (feature 3 <= 89.5)
       If (feature 6 <= 72.5)
        Predict: 9.0
       Else (feature 6 > 72.5)
        Predict: 1.0
      Else (feature 3 > 89.5)
       If (feature 13 <= 18.5)
        Predict: 2.0
       Else (feature 13 > 18.5)
        Predict: 7.0
    Else (feature 13 > 60.5)
     If (feature 4 <= 37.5)
      If (feature 10 <= 54.5)
       If (feature 3 <= 86.5)
        Predict: 5.0
       Else (feature 3 > 86.5)
        Predict: 8.0
      Else (feature 10 > 54.5)
       If (feature 12 <= 99.5)

### numTrees Parameter 
Let's start by tuning the <b>numTrees</b> parameter. Begin by setting it to a lower value, such as <b>1</b>

In [155]:
classRF_tuner(1, "auto")

0.527186870575
Test Error = 0.23219116321
Learned classification tree model:TreeEnsembleModel classifier with 1 trees

  Tree 0:
    If (feature 15 <= 20.5)
     If (feature 10 <= 39.5)
      If (feature 3 <= 92.5)
       If (feature 12 <= 55.5)
        Predict: 1.0
       Else (feature 12 > 55.5)
        Predict: 0.0
      Else (feature 3 > 92.5)
       If (feature 12 <= 26.5)
        Predict: 1.0
       Else (feature 12 > 26.5)
        Predict: 2.0
     Else (feature 10 > 39.5)
      If (feature 5 <= 66.5)
       If (feature 9 <= 10.5)
        Predict: 6.0
       Else (feature 9 > 10.5)
        Predict: 4.0
      Else (feature 5 > 66.5)
       If (feature 9 <= 58.5)
        Predict: 3.0
       Else (feature 9 > 58.5)
        Predict: 9.0
    Else (feature 15 > 20.5)
     If (feature 13 <= 60.5)
      If (feature 0 <= 31.5)
       If (feature 1 <= 71.5)
        Predict: 8.0
       Else (feature 1 > 71.5)
        Predict: 7.0
      Else (feature 0 > 31.5)
       If (feature 1 <= 93.5)


By setting numTrees to a value of 1, we see a slightly higher test error. Note that with numTrees equal to 1, the classifier acts as a Decision Tree, since there is only one tree in the ensemble.

Now let's try setting it to a numTrees to a larger value, such as 180. 

In [156]:
classRF_tuner(180, "auto")

1.94899487495
Test Error = 0.142921550947
Learned classification tree model:TreeEnsembleModel classifier with 180 trees

  Tree 0:
    If (feature 4 <= 42.5)
     If (feature 13 <= 60.5)
      If (feature 7 <= 56.5)
       If (feature 7 <= 31.5)
        Predict: 6.0
       Else (feature 7 > 31.5)
        Predict: 4.0
      Else (feature 7 > 56.5)
       If (feature 9 <= 45.5)
        Predict: 1.0
       Else (feature 9 > 45.5)
        Predict: 9.0
     Else (feature 13 > 60.5)
      If (feature 10 <= 61.5)
       If (feature 0 <= 43.5)
        Predict: 5.0
       Else (feature 0 > 43.5)
        Predict: 8.0
      Else (feature 10 > 61.5)
       If (feature 6 <= 2.5)
        Predict: 0.0
       Else (feature 6 > 2.5)
        Predict: 0.0
    Else (feature 4 > 42.5)
     If (feature 13 <= 18.5)
      If (feature 14 <= 55.5)
       If (feature 8 <= 53.5)
        Predict: 5.0
       Else (feature 8 > 53.5)
        Predict: 3.0
      Else (feature 14 > 55.5)
       If (feature 9 <= 48.5)
  

With a lot more trees in the ensemble, the training error has decreased a lot! But the training time has increased substantially as well. Remember that the training time increases roughly linearly with the number of trees.

### featureSubsetStrategy Parameter

Remember that the featureSubsetStrategy parameter only changes the number of features used as candidates for splitting. The default is set to <b>"auto"</b>, which will select "all", "sqrt", or "onethird" based on the value of numTrees. Since we are basing our analysis off of the default values, we have a numTrees value of 3, which means "sqrt" is selected. So let's start by changing it it <b>"all"</b>, which will use all of the features

In [157]:
classRF_tuner(3, "all")

0.484307050705
Test Error = 0.220468890893
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 15 <= 20.5)
     If (feature 10 <= 39.5)
      If (feature 12 <= 26.5)
       If (feature 6 <= 78.5)
        Predict: 1.0
       Else (feature 6 > 78.5)
        Predict: 1.0
      Else (feature 12 > 26.5)
       If (feature 3 <= 89.5)
        Predict: 1.0
       Else (feature 3 > 89.5)
        Predict: 2.0
     Else (feature 10 > 39.5)
      If (feature 5 <= 63.5)
       If (feature 9 <= 10.5)
        Predict: 6.0
       Else (feature 9 > 10.5)
        Predict: 4.0
      Else (feature 5 > 63.5)
       If (feature 9 <= 54.5)
        Predict: 3.0
       Else (feature 9 > 54.5)
        Predict: 9.0
    Else (feature 15 > 20.5)
     If (feature 13 <= 60.5)
      If (feature 0 <= 31.5)
       If (feature 1 <= 77.5)
        Predict: 8.0
       Else (feature 1 > 77.5)
        Predict: 7.0
      Else (feature 0 > 31.5)
       If (feature 15 <= 50.5)


We can see that there is a small increase in the building time of the model, which is expected since we are considering all of the features. As well, there is a small increase in the test error. A possibility to the increase in test error is that there are some features that aren't "good" in the model, causing an increase in the test error. Next, we will try with <b>"sqrt"</b>

In [158]:
classRF_tuner(3, "sqrt")

0.466526985168
Test Error = 0.215509467989
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 13 <= 60.5)
     If (feature 15 <= 20.5)
      If (feature 11 <= 31.5)
       If (feature 9 <= 33.5)
        Predict: 2.0
       Else (feature 9 > 33.5)
        Predict: 3.0
      Else (feature 11 > 31.5)
       If (feature 1 <= 99.5)
        Predict: 1.0
       Else (feature 1 > 99.5)
        Predict: 4.0
     Else (feature 15 > 20.5)
      If (feature 0 <= 31.5)
       If (feature 7 <= 74.5)
        Predict: 7.0
       Else (feature 7 > 74.5)
        Predict: 8.0
      Else (feature 0 > 31.5)
       If (feature 10 <= 30.5)
        Predict: 8.0
       Else (feature 10 > 30.5)
        Predict: 6.0
    Else (feature 13 > 60.5)
     If (feature 9 <= 45.5)
      If (feature 8 <= 57.5)
       If (feature 10 <= 30.5)
        Predict: 5.0
       Else (feature 10 > 30.5)
        Predict: 8.0
      Else (feature 8 > 57.5)
       If (feature 7 <= 18.5

This has very similar values to the "auto", which is correct since "auto" is using "sqrt" for featureSubsetStrategy, since our numTrees value was set to 3. Let's try using "onethird" now, which uses one third of the features.

In [159]:
classRF_tuner(3, "onethird")

0.451642990112
Test Error = 0.284490532011
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 5 <= 63.5)
     If (feature 15 <= 0.5)
      If (feature 13 <= 14.5)
       If (feature 0 <= 55.5)
        Predict: 2.0
       Else (feature 0 > 55.5)
        Predict: 9.0
      Else (feature 13 > 14.5)
       If (feature 9 <= 10.5)
        Predict: 6.0
       Else (feature 9 > 10.5)
        Predict: 4.0
     Else (feature 15 > 0.5)
      If (feature 8 <= 64.5)
       If (feature 15 <= 63.5)
        Predict: 6.0
       Else (feature 15 > 63.5)
        Predict: 5.0
      Else (feature 8 > 64.5)
       If (feature 7 <= 44.5)
        Predict: 0.0
       Else (feature 7 > 44.5)
        Predict: 9.0
    Else (feature 5 > 63.5)
     If (feature 15 <= 26.5)
      If (feature 14 <= 86.5)
       If (feature 9 <= 58.5)
        Predict: 3.0
       Else (feature 9 > 58.5)
        Predict: 9.0
      Else (feature 14 > 86.5)
       If (feature 3 <= 86.5)
 

We see that the run-time is similar to the default, but the testing error has decreased a little bit. It's possible that there is about the same number of features when you take one third of them, as if you take the square root of them for this particular dataset. Let's try with the last type, which is <b>"log2"</b>

In [160]:
classRF_tuner(3, "log2")

0.428912162781
Test Error = 0.229486023445
Learned classification tree model:TreeEnsembleModel classifier with 3 trees

  Tree 0:
    If (feature 3 <= 95.5)
     If (feature 4 <= 37.5)
      If (feature 7 <= 12.5)
       If (feature 8 <= 25.5)
        Predict: 8.0
       Else (feature 8 > 25.5)
        Predict: 0.0
      Else (feature 7 > 12.5)
       If (feature 8 <= 68.5)
        Predict: 6.0
       Else (feature 8 > 68.5)
        Predict: 4.0
     Else (feature 4 > 37.5)
      If (feature 15 <= 13.5)
       If (feature 6 <= 64.5)
        Predict: 9.0
       Else (feature 6 > 64.5)
        Predict: 1.0
      Else (feature 15 > 13.5)
       If (feature 15 <= 90.5)
        Predict: 8.0
       Else (feature 15 > 90.5)
        Predict: 5.0
    Else (feature 3 > 95.5)
     If (feature 8 <= 45.5)
      If (feature 15 <= 20.5)
       If (feature 14 <= 55.5)
        Predict: 5.0
       Else (feature 14 > 55.5)
        Predict: 2.0
      Else (feature 15 > 20.5)
       If (feature 13 <= 60.5)

When using <b>"log2"</b>, there is a decrease in run-time, along with testing error!

---

## RandomForest (Regression)

Now it's time for you to try it out for yourself! Build a Regression RandomForest in a similar way that the Classification RandomForest was built. Please note that you will be using the same dataset in this section (classRF_train, classRF_test), therefore you do not need to re-initialize that section.<br> <br> 

Try to only reference the above section when you are experiencing a lot of difficulty. This section is mainly for you to apply your learning.

For some help with the variables:
<ul>
    <li><b>categoricalFeaturesInfo</b>: Has a value of <b>{}</b> (parameter doesn't require tuning)</li>
    <li><b>featureSubsetStrategy</b>: Can change these values between <b>"auto"</b>, <b>"all"</b>, <b>"sqrt"</b>, <b>"log2"</b>, and <b>"onethird"</b></li>
    <li><b>numTrees</b>: Values range from <b>1</b> to infinity<i>(Default: 3)</i></li>
    <ul>
        <li>Note: If the value is too large, the system can run out of memory and not run.</li>
    </ul>
    <li><b>impurity</b>: For Regression, the value must be set to <b>'variance'</b> <i>(Default: 'variance')</i></li>
    <li><b>maxDepth</b>: Values range between <b>0 and 30</b> <i>(Default: 5)</i></li>
    <li><b>maxBins</b>: Value ranges between <b>2 and 2147483647</b> (largest value for 32-bits) <i>(Default: 32)</i></li>
    <li><b>seed</b> Can be set to any value, or to a value based on system time with <i>None</i> <i>(Default: None)</i></li>
</ul>

When displaying the <b>Mean Squared Error</b>, use the following formula and print statement instead of Training Error: <br>
<b>regRF_MSE = regRF_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(classRF_test.count())</b> <br>
<b>print('Test Error = ' + str(regRF_MSE))</b>

### The Goal
Try to create a model that is better than the model with default values.

### Try to beat!
With some parameter tuning, I was able to get a run-time increase of the model by ~0.9 seconds and a Test error decrease of ~2.54. Try to get a value similar to this, or better.

### Note
We want a model that doesn't take too long to train and will cause overfitting. Remember that a very large model with high accuracy but long run time may not be good because the model may have overfit the data.

In [264]:
#classs_RF_test and train as is, the rest changes
def classRF_tuner(numTreesValue, featureSubsetStrategyValue, maxDepthValue, maxBinsValue):
    start = time.time()
    regRF_model = RandomForest.trainRegressor(classRF_train, categoricalFeaturesInfo={},
                                           featureSubsetStrategy=featureSubsetStrategyValue, numTrees=numTreesValue,
                                           impurity='variance', maxDepth=maxDepthValue, maxBins=maxBinsValue, seed=None)
    model_time = time.time()-start

    regRF_pred = regRF_model.predict(classRF_test.map(lambda x: x.features))
    regRF_label_pred = classRF_test.map(lambda l: l.label).zip(regRF_pred)
    # classRF_error = classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())
    regRF_MSE = regRF_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(classRF_test.count())
    
    # print('Test Error = ' + str(regRF_MSE))
    
    return (str(regRF_MSE), model_time)
    # print('Test Error = ' + str(classRF_error))
    #print('Learned classification tree model:' + regRF_model.toDebugString())

In [278]:
# configurable - 
results = []
timing_array = []

for numOfTrees in range(1,80):
    for strategy in ["auto", "all", "log2", "sqrt", "onethird"]:
        result, timing = classRF_tuner(numOfTrees,strategy,10,20)
        results.append(result)
        timing_array.append(timing)

0.265739238925
370
0.501277923584
0
0.501277923584
