# H-1B Visa Approvals  (2011-2017): Part 4


## Spark MLlib machine learning on an AWS EC2 Instance
  
  
**AMOD-5410H: Big Data**   
**Winter 2018**  
**Nicholas Hopewell - 0496633**

One unique thing about MLlib is that you need to redce down your data to two columns for supervised learning (Features, Labels) and one column for unsupervised learning (Features). This means all columns containing features will eventually be put into this one feature column. How this works is that all the entires in the feature column will be individual arrays containing the information in the initial columns. 

MLlib requires more preprocessing than libaries like SciKit, but this format allows it to work with distributed data. 

Link to the documentation: https://spark.apache.org/docs/latest/ml-guide.html  
Key link in the document - look at this in detail: https://spark.apache.org/docs/latest/ml-features.html

### Tree-Based Methods

In this notebook, I want to explore and give some explaination about tree-based methods. In particular, I want to go over decision trees and their extensions (random forests and gradient boosted trees).

In [1]:
import findspark
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')
import pyspark
# start spark session
from pyspark.sql import SparkSession

In [2]:
# create session
spark = SparkSession.builder.appName('tree_models').getOrCreate()

Note that in my preprocessing notebook I modified the data to transform the problem for a 4-class classification problem to a binary classification problem. That is why I am importing the binary evaluation below. See that notebook for a reasoning / justification behind my decision to make this a binary classification problem.


In [3]:
# evaluation metrics
from pyspark.ml.evaluation import BinaryClassificationEvaluator 
from pyspark.ml import Pipeline

In [4]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import (DecisionTreeClassifier, 
                                       RandomForestClassifier, 
                                       GBTClassifier)

note: for a regression problem, you can import the same models as regression trees by first called .regression instead of .classification. Each import will say   ...regressor instead of ...classifier. In classification, ensembles take votes on the class label for each record. In regression, ensembles take averages to predict outcomes. 


In [5]:
# read the csv file
new_h1b_data = spark.read.csv("/home/ubuntu/csv/updated_2017_data.csv",inferSchema=True,header=True)

Looking at the schema is very important before trying to use MLlib. To transform the data into a format MLlib can use, the transformer method requires the data to be numeric, bool, or vector type. I explain this and link to the documentation just a bit further down this notebook (see VectorAssembler documentation). 

In [6]:
new_h1b_data.printSchema()

root
 |-- CASE_STATUS: integer (nullable = true)
 |-- TOTAL_WORKERS: integer (nullable = true)
 |-- NEW_EMPLOYMENT: integer (nullable = true)
 |-- CONTINUED_EMPLOYMENT: integer (nullable = true)
 |-- CHANGE_PREVIOUS_EMPLOYMENT: integer (nullable = true)
 |-- NEW_CONCURRENT_EMPLOYMENT: integer (nullable = true)
 |-- CHANGE_EMPLOYER: integer (nullable = true)
 |-- AMENDED_PETITION: integer (nullable = true)
 |-- PREVAILING_WAGE: double (nullable = true)
 |-- WAGE_RATE_OF_PAY_FROM: double (nullable = true)
 |-- FULL_TIME_POSITION_N: integer (nullable = true)
 |-- FULL_TIME_POSITION_Y: integer (nullable = true)
 |-- PW_SOURCE_CBA: integer (nullable = true)
 |-- PW_SOURCE_DBA: integer (nullable = true)
 |-- PW_SOURCE_OES: integer (nullable = true)
 |-- PW_SOURCE_Other: integer (nullable = true)
 |-- PW_SOURCE_SCA: integer (nullable = true)
 |-- WAGE_RATE_OF_PAY_TO_N: integer (nullable = true)
 |-- WAGE_RATE_OF_PAY_TO_Y: integer (nullable = true)
 |-- H1B_DEPENDENT_N: integer (nullable = tru

This data set has quite a few columns. Recall that these columns do need to be in a specific format for MLlib. Because I am doing supervised learning, I need to have two columns in the end: one 'Features' column where each element is an array of column values, and one 'Labels' column where the class labels are contained.

Here is a list of columns (to copy and paste later):


In [7]:
new_h1b_data.columns

['CASE_STATUS',
 'TOTAL_WORKERS',
 'NEW_EMPLOYMENT',
 'CONTINUED_EMPLOYMENT',
 'CHANGE_PREVIOUS_EMPLOYMENT',
 'NEW_CONCURRENT_EMPLOYMENT',
 'CHANGE_EMPLOYER',
 'AMENDED_PETITION',
 'PREVAILING_WAGE',
 'WAGE_RATE_OF_PAY_FROM',
 'FULL_TIME_POSITION_N',
 'FULL_TIME_POSITION_Y',
 'PW_SOURCE_CBA',
 'PW_SOURCE_DBA',
 'PW_SOURCE_OES',
 'PW_SOURCE_Other',
 'PW_SOURCE_SCA',
 'WAGE_RATE_OF_PAY_TO_N',
 'WAGE_RATE_OF_PAY_TO_Y',
 'H1B_DEPENDENT_N',
 'H1B_DEPENDENT_Y',
 'WILLFUL_VIOLATOR_N',
 'WILLFUL_VIOLATOR_Y']

In [8]:
from pyspark.ml.feature import VectorAssembler

VectorAssembler is what the columns need to be passed into. I can start by initializing the input columns list of features I want as the predictor variables, and specifying the output column to be called 'Features':

In [9]:
# call VecAssem and initialize in and out
vec_assem = VectorAssembler(inputCols = ['TOTAL_WORKERS',
                                         'NEW_EMPLOYMENT',
                                         'CONTINUED_EMPLOYMENT',
                                         'CHANGE_PREVIOUS_EMPLOYMENT',
                                         'NEW_CONCURRENT_EMPLOYMENT',
                                         'CHANGE_EMPLOYER',
                                         'AMENDED_PETITION',
                                         'PREVAILING_WAGE',
                                         'WAGE_RATE_OF_PAY_FROM',
                                         'FULL_TIME_POSITION_N',
                                         'FULL_TIME_POSITION_Y',
                                         'PW_SOURCE_CBA',
                                         'PW_SOURCE_DBA',
                                         'PW_SOURCE_OES',
                                         'PW_SOURCE_Other',
                                         'PW_SOURCE_SCA',
                                         'WAGE_RATE_OF_PAY_TO_N',
                                         'WAGE_RATE_OF_PAY_TO_Y',
                                         'H1B_DEPENDENT_N',
                                         'H1B_DEPENDENT_Y',
                                         'WILLFUL_VIOLATOR_N',
                                         'WILLFUL_VIOLATOR_Y'], outputCol = 'Features')

Here, I am using every column because during the preprocessing phase I already selected only the columns I needed for classification.

Below, I specify the output as the vector assembler declared above which will transform my data frame into the format required for MLlib.

In [10]:
# call vec_assem
transformed_data = vec_assem.transform(new_h1b_data)

If you get an error here it is most likely because one of your columns is of a type not supported by VectorAssembler. See the documentation here: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler - notice the data can be any numeric type, bool, or vector type. If you have strings, like I originally did, make sure to one hot encode the nominal data into numeric representations. Pure strings (not representing nominal catagories) will not be appropriate, obviously. Although, you can very easily turn string into categories by grouping things like job titles into a few categories (such as 'tech') and then hot encode those categories. Similarly, if you think thresholds of numeric data matter, such as high vs moderate vs low earners, you can bin the data into a number of categories and use those new data as the passed values. 

I did the encoding during preprocess. To do this with mllib, you would use StringIndexer. I provided the example code below to look at:

In [None]:
# from pyspark.ml.feature import StringIndexer
# string_encode = StringIndexer(inputCol = 'string col you want to encode', outputCol = 'name of new column')
# cols_corrected = string_encode.fit(transformed_data).transform(transformed_data)
# cols_corrected.printSchema()  # now it should be correct

Now if I print the schema of the transformed data, notice the new 'Feature' vector at the bottom.

In [11]:
transformed_data.printSchema()

root
 |-- CASE_STATUS: integer (nullable = true)
 |-- TOTAL_WORKERS: integer (nullable = true)
 |-- NEW_EMPLOYMENT: integer (nullable = true)
 |-- CONTINUED_EMPLOYMENT: integer (nullable = true)
 |-- CHANGE_PREVIOUS_EMPLOYMENT: integer (nullable = true)
 |-- NEW_CONCURRENT_EMPLOYMENT: integer (nullable = true)
 |-- CHANGE_EMPLOYER: integer (nullable = true)
 |-- AMENDED_PETITION: integer (nullable = true)
 |-- PREVAILING_WAGE: double (nullable = true)
 |-- WAGE_RATE_OF_PAY_FROM: double (nullable = true)
 |-- FULL_TIME_POSITION_N: integer (nullable = true)
 |-- FULL_TIME_POSITION_Y: integer (nullable = true)
 |-- PW_SOURCE_CBA: integer (nullable = true)
 |-- PW_SOURCE_DBA: integer (nullable = true)
 |-- PW_SOURCE_OES: integer (nullable = true)
 |-- PW_SOURCE_Other: integer (nullable = true)
 |-- PW_SOURCE_SCA: integer (nullable = true)
 |-- WAGE_RATE_OF_PAY_TO_N: integer (nullable = true)
 |-- WAGE_RATE_OF_PAY_TO_Y: integer (nullable = true)
 |-- H1B_DEPENDENT_N: integer (nullable = tru

The only two columns I need now is the Features vector and the target column (CASE_STATUS).

In [12]:
# MLlib ready data
h1b_data = transformed_data.select('Features', 'CASE_STATUS')

Now, these data can be split into training and testing data:

*note: I am using an 80-20 split here.

In [13]:
train_data,test_data = h1b_data.randomSplit([0.8, 0.2]) 

Now I can attach the three classification models to to some local objects:

In [14]:
# decision tree, rand forest, g boosted
dtc = DecisionTreeClassifier(labelCol='CASE_STATUS', featuresCol='Features')
rfc = RandomForestClassifier(labelCol='CASE_STATUS', featuresCol='Features')
gbt = GBTClassifier(labelCol='CASE_STATUS', featuresCol='Features')

Now I can fit these models to the training data:

*note: this will take time, especially on a micro instance. Three models are being fit in one Jupyter cell. The data is not terrible large but I am only on a single micro instance. If this were on a larger cluster, expect the time to be cut down.

In fact, there is a very good chance you will get an error reading something like "Py4JNetworkError("Answer from Java side is empty")". This means the data is too large for this computation. The training of a model is the computationally intensive part, not testing. 

In [15]:
decision_tree = dtc.fit(train_data)

In [19]:
random_forest = rfc.fit(train_data)

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1035, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
    response = connection.send_command(command)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37014)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-1be1847e8aea>", line 1, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

Py4JError: An error occurred while calling o54.fit

In [15]:
# fit to training data
#decision_tree = dtc.fit(train_data)
#random_forest = rfc.fit(train_data)
#boosted_tree = gbt.fit(train_data)

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1035, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
    response = connection.send_command(command)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37757)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-c81ba6aafce2>", line 3, in <module>
    random_forest = rfc.fit(train_data)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/home/ubuntu/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  Fil

Py4JError: An error occurred while calling o54.fit

**More complex models are simply not running on my micro ec2 instance. That is the memory allocation error I am getting. I am going to try using a Zepplin Notebook running on AWS elastic map reduce.**

Now the models can be compared on the test data through another transformation.

In [16]:
# apply to test
dtc_preds = decision_tree.transform(test_data)
#rfc_preds = random_forest.transform(test_data)
#gbt_preds = boosted_tree.transform(test_data)

Now for the evaluation metrics to see how well the models fit the data:

In [17]:
binary_eval = BinaryClassificationEvaluator(labelCol = 'CASE_STATUS')

In [18]:
print('Decision Tree fit:')
print(binary_eval.evaluate(dtc_preds))

Decision Tree fit:
0.6015052489170445


This is pretty low, a little better than guessing. But also consider that basic decision trees do not have much predictive power and are not as good as many modern models. This number does, however, tell me that my data isnot easily seperable. If the decision tree fit the data very well with default parameters, I would conclude the data are easily seperable. 