## Spark ML examples
Borrowed from [here](https://github.com/susanli2016/PySpark-and-MLlib/blob/master/Machine%20Learning%20PySpark%20and%20MLlib.ipynb)

In [22]:
from pyspark.sql import SparkSession

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Load the data

In [3]:
s3_path = "s3n://pg-sample-spark/bank-full.csv"

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [23]:
spark = SparkSession.builder.appName('ml-bank').getOrCreate()
df = spark.read.option("delimiter", ";").csv(s3_path, header = True, inferSchema = True)
df.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- y: string (nullable = true)

## What does the data look like?

In [5]:
df.take(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[Row(age=58, job='management', marital='married', education='tertiary', default='no', balance=2143, housing='yes', loan='no', contact='unknown', day=5, month='may', duration=261, campaign=1, pdays=-1, previous=0, poutcome='unknown', y='no'), Row(age=44, job='technician', marital='single', education='secondary', default='no', balance=29, housing='yes', loan='no', contact='unknown', day=5, month='may', duration=151, campaign=1, pdays=-1, previous=0, poutcome='unknown', y='no'), Row(age=33, job='entrepreneur', marital='married', education='secondary', default='no', balance=2, housing='yes', loan='yes', contact='unknown', day=5, month='may', duration=76, campaign=1, pdays=-1, previous=0, poutcome='unknown', y='no'), Row(age=47, job='blue-collar', marital='married', education='unknown', default='no', balance=1506, housing='yes', loan='no', contact='unknown', day=5, month='may', duration=92, campaign=1, pdays=-1, previous=0, poutcome='unknown', y='no'), Row(age=33, job='unknown', marital='si

## Set up pipeline for data analysis and transformation

The below code are taken from databricks' official site and it indexes each categorical column using the StringIndexer, then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row. We use the StringIndexer again to encode our labels to label indices.

Next, we use the VectorAssembler to combine all the feature columns into a single vector column.

In [6]:
numeric_features = [t[0] for t in df.dtypes if t[1] == 'int']

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome']
stages = []
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'y', outputCol = 'label')
stages += [label_stringIdx]
numericCols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
print(stages)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[StringIndexer_45778a9ff40f, OneHotEncoderEstimator_bafbe8034fe3, StringIndexer_e43228c98aa2, OneHotEncoderEstimator_47494f12d620, StringIndexer_d81f6a95b8d3, OneHotEncoderEstimator_eff036e6c502, StringIndexer_55c9408d0b7c, OneHotEncoderEstimator_57dc7d6d7c9d, StringIndexer_91b66f43fef3, OneHotEncoderEstimator_15a550f1bb0b, StringIndexer_a95a292ef457, OneHotEncoderEstimator_af616a26925e, StringIndexer_cf3ba5955545, OneHotEncoderEstimator_765efef57e54, StringIndexer_5ad26969719a, OneHotEncoderEstimator_5d25df5949fb, StringIndexer_1b33cf2ff559, VectorAssembler_82748ff1c2c2]

In [9]:
df = df.select('age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y')
cols = df.columns

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Pipeline
We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow. A Pipeline’s stages are specified as an ordered array.

In [10]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
df.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- y: string (nullable = true)

## Try different models

In [11]:
train, test = df.randomSplit([0.7, 0.3], seed = 2018)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Training Dataset Count: 31512
Test Dataset Count: 13699

In [14]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=10)
lrModel = lr.fit(train)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [15]:
predictions = lrModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+-----+--------------------+----------+--------------------+
|age|        job|label|       rawPrediction|prediction|         probability|
+---+-----------+-----+--------------------+----------+--------------------+
| 25|blue-collar|  0.0|[-0.0074897467672...|       1.0|[0.49812757206119...|
| 25|blue-collar|  0.0|[3.69942973985946...|       0.0|[0.97585954821538...|
| 26|blue-collar|  0.0|[0.67389844135347...|       0.0|[0.66237553507928...|
| 28|blue-collar|  0.0|[2.95848552534532...|       0.0|[0.95066300919706...|
| 28|blue-collar|  0.0|[3.19444399099684...|       0.0|[0.96062465858797...|
| 28|blue-collar|  0.0|[2.18706956038553...|       0.0|[0.89908232877387...|
| 28|blue-collar|  0.0|[3.58762605707277...|       0.0|[0.97308076654559...|
| 29|blue-collar|  0.0|[3.62724352882160...|       0.0|[0.97409930693882...|
| 29|blue-collar|  0.0|[4.21665335950071...|       0.0|[0.98546642202779...|
| 29|blue-collar|  0.0|[3.37984658370938...|       0.0|[0.96706871973232...|

In [16]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print('Test Area Under ROC', evaluator.evaluate(predictions))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Test Area Under ROC 0.8831402282344049

In [15]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())

cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

cvModel = cv.fit(train)
predictions = cvModel.transform(test)
print('Test Area Under ROC', evaluator.evaluate(predictions))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-15:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 757



Test Area Under ROC 0.8869487141751714

## Decision Tree Classifier
Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.

In [17]:
from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 3)
dtModel = dt.fit(train)
predictions = dtModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+-----+----------------+----------+--------------------+
|age|        job|label|   rawPrediction|prediction|         probability|
+---+-----------+-----+----------------+----------+--------------------+
| 25|blue-collar|  0.0|   [439.0,629.0]|       1.0|[0.41104868913857...|
| 25|blue-collar|  0.0|[25495.0,1984.0]|       0.0|[0.92779941045889...|
| 26|blue-collar|  0.0| [1922.0,1043.0]|       0.0|[0.64822934232715...|
| 28|blue-collar|  0.0|[25495.0,1984.0]|       0.0|[0.92779941045889...|
| 28|blue-collar|  0.0|[25495.0,1984.0]|       0.0|[0.92779941045889...|
| 28|blue-collar|  0.0|[25495.0,1984.0]|       0.0|[0.92779941045889...|
| 28|blue-collar|  0.0|[25495.0,1984.0]|       0.0|[0.92779941045889...|
| 29|blue-collar|  0.0|[25495.0,1984.0]|       0.0|[0.92779941045889...|
| 29|blue-collar|  0.0|[25495.0,1984.0]|       0.0|[0.92779941045889...|
| 29|blue-collar|  0.0|[25495.0,1984.0]|       0.0|[0.92779941045889...|
+---+-----------+-----+----------------+----------+

In [18]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Test Area Under ROC: 0.3226729919510868

## Random Forest Classifier
One simple decision tree performed poorly because it is too weak given the range of different features. The prediction accuracy of decision trees can be improved by Ensemble methods.



In [19]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')
rfModel = rf.fit(train)
predictions = rfModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+-----+--------------------+----------+--------------------+
|age|        job|label|       rawPrediction|prediction|         probability|
+---+-----------+-----+--------------------+----------+--------------------+
| 25|blue-collar|  0.0|[13.8316359122759...|       0.0|[0.69158179561379...|
| 25|blue-collar|  0.0|[18.7672623845050...|       0.0|[0.93836311922525...|
| 26|blue-collar|  0.0|[12.9139012863136...|       0.0|[0.64569506431568...|
| 28|blue-collar|  0.0|[18.7672623845050...|       0.0|[0.93836311922525...|
| 28|blue-collar|  0.0|[18.7672623845050...|       0.0|[0.93836311922525...|
| 28|blue-collar|  0.0|[18.6236605734411...|       0.0|[0.93118302867205...|
| 28|blue-collar|  0.0|[18.7394752687360...|       0.0|[0.93697376343680...|
| 29|blue-collar|  0.0|[18.7672623845050...|       0.0|[0.93836311922525...|
| 29|blue-collar|  0.0|[18.7672623845050...|       0.0|[0.93836311922525...|
| 29|blue-collar|  0.0|[18.7672623845050...|       0.0|[0.93836311922525...|

In [19]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Test Area Under ROC: 0.8564152099432955

In [20]:

from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
predictions = gbtModel.transform(test)
predictions.select('age', 'job', 'label', 'rawPrediction', 'prediction', 'probability').show(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-----------+-----+--------------------+----------+--------------------+
|age|        job|label|       rawPrediction|prediction|         probability|
+---+-----------+-----+--------------------+----------+--------------------+
| 25|blue-collar|  0.0|[-0.3067465338488...|       1.0|[0.35126279431950...|
| 25|blue-collar|  0.0|[1.14993152651692...|       0.0|[0.90886569646160...|
| 26|blue-collar|  0.0|[0.09478166523072...|       0.0|[0.54724942786082...|
| 28|blue-collar|  0.0|[1.14338723694234...|       0.0|[0.90777576743999...|
| 28|blue-collar|  0.0|[1.16754982963295...|       0.0|[0.91174255903043...|
| 28|blue-collar|  0.0|[1.01712064459232...|       0.0|[0.88434557816390...|
| 28|blue-collar|  0.0|[1.22492384426637...|       0.0|[0.92055031185947...|
| 29|blue-collar|  0.0|[1.25178196394701...|       0.0|[0.9243912872082,...|
| 29|blue-collar|  0.0|[1.29684498598191...|       0.0|[0.93045437123828...|
| 29|blue-collar|  0.0|[1.19024615283953...|       0.0|[0.91532759723223...|

In [21]:
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Test Area Under ROC: 0.8861918511262175

## Extension
Gradient-boosted Tree achieved the best results, we will try tuning this model with the ParamGridBuilder and the CrossValidator.

In [22]:

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 4, 6])
             .addGrid(gbt.maxBins, [20, 60])
             .addGrid(gbt.maxIter, [10, 20])
             .build())

cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations.  This can take about 6 minutes since it is training over 20 trees!
cvModel = cv.fit(train)
predictions = cvModel.transform(test)
evaluator.evaluate(predictions)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Exception in thread cell_monitor-22:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py", line 178, in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 2842



0.8998086453386656