**Machine Learning Packages: spark.mllib vs. spark.ml**

In Spark 2.x goes further with spark.ml
spark.ml execution is much faster between 10 and 100 times than the previous.
*Easy of hyperparameter tuning.*
*Both libraries offer powerful support, however, the speed of execution in spark 2 is clearly higher*

- RDDs and Spark 1.x

RDDs are still the fundamental building blocks of Spark

All operations in Spark are performed on in-memory objects

An RDD is a **collection** of entities
- rows, records

In [3]:
!pyspark --version

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.3
      /_/
                        
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_211
Branch 
Compiled by user  on 2019-02-04T13:00:46Z
Revision 
Url 
Type --help for more information.


spark.mllib      vs.     Spark.ml
- Older***************** - Newer
- RDDs                   - DataFrames(faster!)
- more functionality     - Functionality catching up
- ETL hard - no pipeline - Support for ML pipelines
- Hyperparameter.tuning.hard - Tools for hyperparameter tuning 

- Decision Trees model for classification

In [5]:
#Implement classification using decision trees in spark.mllib

In [6]:
#Data in the CSV as well as the LIBSVM format

In [3]:
# assign the sparkcontext to a variable "rawData"

In [7]:
rawData = sc.textFile("datasets/wine.data")

In [8]:
rawData.take(5)  

['1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065',
 '1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050',
 '1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185',
 '1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480',
 '1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735']

In [9]:
# set the data in a form to be fed into a decision tree learning model

In [10]:
from pyspark.mllib.regression import LabeledPoint

def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return LabeledPoint(values[0], values[1:])

In [11]:
#use the function to perform a map operation on the data

In [17]:
parsedData = rawData.map(parsePoint)
parsedData

PythonRDD[9] at RDD at PythonRDD.scala:52

In [18]:
parsedData.take(5)

[LabeledPoint(1.0, [14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0]),
 LabeledPoint(1.0, [13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0]),
 LabeledPoint(1.0, [13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0]),
 LabeledPoint(1.0, [14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0]),
 LabeledPoint(1.0, [13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0])]

In [19]:
#spliting the data for training and test 70% and 30% respectively

In [21]:
(trainingData, testData) = parsedData.randomSplit([0.7, 0.3])

In [22]:
trainingData

PythonRDD[11] at RDD at PythonRDD.scala:52

In [23]:
trainingData.take(5)

[LabeledPoint(1.0, [14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0]),
 LabeledPoint(1.0, [13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0]),
 LabeledPoint(1.0, [14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0]),
 LabeledPoint(1.0, [13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0]),
 LabeledPoint(1.0, [14.2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450.0])]

In [24]:
# here is the built-in machine learning decision tree model

In [25]:
from pyspark.mllib.tree import DecisionTree

model = DecisionTree.trainClassifier(trainingData,
                                    numClasses=4,
                                    categoricalFeaturesInfo={},
                                    impurity='gini',
                                    maxDepth=3,
                                    maxBins=32)

In [26]:
# execute the code and use it for predictions using "predict"

In [27]:
predictions = model.predict(testData.map(lambda x: x.features))
predictions.take(5)

[1.0, 1.0, 1.0, 3.0, 1.0]

In [28]:
# compare whether the predictions are good or bad by comparing them with the actual labels

In [29]:
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
labelsAndPredictions.take(5)

[(1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 3.0), (1.0, 1.0)]

In [30]:
# Defining our logic to calculate accuracy

In [36]:
testAcc = labelsAndPredictions.filter(
    lambda lp: lp[0] == lp[1]).count() / float(testData.count())

print('Test Accuracy = ', testAcc)

Test Accuracy =  0.7924528301886793


In [37]:
#Evaluating how the model performs on test data

In [38]:
from pyspark.mllib.evaluation import MulticlassMetrics

metrics = MulticlassMetrics(labelsAndPredictions)

In [39]:
#using the built-in accuracy measure available in MultiClassMetrics

In [40]:
metrics.accuracy

0.7924528301886793

In [41]:
# another evaluation method "fMeasure()" can be accessed

In [42]:
metrics.fMeasure()

0.7924528301886793

In [43]:
# individual score can be extracted as well for every label

In [44]:
metrics.precision(1.0)

0.8571428571428571

In [45]:
metrics.precision(3.0)

0.6

In [47]:
# "confusionMatrix" is a row column plot of actual labels vs. prediction labels

In [46]:
metrics.confusionMatrix()

DenseMatrix(3, 3, [12.0, 0.0, 2.0, 0.0, 18.0, 1.0, 0.0, 8.0, 12.0], 0)

In [48]:
# view the actual machine learning model parameter

In [49]:
print(model.toDebugString())

DecisionTreeModel classifier of depth 3 with 15 nodes
  If (feature 12 <= 760.0)
   If (feature 9 <= 4.85)
    If (feature 9 <= 4.300000000000001)
     Predict: 2.0
    Else (feature 9 > 4.300000000000001)
     Predict: 2.0
   Else (feature 9 > 4.85)
    If (feature 1 <= 2.0700000000000003)
     Predict: 2.0
    Else (feature 1 > 2.0700000000000003)
     Predict: 3.0
  Else (feature 12 > 760.0)
   If (feature 6 <= 2.2800000000000002)
    If (feature 1 <= 1.525)
     Predict: 2.0
    Else (feature 1 > 1.525)
     Predict: 3.0
   Else (feature 6 > 2.2800000000000002)
    If (feature 3 <= 26.75)
     Predict: 1.0
    Else (feature 3 > 26.75)
     Predict: 2.0



In [50]:
#Working with the LIBSVM Data Format

In [51]:
from pyspark.mllib.util import MLUtils

In [56]:
libsvmData = MLUtils.loadLibSVMFile(sc, 'datasets/wine.scale')

In [58]:
libsvmData.take(5)

[LabeledPoint(1.0, (13,[0,1,2,3,4,5,6,7,8,9,10,11,12],[0.68421,-0.616601,0.144385,-0.484536,0.23913,0.255172,0.147679,-0.433962,0.18612,-0.255973,-0.089431,0.941392,0.122682])),
 LabeledPoint(1.0, (13,[0,1,2,3,4,5,6,7,8,9,10,11,12],[0.142105,-0.588933,-0.165775,-0.938144,-0.347826,0.151724,0.0210971,-0.509434,-0.451104,-0.47099,-0.0731708,0.56044,0.101284])),
 LabeledPoint(1.0, (13,[0,1,2,3,4,5,6,7,8,9,10,11,12],[0.121053,-0.359684,0.40107,-0.175258,-0.326087,0.255172,0.223629,-0.358491,0.514196,-0.249147,-0.105691,0.391941,0.293866])),
 LabeledPoint(1.0, (13,[0,1,2,3,4,5,6,7,8,9,10,11,12],[0.757895,-0.521739,0.219251,-0.360825,-0.0652174,0.97931,0.329114,-0.584906,0.116719,0.112628,-0.382114,0.59707,0.714693])),
 LabeledPoint(1.0, (13,[0,1,2,3,4,5,6,7,8,9,10,11,12],[0.163158,-0.268775,0.614973,0.0721649,0.0434783,0.255172,-0.00843878,-0.018868,-0.11041,-0.481229,-0.089431,0.216117,-0.348074]))]

In [59]:
(trainingData, testData) = libsvmData.randomSplit([0.7, 0.3])

In [60]:
libsvmModel = DecisionTree.trainClassifier(trainingData,
                                          numClasses=4,
                                          categoricalFeaturesInfo={},
                                          impurity='gini',
                                          maxDepth=5,
                                          maxBins=32)

In [61]:
predictions = libsvmModel.predict(testData.map(lambda x: x.features))

In [62]:
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
labelsAndPredictions.take(5)

[(1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0), (1.0, 1.0)]

In [63]:
metrics = MulticlassMetrics(labelsAndPredictions)

In [64]:
metrics.accuracy

0.94

In [65]:
metrics.confusionMatrix().toArray()

array([[17.,  1.,  0.],
       [ 0., 11.,  1.],
       [ 0.,  1., 19.]])

In [66]:
print(model.toDebugString())

DecisionTreeModel classifier of depth 3 with 15 nodes
  If (feature 12 <= 760.0)
   If (feature 9 <= 4.85)
    If (feature 9 <= 4.300000000000001)
     Predict: 2.0
    Else (feature 9 > 4.300000000000001)
     Predict: 2.0
   Else (feature 9 > 4.85)
    If (feature 1 <= 2.0700000000000003)
     Predict: 2.0
    Else (feature 1 > 2.0700000000000003)
     Predict: 3.0
  Else (feature 12 > 760.0)
   If (feature 6 <= 2.2800000000000002)
    If (feature 1 <= 1.525)
     Predict: 2.0
    Else (feature 1 > 1.525)
     Predict: 3.0
   Else (feature 6 > 2.2800000000000002)
    If (feature 3 <= 26.75)
     Predict: 1.0
    Else (feature 3 > 26.75)
     Predict: 2.0



In [71]:
#SPARK.ML

DataFrame

Use to represent the ML dataset for training

Different columns for features, labels, predictions

Transformers

Algorithm to convert one DataFrame to another

DataFrame with features -> DataFrame with predictions

Estimator

An algorithm that fits on a DataFrame to produce a Transformer

An ML algorithm -> trains on input data -> produces a model

Pipeline

Chains Estimators and Transformers to form a machine learning workflow

Chains a series of operations to be performed on DataFrames

Parameter

Design settings in ML algorithms that can be tuned

Transformers and Estimators have a common API for parameters

#Pipeline Stages

Estimator Stages-------------------------------------------Transformer Stages

DataFrame in Transformer out-------------------------------DataFrame in, DataFrame out

Implement a fit() method-----------------------------------Implement a transform() method

Obtain trained machine learning model by invoking fit()----Transform features or carry out prediction using transform()

    Implement classification using Decision Trees in spark.ml

    Evaluate the model using the F1 score



spark.ml has:

    High level abstractions such as Estimators and Transformations

    Chained together in a pipeline i.e. machine learning workflow

    Special libraries for feature engineering

    Evaluating classifiers using the confusion matrix

    Decision trees and random forests for classification

    Specialized regression models such as Lasso and Ridge regression



In [1]:
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName('Predicting the grape variety from wine characteristics')\
    .getOrCreate()

rawData = spark.read\
            .format('csv')\
            .option('header', 'false')\
            .load('wine.data')

In [2]:
rawData

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string, _c12: string, _c13: string]

In [3]:
rawData.take(5)

[Row(_c0='1', _c1='14.23', _c2='1.71', _c3='2.43', _c4='15.6', _c5='127', _c6='2.8', _c7='3.06', _c8='.28', _c9='2.29', _c10='5.64', _c11='1.04', _c12='3.92', _c13='1065'),
 Row(_c0='1', _c1='13.2', _c2='1.78', _c3='2.14', _c4='11.2', _c5='100', _c6='2.65', _c7='2.76', _c8='.26', _c9='1.28', _c10='4.38', _c11='1.05', _c12='3.4', _c13='1050'),
 Row(_c0='1', _c1='13.16', _c2='2.36', _c3='2.67', _c4='18.6', _c5='101', _c6='2.8', _c7='3.24', _c8='.3', _c9='2.81', _c10='5.68', _c11='1.03', _c12='3.17', _c13='1185'),
 Row(_c0='1', _c1='14.37', _c2='1.95', _c3='2.5', _c4='16.8', _c5='113', _c6='3.85', _c7='3.49', _c8='.24', _c9='2.18', _c10='7.8', _c11='.86', _c12='3.45', _c13='1480'),
 Row(_c0='1', _c1='13.24', _c2='2.59', _c3='2.87', _c4='21', _c5='118', _c6='2.8', _c7='2.69', _c8='.39', _c9='1.82', _c10='4.32', _c11='1.04', _c12='2.93', _c13='735')]

In [4]:
dataset = rawData.toDF('Label',
                'Alcohol',
                'MalicAcid',
                'Ash',
                'AshAlkalinity',
                'Magnesium',
                'TotalPhenols',
                'Flavanoids',
                'NonflavanoidPhenols',
                'Proanthocyanins',
                'ColorIntensity',
                'Hue',
                'OD',
                'Proline'
                )

In [5]:
dataset

DataFrame[Label: string, Alcohol: string, MalicAcid: string, Ash: string, AshAlkalinity: string, Magnesium: string, TotalPhenols: string, Flavanoids: string, NonflavanoidPhenols: string, Proanthocyanins: string, ColorIntensity: string, Hue: string, OD: string, Proline: string]

In [6]:
dataset.show(5)

+-----+-------+---------+----+-------------+---------+------------+----------+-------------------+---------------+--------------+----+----+-------+
|Label|Alcohol|MalicAcid| Ash|AshAlkalinity|Magnesium|TotalPhenols|Flavanoids|NonflavanoidPhenols|Proanthocyanins|ColorIntensity| Hue|  OD|Proline|
+-----+-------+---------+----+-------------+---------+------------+----------+-------------------+---------------+--------------+----+----+-------+
|    1|  14.23|     1.71|2.43|         15.6|      127|         2.8|      3.06|                .28|           2.29|          5.64|1.04|3.92|   1065|
|    1|   13.2|     1.78|2.14|         11.2|      100|        2.65|      2.76|                .26|           1.28|          4.38|1.05| 3.4|   1050|
|    1|  13.16|     2.36|2.67|         18.6|      101|         2.8|      3.24|                 .3|           2.81|          5.68|1.03|3.17|   1185|
|    1|  14.37|     1.95| 2.5|         16.8|      113|        3.85|      3.49|                .24|           2.1

In [7]:
from pyspark.ml.linalg import Vectors

def vectorize(data):
    return data.rdd.map(lambda r: [r[0], Vectors.dense(r[1:])]).toDF(['label', 'features'])

In [8]:
vectorizedData = vectorize(dataset)

In [9]:
vectorizedData.show(5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    1|[14.23,1.71,2.43,...|
|    1|[13.2,1.78,2.14,1...|
|    1|[13.16,2.36,2.67,...|
|    1|[14.37,1.95,2.5,1...|
|    1|[13.24,2.59,2.87,...|
+-----+--------------------+
only showing top 5 rows



In [10]:
vectorizedData.take(5)

[Row(label='1', features=DenseVector([14.23, 1.71, 2.43, 15.6, 127.0, 2.8, 3.06, 0.28, 2.29, 5.64, 1.04, 3.92, 1065.0])),
 Row(label='1', features=DenseVector([13.2, 1.78, 2.14, 11.2, 100.0, 2.65, 2.76, 0.26, 1.28, 4.38, 1.05, 3.4, 1050.0])),
 Row(label='1', features=DenseVector([13.16, 2.36, 2.67, 18.6, 101.0, 2.8, 3.24, 0.3, 2.81, 5.68, 1.03, 3.17, 1185.0])),
 Row(label='1', features=DenseVector([14.37, 1.95, 2.5, 16.8, 113.0, 3.85, 3.49, 0.24, 2.18, 7.8, 0.86, 3.45, 1480.0])),
 Row(label='1', features=DenseVector([13.24, 2.59, 2.87, 21.0, 118.0, 2.8, 2.69, 0.39, 1.82, 4.32, 1.04, 2.93, 735.0]))]

In [11]:
from pyspark.ml.feature import StringIndexer

labelIndexer = StringIndexer = StringIndexer(inputCol='label',
                                            outputCol='indexedLabel')

In [12]:
indexedData = labelIndexer.fit(vectorizedData).transform(vectorizedData)
indexedData.take(2)

[Row(label='1', features=DenseVector([14.23, 1.71, 2.43, 15.6, 127.0, 2.8, 3.06, 0.28, 2.29, 5.64, 1.04, 3.92, 1065.0]), indexedLabel=1.0),
 Row(label='1', features=DenseVector([13.2, 1.78, 2.14, 11.2, 100.0, 2.65, 2.76, 0.26, 1.28, 4.38, 1.05, 3.4, 1050.0]), indexedLabel=1.0)]

In [13]:
indexedData

DataFrame[label: string, features: vector, indexedLabel: double]

In [14]:
indexedData.select('label').distinct().show()

+-----+
|label|
+-----+
|    3|
|    1|
|    2|
+-----+



In [15]:
indexedData.select('indexedLabel').distinct().show()

+------------+
|indexedLabel|
+------------+
|         0.0|
|         1.0|
|         2.0|
+------------+



In [16]:
(trainingData, testData) = indexedData.randomSplit([0.8, 0.2])

DecisionTree Classifier

    Specify the features and label columns
    maxDepth: The maximum depth of the decision tree
    impurity: We use gini instead of entropy. Gini measurement is the probability of a random sample being classified correctly. Entropy is a measure of information (seek to maximize information gain when making a split). Outputs generally don't vary much when either option is chosen, but entropy may take longer to compute as it calculates a logarithm



In [17]:
from pyspark.ml.classification import DecisionTreeClassifier

dtree = DecisionTreeClassifier(
    labelCol='indexedLabel', 
    featuresCol='features',
    maxDepth=3,
    impurity='gini'
)

**Traing the model using the training data**

In [18]:
model = dtree.fit(trainingData)

**Use Spark ML's MulticlassClassificationEvaluator to evaluate the model**

    - Used to evaluate classification models
    - It takes a set of labels and predictions as input
    - Similar to (but not the same as MulticlassMetrics in MLLib)
    - metricName: Can be precision, recall, weightedPrecision, weightedRecall and f1



In [19]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel',
                                              predictionCol='prediction', 
                                              metricName='f1')

**Transform the test data using our model to include predictions**

In [20]:
transformed_data = model.transform(testData)
transformed_data.show(5)

+-----+--------------------+------------+--------------+-------------+----------+
|label|            features|indexedLabel| rawPrediction|  probability|prediction|
+-----+--------------------+------------+--------------+-------------+----------+
|    1|[13.16,2.36,2.67,...|         1.0|[0.0,44.0,0.0]|[0.0,1.0,0.0]|       1.0|
|    1|[13.2,1.78,2.14,1...|         1.0|[0.0,44.0,0.0]|[0.0,1.0,0.0]|       1.0|
|    1|[13.51,1.8,2.65,1...|         1.0|[0.0,44.0,0.0]|[0.0,1.0,0.0]|       1.0|
|    1|[13.72,1.43,2.5,1...|         1.0|[0.0,44.0,0.0]|[0.0,1.0,0.0]|       1.0|
|    1|[13.75,1.73,2.41,...|         1.0|[0.0,44.0,0.0]|[0.0,1.0,0.0]|       1.0|
+-----+--------------------+------------+--------------+-------------+----------+
only showing top 5 rows



Measure accuracy of model on the test data

In [21]:
print(evaluator.getMetricName(), 
      'accuracy:', 
      evaluator.evaluate(transformed_data))

f1 accuracy: 0.9245538720538721
