## 202 - Training and Evaluating CNTK Models in Spark ML Pipelines

Yet again, now using the `Word2Vec` Estimator from Spark.  We can use the tree-based
learners from spark in this scenario due to the lower dimensionality representation of
features.

In [1]:
import pandas as pd


StatementMeta(SamplePool, 44, 1, Finished, Available)



In [2]:
data = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/BookReviewsFromAmazon10K.parquet")
data.limit(10).toPandas()

StatementMeta(SamplePool, 44, 2, Finished, Available)

   rating                                               text
0       5  I LOVED THIS BOOK!  This was my first Jodi Pic...
1       5  My Sister's Keeper:  A Novel  What a very touc...
2       4  hooked by chapter one; tear jerker in the end ...
3       4  A thought-provoking book  A very interesting p...
4       5  Fiction mimics the future  Very well written, ...
5       5  Page turner until the end!  This was a fantast...
6       4  Makes you aware  Makes you aware of some of th...
7       5  A Hands-on Book for Connecting With Students  ...
8       5  Dozens of positive, effective strategies  A vi...
9       5  Hard To Put Down  Harry Bosch and his partner ...

Modify the label column to predict a rating greater than 3.

In [3]:
processedData = data.withColumn("label", data["rating"] > 3) \
                    .select(["text", "label"])
processedData.limit(5).toPandas()

StatementMeta(SamplePool, 44, 3, Finished, Available)

                                                text  label
0  I LOVED THIS BOOK!  This was my first Jodi Pic...   True
1  My Sister's Keeper:  A Novel  What a very touc...   True
2  hooked by chapter one; tear jerker in the end ...   True
3  A thought-provoking book  A very interesting p...   True
4  Fiction mimics the future  Very well written, ...   True

Split the dataset into train, test and validation sets.

In [4]:
train, test, validation = processedData.randomSplit([0.60, 0.20, 0.20])

StatementMeta(SamplePool, 44, 4, Finished, Available)



Use `Tokenizer` and `Word2Vec` to generate the features.

In [5]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, Word2Vec
tokenizer = Tokenizer(inputCol="text", outputCol="words")
partitions = train.rdd.getNumPartitions()
word2vec = Word2Vec(maxIter=4, seed=42, inputCol="words", outputCol="features",
                    numPartitions=partitions)
textFeaturizer = Pipeline(stages = [tokenizer, word2vec]).fit(train)

StatementMeta(SamplePool, 44, 5, Finished, Available)



Transform each of the train, test and validation datasets.

In [6]:
ptrain = textFeaturizer.transform(train).select(["label", "features"])
ptest = textFeaturizer.transform(test).select(["label", "features"])
pvalidation = textFeaturizer.transform(validation).select(["label", "features"])
ptrain.limit(5).toPandas()

StatementMeta(SamplePool, 44, 6, Finished, Available)

   label                                           features
0  False  [0.016259915026960968, 0.028176641930675126, -...
1  False  [0.029078174612228655, -0.0014457819621408088,...
2  False  [0.014679526520410071, 0.056803625366754006, 0...
3  False  [0.053565987438345564, 0.05298001414988763, 0....
4  False  [0.0192281842113328, 0.02463517978851777, -0.0...
  Unsupported type in conversion to Arrow: VectorUDT
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.

Generate several models with different parameters from the training data.

In [7]:
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
from mmlspark.train import TrainClassifier
import itertools

lrHyperParams       = [0.05, 0.2]
logisticRegressions = [LogisticRegression(regParam = hyperParam)
                       for hyperParam in lrHyperParams]
lrmodels            = [TrainClassifier(model=lrm, labelCol="label").fit(ptrain)
                       for lrm in logisticRegressions]

rfHyperParams       = itertools.product([5, 10], [2, 3])
randomForests       = [RandomForestClassifier(numTrees=hyperParam[0], maxDepth=hyperParam[1])
                       for hyperParam in rfHyperParams]
rfmodels            = [TrainClassifier(model=rfm, labelCol="label").fit(ptrain)
                       for rfm in randomForests]

gbtHyperParams      = itertools.product([8, 16], [2, 3])
gbtclassifiers      = [GBTClassifier(maxBins=hyperParam[0], maxDepth=hyperParam[1])
                       for hyperParam in gbtHyperParams]
gbtmodels           = [TrainClassifier(model=gbt, labelCol="label").fit(ptrain)
                       for gbt in gbtclassifiers]

trainedModels       = lrmodels + rfmodels + gbtmodels

StatementMeta(SamplePool, 44, 7, Finished, Available)



Find the best model for the given test dataset.

In [8]:
from mmlspark.automl import FindBestModel
bestModel = FindBestModel(evaluationMetric="AUC", models=trainedModels).fit(ptest)
bestModel.getEvaluationResults().show()
bestModel.getBestModelMetrics().show()
bestModel.getAllModelMetrics().show()

StatementMeta(SamplePool, 44, 8, Finished, Available)

+-------------------+--------------------+
|false_positive_rate|  true_positive_rate|
+-------------------+--------------------+
|                0.0|                 0.0|
|                0.0|6.211180124223603E-4|
|                0.0|0.001242236024844...|
|                0.0|0.001863354037267...|
|                0.0|0.002484472049689441|
|                0.0|0.003105590062111801|
|                0.0|0.003726708074534...|
|                0.0|0.004347826086956522|
|                0.0|0.004968944099378882|
|                0.0|0.005590062111801...|
|                0.0|0.006211180124223602|
|                0.0|0.006832298136645...|
|                0.0|0.007453416149068323|
|                0.0|0.008074534161490683|
|                0.0|0.008695652173913044|
|                0.0|0.009316770186335404|
|                0.0|0.009937888198757764|
|                0.0|0.010559006211180125|
|                0.0|0.011180124223602485|
|                0.0|0.011801242236024845|
+----------

Get the accuracy from the validation dataset.

In [9]:
from mmlspark.train import ComputeModelStatistics
predictions = bestModel.transform(pvalidation)
metrics = ComputeModelStatistics().transform(predictions)
print("Best model's accuracy on validation set = "
      + "{0:.2f}%".format(metrics.first()["accuracy"] * 100))
print("Best model's AUC on validation set = "
      + "{0:.2f}%".format(metrics.first()["AUC"] * 100))

StatementMeta(SamplePool, 44, 9, Finished, Available)

Best model's accuracy on validation set = 82.62%
Best model's AUC on validation set = 85.43%

In [10]:
spark.stop()

StatementMeta(SamplePool, 44, 10, Finished, Available)

