## 203 - Hyperparameter Tuning with MMLSpark

We can do distributed randomized grid search hyperparameter tuning with MMLSpark.

First, we import the packages

In [1]:
import pandas as pd


StatementMeta(SamplePool, 35, 1, Finished, Available)



Now let's read the data and split it to tuning and test sets:

In [2]:
data = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/BreastCancer.parquet")
tune, test = data.randomSplit([0.80, 0.20])
tune.limit(10).toPandas()

StatementMeta(SamplePool, 35, 2, Finished, Available)

   Label  Clump_Thickness  ...  Normal_Nucleoli  Mitoses
0      0                1  ...                1        1
1      0                1  ...                1        1
2      0                1  ...                1        1
3      0                1  ...                1        1
4      0                2  ...                1        1
5      0                3  ...                1        1
6      0                3  ...                1        1
7      0                3  ...                1        1
8      0                3  ...                6        1
9      0                4  ...                6        1

[10 rows x 10 columns]

Next, define the models that wil be tuned:

In [3]:
from mmlspark.automl import TuneHyperparameters
from mmlspark.train import TrainClassifier
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
logReg = LogisticRegression()
randForest = RandomForestClassifier()
gbt = GBTClassifier()
smlmodels = [logReg, randForest, gbt]
mmlmodels = [TrainClassifier(model=model, labelCol="Label") for model in smlmodels]

StatementMeta(SamplePool, 35, 3, Finished, Available)



We can specify the hyperparameters using the HyperparamBuilder.
We can add either DiscreteHyperParam or RangeHyperParam hyperparameters.
TuneHyperparameters will randomly choose values from a uniform distribution.

In [4]:
from mmlspark.automl import *

paramBuilder = \
  HyperparamBuilder() \
    .addHyperparam(logReg, logReg.regParam, RangeHyperParam(0.1, 0.3)) \
    .addHyperparam(randForest, randForest.numTrees, DiscreteHyperParam([5,10])) \
    .addHyperparam(randForest, randForest.maxDepth, DiscreteHyperParam([3,5])) \
    .addHyperparam(gbt, gbt.maxBins, RangeHyperParam(8,16)) \
    .addHyperparam(gbt, gbt.maxDepth, DiscreteHyperParam([3,5]))
searchSpace = paramBuilder.build()
# The search space is a list of params to tuples of estimator and hyperparam
print(searchSpace)
randomSpace = RandomSpace(searchSpace)

StatementMeta(SamplePool, 35, 4, Finished, Available)

dict_items([(Param(parent='LogisticRegression_93ca77483e6f', name='regParam', doc='regularization parameter (>= 0).'), (LogisticRegression_93ca77483e6f, <mmlspark.automl.HyperparamBuilder.RangeHyperParam object at 0x7f816b5b3780>)), (Param(parent='RandomForestClassifier_f25f794c99fc', name='numTrees', doc='Number of trees to train (>= 1).'), (RandomForestClassifier_f25f794c99fc, <mmlspark.automl.HyperparamBuilder.DiscreteHyperParam object at 0x7f816b5a6f28>)), (Param(parent='RandomForestClassifier_f25f794c99fc', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'), (RandomForestClassifier_f25f794c99fc, <mmlspark.automl.HyperparamBuilder.DiscreteHyperParam object at 0x7f816b5b37b8>)), (Param(parent='GBTClassifier_a66ae16ec1d2', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'), (GBTClassifier_a66ae16ec1d2,

Next, run TuneHyperparameters to get the best model.

In [5]:
bestModel = TuneHyperparameters(
              evaluationMetric="accuracy", models=mmlmodels, numFolds=2,
              numRuns=len(mmlmodels) * 2, parallelism=1,
              paramSpace=randomSpace.space(), seed=0).fit(tune)

StatementMeta(SamplePool, 35, 5, Finished, Available)



We can view the best model's parameters and retrieve the underlying best model pipeline

In [6]:
print(bestModel.getBestModelInfo())
print(bestModel.getBestModel())

StatementMeta(SamplePool, 35, 6, Finished, Available)

cacheNodeIds: false, checkpointInterval: 10, featureSubsetStrategy: auto, featuresCol: TrainClassifier_e1e873f4db55_features, impurity: gini, labelCol: Label, maxBins: 32, maxDepth: 5, maxMemoryInMB: 256, minInfoGain: 0.0, minInstancesPerNode: 1, numTrees: 10, predictionCol: prediction, probabilityCol: probability, rawPredictionCol: rawPrediction, seed: -5387697053847413545, subsamplingRate: 1.0
TrainClassifier_0b104efd2b1e

We can score against the test set and view metrics.

In [7]:
from mmlspark.train import ComputeModelStatistics
prediction = bestModel.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()

StatementMeta(SamplePool, 35, 7, Finished, Available)

  evaluation_type  ...       AUC
0  Classification  ...  0.986696

[1 rows x 6 columns]
  Unsupported type in conversion to Arrow: MatrixUDT
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.

In [8]:
spark.stop()

StatementMeta(SamplePool, 35, 8, Finished, Available)

