## 203 - Hyperparameter Tuning with MMLSpark

We can do distributed randomized grid search hyperparameter tuning with MMLSpark.

First, we import the packages

In [None]:
import pandas as pd
import mmlspark
from pyspark.sql.types import IntegerType, StringType, FloatType, StructType, StructField

Now let's read the data and split it to tuning and test sets:

In [None]:
dataFilePath = "BreastCancer.csv"
textSchema = StructType([StructField("Label", IntegerType(), False),
                         StructField("Clump Thickness", IntegerType(), False),
                         StructField("Uniformity of Cell Size", IntegerType(), False),
                         StructField("Uniformity of Cell Shape", IntegerType(), False),
                         StructField("Marginal Adhesion", IntegerType(), False),
                         StructField("Single Epithelial Cell Size", IntegerType(), False),
                         StructField("Bare Nuclei", FloatType(), False),
                         StructField("Bland Chromatin", IntegerType(), False),
                         StructField("Normal Nucleoli", IntegerType(), False),
                         StructField("Mitoses", IntegerType(), False),])
import os, urllib
if not os.path.isfile(dataFilePath):
    urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/" + dataFilePath, dataFilePath)
data = spark.createDataFrame(pd.read_csv(dataFilePath, sep=",", header=0, na_values="?"), textSchema)
tune, test = data.randomSplit([0.80, 0.20])
tune.limit(10).toPandas()

Next, define the models that wil be tuned:

In [None]:
from mmlspark import TuneHyperparameters
from mmlspark.TrainClassifier import TrainClassifier
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
logReg = LogisticRegression()
randForest = RandomForestClassifier()
gbt = GBTClassifier()
smlmodels = [logReg, randForest, gbt]
mmlmodels = [TrainClassifier(model=model, labelCol="Label") for model in smlmodels]

We can specify the hyperparameters using the HyperparamBuilder.
We can add either DiscreteHyperParam or RangeHyperParam hyperparameters.
TuneHyperparameters will randomly choose values from a uniform distribution.

In [None]:
from mmlspark import HyperparamBuilder
from mmlspark import RangeHyperParam
from mmlspark import DiscreteHyperParam
from mmlspark import RandomSpace
paramBuilder = \
  HyperparamBuilder() \
    .addHyperparam(logReg, logReg.regParam, RangeHyperParam(0.1, 0.3, isDouble=True)) \
    .addHyperparam(randForest, randForest.numTrees, DiscreteHyperParam([5,10])) \
    .addHyperparam(randForest, randForest.maxDepth, DiscreteHyperParam([3,5])) \
    .addHyperparam(gbt, gbt.maxBins, RangeHyperParam(8,16)) \
    .addHyperparam(gbt, gbt.maxDepth, DiscreteHyperParam([3,5]))
randomSpace = RandomSpace(paramBuilder.build())

Next, run TuneHyperparameters to get the best model.

In [None]:
bestModel = TuneHyperparameters(
              evaluationMetric="accuracy", models=mmlmodels, numFolds=2,
              numRuns=len(mmlmodels) * 2, parallelism=1,
              paramSpace=randomSpace.space(), seed=0).fit(tune)

We can view the best model's parameters and retrieve the underlying best model pipeline

In [None]:
print(bestModel.getBestModelInfo())
print(bestModel.getBestModel())

We can score against the test set and view metrics.

In [None]:
from mmlspark import ComputeModelStatistics
prediction = bestModel.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()