<img src="../ucsb_logo_seal.png"> 

## ML Model Selection and Tuning

### PSTAT 135 / 235: Big Data Analytics
### University of California, Santa Barbara
### Last Updated: Sep 4, 2019


---  


**Sources:**  
Learning Spark, Chapter 11: Machine Learning with MLlib  
https://spark.apache.org/docs/2.1.1/ml-tuning.html  



### OBJECTIVES
- Discuss cross validation  
- Discuss hyperparameter tuning  
- Discuss model evaluation  


### CONCEPTS

- Data Splitting  
- Train/Validation/Test sets  
- K-Fold Cross Validation  
- CrossValidator  


---

**Model Tuning**

Oftentimes, a model will include hyperparameters that need to be tuned for optimal performance.  

We have seen many examples, such as the cost parameter in the support vector machine, and the   regularization parameter in L2 regression

The optimal value of the hyperparameter cannot be determined in advance, as it depends on the data.  

Before a model is trained on data, a plan should be made for *data splitting*.  The purpose of the data splitting step is to accomplish the following:  


**1. Model Performance Evaluation**  
Set aside a fraction of the data which has not been used for training or tuning.  This test set will be used to evaluate the performance of the model.  If the same data used in training/tuning is also used for evaluation, the results will be too optimistic.  

**2. Training and Tuning**  
After setting aside the test set, the remaining data will be used for training and tuning.  This train/validation data is often applied in a k-fold cross validation (cv) procedure.  We outline an example cv procedure below.  Typical values for $k$ (the number of folds) are 5 and 10.  

The fractions used in the train/validation/test sets will vary depending on factors including the size of the dataset.  

Additionally, some users may include more elaborate splitting schemes (e.g, extra validation sets or test sets), depending on the specific problem.  

5-Fold Cross Validation with a Separate Test Set  
The Training/Validation Sets are 80% of the data; each fold is 16% of the data.  
The Test Set is 20% of the data.



**Cross Validation Illustration**  
<img src="cross_validation_img.png">  

**Spark Implementation of Model Tuning**  

First some quick definitions:  


- `ParamMaps`: parameters to choose from, aka parameter grid
- Estimator:   algorithm or `Pipeline` to tune
- Evaluator: metric to measure how well a fitted Model does on held-out test data

Tuning can be done on models or pipelines

**Important Note:**  
Spark validation set = our test set

**Methods available for model selection:**  

`CrossValidator`

`TrainValidationSplit`


From https://spark.apache.org/docs/2.1.1/ml-tuning.html

Spark `CrossValidator`

`CrossValidator` begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular `ParamMap`, `CrossValidator` computes the average evaluation metric for the 3 Models produced by fitting the `Estimator` on the 3 different (training, test) dataset pairs.
After identifying the best `ParamMap`, `CrossValidator` finally re-fits the `Estimator` using the best `ParamMap` and the entire dataset.


**`CrossValidator` Example**

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

spark= SparkSession.builder.getOrCreate()

# Prepare training documents, which are labeled.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])


print('training: {}'.format(training))
print('type(training): {}'.format(type(training)))

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

print('len(paramGrid): {}'.format(len(paramGrid)))

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")

for row in selected.collect():
    print(row)

**IMPORTANT NOTE**  
The call below resulted in a hive error:   
`spark.createDataFrame([  `  

I made the following change to the shell.py file in   
C:\spark\spark-2.2.0-bin-hadoop2.7\python\pyspark  

Commented out enableHiveSupport() from:  

        spark = SparkSession.builder\
            .enableHiveSupport()\
            .getOrCreate()

**`TrainValidationSplit`**  

This method only performs one split (unlike the $k$ splits of `CrossValidator`).  
Advantage: runtime is faster since the model is trained only once.  
Disadvantage: results may not be as reliable out-of-sample of the training dataset isn’t sufficiently large  

Method takes parameter `trainRatio`  

For example, with `trainRatio` = 0.6, the train/test sets will be 60%/40% of the data, respectively  

For an example:  
https://spark.apache.org/docs/2.1.1/ml-tuning.html
