# Lesson 31 - Cross-Validation

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

from pyspark.ml.feature import VectorAssembler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression 
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

spark = SparkSession.builder.getOrCreate()

## Train/Test Split
In the previous lesson, we introduced the train/test split technique for estimating a model's out-of-sample performance. Using this approach, we first split the labeled data into two sets, called the training set and the test set. We train our model on the training set, and then score it on the test set. The score calculated on the test set provides us with an estimate of how the model will perform on out-of-sample data. 

This approach is used often in practice, but it does have certain flaws. The most serious of which is that an estimate generated in this way can be highly dependent on exactly which observations were randomly selected to compose the test set. As a result, estimates created using this approach can have a high variance. 

Let's illustrate this idea with an example. Suppose that two data scientists are working with the same dataset. They each decide to randomly set aside 20% of the observations in the dataset for testing and will then train a model on the remaining 80% of the data. The type of model used by the two data scientists will be exactly the same, but the test sets that they each randomly select will be different. They both score their trained models on their own version of the test set. Suppose that one data scientist reports a test set accuracy of 85%, while the other data scientist reports a test set accuracy of 67%. These test scores provide VERY different estimates for how well the model will perform on out-of-sample data. 

Obviously the test accuracies in this example were manufactured to be far apart for the purpose of making a point. But it is not unreasonable to see differences this large, or even larger, in test scores calculated on two different test sets. Although, it should be pointed out that when working with very large datasets, these estimates do tend to be a bit more stable. 

We will now introduce a more sophisticated approach for estimating a model's out of sample performance.

## K-Fold Cross-Validation
K-fold cross-validation is a popular technique for estimating a model's out-of-sample performance. It is similar to, but more sophisticated than, the train/test split approach. The process for scoring a model using K-fold cross-validation is detailed below.

1. We start by randomly splitting the labeled data into K roughly-equal-sized pieces called **folds**.
2. We then train K versions of the model, each using the same set of hyperparameters.
3. Each version of the model is trained on K-1 folds and scored on the remaining fold. That is to say, we estimate the out-of-sample score for each model using the single fold on which that particular model was not trained. Each fold is thus used as the test set for exactly one model.
4. This will result in K out-of-sample estimates. We average these scores together and report that as the cross-validation score for the model.
5. We then retrain the model on the entire collection of labeled data that is available to us.
Since the cross-validation score is calculated as the average of several out-of-sample estimates, it tends to provide a more stable estimate of the model's out of sample performance than that obtained using a single test set.

Common values for K are 3, 5, 10, and n. When K = n, we refer to the technique as leave-one-out cross-validation, or LOOCV.

The process of performing K-Fold cross-validation is illustrated in the figure below.

![K-Fold CV](https://drbeane.github.io/files/images/417/kfold_cv.jpeg)

## Load and Explore Data

To demonstrate the use of cross-validation to estimate out-of-sample performance, we will return to the synthetic dataset introduced in the previous lesson. Recall that the contents of this dataset has been split into two equal-sized parts which are stored in separate CSV files. We will load these datasets into DataFrames named `df1` and `df2` and will then combine these DataFrames into a single DataFrame named `df`.

In [0]:
df_schema = (
    'c01 DOUBLE, c02 DOUBLE, c03 DOUBLE, c04 DOUBLE, c05 DOUBLE, '
    'c06 DOUBLE, c07 DOUBLE, c08 DOUBLE, c09 DOUBLE, c10 DOUBLE, '
    'c11 STRING, c12 STRING, c13 STRING, c14 STRING, c15 STRING, '
    'c16 STRING, c17 STRING, c18 STRING, c19 STRING, c20 STRING, '
    'label INTEGER'
)

df1 = (
    spark.read
    .option('delimiter', ',')
    .option('header', True)
    .schema(df_schema)
    .csv('/FileStore/tables/synthetic_data_1.csv')
)

df2 = (
    spark.read
    .option('delimiter', ',')
    .option('header', True)
    .schema(df_schema)
    .csv('/FileStore/tables/synthetic_data_2.csv')
)

df = df1.union(df2)

df.printSchema()

In [0]:
df.show(5)

In [0]:
N = df.count()

print(N)

## Distribution of Label Values

We will now determine the distribution of label values in the dataset.

In [0]:
(df.select('label')
   .groupby('label')
   .agg(
       expr('COUNT(*) as count'), 
       expr(f'ROUND(COUNT(*)/{N},4) as prop')
    ).show()
)

## Identify Numerical and Categorical Features

The first 10 columns of our DataFrames represent numerical features and the next two columns represent categorical features. The last columns represents the label.

In [0]:
num_features = df.columns[:10]
cat_features = df.columns[10:-1]

## Define Pipeline Stages

We will now create several stages to perform processing and modeling tasks on our dataset. We will also create an evaluator to calculate the accuracy score for our model.

In [0]:
ix_features = [c + '_ix' for c in cat_features]
vec_features = [c + '_vec' for c in cat_features]

feature_indexer = StringIndexer(inputCols=cat_features, outputCols=ix_features)

encoder = OneHotEncoder(inputCols=ix_features, outputCols=vec_features, dropLast=False)

assembler = VectorAssembler(inputCols=num_features + vec_features, outputCol='features')

logreg = LogisticRegression(featuresCol='features', labelCol='label')

In [0]:
accuracy_eval = MulticlassClassificationEvaluator(
    predictionCol='prediction', labelCol='label', metricName='accuracy')

## Preprocessing

When using K-Fold cross validation, our model will be fit K+1 times. We could combine all of the stages for pre-processing and modeling into a pipeline and use that in our cross-validation process, but then each one of these states would have to be fit each of the K+1 times. This would be inefficient and time-consuming. Instead, we will create a pipeline that performs only the preprocessing tasks and will leave the `LogisticRegression` object out of the pipeline. We will then apply the preprocessing pipeline to the data, and will fit only the `LogisticRegression` when performing cross-validation.

In [0]:
pre_pipeline = Pipeline(stages=[feature_indexer, encoder, assembler]).fit(df)
train = pre_pipeline.transform(df)
train.persist()

train.select('features').show(5, truncate=False)

## Train Model 

Before using cross-validation to estimate out-of-sample performance, we will first train the logistic regression model on the entire dataset and will calculate the training accuracy. We do this so that we can compare the results.

In [0]:
logreg_model = logreg.fit(train)
pred = logreg_model.transform(train)

score = accuracy_eval.evaluate(pred)

print('Training Score:', score)

## Cross-Validation

We will now use the `CrossValidator` class to perform cross-validation. Notice that when creating the `CrossValidator` instance, we specify values for 5 parameters. 

- The `estimator` parameter represents the model being fit and then evaluated on each of the K folds.
- The `estimatorParamMaps` parameter will be discussed in future lessons. For now, we will simply set it to a list containing an empty dictionary. 
- The `evaluator` parameter represents to `MulticlassClassificationEvaluator` used to score each version of the model. 
- The `numFolds` parameter specifies the number of folds that the data is being split into. 
- The `parallelism` parameter specifies the number of threads to use when performing cross-validation.

After creating an instance of the `CrossValidator` class, we must then fit it to the training data. This will return an object of type `CrossValidatorModel`. This new object will contain an `avgMetrics` attribute, which will be a list containing the cross-validation estimate for the model's out-of-sample performance.

In [0]:
cv = CrossValidator(estimator=logreg, estimatorParamMaps=[{}], 
                    evaluator=accuracy_eval, numFolds=10, parallelism=6)

cv_model = cv.fit(train)

print('\nCross-Validation Estimate of Out-Of-Sample Performance:', cv_model.avgMetrics[0])