# Lesson 30 - Overfitting

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

from pyspark.ml.feature import VectorAssembler, OneHotEncoder, StringIndexer
from pyspark.ml.classification import LogisticRegression 
from pyspark.ml import Pipeline

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

spark = SparkSession.builder.getOrCreate()

## Out-of-Sample Data

Up to this point, when we have measured a model's performance, we have used the training data to calculate some metric (typically accuracy, when working with a classification model). However, when we apply a model in a production setting, we will be applying it to new observations that the model was not trained on, and for which we do not know the true label values. Observations that are not included within the training set are referred to as **out-of-sample** observations. We are typically much more interested in how the model will perform on out-of-sample data than we are in how well the model performs on the training data. 

We will not typically be able to truly measure a model's out-of-sample performance since they would require use to be able to directly measure the model's performance on all possible observations that the model was not trained on, the vast majority of which will likely have labels that are unknown to use. There are techniques that can be used to estimate a model's out-of-sample performance, however. The most obvious technique might be to just use a metric calculated on the training data as an estimate for the same metric as calculated on out-of-sample data. This approach has some serious issues, however.

## Overfitting

When we train a machine learning model, the training algorithm is designed to optimize performance on the training set. The algorithm will try to take advantage of any correlations between the feature values and label values that might be present within the training set, even if these correlations are the result of noise. The training algorithm might will pick up on patterns that are present within the training data, but that do not generalize to out-of-sample observations. This phenomenon is known as **overfitting**. 

As a result of overfitting, you should always assume that any model metric calculated from the training set will give an overly-optimistic view of how well the model will perform with presented with out-of-sample data. The degree to which to which a training score might overestimate out-of-sample performance varies from dataset to dataset and from model to model. In some situations, the training score might be a good estimate of the model's out-of-sample performance. In other cases, the training score might indicate that a model makes nearly perfect predictions, when in fact the model has nearly no value when it comes to generating predictions on out-of-sample data. 

The key take-away is that you should almost never trust a training score by itself, and should generally use other methods for estimating a model's out-of-sample performance.

## Test Sets

One common approach to estimating a model's out-of-sample performance is to split the labeled data into two sets: A training set upon which we will train a model and a test set that we will use to evaluate the model. For example, suppose that we randomly select and set aside 20% of our labeled data for testing. We can then training a model on the remaining 80%, and then score that model on the test set, which represents out-of-sample data. This will provide us with an estimate of the model's performance on all out-of-sample observations. After obtaining this estimate, we will typically retrain the model on the entire dataset, since models tend to perform better if they are provided more data to train on. 

This train/test split approach can work reasonably well, but does have some drawbacks. One major flaw in this approach is that the estimates of out-of-sample performance can be highly dependent on exactly what observations are selected for the test set, and so estimates generated in this way can be very volatile. We will discuss this concern in more detail in the next section.

## Load and Explore Data

We will now demonstrate the phenomenon of overfitting. We will be using a synthetic dataset that has been split into two equal-sized parts. The two parts are stored in separate CSV files named `synthetic_data_1.csv` and `synthetic_data_2.csv`. We will now load both of these datasets into DataFrames named `df1` and `df2`.

In [0]:
df_schema = (
    'c01 DOUBLE, c02 DOUBLE, c03 DOUBLE, c04 DOUBLE, c05 DOUBLE, '
    'c06 DOUBLE, c07 DOUBLE, c08 DOUBLE, c09 DOUBLE, c10 DOUBLE, '
    'c11 STRING, c12 STRING, c13 STRING, c14 STRING, c15 STRING, '
    'c16 STRING, c17 STRING, c18 STRING, c19 STRING, c20 STRING, '
    'label INTEGER'
)

df1 = (
    spark.read
    .option('delimiter', ',')
    .option('header', True)
    .schema(df_schema)
    .csv('/FileStore/tables/synthetic_data_1.csv')
)

df2 = (
    spark.read
    .option('delimiter', ',')
    .option('header', True)
    .schema(df_schema)
    .csv('/FileStore/tables/synthetic_data_2.csv')
)

df1.printSchema()

We will now display a few rows of both DataFrames. Note that the structure of the two DataFrames are very similar. It is important to keep in mind that the data contained in this DataFrames were generated using the same process.

In [0]:
df1.show(5)
df2.show(5)

In [0]:
N1 = df1.count()
N2 = df2.count()

print(N1)
print(N2)

## Distribution of Label Values

We will now determine the distribution of label values in the two datasets.

In [0]:
(df1.groupby('label')
    .agg(expr('COUNT(*) as count'), 
         expr(f'ROUND(COUNT(*)/{N1},4) as prop')
    ).show()
)

(df2.groupby('label')
    .agg(
        expr('COUNT(*) as count'), 
        expr(f'ROUND(COUNT(*)/{N1},4) as prop')
    ).show()
)

## Identify Numerical and Categorical Features

The first 10 columns of our DataFrames represent numerical features and the next two columns represent categorical features. The last columns represents the label.

In [0]:
num_features = df1.columns[:10]
cat_features = df1.columns[10:-1]

## Define Pipeline Stages

We will now create several stages to perform processing and modeling tasks on our dataset.

In [0]:
ix_features = [c + '_ix' for c in cat_features]
vec_features = [c + '_vec' for c in cat_features]

feature_indexer = StringIndexer(inputCols=cat_features, outputCols=ix_features)

encoder = OneHotEncoder(inputCols=ix_features, outputCols=vec_features, dropLast=False)

assembler = VectorAssembler(inputCols=num_features + vec_features, outputCol='features')

logreg = LogisticRegression(featuresCol='features', labelCol='label')

## Create and Fit the Pipeline

In the cell below, we create a pipeline object containing the relevant stages and then fit the pipeline to the data in `df1`. We then use the `transform()` method of the fitted pipeline model to generate predictions for both datasets, `df1` and `df2`.

In [0]:
model = Pipeline(stages=[feature_indexer, encoder, assembler, logreg]).fit(df1)

pred1 = model.transform(df1)
pred2 = model.transform(df2)

pred1.select(['probability', 'prediction', 'label']).show(5, truncate=False)

pred2.select(['probability', 'prediction', 'label']).show(5, truncate=False)

## Score the Model

We will now score our model on the data contained in `df1`, as well as the data contained in `df2`. Note that the model was actually trained on `df1`.

In [0]:
accuracy_eval = MulticlassClassificationEvaluator(
    predictionCol='prediction', labelCol='label', metricName='accuracy')

score1 = accuracy_eval.evaluate(pred1)
score2 = accuracy_eval.evaluate(pred2)

print('Training Accuracy (df1):      ', score1)
print('Out-of-Sample  Accuracy (df2):', score2)

## Swap Training and Test Sets

As you can see, the model performed dramatically better on `df1` (which it was trained on) than on `df2`. You might suspect that there might be some characteristic of `df1` or `df2` that simply makes it harder for for a model to perform well on `df2`. To dispel that notion, we will now train the model on `df2` and evaluate it on `df1`. In thos case you see that the model performs significantly better on `df2` than on `df1`.

In [0]:
model = Pipeline(stages=[feature_indexer, encoder, assembler, logreg]).fit(df2)

pred1 = model.transform(df1)
pred2 = model.transform(df2)

score1 = accuracy_eval.evaluate(pred1)
score2 = accuracy_eval.evaluate(pred2)

print('Training Accuracy (df2):     ', score2)
print('Out-of-Sample Accuracy (df1):', score1)