# Model Validation Code Along

For this code along, we will move through the various examples of model validation, demonstrating different ways of performing it.  We will move from a basic train-test split towards the most rigorous method using a pipeline.  

The overall goal of validation is to assess how your model will perform on unseen data.  The key to doing so will be to compare how well the model performs on data it has been fit on and data that it has not been fit on.   We select some metric appropriate to the problem (R^2, RMSE, etc), and look at the performance.  We then can say:  
  
    1: It is a low bias, high variance model if it performs well on the training, but poorly on the test. 
    2: It is high bias low variance if it performs poorly on the train and test.    
    3. It is low bias, low variance if it performs well on both training and test.  



Let's load in a data set with a continuous variable.

In [1]:
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
import seaborn as sns

In [2]:
data = load_boston()
X = pd.DataFrame(data['data'])
X.columns = data['feature_names'] 
y = data['target']

In [3]:
X.shape

(506, 13)

In [4]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [None]:
df = X.copy()
df['target'] = y

In [None]:
df.corr()['target']

# Basic Train-Test Split

The simplest form of model validation is performing a single train test split and comparing performance on training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Create a split

In this version, the model building process would include testing hypotheses about features to include and transformations, and judging which model had the best R^2 on the test set. Moreover, we can judge overfitness by looking at the difference between the train and test score.

In [None]:
# First we fit a linear model to the data set using. Only RM/LSTAT

In [None]:
# Look at the score on the train set

In [None]:
# Then we score the model on the test set

In [None]:
# Compare training and test scores

Notice, we do not fit our model on the test set.  This concept will be repeated in many forms.  Do not fit on your test set.

We would continue trying out different hypotheses, and selecting the model that works best on the test set.  The issue becomes that in this process, every time we score our data on our test set, we are actually introducing bias into the model.  We are creating a hypothesis based on knowledge of how it will perform on the test set.   Adjusting it each time until we see that the test score is high. 

# Best Practice: split off test set and don't touch until the end

Best practice is to split off a portion of the data and not touch it until the very end.  None of your model-building will involve this hold out set.  It's only purpose will be as a final check to make sure that the model performs on unsean data as you would expect.

Let's split the data once more:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=.2)

Now, we will "put" the test set away, and not touch it until the very end.

# Secondary Train Test Split

Our test set allowed us to see the drop-off of performance between data our model has been trained on, and data it has not been trained on.Without the test set.  Now that we have set it aside, we need another way to judge whether our model is overfit.  

The simplest way to do that is by performing a secondary split, and building our model in the same way above, as if the training set were the only data in existence.

In [None]:
X_tt, X_val, y_tt, y_val = train_test_split(X_train, y_train, random_state = 42, test_size = .2)

In [None]:
lr = LinearRegression()
lr.fit(X_tt[['RM', 'LSTAT']], y_tt)
lr.score(X_tt[['RM', 'LSTAT']], y_tt)

In [None]:
lr.score(X_val[['RM', 'LSTAT']], y_val)

# Transformations

If we were to perform some transformations to our model, we would have to be careful.  
Take for example a standard scaler.    

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()


A standard scaler learns the mean and standard deviation of our dataset in order to convert each data point into a z-score.

If we fit the data on the RM column, and include all of our data, the mean will be different than if we fit on X_train.


In [None]:
ss.fit(X[['RM']])
ss.mean_

In [None]:
ss.fit(X_train[['RM']])
ss.mean_

Therefore, if we fit our standard scaler on the whole data, then used it to transform our train set, we would be introducing information from the test set into the training set.  This is a form of data leakage.

To prevent this, we only ever fit our transformations (scalers, one-hot-encoders, upsamplers/downsamplers), on the training set.  

This applies even within the secondary split. 

So, with our secondary split example, we would proceed as follows:

In [None]:
# Fit both our scalar and our model on the train set

In [None]:
# Transform the validation set with the fit scaler
# Do not refit the scaler


Consider the following case: our best model includes every feature scaled.  

In [None]:
# Fit the scaler on your training



In [None]:
# Fit and score on training set

In [None]:
# transform and score on the validation set

Since this is what we hypothesize to be our best model, we want to test it out on the hold out set.

To do so, we have to transform the test data exactly as we transformed our training data. In this example, our model was trained on scaled training features, so we have to scale the test features, or else our model will perform horribly.  

However, we want to train on as much information in the training set as we can.  So, once we have selected our final model, we then refit everything on the entire training set.

In [None]:
# Fit on entire training

In [None]:
# Transform holdout

# K-Folds



K-folds validation is an even more rigorous manner of performing validation.  It allows us to shuffle our training data into a specified number of splits.  Each time, a new portion of the data is held out as our validation set. 

We perform our k-folds within the training set.  And each time we fit our model, we have to be careful to only fit on the training portion.  This can get tricky.

In [None]:
from sklearn.model_selection import train_test_split, KFold
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.2)

In [None]:
from sklearn.model_selection import KFold

In [None]:
# instantiate a kfold object and specify number of splits
# designate 4 folds for each split

# loop through each fold
    
    # instantiate a Linear Regression object for each fold
    
    # instantiate a scaler for each fold
    
    # using the indices, create the split associated with each loop
    
    # fit transform the scaler on tt
    
    # fit model on tt
    
    # score both training and validation
    

In [None]:
# Refit on entire training set

In [None]:
# Score on holdout

# Pipelines

The process of fitting a KFold instance can be tricky. It allows for a lot of control, but can also lead to errors.  You can also forget to apply a transformation to the test set.  Pipelines allow you to package the transformations in one place.  

They also can be applied within cross validation and grid search methods (you will learn about grid search  soon).   

The same process as above of fitting the scaler on multiple subsets of the training set, and scoring on multiple val sets, can be performed in a couple of lines.


In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

pipe = make_pipeline(StandardScaler(), LinearRegression())

cross_val_score(pipe, X_train, y_train, cv=4)


In [None]:
pipe.fit(X_train, y_train)

In [None]:
pipe.score(X_test, y_test)

In [None]:
y_hat = pipe.predict(X_test)