# Cross Validation

This section documents how to do cross-validation in Scikit-Learn.  Cross validation is our
critical model evaluation system.  It tries to simulate how a model would perform on clean 
data by splitting it into training and testing samples.  To keep things simple we will stick
with the basic linear model that we used for monte-carlo examples in class.  Also,
the only model fit will be a basic linear regression.  Everything that is done here can
easily be extended to any of the models in the Scikit-learn family of ML models.

In [1]:
# Load helpers
# Will try to just load what I need on this
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit

## Linear model data generation
This model is from the class notes, and generates a simple linear model with M predictors.  We used it to generate overfitting even in linear model space.

In [2]:
# Function to generate linear data experiments
def genLinData(N,M,noise):
    # y = x_1 + x_2 .. x_M + eps
    # X's scaled so the variance of explained part is same order as noise variance (if std(eps) = 1)
    sigNoise = np.sqrt(1./M)
    X = np.random.normal(size=(N,M),loc=0,scale=sigNoise)
    eps = np.random.normal(size=N,loc=0,scale=noise)
    y = np.sum(X,axis=1)+eps
    return X,y

### Over fitting in one run using train_test_split

In [4]:
# Basic overfitting example
X, y = genLinData(200,50,1.0)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)
# Now run regression
# print score, which is R-squared (fit)
lr = LinearRegression()
lr.fit(X_train, y_train)
print(lr.score(X_train,y_train))
print(lr.score(X_test,y_test))

0.6490379774856974
0.2253399639542668


### Raw Python for the appropriate simulation of many test scores

In [5]:
nmc = 100
X, y = genLinData(200,50,1.0)
scoreVec = np.zeros(nmc)
for i in range(nmc):
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)
    # Now run regression
    # print score, which is R-squared (fit)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    scoreVec[i] = lr.score(X_test,y_test)
print(np.mean(scoreVec))
print(np.std(scoreVec))
print(np.mean(scoreVec<0))

0.44224466961870573
0.1097046822176599
0.0


### Automate this by building a function

In [6]:
# A function to automate MC experiments
def MCtraintest(nmc,X,y,modelObj,testFrac):
    trainScore = np.zeros(nmc)
    testScore  = np.zeros(nmc)
    for i in range(nmc):
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=testFrac)
        modelObj.fit(X_train,y_train)
        trainScore[i] = modelObj.score(X_train,y_train)
        testScore[i]  = modelObj.score(X_test,y_test)
    return trainScore,testScore

In [7]:
nmc = 100
lr = LinearRegression()
trainS, testS = MCtraintest(nmc,X,y,lr,0.25)
print(np.mean(trainS))
print(np.std(trainS))
print(np.mean(testS))
print(np.std(testS))

0.761162190524635
0.02293850535797592
0.46753101981207146
0.12839692764931585


# Scikit-learn functions
* Scikit-learn has many built in functions for cross validation. 
* Here are a few of them.

### cross-validate
* This general functions does many things.  
* This first example uses it on a data set, and performs an even more basic cross-validation than we have been doing.  
* This is called k-fold cross-validation.
* It splits the data set into k parts.  Then trains on k-1 parts, and tests on the remaining 1 part.
* This is a very standard cross-validation system
* It returns a rich dictionary of results

In [None]:
# X, y = genLinData(200,50,1.0)
lr = LinearRegression()
CVInfo = cross_validate(lr, X, y, cv=5,return_train_score=True)
print(np.mean(CVInfo['train_score']))
print(np.mean(CVInfo['test_score']))

# ShuffleSplit
* ShuffleSplit function can add a randomized train/test split to cross-validate
* Here is how you do it

In [None]:
# X, y = genLinData(200,50,1.0)
lr = LinearRegression()
shuffle = ShuffleSplit(n_splits=100, test_size=.25, random_state=0)
CVInfo = cross_validate(lr, X, y, cv=shuffle,return_train_score=True)
print(np.mean(CVInfo['train_score']))
print(np.mean(CVInfo['test_score']))

### cross_val_score
* This is a very basic cross validation system
* It returns a simple vector of test set (only) scores
* Also, uses ShuffleSplit

In [None]:
# X, y = genLinData(200,50,1.0)
lr = LinearRegression()
shuffle = ShuffleSplit(n_splits=100, test_size=.25, random_state=0)
CVScores = cross_val_score(lr, X, y, cv=shuffle)
print(np.mean(CVScores))
print(CVScores)

# Summary
* Cross-validation is the **gold standard** for testing our models
* In most cases we will always use randomized cross validation from now on
* This section outlines several ways to do this
* See **Scikit-learn** documentation since there are many parts and ways to call the various cross validation functions