# Cross-Validation and Scikit-Learn Model Template

Cross-validation is a technique that uses the training data, for which we have outcome labels, to estimate the performance and bias of a classifier.  It is commonly called *k*-fold cross-validation (*k*-fold CV), since we partition the training data into *k* groups.  We then leave one of those groups out for testing data, train the model on the remaining groups, and test on the left out group to get an estimate of performance.  The performance is then averaged across all **folds** (groups).  Here is a summary:

 - Randomly split the data into *k* folds
 - For *i* in 1...*k*, leave out fold *i* for testing
 - Train the classifier on all folds except for fold *i* (the left out fold)
 - Assess performance by testing on the left out fold *i*
 - Repeat and average performance across all folds

Scikit-learn has built-in functions for creating CV folds and computing CV metrics.  An example is given below with the Perceptron neural network classifier.  Use this format on your own classifier to develop results for Week 3.

In [2]:
import pandas as pd
import numpy as np

In [16]:
# Load the data
train_full = pd.read_csv('../data/train_complete.csv')

# Drop the row indices
train_full = train_full.drop(train_full.columns[0], axis=1)

# Log transform the Fare feature to be more normally distributed
train_full['Fare'] = np.log10(train_full['Fare'] + 1)

train_full.head()

Unnamed: 0,Survived,Pclass,Fare,Age,Embarked_C,Embarked_S,Embarked_Q,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Rare_Title,Sex_Female
0,0.0,3.0,0.916454,22.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,1.0,1.859038,38.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,1.0,3.0,0.950608,26.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,1.0,1.0,1.733197,35.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,0.0,3.0,0.956649,35.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [25]:
# Ignore warnings from sklearn (omit this if you're still experimenting with code)
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

# Import the things we need for this code block
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Perceptron

# Create the CV folds
# First separate the target data from the features
X = train_full.drop('Survived', axis=1)
y = train_full['Survived']

# Fit and sore the cross-validation using 10-fold CV
classifier = Perceptron(random_state=154)
scores = cross_val_score(classifier, X, y, cv=10)

print('The mean and std deviation of the CV scores is {:.3f} (+/- {:.3f})'.format(np.mean(scores), np.std(scores)))

The mean and std deviation of the CV scores is 0.691 (+/- 0.079)


In [39]:
# Examine the coefficients for each feature with a single fold
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (20% of the data for testing)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=154)

# Train and score the classifier
classifier2 = Perceptron(random_state=154)
classifier2.fit(Xtrain, ytrain)
scores2 = classifier2.score(Xtest, ytest)

print('The score of this Perceptron is {:.3f}'.format(scores2))

# Feature coefficients
print('\n\nFeature: \tCoefficient')
for i, colname in enumerate(Xtrain.columns):
    print('{}:\t{:.3f}'.format(str(colname), float(classifier2.coef_[0][i])))

The score of this Perceptron is 0.721


Feature: 	Coefficient
Pclass:	-119.000
Fare:	154.698
Age:	-13.529
Embarked_C:	70.000
Embarked_S:	-30.000
Embarked_Q:	10.000
Title_Master:	55.000
Title_Miss:	101.000
Title_Mr:	-210.000
Title_Mrs:	112.000
Title_Rare_Title:	-4.000
Sex_Female:	225.000


**These Coefficients are Interpretable**

Since this neural network has only one layer, we can interpret these coefficients as being positively or negatively related to the outcome (the absolute value isn't as interpretable, but relative values are).  It is clear that passenger class is inversely related to survival, fare positively related, and that being a woman or child is also positively related.  If we had used more than one layer, the coefficients would not be interpretable, which is a disadvantage of neural networks.

**Not all Classifier Types have Coefficients**

The Perceptron and some regression models will have coefficients that might indicate feature importance.  However, other models like RandomForest will have different ways of calculating feature importance.  See if you can google a way to do that for your particular model and ask questions on Slack if you need help.