You want to evaluate how well your model will work in the real world

Create a pipeline that preprocesses the data, trains the model, and then evaluates
it using cross-validation



In [5]:
# Load libraries

from sklearn import metrics
from sklearn import datasets
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [6]:
# Load digits datasets
digits=datasets.load_digits()
# Create features matrix
features=digits.data
# Create target vector
target=digits.target

In [7]:
# Create standardizer
standardizer=StandardScaler()

# Create logistic regression object
logit = LogisticRegression()


# Create a pipeline that standardizes, then runs logistic regression
pipeline = make_pipeline(standardizer, logit)



In [8]:
# Create k-Fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# Conduct k-fold cross-validation
cv_results = cross_val_score(pipeline, # Pipeline
features, # Feature matrix
target, # Target vector
cv=kf, # Cross-validation technique
scoring="accuracy", # Loss function
n_jobs=-1) # Use all CPU scores

In [9]:
# Calculate mean
cv_results.mean()

0.9693916821849783

At first consideration, evaluating supervised-learning models might appear
straightforward: train a model and then calculate how well it did using some
performance metric (accuracy, squared errors, etc.). However, this approach is
fundamentally flawed. If we train a model using our data, and then evaluate how
well it did on that data, we are not achieving our desired goal. Our goal is not to
evaluate how well the model does on our training data, but how well it does on
data it has never seen before (e.g., a new customer, a new crime, a new image).
For this reason, our method of evaluation should help us understand how well
models are able to make predictions from data they have never seen before.

One strategy might be to hold off a slice of data for testing. This is called
validation (or hold-out). In validation our observations (features and targets) are split into two sets, traditionally called the training set and the test set. We take
the test set and put it off to the side, pretending that we have never seen it before.
Next we train our model using our training set, using the features and target
vector to teach the model how to make the best prediction. Finally, we simulate
having never before seen external data by evaluating how our model trained on
our training set performs on our test set. However, the validation approach has
two major weaknesses. First, the performance of the model can be highly
dependent on which few observations were selected for the test set. Second, the
model is not being trained using all the available data, and not being evaluated
on all the available data.


A better strategy, which overcomes these weaknesses, is called k-fold crossvalidation (KFCV). In KFCV, we split the data into k parts called “folds.” The
model is then trained using k – 1 folds—combined into one training set—and
then the last fold is used as a test set. We repeat this k times, each time using a
different fold as the test set. The performance on the model for each of the k
iterations is then averaged to produce an overall measurement.
In our solution, we conducted k-fold cross-validation using 10 folds and
outputted the evaluation scores to cv_results

In [10]:
# View score for all 10 folds
cv_results


array([0.97777778, 0.98888889, 0.96111111, 0.94444444, 0.97777778,
       0.98333333, 0.95555556, 0.98882682, 0.97765363, 0.93854749])

There are three important points to consider when we are using KFCV. First,
KFCV assumes that each observation was created independent from the other
(i.e., the data is independent identically distributed [IID]). If the data is IID, it is
a good idea to shuffle observations when assigning to folds. In scikit-learn we
can set shuffle=True to perform shuffling.

Second, when we are using KFCV to evaluate a classifier, it is often beneficial to
have folds containing roughly the same percentage of observations from each of
the different target classes (called stratified k-fold). For example, if our target
vector contained gender and 80% of the observations were male, then each fold
would contain 80% male and 20% female observations. In scikit-learn, we can conduct stratified k-fold cross-validation by replacing the KFold class with
StratifiedKFold.

Finally, when we are using validation sets or cross-validation, it is important to
preprocess data based on the training set and then apply those transformations to
both the training and test set. For example, when we fit our standardization
object, standardizer, we calculate the mean and variance of only the training
set. Then we apply that transformation (using transform) to both the training
and test sets:


In [11]:
# Import library
from sklearn.model_selection import train_test_split
# Create training and test sets
features_train, features_test, target_train, target_test = train_test_split(
features, target, test_size=0.1, random_state=1)
# Fit standardizer to training set
standardizer.fit(features_train)
# Apply to both training and test sets
features_train_std = standardizer.transform(features_train)
features_test_std = standardizer.transform(features_test)

scikit-learn’s pipeline package makes this easy to do while using cross-validation
techniques. We first create a pipeline that preprocesses the data (e.g.,
standardizer) and then trains a model (logistic regression, logit):

In [12]:
# Create a pipeline
pipeline = make_pipeline(standardizer, logit)


Then we run KFCV using that pipeline and scikit does all the work for us:

In [13]:
# Do k-fold cross-validation
cv_results = cross_val_score(pipeline, # Pipeline
features, # Feature matrix
target, # Target vector
cv=kf, # Cross-validation technique
scoring="accuracy", # Loss function
n_jobs=-1) # Use all CPU scores

cross_val_score comes with three parameters that we have not discussed that
are worth noting. cv determines our cross-validation technique. K-fold is the
most common by far, but there are others, like leave-one-out-cross-validation
where the number of folds k equals the number of observations. The scoring
parameter defines our metric for success, a number of which are discussed in
other recipes in this chapter. Finally, n_jobs=-1 tells scikit-learn to use every
core available. For example, if your computer has four cores (a common number
for laptops), then scikit-learn will use all four cores at once to speed up the
operation.
