Holdout validation is actually a specific example of a larger class of validation techniques called k-fold cross-validation. K-fold cross-validation works by:

    splitting the full dataset into k equal length partitions,
        selecting k-1 partitions as the training set and
        selecting the remaining partition as the test set
    training the model on the training set,
    using the trained model to predict labels on the test set,
    computing an error metric (e.g. simple accuracy) and setting aside the value for later,
    repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration,
    calculating the mean of the k error values.

Using 5 or 10 folds is common for k-fold cross-validation. Here's a diagram describing each iteration of 5-fold cross validation:

Since you're training k models, the more number of folds you use the longer it takes. When working with large datasets, often only a few number of folds are used because of the time and cost it takes, with the tradeoff that having more training examples helps improve the accuracy even with less folds.

In [2]:
#partioning dataset 

import pandas as pd
import numpy as np
admissions = pd.read_csv("admissions.csv")
admissions["actual_label"] = admissions["admit"]
admissions = admissions.drop("admit", axis=1)

shuffled_index = np.random.permutation(admissions.index)
shuffled_admissions = admissions.loc[shuffled_index]
admissions = shuffled_admissions.reset_index()

admissions.ix[0:128, "fold"] = 1
admissions.ix[129:257, "fold"] = 2
admissions.ix[258:386, "fold"] = 3
admissions.ix[387:514, "fold"] = 4
admissions.ix[515:644, "fold"] = 5
# Ensure the column is set to integer type.
admissions["fold"] = admissions["fold"].astype('int')


print(admissions.head())
print(admissions.tail())

   index       gpa         gre  actual_label  fold
0     90  3.498628  632.133663             0     1
1    109  3.285823  528.973583             0     1
2     99  2.916654  661.170680             0     1
3    123  3.119102  548.824829             0     1
4    629  2.944457  624.275140             1     1
     index       gpa         gre  actual_label  fold
639    380  2.894841  555.032255             0     5
640    254  3.108220  617.103330             0     5
641    223  3.364396  624.524777             0     5
642    371  2.746944  529.966555             0     5
643    144  2.687906  540.035697             0     5


When working in a production environment however, you should use scikit-learn. Scikit-learn has a few different tools that make performing cross validation easy. Similar to having to instantiate a LinearRegression or LogisticRegression object before you can train one of those models, you need to instantiate a KFold class before you can perform k-fold cross-validation:

kf = KFold(n, n_folds, shuffle=False, random_state=None)

where:

    n is the number of observations in the dataset,
    n_folds is the number of folds you want to use,
    shuffle is used to toggle shuffling of the ordering of the observations in the dataset,
    random_state is used to specify a seed value if shuffle is set to True.

You'll notice here that only the first parameter depends on the dataset at all. This is because the KFold class returns an iterator object but won't actually handle the training and testing of models. If we're primarily only interested in accuracy and error metrics for each fold, we can use the KFold class in conjunction with the cross_val_score function, which will handle training and testing of the models in each fold.

Here are the relevant parameters for the cross_val_score function:

cross_val_score(estimator, X, Y, scoring=None, cv=None)

where:

    estimator is a sklearn model that implements the fit method (e.g. instance of LinearRegression or LogisticRegression),
    X is the list or 2D array containing the features you want to train on,
    y is a list containing the values you want to predict (target column),
    scoring is a string describing the scoring criteria (list of accepted values here).
    cv describes the number of folds. Here are some examples of accepted values:
        an instance of the KFold class,
        an integer representing the number of folds.

Depending on the scoring criteria you specify, either a single value is returned (e.g. average_precision) or an array of values (e.g. accuracy), one value for each fold.

Here's the general workflow for performing k-fold cross-validation using the classes we just described:

    instantiate the model class you want to fit (e.g. LogisticRegression),
    instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes you want,
    use the cross_val_score function to return the scoring metric you're interested in.


In [3]:
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression

admissions = pd.read_csv("admissions.csv")
admissions["actual_label"] = admissions["admit"]
admissions = admissions.drop("admit", axis=1)
kf=KFold(n=admissions.shape[0],n_folds=5,shuffle=True,random_state=8)
lr=LogisticRegression()
accuracies=cross_val_score(estimator=lr,X=admissions[["gpa"]],y=admissions["actual_label"],cv=kf)
print(accuracies)
print(accuracies.mean())

[ 0.6124031   0.65891473  0.64341085  0.6744186   0.6328125 ]
0.644391957364


Interpretation

Using 5-fold cross-validation, we achieved an average accuracy score of 64.4%, which closely matches the 63.6% accuracy score we achieved using holdout validation. When working with simple univariate models, often holdout validation is more than enough and the similar accuracy scores confirm this. When you're using multiple features to train a model (multivariate models), performing k-fold cross-validation can give you a better sense of the accuracy you should expect when you use the model on data it wasn't trained on.