## k-fold cross-validation

As introduced in Section 5.3.4, k-fold cross-validation is a robust extension of the hold out method whereby the procedure is repeated k times where in each instance (or fold) we treat a different portion of the split as a testing set and the remaining k − 1 portions as the training set. The hold out calculations are then made, as detailed previously, on each fold and the value of M with the lowest average testing error is chosen. This produces a more robust choice of M, because potentially poor hold out choices on individual folds can be averaged out, producing a stronger model.

#### <span style="color:#a50e3e;">Example 2: </span> k-fold cross-validation for classification using polynomial features


In Fig. 6.13 we illustrate the result of applying k-fold cross-validation to choose the ideal number M of polynomial features for the dataset shown in Example 6.5, where it was originally used to illustrate the hold out method. As in the previous example, here we set k = 3, use the softmax cost, and try M in the range M = 2,5,9,14,20,27,35,44 which corresponds (see footnote 5) to polynomial degrees D = 1, 2, . . . , 8 (note that for clarity panels in the figure are indexed by D).

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_13.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 6.13:</strong> <em> Result of performing k-fold cross-validation with k = 3 (see text for further details). The top three rows display the result of performing the hold out method on each fold. The left, middle, and right columns show each fold’s training/testing sets (drawn as thick and thin points respectively), training and testing errors over the range of M tried, and the final model (fit to the entire dataset) chosen by picking the value of M providing the lowest testing error. Due to the split of the data, performing hold out on each fold results in a poor overfitting (first two folds) or underfitting (final fold) model for the data. However, as illustrated in the final row, by averaging the testing errors (bottom middle panel) and choosing the model with minimum associated average test error we average out these problems (finding that D⋆ = 4 or M⋆ = 14) and determine an excellent model for the phenomenon (as shown in the bottom right panel). </em>  </figcaption> 
</figure>

In the top three rows of Fig. 6.13 we show the result of applying hold out on each fold. In each row we show a fold’s training and testing data in the left panel, the training/testing errors for each M on the fold (as computed in Equation (6.25)) in the middle panel, and the final model (learned to the entire dataset) provided by the choice of M with lowest testing error. As can be seen, the particular split leads to an overfitting result on the first two folds and an underfitting result on the third fold. In the middle panel of the final row we show the result of averaging the train- ing/testing errors over all k = 3 folds, and in the right panel the result of choosing the overall best M⋆ = 14 (or equivalently D⋆ = 4) providing the lowest average testing error. By taking this value we average out the poor choices determined on each fold, and end up with a model that fits both the data and underlying function quite well.

#### <span style="color:#a50e3e;">Example 3: </span> Warning examples


When a k-fold determined set of features performs poorly this is almost always indicative of a poorly structured dataset (i.e., there is little relationship between the input/output data), like the one shown in the left panel of Fig. 6.14. However, there are also instances, when we have too little or too poorly distributed data, when a high per- forming k-fold model can be misleading as to how well we understand a phenomenon. In the middle and right panels of the figure we show two such instances that the reader should keep in mind when using k-folds, where we either have too little (middle panel) or poorly distributed data (right panel).

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_14.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 6.14:</strong> <em> (left panel) A low accuracy k-fold fit to a dataset indicates that it has little structure (i.e., that there is little to no relationship between the input and output). It is possible that a high accuracy k-fold fit fails to capture the true nature of an underlying function, as when (middle panel) we have too little data (the k-fold linear separator is shown in black, and the true nonlinear separator is shown dashed) and (right panel) when we have poorly distributed data (again the k-fold separator is shown in black, the original separator dashed). See text for further details. </em>  </figcaption> 
</figure>

In the first instance we have generated a small sample of points based on the second indicator function shown in Fig. 6.3, which has a nonlinear boundary in the original feature space. However, the sample of data is so small that it is perfectly linearly sep- arable, and thus applying e.g., k-fold cross-validation with polynomial basis features will properly (due to the small selection of data) recover a line to distinguish between the two classes. However, clearly data generated via the same underlying process in the future will violate this linear boundary, and thus our model will perform poorly. This sort of problem arises in applications such as automatic medical diagnosis (see Example 1.6) where access to data is limited. Unless we can gather additional data to fill out the space (making the nonlinear boundary more visible) this problem is unavoidable.

In the second instance shown in the right panel of the figure, we have plenty of data (generated using the indicator function originally shown in Fig. 6.4) but it is poorly distributed. In particular, we have no samples from the blue class in the lower half of the space. In this case the k-fold method (again here using polynomial features) properly determines a separating boundary that perfectly distinguishes the two classes. However, many of the blue class points we would receive in the future in the lower half of the space will be misclassified given the learned k-fold model. This sort of issue can arise in practice, e.g., when performing face detection (see Example 1.4), if we do not collect a thorough dataset of blue (e.g., “non-face”) examples. Again, unless we can gather further data to fill out the space this problem is unavoidable.

## k-fold cross-validation for one-versus-all multiclass classification

Employing the one-versus-all (OvA) framework for multiclass classification, we can immediately apply the k-fold method described previously. For a C class problem we simply apply the k-fold method to each of the C two class classification problems, and combine the resulting classifiers as shown in Equation (6.21). We show the result of applying k = 3 fold cross-validation with OvA on two datasets with C = 3 and C = 5 classes respectively in Fig. 6.15 and 6.16, where we have used polynomial features with M = 2,5,9,14,20,27,35,44 or equivalently of degree D = 1,2,...,8 for each two class subproblem. Displayed in each figure are the nonlinear boundaries determined for each fold, as well as the combined result in the right panel of each figure. In both instances the combined boundaries separate the different classes of data very well.

<figure>
  <img src= '../../mlrefined_images/nonlinear_images/Fig_6_15.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 6.15:</strong> <em> Result of performing k = 3 fold cross-validation on the C = 3 class dataset first shown in
Fig. 6.10 using OvA (see text for further details). The left three panels show the result for the red class versus all, blue class versus all, and green class versus all subproblems. For the red/green versus all problems the optimal degree found was D⋆ = 2, while for the blue versus all D⋆ = 4 (note how this produces a better fit than the D = 2 fit shown originally in Fig. 6.10). The right panel shows the combined boundary determined by Equation (6.21), which perfectly separates the three classes. </em>  </figcaption> 
</figure>