##  Nonlinear Supervised Learning Series

# Cross-validation

In the ideal instance of regression, where we look to approximate a continuous function using a fixed or adjustable (neural network) basis of features, we saw in Section 5.1 that using more elements of a basis results in a better approximation (see e.g., Fig. 5.3). In short, in the context of continuous function approximation more (basis elements) is always better. Does the same principle apply in the real instance of regression, i.e., in the case of a noisily sampled function approximation problem? Unfortunately, no.

Take for example the semi-ideal and realistic sinusoidal datasets shown in Fig. 5.13, along with polynomial fits of degrees three (in blue) and ten (in purple). In the left panel of this figure, which shows the discrete sinusoidal dataset with evenly spaced points, by increasing the number of basis features M from 3 to 10 the corresponding polynomial model fits the data and the underlying function better. Conversely, in the right panel while the model fits the data better as we increase the number of polynomial features from M = 3 to 10, the representation of the underlying data-generating function actually gets worse. Since the underlying function is the object we truly wish to understand, this is a problem.

<figure>
  <img src= '../../mlrefined_images/unsupervised_images/Fig_5_13.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 5.13:</strong> <em> Plots of (left panel) discretized and (right panel) noisy samples of the data-generating function
y (x) = sin (2π x), along with its degree three and degree ten polynomial approximations in blue and purple, respectively. While the higher degree polynomial does a better job at modeling both the discretized data and underlying function, it only fits the noisy sample data better, providing a worse approximation of the underlying data-generating function than the lower degree polynomial. Using cross-validation we can determine the more appropriate model, in this case the degree three polynomial, for such a dataset. </em>  </figcaption> 
</figure>

The phenomenon illustrated through this simple example is in fact true more gener- ally: by increasing the number M of any type of basis features (fixed or neural network) we can indeed produce better fitting models of a dataset, but at the potential cost of creating poorer representations of the data-generating function we care foremost about. Stated formally, given any dataset we can drive the value of the Least Squares cost to zero via solving the minimization problem

\begin{equation}
\underset{b,\,\mathbf{w},\Theta}{\mbox{minimize}}\,\,\underset{p=1}{\overset{P}{\sum}}\left(b+\mathbf{f}_{p}^{T}\mathbf{w}-y_{p}\right)^{2}
\end{equation}

byincreasingMwheref =  f  x   f  x   ··· f  x    T.Therefore,choosing
p1p2pMp
M correctly is extremely important. Note in the language of machine learning, a model
corresponding to too large a choice of M is said to overfit the data. Likewise when choosing M too small12 the model is said to underfit the data. For instance, using a degree M = 1 polynomial feature we can only find the best linear fit to the data in Fig. 5.13, which would be not only a poor fit to the observed data but also a poor representation of the underlying sinusoidal pattern.

In this section we describe cross-validation, an effective framework for choosing the proper value for M automatically and intelligently so as to prevent the problem of underfitting/overfitting. For example, in the case of the data shown in the right panel of Fig. 5.13 cross-validation will determine M = 3 the better model, as opposed to M = 12. This discussion will culminate in the description of a specific procedure known as k-fold cross-validation which is commonly used in practice.

#### <span style="color:#a50e3e;">Example 1: </span> Overfitting and underfitting Galileo’s ramp data

In the left panel of Fig. 5.14 we show the data from Galileo’s classic ramp experiment, initially described in Example 1.7, performed in order to understand the relationship between time and the acceleration of an object due to (the force we today know as) gravity. Also shown in this figure is (left panel) the kind of quadratic fit Galileo used to describe the underlying relationship traced out by the data, along with two other possible model choices (right panel): a linear fit in green, as well as a degree 12 polynomial fit in magenta. Of course the linear model is inappropriate, as with this data any line would have large squared error (see e.g., Fig. 3.3) and would thus be a poor representation of the data. On the other hand, while the degree 12 polynomial fits the data-set perfectly, with corresponding squared error value of zero, the model itself just “looks wrong.”

<figure>
  <img src= '../../mlrefined_images/unsupervised_images/Fig_5_14.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 5.14:</strong> <em> Data from Galileo’s simple ramp experiment from Example 3.3, exploring the relationship between time and the distance an object falls due to gravity. (left panel) Galileo fit a simple quadratic to the data. (right panel) A linear model (shown in green) is not flexible enough and as a result, underfits the data. A degree twelve polynomial (shown in magenta) overfits the data, being too complicated and unnatural (between the start and 0.25 of the way down the ramp the ball travels a negative distance!) to be a model of a simple natural phenomenon. </em>  </figcaption> 
</figure>

Examining the right panel of this figure why, for example, when traveling between the beginning and a quarter of the way down the ramp, does the distance the ball travels become negative! This kind of behavior does not at all match our intuition or expectation about how gravity should operate on an object. This is why Galileo chose a quadratic, rather than a higher order degree polynomial, to fit such a data-set: because he expected that the rules which govern our universe are explanatory yet simple.

This principle, that the rules we use to describe our universe should be flexible yet simple, is often called Occam’s Razor and lies at the heart of essentially all scientific inquiry past and present. Since machine learning can be thought of as a set of tools for making sense of arbitrary kinds of data, i.e., not only data relating to a physical system or law, we want the relationship learned in solving a regression (or classification) problem to also satisfy this basic Occam’s Razor principle. In the context of machine learning, Occam’s Razor manifests itself geometrically, i.e., we expect the model (or function) un- derlying our data to be simple yet flexible enough to explain the data we have. The linear model in Fig. 5.14, being too rigid and inflexible to establish the relationship between time and the distance an object falls due to gravity, fits very poorly. As previously mentioned, in machine learning such a model is said to underfit the data we have. On the other hand, the degree 12 polynomial model is needlessly complicated, resulting in a very close fit to the data we have, but is far too oscillatory to be representative of the underlying phenomenon and is said to overfit the data.

## Diagnosing the problem of overfitting/underfitting

A reasonable diagnosis of the overfitting/underfitting problems is that both fail at repre- senting new data, generated via the same process by which the current data was made, that we can potentially receive in the future. For example, the overfitting degree ten polynomial shown in the right panel of Fig. 5.13 would poorly model any future data generated by the same process since it poorly represents the underlying data-generating function (a sinusoid). This data-centric perspective provokes a practical criterion for de- termining an ideal choice of M for a given dataset: the number M of basis features used should be such that the corresponding model fits well to both the current dataset as well as to new data we will receive in the future.

## Hold out cross-validation

While we of course do not have access to any “new data we will receive in the future,” we can simulate such a scenario by splitting our data into two subsets: a larger training set of data we already have, and a smaller testing set of data that we “will receive in the future.” Then, we can try a range of values for M by fitting each to the training set of known data, and pick the one that performs the best on our testing set of unknown data. By keeping a larger portion of the original data as the training set we can safely assume that the learned model which best represents the testing data will also fit the training set fairly well. In short, by employing this sort of procedure for comparing a set of models, referred to as hold out cross-validation, we can determine a candidate that approximately satisfies our criterion for an ideal well-fitting model.

What portion of our dataset should we save for testing? There is no hard rule, and in practice typically between 1/10 to 1/3 of the data is assigned to the testing set. One general rule of thumb is that the larger the dataset (given that it is relatively clean and well distributed) the bigger the portion of the original data may be assigned to the test- ing set (e.g., 1/3 may be placed in the testing set) since the data is plentiful enough for the training data to still accurately represent the underlying phenomenon. Conversely, in general with smaller or less rich (i.e., more noisy or poorly distributed) datasets we should assign a smaller portion to the testing set (e.g., 1/10 may be placed in the test- ing set) so that the relatively larger training set retains what little information of the underlying phenomenon was captured by the original data.


> In general the larger/smaller the original dataset the larger/smaller the portion of the original data that should be assigned to the testing set.

As illustrated in Fig. 5.15, to form the training and testing sets we split the original data randomly into k non-overlapping parts and assign 1 portion for testing (1/k of the original data) and k − 1 portions to the training set (k−1/k of the original data).

<figure>
  <img src= '../../mlrefined_images/unsupervised_images/Fig_5_15.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 5.15:</strong> <em> Hold out cross-validation. The original data (left panel) shown here as the entire circular mass is split randomly (middle panel) into k non-overlapping sets (here k = 3). (right panel) One piece, or 1/k of the original dataset, is then taken randomly as the testing set with the remaining pieces, or k−1/k of the original data, taken as the training set. </em>  </figcaption> 
</figure>

Regardless of the value we choose for k, we train our model on the training set using a range of different values of M. We then evaluate how well each model (or in other words, each value of M) fits to both the training and testing sets, via measuring the model’s training error and testing error, respectively. The best-fitting model is chosen as the one providing the lowest testing error or the best fit to the “unseen” testing data. Finally, in order to leverage the full power of our data we use the optimal number of basis features M to train our model, this time using the entire data (both training and testing sets).

#### <span style="color:#a50e3e;">Example 2: </span> Hold out for regression using Fourier features

To solidify these details, in Fig. 5.16 we show an example of applying hold out cross- validation using a dataset of P = 30 points generated via the function y (x) shown in Fig. 5.3. To perform hold out cross-validation on this dataset we randomly partition it into k = 3 equal-sized (ten points each) non-overlapping subsets, using two partitions together as the training set and the final part as testing set, as illustrated in the left panel of Fig. 5.16. The points in this panel are colored blue and yellow indicating that they belong to the training and testing sets respectively. We then train our model on the train- ing set (blue points) by solving several instances of the Least Squares problem in (5.18). In particular we use a range of even values for M Fourier features M = 2, 4, 6, . . . , 16 (since Fourier elements naturally come in pairs of two as shown in Equation (5.7)) which corresponds to the range of degrees D = 1, 2, 3, . . . , 8 (note that for clarity panels in the figure are indexed by D).

<figure>
  <img src= '../../mlrefined_images/unsupervised_images/Fig_5_16.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 5.16:</strong> <em> An example of hold out cross-validation applied to a simple dataset using Fourier features. (left panel) The original data split into training and testing sets, with the points belonging to each set colored blue and yellow respectively. (middle eight panels) The fit resulting from each set of degree D Fourier features in the range D = 1, 2, . . . , 8 is shown in blue in each panel. Note how the lower degree fits underfit the data, while the higher degree fits overfit the data. (second from right panel) The training and testing errors, in blue and yellow respectively, of each fit over the range of degrees tested. From this we see that D⋆ = 5 (or M⋆ = 10) provides the best fit. Also note how the training error always decreases as we increase the degree/number of basis elements, which will always occur regardless of the dataset/feature basis type used. (right panel) The final model using M⋆ = 10 trained on the entire dataset (shown in red) fits the data well and closely matches the underlying data generating function (shown in dashed black).</em>  </figcaption> 
</figure>

Based on the models learned for each value of M (see the middle set of eight panels of the figure) we plot training and testing errors (in the panel second from the right), measuring how well each model fits the training and testing data respectively, over the entire range of values. Note that unlike the testing error, the training error always de- creases as we increase M (which occurs more generally regardless of the dataset/feature basis used). The model that provides the smallest testing error (M⋆ = 10 or equivalently D⋆ = 5) is then trained again on the entire dataset, giving the final regression model shown in red in the rightmost panel of Fig. 5.16.

## Hold out calculations

Here we give a complete set of holdout cross-validation calculations in a general setting. We denote the collection of points belonging to the training and testing sets respectively by their indices as\noindent\begin{array}{c}
\Omega_{\textrm{train}}=\left\{ p\,\vert\,\left(\mathbf{x}_{p},\,y_{p}\right)\,\mbox{belongs to the training set}\right\} \\
\Omega_{\textrm{test}}=\left\{ p\,\vert\,\left(\mathbf{x}_{p},\,y_{p}\right)\,\mbox{belongs to the testing set}\right\} 
\end{array}We then choose a basis type (e.g., polynomial, Fourier, neural network) and choose a range for the number of basis features over which we search for an ideal value for M. To determine the training and testing error of each value of M tested we first form the corresponding feature vector \mathbf{f}_{p}=\left[\begin{array}{cccc}
f_{1}\left(\mathbf{x}_{p}\right) & f_{2}\left(\mathbf{x}_{p}\right) & \cdots & f_{M}\left(\mathbf{x}_{p}\right)\end{array}\right]^{T} and fit a corresponding model to the training set by solving the correspondingOnce again, for a fixed basis this problem may be solved in closed form since \Theta is empty, while for neural networks it must be solved via gradient descent (see e.g., Example [example-Single-hidden-layer-regression-example]).  Least Squares problem\noindent\underset{b,\,\mathbf{w},\Theta}{\mbox{minimize}}\,\,\underset{p\in\Omega_{\textrm{train}}}{\sum}\left(b+\mathbf{f}_{p}^{T}\mathbf{w}-y_{p}\right)^{2}.Denoting a solution to the problem above as \left(b_{M}^{\star},\,\mathbf{w}_{M}^{\star},\,\Theta_{M}^{\star}\right) we find the training and testing errors for the current value of M by simply computing the mean squared error using these parameters over the training and testing sets, respectively\noindent\begin{array}{c}
\begin{array}{c}
\mbox{Training error}=\frac{1}{\left|\Omega_{\textrm{train}}\right|}\underset{p\in\Omega_{\textrm{train}}}{\sum}\left(b_{M}^{\star}+\mathbf{f}_{p}^{T}\mathbf{w}_{M}^{\star}-y_{p}\right)^{2}\\
\mbox{Testing error}=\frac{1}{\left|\Omega_{\textrm{test}}\right|}\underset{p\in\Omega_{\textrm{test}}}{\sum}\left(b_{M}^{\star}+\mathbf{f}_{p}^{T}\mathbf{w}_{M}^{\star}-y_{p}\right)^{2},
\end{array}\end{array}where the notation \left|\Omega_{\textrm{train}}\right| and \left|\Omega_{\textrm{test}}\right| denotes the cardinality or number of points in the training and testing sets, respectively. Once we have performed these calculations for all values of M we wish to test, we choose the one that provides the lowest testing error, denoted by M^{\star}. 

Finally we form the feature vector \mathbf{f}_{p}=\left[\begin{array}{cccc}
f_{1}\left(\mathbf{x}_{p}\right) & f_{2}\left(\mathbf{x}_{p}\right) & \cdots & f_{M^{\star}}\left(\mathbf{x}_{p}\right)\end{array}\right]^{T} for all the points in the entire dataset, and solve the Least Squares problem over the entire dataset to form the final model\noindent\underset{b,\,\mathbf{w},\Theta}{\mbox{minimize}}\,\,\underset{p=1}{\overset{P}{\sum}}\left(b+\mathbf{f}_{p}^{T}\mathbf{w}-y_{p}\right)^{2}.

## k-fold cross-validation

While the hold out method previously described is an intuitive approach to determining proper fitting models, it suffers from an obvious flaw: having been chosen at ran- dom, the points assigned to the training set may not adequately describe the original data. However, we can easily extend and robustify the hold out method as we now describe.

As illustrated in Fig. 5.17 for k = 3, with k-fold cross-validation we once again randomly split our data into k non-overlapping parts. By combining k − 1 parts we can, as with the hold out method, create a large training set and use the remaining single fold as a test set. With k-fold cross-validation we will repeat this procedure k times (each instance being referred to as a fold), in each instance using a different single portion of the split as testing set and the remaining k − 1 parts as the corresponding training set, and computing the training and testing errors of all values of M as described in the previous section. We then choose the value of M that has the lowest average testing error, a more robust choice than the hold out method provides, that can average out a scenario where one particular choice of training set inadequately describes the original data.

<figure>
  <img src= '../../mlrefined_images/unsupervised_images/Fig_5_17.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 5.17:</strong> <em> k-fold cross-validation for k = 3. The original data shown here as the entire circular mass (top left) is split into k non-overlapping sets (top right) just as with the hold out method. However with k-fold cross-validation we repeat the hold out calculations k times (bottom), once per “fold,” in each instance, keeping a different portion of the split data as the testing set while merging the remaining k − 1 pieces as the training set.</em>  </figcaption> 
</figure>

Note, however, that this advantage comes at a cost: k-fold cross-validation is (ap- proximately) k times more computationally costly than its hold out counterpart. In fact performing k-fold cross-validation is often the most computationally expensive process performed to solve a regression problem.

> Performing k-fold cross-validation is often the most computationally expensive component in solving a general regression problem.


There is again no universal rule for the number k of non-overlapping partitions (or the number of folds) to break the original data into. However, the same intuition previously described for choosing k with the hold out method also applies here, as well as the same convention with popular values of k ranging from k = 3 . . . 10 in practice.

For convenience we provide a pseudo-code for applying k-fold cross-validation in Algorithm 5.1.

### The K-means algorithm

<hr style="height:1px;border:none;color:#555;background-color:#555;">
<p style="line-height: 1.7;">
<strong>1:</strong>&nbsp;&nbsp; <strong>input:</strong> $N \times P$ data matrix $\mathbf{X}$, initialized $N \times K$ centroid matrix $\mathbf{C}$, and maximum number of iterations $J$ <br>

<strong>2:</strong>&nbsp;&nbsp; <code>for</code> $\,\,i = 1,\ldots,J$<br>

<strong>3:</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <code>for</code> $\,\,p = 1,\ldots,P$<br>

<strong>4:</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $k^{\star}=\underset{k=1,\ldots,K}{\mbox{argmin}}\,\,\left\Vert \mathbf{c}_{k}-\mathbf{x}_{p}\right\Vert _{2}^{2}$<br>

<strong>5:</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; update $\mathbf{w}_p$ via $\mathbf{w}_{p}=\mathbf{e}_{k^{\star}}$ where $\mathbf{e}_{k}$ is the $k^{th}$ standard basis vector<br>

<strong>6:</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <code>end for</code><br>

<strong>7:</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <code>for</code> $\,\,k = 1,\ldots,K$<br>

<strong>8:</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; denote $\mathcal{S}_{k}$ the index set of points $\mathbf{x}_{p}$ currently assigned to the $k^{th}$ cluster<br>

<strong>9:</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; update $\mathbf{c}_k$ via $\mathbf{c}_{k}=\frac{1}{\left|\mathcal{S}_{k}\right|}\underset{p\in\mathcal{S}_{k}}{\sum}\mathbf{x}_{p}$<br>

<strong>10:</strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <code>end for</code><br>

<strong>11:</strong>&nbsp; <code>end for</code><br>

<strong>12:</strong>&nbsp; <strong>output:</strong> optimal centroid matrix $\mathbf{C}$ and assignment matrix $\mathbf{W}$<br>

<hr style="height:1px;border:none;color:#555;background-color:#555;">
</p>

#### <span style="color:#a50e3e;">Example 3: </span> k-fold cross-validation for regression using Fourier features

In Fig. 5.18 we illustrate the result of applying k-fold cross-validation to choose the ideal number M of Fourier features for the dataset shown in Example 5.5, where it was originally used to illustrate the hold out method. As in the previous example, here we set k = 3 and try M in the range M = 2,4,6,...,16, which corresponds to the range of degrees D = 1, 2, 3, . . . , 8 (note that for clarity, panels in the figure are indexed by D).

<figure>
  <img src= '../../mlrefined_images/unsupervised_images/Fig_5_18.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 5.18:</strong> <em> Result of performing k-fold cross-validation with k = 3 (see text for further details). The top three rows display the result of performing the hold out method on each fold. The left, middle, and right columns show each fold’s training/testing sets (colored blue and yellow respectively) training and testing errors over the range of M tried, and the final model (fit to the entire dataset) chosen by picking the value of M providing the lowest testing error. Due to the split of the data, performing hold out on the first fold (top row) results in a poor underfitting model for the data. However, as illustrated in the final row, by averaging the testing errors (bottom middle panel) and choosing the model with minimum associated average test error, we average out this problem (finding that D⋆ = 5 or M⋆ = 10) and determine an excellent model for the phenomenon (as shown in the bottom right panel).</em>  </figcaption> 
</figure>


In the top three rows of Fig. 5.18 we show the result of applying hold out on each fold. In each row we show a fold’s training and testing data colored blue and yellow respec- tively in the left panel, the training/testing errors for each M on the fold (as computed in Equation (5.26)) in the middle panel, and the final model (learned to the entire dataset) provided by the choice of M with lowest testing error. As can be seen in the top row, the particular split of the first fold leads to too low a value of M being chosen, and thus an underfitting model. In the middle panel of the final row we show the result of averaging the training/testing errors over all k = 3 folds, and in the right panel the result of choos- ing the overall best M⋆ = 10 (or equivalently D⋆ = 5) providing the lowest average testing error. By taking this value we average out the poor choice determined on the first fold, and end up with a model that fits both the data and underlying function quite well.

#### <span style="color:#a50e3e;">Example 4: </span> Leave-one-out cross-validation for Galileo’s ramp data

In Fig. 5.19 we show how using k = P fold cross-validation (since we have only P = 6 data points, intuition suggests, see Section 5.3.2, that we use a large value for k), sometimes referred to as leave-one-out cross-validation, allows us to recover precisely the quadratic fit Galileo made by eye. Note that by choosing k = P this means that every data point will take a turn being the testing set. Here we search over the polynomial basis features of degree M = 1 . . . 6. While not all of the hold out models over the six folds fit the data well, the average k-fold result is indeed the M⋆ = 2 quadratic polynomial fit originally proposed by Galileo!

<figure>
  <img src= '../../mlrefined_images/unsupervised_images/Fig_5_19.png' width="60%" height="auto" alt=""/>
  <figcaption>   
<strong>Figure 5.19:</strong> <em> (six panels on the left) Each fold of training/testing sets shown in blue/yellow respectively of a k-fold run on the Galileo’s ramp data, along with their individual hold out model (shown in blue). Only the model learned on the fourth fold overfits the data. By choosing the model with minimum average testing error over the k = 6 folds we recover the desired quadratic M⋆ = 2 fit originally proposed by Galileo (shown in magenta in the right panel).</em>  </figcaption> 
</figure>

\begin{equation}

\end{equation}