In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

The issues associated with validation and 
cross-validation are some of the most important
aspects of the practice of machine learning.  Selecting the optimal model
for your data is vital, and is a piece of the problem that is not often
appreciated by machine learning practitioners.

Of core importance is the following question:

**If our estimator is underperforming, how should we move forward?**

- Use simpler or more complicated model?
- Add more features to each observed data point?
- Add more training samples?

The answer is often counter-intuitive.  In particular, **sometimes using a
more complicated model will give _worse_ results.**  Also, **sometimes adding
training data will not improve your results.**  The ability to determine
what steps will improve your model is what separates the successful machine
learning practitioners from the unsuccessful.

## Learning Curves and Validation Curves

One way to address this issue is to use what are often called **Learning Curves**.
Given a particular dataset and a model we'd like to fit (e.g. using k neighbors regression), we'd
like to tune our value of the *hyperparameter* ``n_neighbors`` to give us the best fit.

Lets go back to our regression problem from earlier:

In [None]:
from figures import plot_kneighbors_regularization
plot_kneighbors_regularization()

Learning Curves
============

What the right model for a dataset is depends critically on how much data we have. More data allows us to be more confident about building a complex model. Lets built some intuition on why that is. Look at the folling datasets:


In [None]:
from figures import plot_regression_datasets
plot_regression_datasets()

They all come from the same underlying process. But if you were asked to make a prediction, you would be more likely to draw a straight line for the left-most one, as there are only very few datapoints, and no real rule is aparent. For the dataset in the middle, some structure is recognizable, though the exact shape of the true function is maybe not obvious. With even more data on the right hand side, you would probably be very comfortable with drawing a sinusoidal line with a lot of certainty.

A great way to explore how a model fit evolves with different dataset sizes are learning curves.
A learning curve plots the validation error for a given model against different training set sizes.

But first, take a moment to think about what we're going to see:

**Questions:**

- **As the number of training samples are increased, what do you expect to see for the training error?  For the validation error?**
- **Would you expect the training error to be higher or lower than the validation error?  Would you ever expect this to change?**

We can run the following code to plot the learning curve for a ``d=1`` model:

# FIXME WORST EXAMPLE EVER

In [None]:
from figures import make_dataset
from sklearn.learning_curve import learning_curve
from sklearn.neighbors import KNeighborsRegressor

x, y = make_dataset(n_samples=100)
X = x[:, np.newaxis]

training_sizes, train_scores, test_scores = learning_curve(KNeighborsRegressor(n_neighbors=5), X, y, cv=10)
plt.plot(training_sizes, train_scores.mean(axis=1), label="training scores")
plt.plot(training_sizes, test_scores.mean(axis=1), label="test scores")
plt.legend(loc='best')

You can see that for the very complex model with ``n_neighbors=2``, the score increase 

Notice that the validation error *generally decreases* with a growing training set,
while the training error *generally increases* with a growing training set.  From
this we can infer that as the training size increases, they will converge to a single
value.

From the above discussion, we know that `d = 1` is a high-bias estimator which
under-fits the data. This is indicated by the fact that both the
training and validation errors are very high. When confronted with this type of learning curve,
we can expect that adding more training data will not help matters: both
lines will converge to a relatively high error.

**When the learning curves have converged to a high error, we have an underfitting model.**

An underfitting model can be improved by:

- Using a more sophisticated model (i.e. in this case, increase ``d``)
- Gather more features for each sample.
- Decrease regularlization in a regularized model.

A underfitting model cannot be improved, however, by increasing the number of training
samples (do you see why?)

Now let's look at an over-fit model:

# FIXME ADD EXAMPLE HERE

Here we show the learning curve for `d = 20`. From the above
discussion, we know that `d = 20` is an estimator
which **over-fits** the data. This is indicated by the fact that the
training error is much less than the validation error. As
we add more samples to this training set, the training error will
continue to climb, while the cross-validation error will continue
to decrease, until they meet in the middle. In this case, our
intrinsic error was set to 1.0, and we can infer that adding more
data will allow the estimator to very closely match the best
possible cross-validation error.

**When the learning curves have not yet converged with our full training set, it indicates an over-fit model.**

An overfitting model can be improved by:

- Gathering more training samples.
- Using a less-sophisticated model (i.e. in this case, make ``d`` smaller)
- Increasing regularization.

In particular, gathering more features for each sample will not help the results.

## Summary

We’ve seen above that an under-performing algorithm can be due
to two possible situations: under-fitting and over-fitting.
Using the technique of learning curves, we can train on progressively
larger subsets of the data, evaluating the training error and
cross-validation error to determine whether our algorithm is overfitting or underfitting. But what do we do with this information?

#FIXME redo discussion here

### Underfitting

If our algorithm is **underfitting**, the following actions might help:

- **Add more features**. In our example of predicting home prices,
  it may be helpful to make use of information such as the neighborhood
  the house is in, the year the house was built, the size of the lot, etc.
  Adding these features to the training and test sets can improve
  the fit.
- **Use a more sophisticated model**. Adding complexity to the model can
  help improve the fit. For a polynomial fit, this can be accomplished
  by increasing the degree d. Each learning technique has its own
  methods of adding complexity.
- **Use fewer samples**. Though this will not improve the classification,
  an underfitting algorithm can attain nearly the same error with a smaller
  training sample. For algorithms which are computationally expensive,
  reducing the training sample size can lead to very large improvements
  in speed.
- **Decrease regularization**. Regularization is a technique used to impose
  simplicity in some machine learning models, by adding a penalty term that
  depends on the characteristics of the parameters. If a model is underfitting,
  decreasing the regularization can lead to better results.
  
### Overfitting

If our algorithm shows signs of **overfitting**, the following actions might help:

- **Use fewer features**. Using a feature selection technique may be
  useful, and decrease the over-fitting of the estimator.
- **Use a simpler model**.  Model complexity and over-fitting go hand-in-hand.
- **Use more training samples**. Adding training samples can reduce
  the effect of over-fitting.
- **Increase Regularization**. Regularization is designed to prevent
  over-fitting. So increasing regularization
  can lead to better results for overfitting models.

These choices become very important in real-world situations. For example,
due to limited telescope time, astronomers must seek a balance between
observing a large number of objects, and observing a large number of
features for each object. Determining which is more important for a
particular learning task can inform the observing strategy that the
astronomer employs. In a later exercise, we will explore the use of
learning curves for the photometric redshift problem.