# Footnotes
let's review the classic evaluation recipes
* Simple hold-out validation
* K-fold validation
* iterated K-fold validation with shuffling

### Simple Hold-out validation
Set apart some fraction of your data as your test set. Train on the remaining data, and
evaluate on the test set. As you saw in the previous sections, in order to prevent
information leaks, you shouldn’t tune your model based on the test set, and therefore you
should also reserve a validation set.

#### Cons
This is the simplest evaluation protocol, and it suffers from one flaw: if little data is
available, then your validation and test sets may contain too few samples to be statisti-
cally representative of the data at hand.


### K-Fold Validation
K- FOLD VALIDATION
With this approach, you split your data into K partitions of equal size. For each parti-
tion i , train a model on the remaining K – 1 partitions, and evaluate it on partition i .
Your final score is then the averages of the K scores obtained.

### Iterated K-fold With Shuffling
This one is for situations in which you have relatively little data available and you need
to evaluate your model as precisely as possible. I’ve found it to be extremely helpful in
Kaggle competitions. It consists of applying K -fold validation multiple times, shuffling
the data every time before splitting it K ways. The final score is the average of the
scores obtained at each run of K -fold validation. Note that you end up training and
evaluating `P × K` models (where `P` is the number of iterations you use), which can very
expensive.


## Data preprocessing feature engineering and feature learning
* VECTORIZATION
* VALUE NORMALIZATION

### Vectorization
Whatever data you need to process—sound,
images, text—you must first turn into tensors, a step called data vectorization. For
instance, in the two previous text-classification examples, we started from text repre-
sented as lists of integers (standing for sequences of words)

### Value Normalization
Before you fed this data into your network,
you had to normalize each feature independently so that it had a standard deviation
of 1 and a mean of 0.

In general, it isn’t safe to feed into a neural network data that takes relatively large val-
ues (for example, multidigit integers, which are much larger than the initial values taken
by the weights of a network) or data that is heterogeneous


### What Should be done
* Take small values - btw range 0 - 1
* Be homogenous - That is all features should take values in roughly the same range


## Overfitting and Underfitting
The fundamental issue in machine learning is the tension between optimization
and generalization

**Optimization** to the process of adjusting a model to get the
best performance possible on the training data if still underfit state

The processing of fighting overfitting this way is called **regularization**. Let’s review
some of the most common regularization techniques

### Reduce the network's size
The simplest way to prevent overfitting is to reduce the size of the model: the number
of learnable parameters in the model (which is determined by the number of layers
and the number of units per layer)

On the other hand, if the network has limited memorization resources, it won’t be
able to learn this mapping as easily; thus, in order to minimize its loss, it will have to
resort to learning compressed representations that have predictive power regarding
the targets—precisely the type of representations we’re interested in. At the same
time, keep in mind that you should use models that have enough parameters that they
don’t underfit: