# Chapter 4

# Validation tests

### Simple hold out validation

Split training data into training / validation. That's all

In [2]:
from IPython.display import Image
from IPython.core.display import HTML
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
Image(url= "https://s3-us-west-2.amazonaws.com/mishalaskin/dl/simpleholdoutvalidation.png")

Using TensorFlow backend.


## K fold validation

Split data into $K$ partitions. For each partition $i$, train model on remaining $K-1$ partitions and evaluate on partition $i$ 

To boost this, you can run K validation multiple times, and shuffle the data after you finish a validation batch.

In [5]:
Image(url= "https://s3-us-west-2.amazonaws.com/mishalaskin/dl/kfoldvalidation.png")



# Data preprocessing

## Vectorization

Taking data and turning into a tensor to feed the neural network. One-hot encoding is a form of vectorization

## Value normalization

Values need to be normalized to [0,1] interval

For learning to be easy for NN, data should

1. Take small values - typically most values should be in 0-1 range
2. Be homogenous - features should take values in similar range

Also, use the following normalization convention

3. Normalize each feature independently to have a mean of 0.
4. Normalize each feature independently to have a standard deviation of 

## Missing values

For missing values enter them as 0, you definitely need to train on them so that the network learns these values are not useful.

## Feature engineering

Pruning data to make learning easier. For example, if your data is a collections of stock price images, training on the image itself is computationally expensive. Instead you can extract the coordinates of the points on the graph and use those as the input data.

Basically just use common sense when preparing input data.

## Overfitting / Underfitting

If you can, get more training data. More data beats everything.

If not, try regularization

# Regularization techniques

## 1. Reducing network size

Capacity = number of parameters in network. Large networks have high capacity, as you take Capacity => infinity you can approximate any nonlinear function.

But this also means you quickly overfit the training data. Taking parameters out of the model makes it simpler, and less likely to overfit.

## 2. Weight regularization

Limits the values weights can take. By limiting weights to smaller values, you simplify the model which makes it less likely to be dominated by outlier values.

There are two common types of weight regularization

1. L1 reg - $cost \propto |W|$
2. L2 reg - $cost \propto |W|^2$

Here is an example of adding reg in keras

model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),activation='relu'))

the value 0.001 is the decay coefficient, the larger it is the stronger the regularization. Regularization is added during training, so training loss will be higher than test loss.

Here are reg types you can add in Keras

L1:
regularizers.l1(0.001)

L2:
regularizers.l2(0.001)

L1 and L2:
regularizers.l1_l2(l1=0.001, l2=0.001)

## 3. Dropout

Randomly drop out output features of in the layers. If a handful of features are artificially dominating the network, drop out combats this bias by taking them out at random.

![Image](https://s3-us-west-2.amazonaws.com/mishalaskin/dl/dropout.png)


In Keras dropout is simple. Here's an example

model = models.Sequential()

model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))

model.add(layers.Dropout(0.5))

model.add(layers.Dense(16, activation='relu'))

model.add(layers.Dropout(0.5))

model.add(layers.Dense(1, activation='sigmoid'))

The 0.5 coefficient is the rate of drop out (i.e. dropout 50% of features)

# Final thoughts

## Choosing last layer activation function

![Image](https://s3-us-west-2.amazonaws.com/mishalaskin/dl/lastlayeractivation.png)


