# HDS-LEE Course on Hyperparameter Optimization - Part 1

<img src='https://raw.githubusercontent.com/DLR-SC/Hyperparameter_tutorial/master/img/hds_lee_title.png' width=500px>

In [None]:
import keras
from keras import models
from keras import layers
import numpy as np
keras.__version__

## Overview

The aim of this tutorial is the following:

* Presentation of a (typical) regression problem. In the following, we consider the 'Boston house price
regression problem. 
* A (very) short introduction of Keras.
* Hyperparameter optimization - The manual approach (i.e. without using a dedicated library like Talos)

Note that most of the following code samples can be found in Chapter 3, Section 6 of 
[Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff).
The author of this book has made his code available on [GitHub](https://github.com/fchollet/deep-learning-with-python-notebooks).
This notebook is an adaption to allow for Hyperparameter optimization with Talos.

### Table of Contents

##### 1. <a href=#one>Regression problem</a>
##### 2. <a href=#two>Keras -  a very short introduction </a>
##### 3. <a href=#three>Hyperparameter optimization (manual approach)</a>

----

## 1. Introduction of the regression problem: Predicting Boston house prices <a name="one"></a> 

We will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the 
suburb at the time, such as the crime rate, the local property tax rate, etc.

The dataset we will be using has very few data points, only 506 in 
total, split between 404 training samples and 102 test samples, and each "feature" in the input data (e.g. the crime rate is a feature) has 
a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, 
others between 0 and 100...

Let's take a look at the data (index 0: number of samples, index 1: number of features):

In [None]:
from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()

In [None]:
train_data.shape

In [None]:
test_data.shape


As you can see, we have 404 training samples and 102 test samples. The data comprises 13 features. The 13 features in the input data are as 
follow:

1. Per capita crime rate.
2. Proportion of residential land zoned for lots over 25,000 square feet.
3. Proportion of non-retail business acres per town.
4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. Nitric oxides concentration (parts per 10 million).
6. Average number of rooms per dwelling.
7. Proportion of owner-occupied units built prior to 1940.
8. Weighted distances to five Boston employment centres.
9. Index of accessibility to radial highways.
10. Full-value property-tax rate per $10,000.
11. Pupil-teacher ratio by town.
12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.
13. % lower status of the population.

The targets are the median values of owner-occupied homes, in thousands of dollars:

In [None]:
train_targets


The prices are typically between 10 000 dollar and 50 000 dollar. Compared to current prices this sounds very cheap but remender that we compare
house price's from the 70s without including inflation.

## 2. Keras - a very short introduction <a name="two"></a> 

### Preparing the data


It would be problematic to use values for a neural network that all take wildly different ranges. The network might be able to
automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal 
with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we 
will subtract the mean of the feature and divide by the standard deviation, so that the feature will be centered around 0 and will have a 
unit standard deviation. This is easily done in Numpy:

In [None]:
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std


Note that the quantities that we use for normalizing the test data have been computed using the training data. We should never use in our 
workflow any quantity computed on the test data, even for something as simple as data normalization.

### Building a simple neural network with Keras


Because we have no previous knowledge about optimal parameters or an optimal network architecture, we will be using a very small network 
with just one hidden layers and 32 units. In general, the less training data you have, the worse overfitting will be, and using 
a small network is one way to avoid overfitting.

In [None]:
def build_model():
    # Because we will need to instantiate
    # the same model multiple times,
    # we use a function to construct it.
    model = models.Sequential()
    model.add(layers.Dense(32, activation='relu',
                           input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model


Our network ends with a single unit, and no activation function (i.e. it will be linear layer).
This is a typical setup for scalar regression (i.e. regression where we are trying to predict a single continuous value). 
Applying an activation function would constrain the range that the output can take; for instance if 
we applied a `sigmoid` activation function to our last layer, the network could only learn to predict values between 0 and 1. Here, because 
the last layer is purely linear, the network is free to learn to predict values in any range.

Note that we are compiling the network with the `mse` loss function -- Mean Squared Error, the square of the difference between the 
predictions and the targets, a widely used loss function for regression problems.

We are also monitoring a metric during training: `mae`. This stands for Mean Absolute Error. It is simply the absolute value of the 
difference between the predictions and the targets. For instance, a MAE of 0.5 on this problem would mean that our predictions are off by 
\$500 on average.

### Validating our approach using K-fold validation


To evaluate our network while we keep adjusting its parameters (such as the number of epochs used for training), we could simply split the 
data into a training set and a validation set. However, because we have so few data points, the 
validation set would end up being very small (e.g. about 100 examples). A consequence is that our validation scores may change a lot 
depending on _which_ data points we choose to use for validation and which we choose for training, i.e. the validation scores may have a 
high _variance_ with regard to the validation split. This would prevent us from reliably evaluating our model.

The best practice in such situations is to use K-fold cross-validation. It consists of splitting the available data into K partitions 
(typically K=4 or 5), then instantiating K identical models, and training each one on K-1 partitions while evaluating on the remaining 
partition. The validation score for the model used would then be the average of the K validation scores obtained.

In terms of code, this is straightforward:

In [None]:
# hint: if the code runs too slow, you can just set k=2 
k = 4 # typically K=4 or 5
num_val_samples = len(train_data) // k
num_epochs = 50
all_scores = []
for i in range(k):
    print('processing fold #', i)
    # Prepare the validation data: data from partition # k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]

    # Prepare the training data: data from all other partitions
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)

    # Build the Keras model (already compiled)
    model = build_model()
    # Train the model (in silent mode, verbose=0)
    model.fit(partial_train_data, partial_train_targets,
              epochs=num_epochs, batch_size=1, verbose=0)
    # Evaluate the model on the validation data
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)

In [None]:
# the outputs of the k different runs
all_scores

In [None]:
# the average as a more reliable values
np.mean(all_scores)


As you can notice, the different runs do indeed show rather different validation scores, from 2.1 to 2.9. Their average ($$\approx$$ 2.4) is a much more 
reliable metric than any single of these scores -- that's the entire point of K-fold cross-validation. In this case, we are off by 2,400 dollar on 
average, which is still significant considering that the prices range from 10,000 dollar to 50,000 dollar. 

Let's try to optimize some hyperparameters of our neural network to obtain a better results

## 3. Hyperparameter optimization (manual approach) <a name="three"></a> 

List of neural network hyperparameters that has to be opitimized (not complete):

* Number of hidden layers
* Number of neurons in each layer
* batch size
* number of epochs (length of training)
* Regularization (dropout, L1/L2-regularization)
* Optimizer + Learning rate
* Loss function
* Activation function


## Tasks ## 
__Excercise 1:__
 - Play around with the different hyperparameters to build a better regression model
 - Can you achive a better result as <code>build_better_model() </code>
 - What are the important hyperparameters for this specific application?

In [None]:
from keras import regularizers

def build_better_model_task():

    model = models.Sequential()
    
    # replace 'number_of_neurons' with a value of your choice, e.g. 8, 16, 32, 64, 128
    number_of_neurons = 32
    model.add(layers.Dense(number_of_neurons, activation='relu', input_shape=(train_data.shape[1],)))
    # maybe add a second hidden layer?
    # model.add(layers.Dense(XXX, activation='relu'))    
    
    # or a third hidden layer?
    # model.add(layers.Dense(XXX, activation='relu'))      
    
    # try dropout regularization, here: XXX \in [0, 1.]
    # model.add(layers.Dropout(0.1))
    
    # or a layer with L1/L2-regularization
    # model.add(layers.Dense(32, activation='relu', kernel_regularizer = regularizers.l2(0.001)))   
    
    # or a different activation function elu, tanh, sigmoid, ...
    # model.add(layers.Dense(XXX, activation='elu'))  
    
    model.add(layers.Dense(1))    
    
    # try other optimizers such das Adam, Nadam, ... or maybe or different loss function?
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

Let's have a look how your better model performs. Keep in mind that the number of epochs (num_epochs)
is also a hyperparameter.

In [None]:
# just a copy of the previous cell in which 'build_model' has been replaced with 'build_better_model_task()'

num_epochs = 50 # <- further hyperparameter, try different values

k = 4 # typically K=4 or 5
num_val_samples = len(train_data) // k
all_scores = []
for i in range(k):
    print('processing fold #', i)
    # Prepare the validation data: data from partition # k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]

    # Prepare the training data: data from all other partitions
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)

    # Build the improved Keras model (already compiled)
    model = build_better_model_task()
    # Train the model (in silent mode, verbose=0)
    model.fit(partial_train_data, partial_train_targets,
              epochs=num_epochs, batch_size=1, verbose=0)
    # Evaluate the model on the validation data
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)

In [None]:
# the outputs of the k different runs
all_scores

In [None]:
# the average as a more reliable values
np.mean(all_scores)

### Try optimized model on the test set


Once we are done tuning other parameters of our model, we 
can train a final "production" model on all of the training data, with the best parameters, then look at its performance on the test data:

In [None]:
# Get a fresh, compiled model.
model = build_better_model_task()
# Train it on the entirety of the data.
model.fit(train_data, train_targets,
          epochs=50, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

In [None]:
# Final result on test set
test_mae_score

## Wrapping up


Here's what you should take away from this example:

* When there is little data available, using K-Fold validation is a great way to reliably evaluate a model.
* When little training data is available, it is preferable to use a small network with very few hidden layers (typically only one or two), 
in order to avoid severe overfitting.
* Finding the optimal hyperparameters manually can be very difficult

In the next chapter, you will use Talos to simplify the finding of adequate hyperparameters.