## Regression Analysis with Keras

Neural networks are not only able to perform classification task as previously introduced, but can also forecast continuous values. This type of prediction task is often called regression or regression analysis. In principle, the neural networks' architectures are constructed analogous to the previous examples, except for a few changes. First, the activation function for the output neuron needs to be changed. A value between zero and one, as the sigmoid function would output is meaningless. Therefore, it is changed to be the linear identity function $f(x)=x$. Second, an equally important, is the consideration of the loss function to be optimized. Instead of measuring a binary choice, the distance to the continuous variable needs to be expressed. There is a wide variety of possible loss functions. In practice, however, the mean-square-error (mse) $\overline{(y - \hat{y})^2}$ or the mean-absolute-deviation (mae) $|y - \hat{y}|$ is frequently used.


### Setup

For a detailed explanation of the used modules, please refer to the respective sections in the [introductory notebook](0_MNIST_dataset.ipynb) and [logistic regression notebook](1_logistic_regression.ipynb).

In [None]:
import h5py
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
plt.style.use('seaborn')

SEED = 42
np.random.seed(SEED)
tf.set_random_seed(SEED)

from keras.callbacks import Callback
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.utils import Sequence
from keras.regularizers import l2
from keras import optimizers

### Loading the Data

This time, we are going to work on a different dataset called 'abalone'. It stems from the field of biology and records physical properties of 4177 abalones ("sea snails"), like their height, width, weight etc. Again, the data is stored in an HDF5 file with two datasets, one containing the actual data, and another with the names of the measured properties.

In [None]:
def load_data(path='abalone.h5'):
    """
    Loads a dataset and its column names from the HDF5 file specified by the path.
    It is assumed that the HDF5 dataset containing the data is called 'data' and the columns are called 'columns'.

    Parameters
    ----------
    path : str, optional
        The absolute or relative path to the HDF5 file, defaults to mnist.h5.

    Returns
    -------
    data_and_labels : tuple(np.array[samples, features], np.array[features])
        a tuple with two numpy array containing the data and column names
    """
    with h5py.File(path, 'r') as handle:
        return np.array(handle['data']), list(column.decode('utf-8') for column in handle['columns'])
    
data, columns = load_data()
data.shape, columns

### The Data at a Glance

Let's have a very brief look at the data by plotting each of the nine features in a histogram. You will note, one curiosity about the actually categorical 'sex' column. The creators of the dataset have encoded three possible value, -1 for male, 1 for female and additionally 0 for infants. All the other observables do not expose any particularities.

The prediction task at hand is to forecast how old and abalone is-their ring count + 1,5 years. It is possible to determine the ring count using a combination of colorants and microscopic work. This is somewhat a time consuming and expensive procedure, we would like to replace by a solid prediction based on proxy value, i.e. physical properties of the abalone.

In [None]:
def plot_bins(data, columns):
    """
    Plots the histograms of each of the passed columns of the data set.

    Parameters
    ----------
    data : np.array([samples, features])
        The data to be plotted.
    path : list
        The corresponding column names
    """
    features = data.shape[1]
    figure, axis = plt.subplots(3, features // 3, figsize=(16, 8))
    axis = np.array(axis).flatten()
    
    for i, variable in enumerate(columns):
        axis[i].hist(data[:, i], bins=30)
        axis[i].set_xlabel(variable)
        axis[i].set_ylabel('count')
        
    plt.tight_layout()
    plt.show()
    
plot_bins(data, columns)

### Data Preprocessing

The abalone dataset is already in good shape when it comes to data cleaning, i.e. no missing values, removed noise, etc., but need to be further preprocessed before we are able to feed it to a neural network. While somewhat close, the value ranges of the input variables are spread differently. We need to bring to normalize these to a common value system. Here, the mean of each individual feature will be substracted and then scaled by its standard deviation. This way, all variable are expressed as multiple of their own variation. It is important to determine these values from the training data partition alone. Otherwise, information from the test data would leak into the training process. Additionally, we are going to split off the feature to be predicted (rings) and divide the data into training and test data.

In [None]:
def split_and_preprocess(data, train_fraction=3.0/4.0):    
    """
    Preprocesses that dataset by normalizing to mean-standard deviation and partitioning the data 
    into training and test data and corresponding labels.

    Parameters
    ----------
    data : np.array([samples, features])
        The data to be preprocessed and normalized
    train_fraction : float
        Fraction of samples to be assigned to the training dataset

    Returns
    -------
    train_data, train_labels, test_data, test_labels : 
    tuple(np.array[train samples, features], np.array[train samples],
          np.array[test samples, features], np.array[test samples]
    )
        a tuple with four numpy array containing the training and test data and labels
    """
    split_point = int(data.shape[0] * train_fraction)
    
    # split the data in train, test and corresponding labels
    label_index = -1
    train_labels, test_labels = data[:split_point, label_index], data[split_point:, label_index]
    train, test = data[:split_point, :label_index], data[split_point:, :label_index]
    
    # calculate the mean and standard deviation for each feature for normalization
    mean, sigma = train.mean(axis=0), train.std(axis=0)
    # do not normalize the categorical 'sex' column
    mean[0], sigma[0] = 0.0, 1.0
    
    return (train - mean) / sigma, train_labels, (test - mean) / sigma, test_labels
    
train_data, train_labels, test_data, test_labels = split_and_preprocess(data)

### Building an Appropriate Network

As described in the introduction, we will be using one of the previously neural network architectures-the fully-connected network. The most notable changes are the activation function of the output neuron, the MAE loss-function and a different optimizer called Nadam [1] for faster convergence. 

In [None]:
def build_model(data):
    """
    Constructs a fully-connected neural network model for the given dataset

    Parameters
    ----------
    data : np.array[samples, features]
        the image dataset

    Returns
    -------
    model : keras.Model
        the fully-connected neural network
    """
    model = Sequential()
    
    model.add(Dense(20, activation='tanh', kernel_regularizer=l2(0.1), input_shape=(data.shape[1],)))
    model.add(Dense(5, activation='tanh', kernel_regularizer=l2(0.1)))
    model.add(Dropout(0.2))
    
    model.add(Dense(1, activation='linear'))
    
    model.compile(optimizer=optimizers.Nadam(lr=1e-4), loss='mae')
    
    return model

model = build_model(train_data)
model.summary()

### Tracking the Training Progress

Tracking the training history of a Keras model using a callback is already familiar to you. Here, we have made a slight modification to only track the loss epoch-wise. In this case, the $\texttt{history}$ object returned by Keras' $\texttt{fit}$ call, would provide you with the same information.

In [None]:
class TrainingHistory(Callback):
    """
    Class for tracking the training progress/history of the neural network. Implements the keras.Callback interface.
    """
    def on_train_begin(self, logs):
        self.loss = []
        self.validation_loss = []
            
    def on_epoch_end(self, _, logs):
        """
        Callback invoked after each training batch.
        Should track the training loss and accuracy in the respective members.

        Parameters
        ----------
        _ : int
            unused, int corresponding to the batch number
        logs : dict{str -> float}
            a dictionary mapping from the observed quantity to the actual valu
        """
        if 'loss' in logs:
            self.loss.append(logs['loss'])
        if 'val_loss' in logs:
            self.validation_loss.append(logs['val_loss'])

### Training the Model

Training a regression model work exactly the same like a convolutional neural network in Keras.

In [None]:
def train_model(model, train_data, train_labels, test_data, test_labels, epochs=500, batch_size=32):
    """
    Trains a fully-connected neural network given training and test data/labels.

    Parameters
    ----------
    model : keras.Model
        the fully-connected neural network
    train_data : np.array[train samples, features]
        the training data
    train_labels : np.array[train samples]
        the labels, aka. the vector containing the ring count
    test_data : np.array[test samples, features]
        the test data
    test_labels : np.array[test samples]
        the labels, aka. the vector containing the ring count
    epoch: positive int, optional
        the number of epochs for which the neural network is trained, defaults to 100
    batch_size: positive int, optional
        the size of the training batches, defaults to 32

    Returns
    -------
    history : TrainingHistory
        the tracked training and test history
    """
    history = TrainingHistory()
    model.fit(
        train_data, train_labels, validation_data=(test_data, test_labels),
        epochs=epochs, shuffle=True, callbacks=[history]
    )
    
    return history
    
history = train_model(model, train_data, train_labels, test_data, test_labels)

### Visualization of the Training Progress

Using matplotlib, we are plotting the model's loss during the training phase. It should result in an almost textbook-like smooth decay.

In [None]:
def plot_history(history):
    """
    Plots the training (batch-wise) and test (epoch-wise) loss and accuracy.

    Parameters
    ----------
    history : TrainingHistory
        an instance of TrainingHistory monitoring callback
    """
    figure, axis = plt.subplots(1, 1, figsize=(16, 5))
    
    # plot the training loss and accuracy
    axis.set_xlabel('epoch')
    axis.set_ylabel('loss')
    
    epochs = np.arange(len(history.loss))
    axis.plot(epochs, history.loss, color='C0', label='loss')
    axis.plot(epochs, history.validation_loss, color='C4', label='validation loss')
    axis.set_ylim(bottom=0.0)
    
    # display a legend
    axis.legend(loc=1)
    plt.show()

plot_history(history)

### Very Brief Evaluation of the Model

Using Keras' evaluate function we can one again obtain the loss of the model on the test data. You should be seeing an MAE of roughly 1.75, meaning that on average we mispredict the number of rings by that number. So is this prediction useful? That is highly dependent on the particular application domain of the predictor. However, we may see whether the network is better than the mere fluctuation within the data.

Note: a further optimization of the network will result in an MAE of roughly 1.0.

In [None]:
print('MAE of the prediction:', model.evaluate(test_data, test_labels))
print('Standard deviation of the labels:', test_labels.std())

### References

[1] **Incorporating Nesterov Momentum into Adam**, *Dozat, Timothy*, (2016).