## Introduction

Deep learning has become a hype and everyone wants to try it for better model performance. However, the embarrassed thing is that the neural network sometimes (maybe a lot of times) is not better than a primitive linear model baseline, not even to mention about xgboost or lightgbm. One of the reasons is that neural network is sensitive to the input distributions and scales, which is not usually addressed when fitting linear models or tree-based models.

Normalization (centred and scaled to unit variance) of input feature space should be able to fix the majority of the problem, but it can’t help with deep neural network which could potentially shift the variation during training. Batch normalization layer is introduced to normalize the output from neural network layers to achieve a faster and stabler convergence.

In this post, I will compare a neural network model with linear regression baseline and how batch normalization layer speeds up the training process.

In [1]:
import pandas as pd
import numpy as np

# keras was used to build basic nerual network (multilayer perceptron)
from keras.layers import Dense, BatchNormalization, Activation
from keras.models import Input, Model, Sequential
from keras.callbacks import EarlyStopping, TensorBoard

# sklearn was used for train/valid split, linear model and regression metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics.regression import mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr

Using TensorFlow backend.


## Data
Boston housing data (can be downloaded from http://lib.stat.cmu.edu/datasets/boston) was used in this demostration. I didn't find a good way to directly load this dataset, so some hacks were used to correctly parse the data.

The columns are defined in the data dictionary:

1. CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per `$10,000`
* PTRATIO - pupil-teacher ratio by town
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - % lower status of the population
* MEDV - Median value of owner-occupied homes in `$1000's`


`MEDV` is the target variable we are going to predict

In [2]:
# load the data
res = pd.read_fwf('./boston.txt', skiprows = 22, header = None)
rows_odd = res.loc[np.arange(0,len(res),2)].reset_index(drop = True)
rows_even = res.loc[np.arange(1,len(res),2)].reset_index(drop = True)
res = pd.concat([rows_odd, rows_even], axis=1).iloc[:,:14]
res.columns = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS',
               'RAD','TAX','PTRATIO','B1000','LSTAT','MEDV']
res.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B1000,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [3]:
res.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B1000,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,1.71629,11.166008,11.136779,0.06917,0.554695,6.284634,68.574901,3.696228,4.332016,408.237154,18.455534,356.674032,12.653063,22.532806
std,2.65351,22.991219,6.860353,0.253994,0.115878,0.702617,28.148861,1.999689,1.417166,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,0.5857,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.0819,0.0,5.19,0.0,0.449,5.8855,45.025,2.0737,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.250895,0.0,9.69,0.0,0.538,6.2085,77.5,3.1073,4.0,330.0,19.05,391.44,11.36,21.2
75%,2.326718,12.5,18.1,0.0,0.624,6.6235,94.075,5.112625,5.0,666.0,20.2,396.225,16.955,25.0
max,9.96654,95.0,27.74,1.0,0.871,8.78,100.0,9.2229,8.0,711.0,22.0,396.9,37.97,50.0


In [4]:
train, valid = train_test_split(res, train_size = 0.8)
print('train set: ' + str(train.shape))
print('valid set: ' + str(valid.shape))

train set: (404, 14)
valid set: (102, 14)


In [5]:
train = train.iloc[:,:13].values, train.iloc[:,13].values
valid = valid.iloc[:,:13].values, valid.iloc[:,13].values

## Linear regression model – Baseline

Let’s define function to score models using mean absolute error (MAE), mean squared error (MSE) and Pearson correlation, which can be used for subsequent models and save us some typing.

In [6]:
def score_model(model, valid):
    x, y = valid
    preds = model.predict(x).reshape(-1)
    label = y
    mae = mean_absolute_error(label, preds)
    mse = mean_squared_error(label, preds)
    cor = pearsonr(label, preds)[0]
    return mae, mse, cor

# Then we simply train and evaluate vanilla linear model from sklearn.
model_lm = LinearRegression()
model_lm.fit(X = train[0], y = train[1])
print('linear model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
    score_model(model_lm, valid)))

linear model => mae: 3.79, mse: 30.17, cor :0.84


This is a simple linear regression model without regularization. Validation mean absolute error is 3.79, which is baseline of our model.

## Compare with single layer neural network
Next, a single perceptron is used for a fair comparison with simple linear model.



In [7]:
model_mlp = Sequential()
model_mlp.add(Dense(1, kernel_initializer='normal',
                    activation='linear', input_dim = 13))
model_mlp.compile('adam','mean_squared_error')
model_mlp.summary()

W1211 09:03:10.532372 11044 deprecation_wrapper.py:119] From C:\Users\n174724\.conda\envs\face2bmi\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W1211 09:03:10.555372 11044 deprecation_wrapper.py:119] From C:\Users\n174724\.conda\envs\face2bmi\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W1211 09:03:10.559377 11044 deprecation_wrapper.py:119] From C:\Users\n174724\.conda\envs\face2bmi\lib\site-packages\keras\backend\tensorflow_backend.py:4115: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

W1211 09:03:10.589373 11044 deprecation_wrapper.py:119] From C:\Users\n174724\.conda\envs\face2bmi\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1)                 14        
Total params: 14
Trainable params: 14
Non-trainable params: 0
_________________________________________________________________


In the first 200 epochs of training, the model performance is disappointingly low (MAE @ 5.08)



In [8]:
model_mlp.fit(x = train[0], y = train[1], 
              epochs = 200, batch_size = 32, 
              verbose=0)
print('mlp model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
    score_model(model_mlp, valid)))

W1211 09:03:10.801462 11044 deprecation_wrapper.py:119] From C:\Users\n174724\.conda\envs\face2bmi\lib\site-packages\keras\backend\tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

W1211 09:03:11.023278 11044 deprecation_wrapper.py:119] From C:\Users\n174724\.conda\envs\face2bmi\lib\site-packages\keras\backend\tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.



mlp model => mae: 5.08, mse: 51.39, cor :0.73


When we fit another 2000 epochs, the score was getting closer the linear model baseline, which implies that the model is as capable as a linear model, but just the convergence is extremely slow.



In [9]:
model_mlp.fit(x = train[0], y = train[1], 
              epochs = 2000, batch_size = 32, 
              verbose=0)
print('mlp model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
    score_model(model_mlp, valid)))

mlp model => mae: 3.77, mse: 32.89, cor :0.83


## Neural networks with batch normalization
Now, we define another two functions to add batch normalization layers after dense layers, to test different model scenarios.


In [10]:
def build_mlp(hidden_dim = None, bn = False):
 
    input_layer = Input((13,), name = 'input')
 
    if bn:   # add batch normlization layer
        out = BatchNormalization()(input_layer)
    else:
        out = input_layer
 
    if hidden_dim != None:
        out = Dense(hidden_dim)(out)
        if bn:
            out = BatchNormalization()(out)
        out = Activation('relu')(out)
 
    out = Dense(1, kernel_initializer='normal',
                activation='linear')(out)
    model = Model(input_layer, out)
    model.compile('adam','mean_squared_error',['mae'])
    return model
 
def train_model(model, train, valid, epochs, batch_size, model_name):
 
    es = EarlyStopping(patience=5)
    tb = TensorBoard(log_dir='./tensorboard/'+ model_name)
    callbacks = [es, tb]
 
    model.fit(x=train[0], y = train[1],
              batch_size=batch_size,
              verbose = 0,
              epochs=epochs,
              callbacks = callbacks,
              shuffle=True,
              validation_data = valid)
    return model

let’s run some scenarios to examine the model performance with fixed number of epochs.



In [11]:
# define models for different scenarios
model_mlp_baseline = build_mlp()    # baseline neural network
model_mlp_bn = build_mlp(bn = True)   # neural network + batch normalization
model_mlp_bn_h16 = build_mlp(hidden_dim=16, bn = True)    # extra hidden layer
model_mlp_bn_h64 = build_mlp(hidden_dim=64, bn = True)    # extra wider hidden layer

In [12]:
epochs = 200
batch_size = 16

In [13]:
# train all the models
model_mlp_baseline = train_model(model_mlp_baseline, train, valid, epochs, batch_size, 'mlp_baseline')
model_mlp_bn = train_model(model_mlp_bn, train, valid, epochs, batch_size, 'mlp_bn')
model_mlp_bn_h16 = train_model(model_mlp_bn_h16, train, valid, epochs, batch_size, 'mlp_bn_h16')
model_mlp_bn_h64 = train_model(model_mlp_bn_h64, train, valid, epochs, batch_size, 'mlp_bn_h64')

W1211 09:03:40.416682 11044 deprecation_wrapper.py:119] From C:\Users\n174724\.conda\envs\face2bmi\lib\site-packages\keras\callbacks.py:850: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

W1211 09:03:40.418684 11044 deprecation_wrapper.py:119] From C:\Users\n174724\.conda\envs\face2bmi\lib\site-packages\keras\callbacks.py:853: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.



In [14]:
# print model performance
print('mlp baseline model => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp_baseline, valid)))
print('mlp model with batch normalization => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp_bn, valid)))
print('mlp model with batch normalization + extra hidden layer => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp_bn_h16, valid)))
print('mlp model with batch normalization + wider hidden layer => mae: %3.2f, mse: %3.2f, cor :%3.2f'%(
score_model(model_mlp_bn_h64, valid)))

mlp baseline model => mae: 5.79, mse: 69.34, cor :0.61
mlp model with batch normalization => mae: 3.46, mse: 30.12, cor :0.84
mlp model with batch normalization + extra hidden layer => mae: 3.11, mse: 20.40, cor :0.91
mlp model with batch normalization + wider hidden layer => mae: 2.86, mse: 18.56, cor :0.92


![](https://raw.githubusercontent.com/6chaoran/data-story/master/deep-learning/batch_normalization/tensorboard_mae.png)

With the linear model baseline, MAE@ 3.79:

* NN baseline: premature stopped (because of early stop callback) with MAE @ 5.79
* NN+ Batch Normalization: stably converged with MAE @ 3.46

Increase the complexity of NN, on top with batch normalization:

* NN+ BN + hidden 16-dimension layer: MAE @ 3.11
* NN+ BN + hidden 64-dimension layer: MAE @ 2.86

Just an additional hidden dense layer makes neural network outperforms the linear model, with help of batch normalization layers.

To conclude, batch normalization clearly make the neural network  stable and it’s essential for complex and deep neural network to speed up the model convergence.