# State Farm Distracted Driver Detection - VGG16 Continued

This notebook contains the final attempt at the Kaggle State Farm Distracted Driver Detection competition using the VGG model extended with batchnorm. The purpose is to train the best possible model, as well as testing the external vgg16, utils and plot libraries.

## Initial Setup

Import libraries and functions for future use.

In [2]:
# Plots displayed inline in notebook
%matplotlib inline

# Make Python 3 consistent
from __future__ import print_function, division

# Make help libraries available
import sys

sys.path.append('/home/ubuntu/personal-libraries')

In [3]:
import numpy as np
import pandas as pd
import gc

from kerastools.vgg16 import Vgg16
from kerastools.utils import get_batches, save_array, load_array, get_classes, do_clip

from keras.models import Model, Sequential
from keras.layers import Input, Dense, BatchNormalization, Flatten, Dropout
from keras.layers.convolutional import MaxPooling2D
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
from keras.preprocessing import image

Using Theano backend.
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


## Define model

We setup our initial VGG16 model with batchnormalisation

In [3]:
vgg = Vgg16(use_batchnorm = True)
vgg.model.summary()

Downloading data from http://files.fast.ai/models/vgg16_bn.h5
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 3, 224, 224)       0         
_________________________________________________________________
norm_layer (Lambda)          (None, 3, 224, 224)       0         
_________________________________________________________________
zero_padding2d_1 (ZeroPaddin (None, 3, 226, 226)       0         
_________________________________________________________________
conv_layer_1_0 (Conv2D)      (None, 64, 224, 224)      1792      
_________________________________________________________________
zero_padding2d_2 (ZeroPaddin (None, 64, 226, 226)      0         
_________________________________________________________________
conv_layer_1_1 (Conv2D)      (None, 64, 224, 224)      36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 64, 112, 112)      0         
______________

## Setup batches

We define out validation and training badges for modelling

In [3]:
batch_size = 32

#path = ''
path = 'sample/'

train_batches = vgg.get_batches(path + 'train', batch_size = batch_size)
val_batches = vgg.get_batches(path + 'valid', batch_size = batch_size, shuffle = False)

NameError: name 'vgg' is not defined

## Finetune model - Sample

We need to adjust the standard VGG model to our new input with 10 classes, so we finetune it.

In [5]:
vgg.finetune(train_batches)
vgg.model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 3, 224, 224)       0         
_________________________________________________________________
norm_layer (Lambda)          (None, 3, 224, 224)       0         
_________________________________________________________________
zero_padding2d_1 (ZeroPaddin (None, 3, 226, 226)       0         
_________________________________________________________________
conv_layer_1_0 (Conv2D)      (None, 64, 224, 224)      1792      
_________________________________________________________________
zero_padding2d_2 (ZeroPaddin (None, 64, 226, 226)      0         
_________________________________________________________________
conv_layer_1_1 (Conv2D)      (None, 64, 224, 224)      36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 64, 112, 112)      0         
__________

We train the model using the default learning rate of 0.001 for a single epoch

In [6]:
vgg.fit_batch(train_batches, val_batches, 1)

Epoch 1/1


We see that the accuracy increases fine on the sample, so we increase the learning rate.

In [7]:
vgg.model.optimizer.lr = 0.1

vgg.fit_batch(train_batches, val_batches, 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Try 4 more epochs with lower learning rate.

In [8]:
vgg.model.optimizer.lr = 0.001

vgg.fit_batch(train_batches, val_batches, 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Seems, this is as far as we can get on the sample data set. A pretty good base line in the area of 0.5 - 0.66.

## Finetune model - Full data

We continue our finetuning on the full data set.

In [9]:
path = ''

train_batches = vgg.get_batches(path + 'train', batch_size = batch_size)
val_batches = vgg.get_batches(path + 'valid', batch_size = batch_size, shuffle = False)

Found 19624 images belonging to 10 classes.
Found 2800 images belonging to 10 classes.


We start with a single epoch

In [10]:
vgg.fit_batch(train_batches, val_batches, 1)

Epoch 1/1


We are going much better now. We increase the learning rate and see, where that takes us.

In [11]:
vgg.model.optimizer.lr = 0.1

vgg.fit_batch(train_batches, val_batches, 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


And then we lower the learning rate again, and see where we end up.

In [12]:
vgg.model.optimizer.lr = 0.001

vgg.fit_batch(train_batches, val_batches, 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


The valuation accuracy is pretty good, but we are overfitting quite a lot. Lets try and make more layers trainable and see, if that helps things along.

In [13]:
layers = vgg.model.layers
# Get the index of the first dense layer...
first_dense_idx = [index for index, layer in enumerate(layers) if type(layer) is Dense][0]
# ...and set this and all subsequent layers to trainable
for layer in layers[first_dense_idx:]: layer.trainable = True

And then we rerun the training. First one epoch with low learning rate.

In [14]:
vgg.fit_batch(train_batches, val_batches, 1)

Epoch 1/1


Then four epochs with a higher learning rate.

In [15]:
vgg.model.optimizer.lr = 0.1

vgg.fit_batch(train_batches, val_batches, 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


And then four epochs with a lower learning rate again.

In [16]:
vgg.model.optimizer.lr = 0.001

vgg.fit_batch(train_batches, val_batches, 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


That did little to improve things. Let's try an even lower learning rate.

In [17]:
vgg.model.optimizer.lr = 0.00001

vgg.fit_batch(train_batches, val_batches, 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


We only seem to be stabilizing. Let's save the weights and try a different approach.

In [18]:
vgg.model.save_weights('models/base_vgg16_norm.h5')

## Improved VGG
We continue using the VGG16 network with batchnorm, but attempt to improve it. That is we want to keep the pretrained convolutional layers fixed, and increase our training speed.
We start by defining a new VGG16() model.

In [4]:
vgg = Vgg16(use_batchnorm = True)
vgg.model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 3, 224, 224)       0         
_________________________________________________________________
norm_layer (Lambda)          (None, 3, 224, 224)       0         
_________________________________________________________________
zero_padding2d_1 (ZeroPaddin (None, 3, 226, 226)       0         
_________________________________________________________________
conv_layer_1_0 (Conv2D)      (None, 64, 224, 224)      1792      
_________________________________________________________________
zero_padding2d_2 (ZeroPaddin (None, 64, 226, 226)      0         
_________________________________________________________________
conv_layer_1_1 (Conv2D)      (None, 64, 224, 224)      36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 64, 112, 112)      0         
__________

We then proceed to find the last max pooling layer of the model.

In [5]:
# Define convolutional layers
last_conv_idx = [i for i, l in enumerate(vgg.model.layers) if type(l) is MaxPooling2D][-1]
conv_layers = vgg.model.layers[:last_conv_idx + 1]

We can then define a model using only the convolutional layers.

In [6]:
conv_model = Sequential(conv_layers)
conv_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 3, 224, 224)       0         
_________________________________________________________________
norm_layer (Lambda)          (None, 3, 224, 224)       0         
_________________________________________________________________
zero_padding2d_1 (ZeroPaddin (None, 3, 226, 226)       0         
_________________________________________________________________
conv_layer_1_0 (Conv2D)      (None, 64, 224, 224)      1792      
_________________________________________________________________
zero_padding2d_2 (ZeroPaddin (None, 64, 226, 226)      0         
_________________________________________________________________
conv_layer_1_1 (Conv2D)      (None, 64, 224, 224)      36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 64, 112, 112)      0         
__________

The idea is now, that we want to pre-computer all of our data through the convolutional layers. This will drastically reduce the training time, once we start experimenting with dense model architecture.
We start by defining our batches.

In [7]:
path = ''

train_batches = get_batches(path + 'train', batch_size = 44, target_size = (224, 224), shuffle = False)

valid_batches = get_batches(path + 'valid', batch_size = 50, target_size = (224, 224), shuffle = False)

test_batches = get_batches(path + 'test', batch_size = 2, target_size = (224, 224), shuffle = False, class_mode = None)

Found 19624 images belonging to 10 classes.
Found 2800 images belonging to 10 classes.
Found 79726 images belonging to 1 classes.


We also extract labels and classes for each dataset.

In [8]:
(val_classes, trn_classes, val_labels, trn_labels, 
 val_filenames, filenames, test_filenames) = get_classes(path)

Found 19624 images belonging to 10 classes.
Found 2800 images belonging to 10 classes.
Found 79726 images belonging to 1 classes.


We then pre-compute each of our datasets and save the numpy arrays. This eats a lot of memory on the poor AWS instance, so after each data computation, we do some cleanup to realease the memory. We save and load using bcolz, as it utilises great compression and does I/O very fast.

In [12]:
conv_feat = conv_model.predict_generator(train_batches, np.int(train_batches.samples / train_batches.batch_size))
save_array(path + 'results/conv_computed/conv_feat_norm.dat', conv_feat)

del conv_feat
gc.collect()

0

In [13]:
conv_val_feat = conv_model.predict_generator(valid_batches, np.int(valid_batches.samples / valid_batches.batch_size))
save_array(path + 'results/conv_computed/conv_val_feat_norm.dat', conv_val_feat)

del conv_val_feat
gc.collect()

0

In [14]:
conv_test_feat = conv_model.predict_generator(test_batches, np.int(test_batches.samples / test_batches.batch_size))
save_array(path + 'results/conv_computed/conv_test_feat_norm.dat', conv_test_feat)

del conv_test_feat
gc.collect()

0

And finally, we can load the three feature sets

In [15]:
conv_feat = load_array('results/conv_computed/conv_feat_norm.dat')
conv_val_feat = load_array('results/conv_computed/conv_val_feat_norm.dat')
conv_test_feat = load_array('results/conv_computed/conv_test_feat_norm.dat')

## VGG model with batchnorm and precomputed augmentation.

We precompute some augmented data in order to reduce the overfitting of out model. We start by setting the level of preprocessing and then define a new bunch of batches.

We then train the model on the augmented images. First define a new model.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 3, 224, 224)       0         
_________________________________________________________________
norm_layer (Lambda)          (None, 3, 224, 224)       0         
_________________________________________________________________
zero_padding2d_40 (ZeroPaddi (None, 3, 226, 226)       0         
_________________________________________________________________
conv_layer_1_0 (Conv2D)      (None, 64, 224, 224)      1792      
_________________________________________________________________
zero_padding2d_41 (ZeroPaddi (None, 64, 226, 226)      0         
_________________________________________________________________
conv_layer_1_1 (Conv2D)      (None, 64, 224, 224)      36928     
_________________________________________________________________
max_pooling2d_16 (MaxPooling (None, 64, 112, 112)      0         
__________

Define batches

In [23]:
gen_t = image.ImageDataGenerator(rotation_range = 15,
                                 height_shift_range = 0.05,
                                 shear_range = 0.1,
                                 channel_shift_range = 20,
                                 width_shift_range = 0.1)
path = ''

da_batches = vgg.get_batches(path + 'train',
                             gen_t,
                             batch_size = 44,
                             shuffle = True,
                             target_size = (224, 224))
val_batches = vgg.get_batches(path + 'valid', batch_size = batch_size, shuffle = False)

Found 19624 images belonging to 10 classes.
Found 2800 images belonging to 10 classes.


Compile model

In [24]:
vgg.compile()

Finetune and make all dense layers trainable

In [25]:
vgg.finetune(da_batches)

layers = vgg.model.layers
# Get the index of the first dense layer...
first_dense_idx = [index for index, layer in enumerate(layers) if type(layer) is Dense][0]
# ...and set this and all subsequent layers to trainable
for layer in layers[first_dense_idx:]: layer.trainable = True

Then train the model using the augmented batches. First run some epochs at default learning rate.

In [26]:
vgg.fit_batch(da_batches, val_batches, 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Then increase the learning rate and run some more epochs

In [29]:
vgg.model.optimizer.lr = 0.01

vgg.fit_batch(da_batches, val_batches, 4)

Epoch 1/4
 40/446 [=>............................] - ETA: 440s - loss: 2.3440 - acc: 0.5835

KeyboardInterrupt: 

Finally decrease the learning rate and run some more epochs.

In [None]:
vgg.model.optimizer.lr = 0.001

vgg.fit_batch(train_batches, val_batches, 4)

## Pseudolabeling

We're going to try using a combination of [pseudo labeling](http://deeplearning.net/wp-content/uploads/2013/03/pseudo_label_final.pdf) and [knowledge distillation](https://arxiv.org/abs/1503.02531) to allow us to use unlabeled data (i.e. do semi-supervised learning). For our initial experiment we'll use the validation set as the unlabeled data, so that we can see that it is working without using the test set. Afterwards we add the test set as well.

In [22]:
val_pseudo = vgg.predict(val_batches, batch_size = 50)

We concatenate thse pseudo labels with our training labels

In [24]:
comb_pseudo = np.concatenate([da_trn_labels, val_pseudo])
comb_feat = np.concatenate([da_conv_feat, conv_val_feat])

And train our model using the extended data set.

In [26]:
bn_model_bigger.optimizer.lr = 0.001

bn_model_bigger.fit(x = comb_feat,
                    y = comb_pseudo,
                    batch_size = batch_size,
                    epochs = 1,
                    validation_data = (conv_val_feat, val_labels))

Train on 120544 samples, validate on 2800 samples
Epoch 1/1


<keras.callbacks.History at 0x7f1f34107ed0>

We do not really see much of an improvement. Let's try 4 more epochs.

In [27]:
bn_model_bigger.fit(x = comb_feat,
                    y = comb_pseudo,
                    batch_size = batch_size,
                    epochs = 4,
                    validation_data = (conv_val_feat, val_labels))

Train on 120544 samples, validate on 2800 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f1f348ef5d0>

Now we are crossing the 0.9 threshold of accuracy. Lets lower the learning rate and train for 4 more epochs and see where that gets us.

In [29]:
bn_model_bigger.optimizer.lr = 0.00001

bn_model_bigger.fit(x = comb_feat,
                    y = comb_pseudo,
                    batch_size = batch_size,
                    epochs = 4,
                    validation_data = (conv_val_feat, val_labels))

Train on 120544 samples, validate on 2800 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f1f34887f10>

So we do see a pretty nice improvement. Enough to warrent us trying with the entire test set.

In [30]:
test_pseudo = bn_model_bigger.predict(conv_test_feat, batch_size = 2)

We concatenate thse pseudo labels with our training and valuation pseudo labels

In [31]:
comb_pseudo = np.concatenate([comb_pseudo, test_pseudo])
comb_feat = np.concatenate([comb_feat, conv_test_feat])

And train our model using the extended data set.

In [32]:
bn_model_bigger.optimizer.lr = 0.001

bn_model_bigger.fit(x = comb_feat,
                    y = comb_pseudo,
                    batch_size = batch_size,
                    epochs = 1,
                    validation_data = (conv_val_feat, val_labels))

Train on 200270 samples, validate on 2800 samples
Epoch 1/1


<keras.callbacks.History at 0x7f1f34183290>

Hrm, too early to say, but valuation accuracy holds still. Let's run 4 more epochs.

In [33]:
bn_model_bigger.fit(x = comb_feat,
                    y = comb_pseudo,
                    batch_size = batch_size,
                    epochs = 4,
                    validation_data = (conv_val_feat, val_labels))

Train on 200270 samples, validate on 2800 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f1f58620f50>

Let's lower the learning rate again and run 4 more epochs.

In [34]:
bn_model_bigger.optimizer.lr = 0.00001

bn_model_bigger.fit(x = comb_feat,
                    y = comb_pseudo,
                    batch_size = batch_size,
                    epochs = 4,
                    validation_data = (conv_val_feat, val_labels))

Train on 200270 samples, validate on 2800 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f1f58650fd0>

We are improving slowly. Let's run for 10 more epochs.

In [36]:
bn_model_bigger.fit(x = comb_feat,
                    y = comb_pseudo,
                    batch_size = batch_size,
                    epochs = 10,
                    validation_data = (conv_val_feat, val_labels))

Train on 200270 samples, validate on 2800 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f1f348ef8d0>

Having trained the model, we save the weights.

In [37]:
bn_model_bigger.save_weights('models/batchnorm_vgg16.h5')

## Submitting to Kaggle

We finally submit the improved model to Kaggle.

We start by finding the optimal level of clipping.

In [48]:
valid_batches_pred = get_batches(path + 'valid', batch_size = 50, target_size = (224, 224), shuffle = False, class_mode = None)
conv_val_feat_pred = conv_model.predict_generator(valid_batches_pred, np.int(valid_batches_pred.samples / valid_batches_pred.batch_size))

val_predictions = bn_model_bigger.predict(conv_val_feat_pred, batch_size = 50)

Found 2800 images belonging to 10 classes.


In [57]:
def do_clip(arr, mx): return np.clip(arr, (1 - mx) / 9, mx)

We then proceed to determine the optimal level of clipping using the validation data set.

In [58]:
test_clip = []
for i in np.arange(0.70, 1.0, 0.01):
    test_clip.append([i, categorical_crossentropy(val_labels, do_clip(val_predictions, i)).eval().mean()])

min(test_clip, key = lambda x: x[1])

[1.0000000000000002, 0.48164355679327142]

Here it is said, that no clipping is best. Weird. Lets submit two. One clipped and one not clipped. First we compute the predictions.

In [59]:
test_predictions = bn_model_bigger.predict(conv_test_feat, batch_size = 2)

Define classes

In [61]:
classes = sorted(valid_batches.class_indices, key = valid_batches.class_indices.get)

Then we make a cliping based on earlier experience

In [62]:
sumbit_pred = do_clip(test_predictions, 0.89)

Then we prepare a submission without clipping.

In [65]:
submission_no_clip = pd.DataFrame(test_predictions, columns = classes)
submission_no_clip.insert(0, 'img', [a[8:] for a in test_batches.filenames])
submission_no_clip.head()

Unnamed: 0,img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9
0,img_81601.jpg,0.046099,0.006723,0.004307,0.000682,0.001594,0.001567,0.026646,0.020867,0.016716,0.874798
1,img_14887.jpg,0.68964,0.007839,0.000621,0.002267,0.001608,0.003189,0.001825,0.000393,0.005292,0.287327
2,img_62885.jpg,0.005301,0.000154,0.000257,0.017661,0.973055,0.000493,0.000514,1.6e-05,0.000978,0.001571
3,img_45125.jpg,0.002003,0.007311,0.017353,0.000462,0.002328,0.000509,0.656182,0.006239,0.302687,0.004926
4,img_22633.jpg,0.138601,0.055678,0.011123,0.001665,0.004929,0.012565,0.022692,0.007558,0.186375,0.558814


And a submission with clipping

In [66]:
submission_clip = pd.DataFrame(sumbit_pred, columns = classes)
submission_clip.insert(0, 'img', [a[8:] for a in test_batches.filenames])
submission_clip.head()

Unnamed: 0,img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9
0,img_81601.jpg,0.046099,0.012222,0.012222,0.012222,0.012222,0.012222,0.026646,0.020867,0.016716,0.874798
1,img_14887.jpg,0.68964,0.012222,0.012222,0.012222,0.012222,0.012222,0.012222,0.012222,0.012222,0.287327
2,img_62885.jpg,0.012222,0.012222,0.012222,0.017661,0.89,0.012222,0.012222,0.012222,0.012222,0.012222
3,img_45125.jpg,0.012222,0.012222,0.017353,0.012222,0.012222,0.012222,0.656182,0.012222,0.302687,0.012222
4,img_22633.jpg,0.138601,0.055678,0.012222,0.012222,0.012222,0.012565,0.022692,0.012222,0.186375,0.558814


Finally we save the two submissions

In [67]:
submission_file_name_no_clip = 'results/augmented-pseudo-vgg-no-clip.gz'
submission_no_clip.to_csv(submission_file_name_no_clip, index = False, compression = 'gzip')

submission_file_name_clip = 'results/augmented-pseudo-vgg-clip.gz'
submission_clip.to_csv(submission_file_name_clip, index = False, compression = 'gzip')

In [70]:
from IPython.display import FileLink
FileLink('results/augmented-pseudo-vgg-no-clip.gz')

In [71]:
FileLink('results/augmented-pseudo-vgg-clip.gz')

Turns out that in this case, the no clipping submission actually performed best, by a absolute 0.03.