MDR, June 2017, personal study.

This notebook records a personal learning project, conducted during the Fast.AI course, in which a convnet is taught to ID distracted drivers. The dataset comes from an old Kaggle comp. This is the second experiment, in which I build a CNN from scratch, following the statefarm notebook.

# Planning

This notebook assumes that the data has already been downloaded and organised
* setup packages, paths
* review the task
* set up a simple model
    * with batchnorm (not dropout?)
* train the network 
* review the val_acc
* assess errors
* try additional techniques for improving the accuracy, reducing overfitting
    * data augmentation
    * ensembling
* submit to Kaggle    

# Setup (every session)

## Import packages (every new session)

<strong> N.B. </strong> You may need to ensure that the files...<strong>
* 'utils.py' (by fast.ai), 
* 'vgg16.py' (adapted by fast.ai?), and
* 'vgg16bn' </strong>
...are findable by python - e.g. present in the root of the current directory.

In [1]:
%matplotlib inline
import numpy as np
import os

In [2]:
import utils
from utils import *
#from imp import reload  # fixes a P2-P3 incompatibility
#reload(utils)

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1070 (CNMeM is disabled, cuDNN 5110)
Using Theano backend.


A function for checking for a folder name and creating one if needed.

In [3]:
import errno

def make_sure_path_exists(path):
    try:
        os.makedirs(path)
    except OSError as exception:
        if exception.errno != errno.EEXIST:
            raise

## Set up the path variables

<strong>DO re-run these cells, too...</strong>

In [5]:
current_dir = os.getcwd()
HOME_DIR = current_dir
DATA_HOME_DIR = current_dir + '/data'
print(' ', HOME_DIR, '\n ', DATA_HOME_DIR)

  /home/mark/Study/dl.fast.ai/deeplearning1/_MDR/2_distracted_driver 
  /home/mark/Study/dl.fast.ai/deeplearning1/_MDR/2_distracted_driver/data


This relative approach to folder referencing also works...

In [6]:
all_data_path = "data/"
sampled_path = all_data_path + "sample"
test_path = all_data_path + "test"

# The following allows for easy switching...
data_path = all_data_path   
#data_path = sampled_path      

%ls $all_data_path

driver_imgs_list.csv      [0m[01;34mresults[0m/                   [01;34mtest[0m/      [01;34mvalid[0m/
[01;31mdriver_imgs_list.csv.zip[0m  [01;34msample[0m/                    [01;34mtrain[0m/
[01;31mimgs.zip[0m                  [01;31msample_submission.csv.zip[0m  [01;34munzipped[0m/


# Review the task and data

## The task

Quoted from the Kaggle data page: 

<blockquote>In this competition you are given driver images, each taken in a car with a driver doing something in the car (texting, eating, talking on the phone, makeup, reaching behind, etc). Your goal is to predict the likelihood of what the driver is doing in each picture. 

The 10 classes to predict are:

    c0: safe driving
    c1: texting - right
    c2: talking on the phone - right
    c3: texting - left
    c4: talking on the phone - left
    c5: operating the radio
    c6: drinking
    c7: reaching behind
    c8: hair and makeup
    c9: talking to passenger

To ensure that this is a computer vision problem, we have removed metadata such as creation dates. The train and test data are split on the drivers, such that one driver can only appear on either train or test set. 

To discourage hand labeling, <strong>we have supplemented the test dataset with some images that are resized. These processed images are ignored and don't count towards your score.</strong>
</blockquote>
File descriptions

    imgs.zip - zipped folder of all (train/test) images
    sample_submission.csv - a sample submission file in the correct format
    driver_imgs_list.csv - a list of training images, their subject (driver) id, and class id

In [73]:
%ls -l $all_data_path

total 4203272
-rw-r--r--  1 mark mark     491359 Apr  7  2016 driver_imgs_list.csv
-rw-rw-r--  1 mark mark      95118 May  6 09:41 [0m[01;31mdriver_imgs_list.csv.zip[0m
-rw-rw-r--  1 mark mark 4292669227 May  6 09:41 [01;31mimgs.zip[0m
drwxrwxr-x  4 mark mark       4096 May 14 22:08 [01;34mresults[0m/
drwxrwxr-x  5 mark mark       4096 May  9 21:09 [01;34msample[0m/
-rw-rw-r--  1 mark mark     211199 May  6 09:18 [01;31msample_submission.csv.zip[0m
drwxr-xr-x  3 mark mark   10620928 May 14 21:32 [01;34mtest[0m/
drwxr-xr-x 12 mark mark       4096 May  7 22:04 [01;34mtrain[0m/
drwxrwxr-x  2 mark mark       4096 May 13 09:23 [01;34munzipped[0m/
drwxrwxr-x 12 mark mark       4096 May 11 22:08 [01;34mvalid[0m/


## Import and inspect the class data (as a data frame)

In [6]:
import pandas as pd
images_df = pd.read_csv(class_file_path, sep=',')

NameError: name 'class_file_path' is not defined

In [10]:
print(images_df.shape)
print(images_df.columns)

(22424, 3)
Index(['subject', 'classname', 'img'], dtype='object')


List unique values - classes.

In [11]:
#List unique values in the df['name'] column
images_df.classname.unique()

array(['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9'], dtype=object)

... and drivers ...

In [12]:
l = len(images_df.subject.unique())          ; print(l)
s = sorted(list(images_df.subject.unique())) ; print(s)

26
['p002', 'p012', 'p014', 'p015', 'p016', 'p021', 'p022', 'p024', 'p026', 'p035', 'p039', 'p041', 'p042', 'p045', 'p047', 'p049', 'p050', 'p051', 'p052', 'p056', 'p061', 'p064', 'p066', 'p072', 'p075', 'p081']


This suggests that - to achieve an 80/20 split between test and validation - we'd have to select 5 drivers to move into the 'valid' set. We'll store this list of drivers for use when we build the valid directory. 

In [13]:
v = int(l * .2) ; print(v)
valid_subj = s[-v:]
print(valid_subj)

5
['p064', 'p066', 'p072', 'p075', 'p081']


## Check the directory structure

Let's check that against the directory structure of the unzipped images (imgs).

In [14]:
!tree -d $all_data_path/unzipped

[01;34mdata/unzipped[00m

0 directories


Q: How many training images are there, across all 10 classes? A: 22,424 images

In [16]:
src = all_data_path + '/train/**/*.jpg'
print(src)
g = glob(src, recursive=True)
train_set_size = len(g)
print(train_set_size)

data/train/**/*.jpg
22424


['data/train/c8/img_22971.jpg',
 'data/train/c8/img_76367.jpg',
 'data/train/c8/img_25239.jpg',
 'data/train/c8/img_23496.jpg',
 'data/train/c8/img_44057.jpg',
 'data/train/c8/img_70207.jpg',
 'data/train/c8/img_47484.jpg',
 'data/train/c8/img_72616.jpg',
 'data/train/c8/img_63477.jpg',
 'data/train/c8/img_93136.jpg',
 'data/train/c8/img_37289.jpg',
 'data/train/c8/img_98495.jpg',
 'data/train/c8/img_5355.jpg',
 'data/train/c8/img_26813.jpg',
 'data/train/c8/img_56050.jpg',
 'data/train/c8/img_8631.jpg',
 'data/train/c8/img_18910.jpg',
 'data/train/c8/img_14072.jpg',
 'data/train/c8/img_56207.jpg',
 'data/train/c8/img_74098.jpg',
 'data/train/c8/img_101737.jpg',
 'data/train/c8/img_100480.jpg',
 'data/train/c8/img_43109.jpg',
 'data/train/c8/img_76691.jpg',
 'data/train/c8/img_88573.jpg',
 'data/train/c8/img_88803.jpg',
 'data/train/c8/img_17506.jpg',
 'data/train/c8/img_28833.jpg',
 'data/train/c8/img_37131.jpg',
 'data/train/c8/img_55033.jpg',
 'data/train/c8/img_27663.jpg',
 'data/t

# Set up a simple model

## Set up the batches

In [7]:
# Grab a few images at a time for training and validation.
# NB: They must be in subdirectories named based on their category
print(data_path+'train')
batch_size = 64
batches = get_batches(data_path+'train', batch_size = batch_size)
val_batches = get_batches(data_path+'valid', batch_size = batch_size*2, shuffle=False)

data/train
Found 18587 images belonging to 10 classes.
Found 3837 images belonging to 10 classes.


In [8]:
(val_classes, trn_classes, val_labels, trn_labels, val_filenames, filenames, test_filenames) = get_classes(data_path)

Found 18587 images belonging to 10 classes.
Found 3837 images belonging to 10 classes.
Found 79726 images belonging to 1 classes.


## Inspecting the imported class data

Let's make some sense of what that last command achieved. Looking at the vgg (fast.ai) code from which it comes, it extracts information from the json file that comes with imagenet.

In [65]:
print('val_classes:', len(val_classes))
print(val_classes[454:458])
print(val_labels[454:458])
print(val_filenames[454:458])
print('trn_classes:', len(trn_classes))
print(trn_classes[2031:2034])
print(trn_labels[2031:2034])
print('filenames:', len(filenames))
print(filenames[2031:2034])
print('test_filenames:', len(test_filenames))
print(test_filenames[2031:2034])

val_classes: 3837
[0 0 1 1]
[[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]]
['c0/img_75810.jpg', 'c0/img_25783.jpg', 'c1/img_59482.jpg', 'c1/img_86904.jpg']
trn_classes: 18587
[0 0 1]
[[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]]
filenames: 18587
['c0/img_57235.jpg', 'c0/img_88177.jpg', 'c1/img_80626.jpg']
test_filenames: 79726
['unknown/img_74200.jpg', 'unknown/img_49022.jpg', 'unknown/img_39851.jpg']


## Importing the data

The lines below need <strong>only be run once</strong> (they take 4 minutes) - on following sessions, one can simply load the arrays (see the cell below it). 

In [67]:
trn = get_data(data_path + 'train')
val = get_data(data_path + 'valid')

Found 18587 images belonging to 10 classes.
Found 3837 images belonging to 10 classes.


Save the data arrays, for faster reloading.

In [68]:
save_array(data_path + 'results/val.dat', val)
save_array(data_path + 'results/trn.dat', trn)

In [9]:
val = load_array(data_path + 'results/val.dat')
trn = load_array(data_path + 'results/trn.dat')

# Experiments

## A single conv layer

In [7]:
def conv1(batches):
    model = Sequential([
            BatchNormalization(axis=1, input_shape=(3,224,224)),
            Convolution2D(32,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Convolution2D(64,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Flatten(),
            Dense(200, activation='relu'),
            BatchNormalization(),
            Dense(10, activation='softmax')
        ])

    model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    model.optimizer.lr = 0.001
    model.fit_generator(batches, batches.nb_sample, nb_epoch=4, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    return model

### Train the model

In [11]:
model = conv1(batches)

Epoch 1/2
Epoch 2/2
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


### Discussion

I'm really puzzled by this, on two counts (first two points are same issue?).
1. The model took almost 2x as long as the fast.ai notebook, when it usually runs much faster.
2. The temperature profiles were the opposite of normal - this time it was the CPU that ran hot. The GPU barely hit 37deg.
3. More importantly, the val_acc was far lower than Howerd's (shown below).

In [17]:
On further reflection, I suspect that the simplicity of the model explains the temperature differences. 

SyntaxError: invalid syntax (<ipython-input-17-fc08a237f549>, line 1)

One possibility is that Howard used a different approach to setting up the validation set. Hypothesis: I ensured that the same drivers could not be found in both sets, but he didn't? One way to test this would be to see what I get against the Kaggle test set. 

## Augment the data

Data augmentation is really a way to ensure that the model doesn't overfit on some fairly simple characteristics. It's built in to Keras.

In [9]:
batch_size = 64
gen_t = image.ImageDataGenerator(channel_shift_range=20, height_shift_range=0.05, 
                                 rotation_range=15, shear_range=0.1, width_shift_range=0.1)  
batches = get_batches(data_path+'train', gen_t, batch_size=batch_size)

Found 18587 images belonging to 10 classes.


In [13]:
model = conv1(batches)

Epoch 1/2
Epoch 2/2
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


### Discussion

* The weird temps haven't changed, but the amount of time that an epoch takes has gone up. This makes sense. Not sure whether raising the batch size might *increase* overfitting, though?
* By Epoch 2-4, train-acc is lower and val_acc is higher, so overfitting has been reduced. Good.

At this point, JH reduces the training rate and re-runs the same model and data. Perhaps it would be smarter to reduce the LR after the second epoch in the second set?

## Employ JH's model with an adapted training process

In [14]:
def conv2(batches):
    model = Sequential([
            BatchNormalization(axis=1, input_shape=(3,224,224)),
            Convolution2D(32,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Convolution2D(64,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D((3,3)),
            Flatten(),
            Dense(200, activation='relu'),
            BatchNormalization(),
            Dense(10, activation='softmax')
        ])

    model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    model.optimizer.lr = 0.001
    # 2, not 4, epochs here - MDR
    model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    model.optimizer.lr = 0.0001
    # added - MDR
    model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                     nb_val_samples=val_batches.nb_sample)
    return model

In [16]:
batch_size = 128
model = conv2(batches)

Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2
Epoch 1/2
Epoch 2/2


One other change - I experimented with turning the GPU fan back to automatic settings. And then turned it back to manual (75%)

### Discussion

* val_acc increased monotonically throughout this experiment - GOOD. 
* acc hit a fairly high point, though (.9432), so we're still overfitting.
* Re the fan: temps did increase, when I turned it to auto. They rose by 15deg (35-50deg), but this took five minutes. This steady rise is still very much slower than the profile for VGG, but it's reassuring to see that it did happen. Putting the fan back on dropped the temps again - but not quite to the previous baseline (at least, not within 12 mins). 
* JH notes that his results were very 'unstable' - the val_acc was jumping around (unlike mine).

## A deeper architecture: four conv-pooling pairs, plus dropout

Three things are changed, here (plus one repeal of a change I made above, and now doubt the reason of):

* the architecture has changed: it has more conv and pool layers, and it has dropout
* the data augmentation params have been adjusted, compared to those used above
* I've switched the batch_size back to 64, to match that used in setting up the batches at the top of the notebook - shouldn't they match?

In [18]:
gen_t = image.ImageDataGenerator(channel_shift_range=20, height_shift_range=0.05, 
                                 rotation_range=15, shear_range=0.1, width_shift_range=0.1)  
batches = get_batches(data_path+'train', gen_t, batch_size=batch_size)

Found 18587 images belonging to 10 classes.


Note - this time, we haven't wrapped the model in a function (that returns a model). Both ways work.

In [19]:
model = Sequential([
        BatchNormalization(axis=1, input_shape=(3,224,224)),
        Convolution2D(32,3,3, activation='relu'),
        BatchNormalization(axis=1),
        MaxPooling2D(),                               # before, he defined the size (3,3)
        Convolution2D(64,3,3, activation='relu'),
        BatchNormalization(axis=1),
        MaxPooling2D(),
        Convolution2D(128,3,3, activation='relu'),    # new
        BatchNormalization(axis=1),                   # new
        MaxPooling2D(),                               # new
        Flatten(),              
        Dense(200, activation='relu'),                # new
        BatchNormalization(),                         # new
        Dropout(0.5),                                 # new
        Dense(200, activation='relu'),
        BatchNormalization(),
        Dropout(0.5),                                 # new
        Dense(10, activation='softmax')
    ])

JH's LR is really low, here.

In [20]:
model.compile(Adam(lr=0.00001), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                    nb_val_samples=val_batches.nb_sample)             

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fb59bbf3b00>

 Surely way too slow!

In [21]:
model.compile(Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                    nb_val_samples=val_batches.nb_sample)  

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fb599af08d0>

In [22]:
model.optimizer.lr = 0.0001
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                    nb_val_samples=val_batches.nb_sample)  

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fb598d6f208>

I wonder if it's worth keeping going?

In [23]:
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                    nb_val_samples=val_batches.nb_sample)  

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fb598d6feb8>

Perhaps with a lower learning rate at this point...

In [24]:
model.optimizer.lr = 0.00001
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                    nb_val_samples=val_batches.nb_sample) 

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fb599af0a20>

### Discussion

I'm stopping this one here. It worked ok - 63% acc isn't bad, but we're overfitting still. It's likely that I'll have more luck applying the same techniques in fine-tuning VGG.