# Udacity Machine Learning Capstone Project

## State Farm Distracted Driver Detection (Can computer vision spot distracted drivers?)

---
### The Road Ahead

We break the notebook into separate steps.  Feel free to use the links below to navigate the notebook.

* [Step 0](#step0): Import Datasets
* [Step 1](#step1): Data Analysis and Preprocessing
* [Step 2](#step2): Create a CNN to classify driver images (from scratch)
* [Step 3](#step3): Use a CNN to classify driver images (using transfer learning)
* [Step 4](#step4): Create a CNN to classify driver images (using transfer learning)
* [Step 5](#step5): Algorithm test result

---
<a id='step0'></a>
## Step 0: Import Datasets

### Import Driver Image Dataset

In the code cell below, we import a dataset of driver images. We populate a few variables through the use of the `load_files` function from the scikit-learn library:
- `train_files`, `valid_files`, `test_files` - numpy arrays containing file paths to images
- `train_targets`, `valid_targets`, `test_targets` - numpy arrays containing onehot-encoded classification labels 
- `label_names` - list of string-valued label codes of driver behaviors for translating labels :

    * c0: normal driving
    * c1: texting - right
    * c2: talking on the phone - right
    * c3: texting - left
    * c4: talking on the phone - left
    * c5: operating the radio
    * c6: drinking
    * c7: reaching behind
    * c8: hair and makeup
    * c9: talking to passenger

In [None]:
from sklearn.datasets import load_files       
from keras.utils import np_utils
import numpy as np
from glob import glob
from sklearn.cross_validation import train_test_split

# define function to load train and test image datasets provided by State Farm
def load_dataset(path):
    data = load_files(path)
    driver_files = np.array(data['filenames'])
    driver_targets = np_utils.to_categorical(np.array(data['target']), 10)
    return driver_files, driver_targets

# load original train and test datasets provided by State Farm
train_files, train_targets = load_dataset('imgs/train')
test_files, test_targets = load_dataset('imgs/test')

print('There are %s total driver images provided by State Farm:' % len(np.hstack([train_files, test_files])))
print('%d train driver images' % len(train_files))
print('%d test driver images\n' % len(test_files))

# load list of label codes of driver behaviors
label_names = [item[11:13] for item in sorted(glob("imgs/train/*/"))]

print('There are %d total driver behavior categories.\n' % len(label_names))

In [None]:
# Since the number of driver images in original train data set provided by State Farm is too large, for avoidance of 
# running out of computer memory when we transform them into tensors to feed and train the CNN models, 
# we randomly sample a particular ratio (0.33) of them to use, and ignore the remaining portion (ratio 0.66).
use_files, non_use_files, use_targets, non_use_targets = \
            train_test_split(train_files, train_targets, test_size=0.66, random_state=5)

print('To avoid running out of memory,')
print('we select %s images from original train set' % len(use_files))
print('and ignore the remaining %s images.\n' % len(non_use_files))

# shuffle and split the sampled "use" dataset into training set and validation set
train_files, valid_files, train_targets, valid_targets = \
            train_test_split(use_files, use_targets, test_size=0.2, random_state=5)

print('Among the %s randomly selected images,' % len(use_files))
print('we use %d images for training and' % len(train_files))
print('we use %d images for validation.' % len(valid_files))

In [None]:
test_image_filename_list = [test_file_path[15:] for test_file_path in test_files]

---
<a id='step1'></a>
## Step 1: Data Analysis and Preprocessing

### Data Analysis

In the code cells below, we read the **driver_imgs_list.csv** file provided by State Farm. This csv file is a list of original training images, their subject (driver id), and classname (label id). We then analyze this original train data set. There is just a little bit size imbalance between different classes. Note that class 'c0' has the maximum number of 2489 images and class 'c8' has the minimum number of 1911 images among all the 10 classes.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
from IPython.display import display

training_images_df = pd.read_csv("driver_imgs_list.csv")
display(training_images_df.head())

In [None]:
display(training_images_df.describe())

In [None]:
print(training_images_df['classname'].value_counts(sort=True))

### Data Preprocessing

When using TensorFlow as backend, Keras CNNs require a 4D array (which we'll also refer to as a 4D tensor) as input, with shape

$$
(\text{nb_samples}, \text{rows}, \text{columns}, \text{channels}),
$$

where `nb_samples` corresponds to the total number of images (or samples), and `rows`, `columns`, and `channels` correspond to the number of rows, columns, and channels for each image, respectively.  

The `path_to_tensor` function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN.  The function first loads the image and resizes it to a square image that is $112 \times 112$ pixels.  Next, the image is converted to an array, which is then resized to a 4D tensor.  In this case, since we are working with color images, each image has three channels.  Likewise, since we are processing a single image (or sample), the returned tensor will always have shape

$$
(1, 112, 112, 3).
$$

The `paths_to_tensor` function takes a numpy array of string-valued image paths as input and returns a 4D tensor with shape 

$$
(\text{nb_samples}, 112, 112, 3).
$$

Here, `nb_samples` is the number of samples, or number of images, in the supplied array of image paths.  It is best to think of `nb_samples` as the number of 3D tensors (where each 3D tensor corresponds to a different image) in your dataset!

In [None]:
from keras.preprocessing import image                  
from tqdm import tqdm

def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(112, 112))
    # convert PIL.Image.Image type to 3D tensor with shape (112, 112, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, 112, 112, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
    return np.vstack(list_of_tensors)

In the code cell below, we rescale the images by dividing every pixel in every image by 255.

In [None]:
from PIL import ImageFile                            
ImageFile.LOAD_TRUNCATED_IMAGES = True                 

# pre-process the data for Keras
train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255

#run out of memory
#test_tensors = paths_to_tensor(test_files).astype('float32')/255

---
<a id='step2'></a>
## Step 2: Create a CNN to classify driver images (from scratch)

We will use Keras and Tensorflow to implement our CNN model. In this step, we will
provide the first architecture of the CNN model we design. And then test the performance
result of the first CNN model.

### Model Architecture

We create a CNN to classify driver behaviors. At the end of code cell block, we summarize the layers of the CNN model by executing the line:
    
        model.summary()

We use three convolotion layers followed by three max pooling layers interleavingly and then use two fully connected layers behind in the CNN architecture. We also adopt batch_normalization layer between each convolution layer or dense layer and their activation layer to avoid covariate shift and accelerate the training process. The number of filters in each convolution layer is twice to the previous one (this is a common practice), and we choose 16, 32, and 64 filters to extract the feature maps (regional information). The window size of feature filter in each convolution layer and also the pool size in each max pooling layer are both (2,2), and it's also a kind of typical choices. We set the padding parameter to be 'same' for not loss information near matrix boundaries. The activation function in each layer beside output is ReLU for dealing with the vanishing gradient problem, and that in output layer is SoftMax for calculation of probabilities on the multi-classes. In max pooling layers, we set the strides parameter to be 2 for half both length and width of each 2D feature map (dimension reduction), and such strides setting is typical. Before fully connected layers, we use the GlobalAveragePooling2D layer, which can immediately reduce the amount of parameters and avoid overfitting as well as save much time. We adopt the dropout layers with probability 0.2 to reduce opportunity of overfitting. We choose the number of nodes to be 64 in the first fully connected layer for initial try, and due to the 10 classes of driver behaviors, the number of nodes in output layer is 10.

In [None]:
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense, Activation
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization

model = Sequential()

model.add(Conv2D(filters=16, kernel_size=2, padding='same', input_shape=(112, 112, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=2, strides=2, padding='same'))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=2, strides=2, padding='same'))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=2, strides=2, padding='same'))
model.add(BatchNormalization())
model.add(Dropout(0.2))

model.add(GlobalAveragePooling2D())

model.add(Dense(64))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(10))
model.add(BatchNormalization())
model.add(Activation('softmax'))

model.summary()

### Compile the Model

In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

### Train the Model

We train the first CNN model we design in the code cell below. Use model checkpointing to save the model that attains the best validation loss.

In [None]:
from keras.callbacks import ModelCheckpoint  

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.from_scratch.hdf5', 
                               verbose=1, save_best_only=True)

model.fit(train_tensors, train_targets, 
          validation_data=(valid_tensors, valid_targets),
          epochs=20, batch_size=20, callbacks=[checkpointer], verbose=1)

### Load the Model with the Best Validation Loss

In [None]:
model.load_weights('saved_models/weights.best.from_scratch.hdf5')

### Test the Model

In the code cell below, we test our first CNN model on the testing data set of driver images. The prediction probability results of all the test images are written into the csv file: **CNN_1_test_probability.csv**, following the submission format defined by Kaggle.

**The score (evaluation metrics: logarithmic loss function) of our first CNN model is 2.86036.**

**The test result of our first CNN model is ranked 1367 out of 1440 in public leader board.**

In [None]:
driver_behavior_predictions = []
for test_file in tqdm(test_files):
    test_tensor = path_to_tensor(test_file)
    test_tensor = np.vstack(test_tensor).astype('float32')/255
    driver_behavior_predictions.append(model.predict(np.expand_dims(test_tensor, axis=0))[0])

#driver_behavior_predictions = [model.predict(np.expand_dims(test_tensor, axis=0))[0] for test_tensor in test_tensors]

test_image_probability_csv = np.column_stack((np.asarray(test_image_filename_list), \
                                              np.asarray(driver_behavior_predictions, dtype=np.float32)))

np.savetxt('submission/CNN_1_test_probability.csv', test_image_probability_csv, delimiter=',', \
           comments='', newline='\n', fmt='%s', header='img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9')

# get index of predicted dog breed for each image in test set
#dog_breed_predictions = [np.argmax(model.predict(np.expand_dims(tensor, axis=0))) for tensor in test_tensors]

# report test accuracy
#test_accuracy = 100*np.sum(np.array(dog_breed_predictions)==np.argmax(test_targets, axis=1))/len(dog_breed_predictions)
#print('Test accuracy: %.4f%%' % test_accuracy)

---
<a id='step3'></a>
## Step 3: Use a CNN to classify driver images (using transfer learning)

To reduce training time without sacrificing accuracy, we will train a CNN model using
transfer learning. In this step, our CNN model will use the pre-trained VGG-16 model as a
fixed feature extractor, where the last convolutional output of VGG-16 is fed as input to our
model.

## VGG16
### Obtain Bottleneck Features

In [None]:
from keras.applications.vgg16 import VGG16

# https://keras.io/applications/#vgg16
# NOT include the 3 fully-connected layers at the top of the network
model = VGG16(include_top=False)

model.summary()

In [None]:
bottleneck_features_train = \
        np.asarray([model.predict(np.expand_dims(train_tensor, axis=0))[0] for train_tensor in train_tensors], dtype=np.float32)

bottleneck_features_valid = \
        np.asarray([model.predict(np.expand_dims(valid_tensor, axis=0))[0] for valid_tensor in valid_tensors], dtype=np.float32)

In [None]:
np.save(open('bottleneck_features/driver_VGG16_train.npy', 'wb'), bottleneck_features_train)
np.save(open('bottleneck_features/driver_VGG16_valid.npy', 'wb'), bottleneck_features_valid)

In [None]:
# bottleneck_features_train = np.load('bottleneck_features/driver_VGG16_train.npy')
# bottleneck_features_valid = np.load('bottleneck_features/driver_VGG16_valid.npy')

### Model Architecture

In our second CNN model using transfer learning of VGG16, we only add a global average pooling layer and a fully connected layer, where the latter contains one node for each driver behavior category and is equipped with a softmax.

In [None]:
VGG16_model = Sequential()

VGG16_model.add(GlobalAveragePooling2D(input_shape=bottleneck_features_train.shape[1:]))

VGG16_model.add(Dense(64))
VGG16_model.add(BatchNormalization())
VGG16_model.add(Activation('relu'))
VGG16_model.add(Dropout(0.2))

VGG16_model.add(Dense(10))
VGG16_model.add(BatchNormalization())
VGG16_model.add(Activation('softmax'))

VGG16_model.summary()

### Compile the Model

In [None]:
VGG16_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

### Train the Model

In [None]:
checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.VGG16.hdf5', 
                               verbose=1, save_best_only=True)

VGG16_model.fit(bottleneck_features_train, train_targets, 
                validation_data=(bottleneck_features_valid, valid_targets),
                epochs=20, batch_size=20, callbacks=[checkpointer], verbose=1)

### Load the Model with the Best Validation Loss

In [None]:
VGG16_model.load_weights('saved_models/weights.best.VGG16.hdf5')

### Test the Model

In the code cell below, we test our second CNN model (using transfer learning of VGG16) on the testing data set of driver images. The prediction probability results of all the test images are written into the csv file: **CNN_VGG16_test_probability.csv**, following the submission format defined by Kaggle.

**The score (evaluation metrics: logarithmic loss function) of our second CNN model (using transfer learning of VGG16) is .**

**The test result of our second CNN model (using transfer learning of VGG16) is ranked    out of 1440 in public leader board.**

In [None]:
driver_behavior_predictions = []
for test_file in tqdm(test_files):
    test_tensor = path_to_tensor(test_file)
    test_tensor = np.vstack(test_tensor).astype('float32')/255
    test_bottleneck_feature = model.predict(np.expand_dims(test_tensor, axis=0))[0]
    driver_behavior_predictions.append(VGG16_model.predict(np.expand_dims(test_bottleneck_feature, axis=0))[0])

test_image_probability_csv = np.column_stack((np.asarray(test_image_filename_list), \
                                              np.asarray(driver_behavior_predictions, dtype=np.float32)))

np.savetxt('submission/CNN_VGG16_test_probability.csv', test_image_probability_csv, delimiter=',', \
           comments='', newline='\n', fmt='%s', header='img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9')

# get index of predicted dog breed for each image in test set
#VGG16_predictions = [np.argmax(VGG16_model.predict(np.expand_dims(feature, axis=0))) for feature in test_VGG16]

# report test accuracy
#test_accuracy = 100*np.sum(np.array(VGG16_predictions)==np.argmax(test_targets, axis=1))/len(VGG16_predictions)
#print('Test accuracy: %.4f%%' % test_accuracy)

In [None]:
#from extract_bottleneck_features import *

#def VGG16_predict_breed(img_path):
#    # extract bottleneck features
#    bottleneck_feature = extract_VGG16(path_to_tensor(img_path))
#    # obtain predicted vector
#    predicted_vector = VGG16_model.predict(bottleneck_feature)
#    # return dog breed that is predicted by the model
#    return dog_names[np.argmax(predicted_vector)]

---
<a id='step4'></a>
## Step 4: Create a CNN to classify driver images (using transfer learning)

In this step, instead of VGG-16, we may try to use the other pre-trained model, like VGG19, Resnet50, InceptionV3, or Xception, for different model choice of transfer learning. We can compare
the chosen CNN model with the above one (VGG-16) and check the difference between their prediction scores.

Here we choose ResNet50 to be the CNN model of transfer learning below.

## ResNet50
### Obtain Bottleneck Features

In [22]:
from keras.applications.resnet50 import ResNet50

# https://keras.io/applications/#resnet50
# NOT include the fully-connected layer at the top of the network.
model = ResNet50(include_top=False)

model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_2 (InputLayer)             (None, None, None, 3) 0                                            
____________________________________________________________________________________________________
conv1 (Conv2D)                   (None, None, None, 64 9472        input_2[0][0]                    
____________________________________________________________________________________________________
bn_conv1 (BatchNormalization)    (None, None, None, 64 256         conv1[0][0]                      
____________________________________________________________________________________________________
activation_5 (Activation)        (None, None, None, 64 0           bn_conv1[0][0]                   
___________________________________________________________________________________________

In [23]:
bottleneck_features_train = \
        np.asarray([model.predict(np.expand_dims(train_tensor, axis=0))[0] for train_tensor in train_tensors], dtype=np.float32)

bottleneck_features_valid = \
        np.asarray([model.predict(np.expand_dims(valid_tensor, axis=0))[0] for valid_tensor in valid_tensors], dtype=np.float32)

In [24]:
np.save(open('bottleneck_features/driver_ResNet50_train.npy', 'wb'), bottleneck_features_train)
np.save(open('bottleneck_features/driver_ResNet50_valid.npy', 'wb'), bottleneck_features_valid)

In [26]:
# bottleneck_features_train = np.load('bottleneck_features/driver_ResNet50_train.npy')
# bottleneck_features_valid = np.load('bottleneck_features/driver_ResNet50_valid.npy')

### Model Architecture

In [28]:
ResNet50_model = Sequential()

ResNet50_model.add(GlobalAveragePooling2D(input_shape=bottleneck_features_train.shape[1:]))

ResNet50_model.add(Dense(64))
ResNet50_model.add(BatchNormalization())
ResNet50_model.add(Activation('relu'))
ResNet50_model.add(Dropout(0.2))

ResNet50_model.add(Dense(10))
ResNet50_model.add(BatchNormalization())
ResNet50_model.add(Activation('softmax'))

ResNet50_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
global_average_pooling2d_4 ( (None, 2048)              0         
_________________________________________________________________
dense_7 (Dense)              (None, 64)                131136    
_________________________________________________________________
batch_normalization_10 (Batc (None, 64)                256       
_________________________________________________________________
activation_56 (Activation)   (None, 64)                0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 10)                650       
_________________________________________________________________
batch_normalization_11 (Batc (None, 10)                40        
__________

### Compile the Model

In [29]:
ResNet50_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

### Train the Model

In [30]:
checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.ResNet50.hdf5', 
                               verbose=1, save_best_only=True)

ResNet50_model.fit(bottleneck_features_train, train_targets, 
                   validation_data=(bottleneck_features_valid, valid_targets),
                   epochs=20, batch_size=20, callbacks=[checkpointer], verbose=1)

Train on 179 samples, validate on 45 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
 20/179 [==>...........................] - ETA: 0s - loss: nan - acc: 0.1000Epoch 00008: val_loss did not improve
Epoch 10/20
 20/179 [==>...........................] - ETA: 0s - loss: nan - acc: 0.2500Epoch 00009: val_loss did not improve
Epoch 11/20
Epoch 12/20
 20/179 [==>...........................] - ETA: 0s - loss: nan - acc: 0.1000Epoch 00011: val_loss did not improve
Epoch 13/20
 20/179 [==>...........................] - ETA: 0s - loss: nan - acc: 0.1000Epoch 00012: val_loss did not improve
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
 20/179 [==>...........................] - ETA: 0s - loss: nan - acc: 0.1500Epoch 00016: val_loss did not improve
Epoch 18/20
Epoch 19/20
Epoch 20/20
 20/179 [==>...........................] - ETA: 0s - loss: nan - acc: 0.0500Epoch 00019: val_loss did not improve


<keras.callbacks.History at 0x1ff5fbcdb00>

### Load the Model with the Best Validation Loss

In [31]:
ResNet50_model.load_weights('saved_models/weights.best.ResNet50.hdf5')

### Test the Model

In the code cell below, we test our CNN model using transfer learning of ResNet50 on the testing data set of driver images. The prediction probability results of all the test images are written into the csv file: **CNN_ResNet50_test_probability.csv**, following the submission format defined by Kaggle.

**The score (evaluation metrics: logarithmic loss function) of our CNN model using transfer learning of ResNet50 is .**

**The test result of our CNN model using transfer learning of ResNet50 is ranked    out of 1440 in public leader board.**

In [32]:
driver_behavior_predictions = []
for test_file in tqdm(test_files):
    test_tensor = path_to_tensor(test_file)
    test_tensor = np.vstack(test_tensor).astype('float32')/255
    test_bottleneck_feature = model.predict(np.expand_dims(test_tensor, axis=0))[0]
    driver_behavior_predictions.append(ResNet50_model.predict(np.expand_dims(test_bottleneck_feature, axis=0))[0])

test_image_probability_csv = np.column_stack((np.asarray(test_image_filename_list), \
                                              np.asarray(driver_behavior_predictions, dtype=np.float32)))

np.savetxt('submission/CNN_ResNet50_test_probability.csv', test_image_probability_csv, delimiter=',', \
           comments='', newline='\n', fmt='%s', header='img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9')

# get index of predicted dog breed for each image in test set
#VGG16_predictions = [np.argmax(VGG16_model.predict(np.expand_dims(feature, axis=0))) for feature in test_VGG16]

# report test accuracy
#test_accuracy = 100*np.sum(np.array(VGG16_predictions)==np.argmax(test_targets, axis=1))/len(VGG16_predictions)
#print('Test accuracy: %.4f%%' % test_accuracy)

  0%|                                                                             | 43/79726 [00:15<7:40:12,  2.89it/s]

KeyboardInterrupt: 

In [None]:
#def Resnet50_predict_breed(img_path):
#    # extract bottleneck features
#    # VGG19, Resnet50, InceptionV3, or Xception
#    bottleneck_feature = extract_Resnet50(path_to_tensor(img_path))
#    # obtain predicted vector
#    predicted_vector = Resnet50_model.predict(bottleneck_feature)
#    # return dog breed that is predicted by the model
#    return dog_names[np.argmax(predicted_vector)]

---
<a id='step5'></a>
## Step 5: Algorithm test result

We choose the best result (the lowest score of the evaluation metric of log-loss error function) among the CNN models we
construct above to be the final output of our proposed algorithm.

**The CNN model we construct above with the lowest score of log-loss function is:  , and this CNN model has the score of  .**