    10 May 2017 - Lecture 2 JNB Code Along - WH Nixalo

[Notebook](https://github.com/fastai/courses/blob/ed1fb08d86df277d2736972a1ff1ac39ea1ac733/deeplearning1/nbs/lesson2.ipynb) | Lecture[1:20:00](https://www.youtube.com/watch?v=e3aM6XTekJc)
## 1 Linear models with CNN features

In [1]:
# This is to point Python to my utils folder
import sys; import os
# DIR = %pwd
sys.path.insert(1, os.path.join('../utils'))

# Rather than importing everything manually, we'll make things easy
#   and load them all in utils.py, and just import them from there.
import utils; reload(utils)
from utils import *
%matplotlib inline

Using Theano backend.


## 1.1  Intro

We need to find a way to convert the imagenet predictions to a probability of being a cat or a dog, since that is what the Kaggle copmetition requires us to submit. We could use the imagenet hierarchy to download a list of all the imagenet categories in each of the dog and cat groups, and could then solve our problem in various ways, such as:

* Finding the largest probability that's either a cat or a dog, and using that label
* Averaging the prbability of all the cat categories and comparing it to the average of all the dog categories.

But these approaches have some downsides:

* They require manual coding for something that we should be able to learn from the data
* They ignore information available in the predictions; for instance, if the models predict that there is a bone in th eimage, it's more likely to be a dog than a cat.

A very simple solution to both of these problems is to learn a linear model that is trained using the 1,000 predictions from the imagenet model for each image as input, and the dog/cat label as target.

In [2]:
%matplotlib inline
from __future__ import division, print_function
import os, json
from glob import glob
import numpy as np
import scipy
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
np.set_printoptions(precision=4,  linewidth=100)
from matplotlib import pyplot as plt
import utils; reload(utils)
from utils import plots, get_batches, plot_confusion_matrix, get_data

In [3]:
from numpy.random import random, permutation
from scipy import misc, ndimage
from scipy.ndimage.interpolation import zoom

import keras
from keras import backend as K
from keras.utils.data_utils import get_file
from keras.models import Sequential
from keras.layers import Input
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.preprocessing import image

## 1.2 Linear models in keras

Let's forget the motivating example for a second a see how we can create a simple Linear model in Keras:

Each of the ```Dense()``` layers is just a *linear* model, followed by a simple *activation function*.

In a linear model each row is calculated as ```sum(row * weights)```, where weights need to be learnt from the data & will be the same for every row. Let's create some data that we know is linearly related:

In [4]:
# we'll create a random matrix w/ 2 columns; & do a MatMul to get our 
# y value using a vector [2, 3] & adding a constant of 1.
x = random((30, 2))
y = np.dot(x, [2., 3.]) + 1.

In [5]:
x[:5]

array([[ 0.4769,  0.0115],
       [ 0.2924,  0.2354],
       [ 0.5415,  0.4835],
       [ 0.6453,  0.0165],
       [ 0.3601,  0.9353]])

In [6]:
y[:5]

array([ 1.9884,  2.2909,  3.5334,  2.3402,  4.5262])

We can use kears to create a simple linear model (```Dense()``` - with no activation - in Keras) and optimize it using SGD to minimize mean squared error.

In [7]:
# Keras calls the Linear Model "Dense"; aka. "Fully-Connected" in other 
# libraries.
# So when we go 'Dense' w/ an input of 2 columns, & output of 1 col,
# we're defining a linear model that can go from the 2 col array above, to 
# the 1 col output of y above.
# Sequential() is a way of building multiple-layer networks. It takes an 
# array containing all the layers in your NN. A LM is a single Dense layer.
# This automatically initializes the weights sensibly & calc derivatives.
# We just tell it how to optimize the weights: SGD w/ LR=0.1, minz(MSE).
lm = Sequential([Dense(1, input_shape=(2,))])
lm.compile(optimizer=SGD(lr=0.1), loss='mse')

In [10]:
# find out our loss function w random weights
lm.evaluate(x, y, verbose=0)

20.422586441040039

In [12]:
# now run SGD for 5 epochs & watch the loss improve
# lm.fit(..) does the solving
lm.fit(x, y, nb_epoch=5, batch_size=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x114d8f610>

In [13]:
# now evaluate and see the improvement:
lm.evaluate(x, y, verbose=0)

0.028723947703838348

In [14]:
# take a look at the weights, they should be virt. equal to 2, 3, and 1:
lm.get_weights()

[array([[ 1.3697],
        [ 2.6763]], dtype=float32), array([ 1.5691], dtype=float32)]

In [16]:
# so let's run another 5 epochs and see if this improves things:
lm.fit(x, y, nb_epoch=5, batch_size=1)
lm.evaluate(x, y, verbose=0)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.003555244067683816

In [17]:
# and take a look at the new weights:
lm.get_weights()

[array([[ 1.7646],
        [ 2.8917]], dtype=float32), array([ 1.1795], dtype=float32)]

Above is everything Keras is doing behind the scenes.
So, if we pass multiple layers to Keras via ```Sequential(..)```, we can start to build & optimize Deep Neural Networks.

Before that, we can still use the single-layer LM to create a pretty decent entry to the dogs-vs-cats Kaggle competition.

## 1.3 Train Linear Model on Predictions

Forgetting finetuning -- how do we take the output of an ImageNet network and as simply as possible, create a a good entry to the cats-vs-dogs competition? -- Our current ImageNet network returns a thousand probabilities but we need just cat vs dog. We don't want to manually write code to roll of the hierarchy into cats/dogs.

So what we can do is learn a Linear Model that takes the output of the ImageNet model, all it's 1000 predictions, and uses that as input, and uses the dog/cat label as the target -- and that LM would solve our problem.

### 1.3.1 Training the model

We start with some basic config steps. We copy a small amount of our data into a 'sample' directory, with the exact same structure as our 'train' directory -- this is *always* a good idea in *all* Machine Learning, since we should do all of our initial testing using a dataset small enough that we never have to wait for it.

In [29]:
# setup the directories
os.mkdir('data')
os.mkdir('data/dogscats')

path = "data/dogscats/"
model_path = path + 'models/'
# if the path to our models DNE, make it
if not os.path.exists(model_path): os.mkdir(model_path)
# NOTE: os.mkdir(..) only works for a single folder
#       Also will throw error if dir already exists

We'll process as many images at a time as we can. This is a case of T&E to find the max batch size that doesn't cause a memory error.

In [30]:
batch_size = 100

We need to start with our VGG 16 model, since we're using its predictions & features

In [31]:
from vgg16 import Vgg16
vgg = Vgg16()
model = vgg.model

Our overall approach here will be:
1. Get the true labels for every image
2. Get the 1,000 ImageNet category predictions for every image
3. Feed these predictions as input to a simple linear model.
Let's start by grabbing training and validation batches.

(so that's a thousand floats for every image)

use an output of 2 as input to LM

output of 1 as target to our LM, create LM & build predictions

As usual, we start by creating our batches & validation vatches

In [32]:
# Use batch size of 1 since we're just doing preprocessing on the CPU
val_batches = get_batches(path + 'valid', shuffle=False, batch_size=1)
batches = get_batches(path + 'train', shuffle=False, batch_size=1)

Found 50 images belonging to 2 classes.
Found 352 images belonging to 2 classes.


Getting the 1,000 categories for each image will take a long time & there's no reason to do it again & again. So after we do it the first time, let's save the resulting arrays.

In [33]:
import bcolz
def save_array(fname, arr): c=bcolz.carray(arr, rootdir=fname, mode='w'); c.flush()
def load_array(fname): return bcolz.open(fname)[:]

It's also time consuming to convert all the images in the 224x224 format VGG 16 expects. So ```get_data``` will also store a Numpy array of the results of that conversion.

In [34]:
# ?? shows you the source code
??get_data

In [36]:
val_data = get_data(path + 'valid')
trn_data = get_data(path + 'train')

Found 50 images belonging to 2 classes.
Found 352 images belonging to 2 classes.


In [38]:
# so what the above does is createa a Numpy array with our full set of
# training images -- 352 imgs, ea. of which is 3 colors, and 224x224
trn_data.shape

(352, 3, 224, 224)

In [40]:
save_array(model_path + 'train_data.bc', trn_data)
save_array(model_path + 'valid_data.bc', val_data)

& Now we can load our training & validation data layer without recalculating them

In [41]:
trn_data = load_array(model_path + 'train_data.bc')
val_data = load_array(model_path + 'valid_data.bc')

In [42]:
val_data.shape # our 50 validatn imgs

(50, 3, 224, 224)

Most Deep Learning is done w/ One-Hot Encoding: prediction = 1, all other classes = 0; & Keras expects labels in a very specific format. Example of One Hot Encoding:
```
Class: 1Ht Enc:
   0    100
   1    010
   2    001
   1    010
   0    100
```
1Ht Encoding is used because you can perform a MatMul since the num. weights == encoding length. In the above example W would be a vector of: ```w1, w2, w3```

This lets you do Deep Learning very easily with categorical variables

Keras returns *classes* as a single column, so we convert to 1Ht.

In [43]:
def onehot(x): return np.array(OneHotEncoder().fit_transform(x.reshape(-1, 1)).todense())

In [45]:
# So, next thing we want to do is grab our labels and One-Hot Encode them
val_classes = val_batches.classes
trn_classes = batches.classes
val_labels = onehot(val_classes)
trn_labels = onehot(trn_classes)

In [50]:
trn_classes.shape # Keras single col of all imgs

(352,)

In [47]:
trn_labels.shape # One-Hot Encoded: 2 bit-width col <--> 2 classes

(352, 2)

In [52]:
trn_classes[:4] # taking a look at 1st 4 classes

array([0, 0, 0, 0], dtype=int32)

In [51]:
trn_labels[:4] # seeing the 1st 4 labels are 1Ht encoded

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])

Now we can finally do Step No.2: get the 1,000 ImageNet categ. preds for every image. Keras makes this easy for us. We can simple call ```model.predict(..)``` and pass in our data

In [53]:
trn_features = model.predict(trn_data, batch_size=batch_size)
val_features = model.predict(val_data, batch_size=batch_size)

In [54]:
trn_features.shape # we can see it is indeed No. imgs x 1000 categories

(352, 1000)

In [55]:
# let's take a look at one of the images (displaying all its categs)
trn_features[0]

array([  4.2062e-07,   1.5420e-03,   6.3156e-06,   1.6525e-05,   1.7104e-05,   9.9630e-06,
         7.9745e-06,   1.0407e-04,   6.6819e-06,   8.2504e-08,   8.4365e-07,   2.0677e-07,
         8.5876e-07,   1.1499e-06,   4.7719e-08,   6.0018e-06,   9.5073e-07,   1.2924e-06,
         2.9988e-07,   2.5385e-07,   1.5620e-07,   8.7231e-07,   4.5659e-07,   9.3004e-07,
         1.2746e-07,   1.3403e-05,   2.4640e-05,   8.9075e-05,   1.8840e-05,   2.1417e-04,
         3.9174e-06,   1.6371e-05,   1.2553e-05,   8.5407e-06,   2.0350e-06,   1.5323e-06,
         1.3331e-05,   3.3202e-05,   1.8858e-05,   1.0838e-05,   1.7524e-05,   6.8391e-07,
         8.3252e-06,   4.1266e-05,   7.1616e-06,   3.5467e-05,   2.6231e-05,   1.2236e-05,
         1.5050e-06,   1.7549e-06,   7.6077e-06,   9.2585e-04,   7.3585e-07,   4.2305e-07,
         3.6098e-06,   2.8183e-06,   1.5933e-06,   2.2448e-07,   4.4545e-07,   1.8274e-06,
         1.7588e-05,   5.1704e-06,   6.2866e-07,   1.8056e-07,   3.1049e-06,   1.7660e-07,

Not surprisingly, nearly all of these numbers are near zero.

Now we can define our linear model, just like we did earlier; now that we have our 1000 features for each image



In [56]:
# 1000 inputs, since those're the saved features, and 2 outputs: dog & cat
lm = Sequential([Dense(2, activation='softmax', input_shape=(1000,))])
lm.compile(optimizer=RMSprop(lr=0.1), loss='categorical_crossentropy', metrics=['accuracy'])

& Now we're ready to fit the model!  RMSprop is somewhat better than SGD. It's a minor tweak on SGD that tends to be much faster.

In [57]:
batch_size=4

In [59]:
lm.fit(trn_features, trn_labels, batch_size=batch_size, nb_epoch=3, 
       validation_data = (val_features, val_labels))

Train on 352 samples, validate on 50 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1158afcd0>

In [60]:
# let's have a look at our model
lm.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
dense_6 (Dense)                  (None, 2)             2002        dense_input_3[0][0]              
Total params: 2,002
Trainable params: 2,002
Non-trainable params: 0
____________________________________________________________________________________________________


So it ran almost instantly because running 3 epochs on a single layer with 2000 is really quick for my little i5 MacBook :3

We got an accuracy of ```.92```. Let's run another 3 epochs and see if this changes:

In [63]:
lm.fit(trn_features, trn_labels, batch_size=batch_size, nb_epoch=3,
       validation_data = (val_features, val_labels))

Train on 352 samples, validate on 50 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x11dd78c10>

(I actually ran 9, bc on a tiny set of 350 images it took a bit more to improve: no change on the 1st, dropped to ```.90``` on the 2nd, and finally up to ```.94``` on the final)

Here we haven't done any finetuning. All we did was take the ImageNet model of predictions, and built a model that maps from those predictions to either 'Cat' or 'Dog'

This is actually what most amatuer Machine Learning researchers do. They take a pretrained model, they grab the outputs, stick it into a linear model -- and it actually often works pretty well!

To get this 94% accuracy, we haven't done used any magical libraries at all. We just grabbed our batches up, we turned the images into a Numpy array, we took the Numpy array and ran ```model.predict(..)``` on them, we grabbed our labels and One-Hot Encoded them, and finally we took the 1Ht Enc labels and the 1,000 probabilities and fed them to a Linear Model with a thousand inputs and 2 outputs - and trained it and ended up with a validationa ccuracy of ```0.9400```

### 1.3.3 About Activation Functions

The last thing we're going to do is take this and turn it into a finetuning model. For that we need to understand activation functions. We've been looking at our Linear Model as a series of matrix multiplies. But a series of matrix multiplies is itself a matrix multiply --> a series of linear models is itself a linear model. Deep Learning must be doing something more than just this. At each stage (layer) it is putting the activations, the results of the previous layer, through a non-Linearity of some sort. ```tanh```, ```sigmoid```, ```max(0,x)``` (ReLU), etc.

Using the activation functions at each layer, we now have a genuine, modern (ca.2017), Deep Learning Neural Network. This kind of NN is capable of approximating any given function of arbitrary complexity.

A series of matrix-multiplies & activation (sa. ReLU) is actually what's going on in a DLNN.

Remember how we defined our model:

```
lm = Sequential([Dense(2, activation='softmax', input_shape(1000,))])
```

And the definition of a fully connected layer in the original VGG:

```
model.add(Dense(4096, activation='relu'))
```

What that ```activation``` parameter says is "after you do the Matrix Π, do a activation of (in this case): ```max(0, x)```"

## 2 Modifying the Model
## 2.1 Retrain last layer's Linear Model
So what we need to do is take our final layer, which has a Matrix Multip and & activation function, and we're going to remove it. To understand why, take a look at our DLNN layers:

In [64]:
vgg.model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lambda_1 (Lambda)                (None, 3, 224, 224)   0           lambda_input_1[0][0]             
____________________________________________________________________________________________________
zeropadding2d_1 (ZeroPadding2D)  (None, 3, 226, 226)   0           lambda_1[0][0]                   
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 64, 224, 224)  1792        zeropadding2d_1[0][0]            
____________________________________________________________________________________________________
zeropadding2d_2 (ZeroPadding2D)  (None, 64, 226, 226)  0           convolution2d_1[0][0]            
___________________________________________________________________________________________

The last layer is a Dense (FC/Linear) layer. It doesn't make sense to add another dense layer atop of a dense layer that's already tuned to classify the 1,000 ImageNet categories. We'll remove it, and use the previous Dense layer with it's 4096 activations to find Cats & Dogs.

We do this by calling ```model.pop()``` to pop off the last layer, and set all remaining layers to be fixed, so they aren't altered.

In [80]:
model.pop()
for layer in model.layers: layer.trainable=False

Now we add our final Cat vs Dog layer

In [83]:
model.add(Dense(2, activation='softmax'))

To see what happened when we called ```vgg.finetune()``` earlier:
Basically what it does is a ```model.pop()``` and a ```model.add(Dense(..))```

In [84]:
??vgg.finetune()

After we add our new final layer, we'll setup our batches to use preprocessed images (and we'll also *shuffle* the traiing batches to add more randomness when using multiple epochs):

In [86]:
gen = image.ImageDataGenerator()
batches = gen.flow(trn_data, trn_labels, batch_size=batch_size, shuffle=True)
val_batches = gen.flow(val_data, val_labels, batch_size=batch_size, shuffle=False)

Now we have a model designed to classify Cats vs Dogs instead of the 1,000 ImageNet categories & THEN Cats vs Dogs. After this, everything is done the same as before. Compile the model & choose optimizer, fit the model (btw, whenever we work with batches in Keras, we'll be using ```model.function_generator(..)``` instead of ```model.function(..)```

So let's do that and see what we get after 2 epochs of training:
We'll also define a function for fitting models to save time typing.

In [87]:
# NOTE: now use batches.n instead of batches.N
def fit_model(model, batches, val_batches, nb_epoch=1):
    model.fit_generator(batches, samples_per_epoch=batches.n, nb_epoch=nb_epoch,
                        validation_data=val_batches, nb_val_samples=val_batches.n)

It'll run a bit slowly since it has to calculate all previous layers in order to know what input to pass to the new final layer. We can save time by precalculating the output of the penultimate layer, like we did for the final layer earlier. Note for later work.

In [88]:
# compile the new model
opt = RMSprop(lr=0.1)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [89]:
# then fit it
fit_model(model, batches, val_batches, nb_epoch=2)

Epoch 1/2
Epoch 2/2


Note how little actual code was needed to finetune the model. Because this is such an important and common operation, Keras is set up to make it as easy as possible. Not external helper functions were needed.

It's a good idea to save weights of all your models, so you can re-use them later. Be sure to note the git log number of your model when keeping a research journal of your results.

In [90]:
model.save_weights(model_path + 'finetune1.h5')

In [91]:
# We can now use this as a good starting point for future Dogs v Cats models
model.load_weights(model_path + 'finetune1.h5')

In [92]:
model.evaluate(val_data, val_labels)



[0.32237322405329905, 0.97999999046325681]

Week 2 Assignments:

**Take it further** -- now that you know what's going on with finetuning and linear layers -- think about everything you know: the evaluation function, the categorical cross entropy loss function, finetuning: and see if you can find ways to make your model better and see how high up the rankings in Kaggle you can get.

**If you want to push yourself** -- see if you can do the same thing by writing all the code yourself. Don't use the class notebooks at all -- build it all from scratch.

**If you want to go *Even* further** -- see if you can enter another Kaggle competition (Galaxy Zoo, Plankton, Statefarm Distracted Driver, etc)


-- end of lecture 2 --

10 May 2017 WNx

We can look at the earlier prediction examples visualizations by redefiing *probs* and *preds* and re-using our earlier code.

In [None]:
preds = model.predict_classes(val_data, batch_size=batch_size)
probs = model.predict_proba(val_data, batch_size=batch_size)[:,0]

### 2.2 Retraining more layers

### 2.2.1 An Introduction to back-propagation

### 2.2.2 Training multiple layers in Keras