## Transfer Learning

In [37]:
import datetime
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D

For this exercise, we will use the mnist dataset. This dataset has digits from 0-9 which we will attempt to classift given their images in the form of a array. The goal of transfer learning is to teach a model something, and see how the model transfers its knowledge to learn another thing with 
greater accuracy. For this example, I will train a model on the numbers 5-9, then train the last layer with the numbers 0-4 and see how accuractly it can classify 0-4. 


### Creating a method for building a CNN

In [38]:
#These are paramters that we will hypertune for the model
batch_size = 128
num_classes = 5
epochs = 5 

img_rows, img_cols = 28, 28
filters = 32
pool_size = 2
kernel_size = 3

***Creating a function that creates a model and has three inputs:***

1. Model
2. Train set
3. Test set 
4. The amount of classes

In [39]:
now = datetime.datetime.now  # Used to record time for training and testing
def train_model(model, train, test, num_classes):
    X_train = train[0].reshape((train[0].shape[0],) + input_shape) #Reshaping our nn 
    X_test = test[0].reshape((test[0].shape[0],) + input_shape)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255  #Scaling
    X_test /= 255
    print('X_train shape:', X_train.shape)
    print(X_train.shape[0], 'train samples') #The amount of training samples in the dataset
    print(X_test.shape[0], 'test samples') #The amount of test samples in the dataset

    # turning our output into a OneHotEncoded array, this helps with accuracy
    y_train = keras.utils.to_categorical(train[1], num_classes)
    y_test = keras.utils.to_categorical(test[1], num_classes)

    model.compile(loss='categorical_crossentropy',optimizer='adadelta',metrics=['accuracy'])

    t = now()
    model.fit(X_train, y_train,batch_size=batch_size,epochs=epochs,verbose=1,validation_data=(X_test, y_test))
    print('Training time: %s' % (now() - t))

    score = model.evaluate(X_test, y_test, verbose=1)
    print('Test score:', score[0])
    print('Test accuracy:', score[1])

### Lets test this model on the mnist data

In [40]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [41]:
#Digits 0-4
X_train_lt5 = X_train[y_train < 5] #lt = less than 
y_train_lt5 = y_train[y_train < 5]
X_test_lt5 = X_test[y_test < 5]
y_test_lt5 = y_test[y_test < 5]

#Digits 5-9
X_train_gte5 = X_train[y_train >= 5] #gte = greater than 5
y_train_gte5 = y_train[y_train >= 5] - 5
X_test_gte5 = X_test[y_test >= 5]
y_test_gte5 = y_test[y_test >= 5] - 5 

In [42]:
y_test_gte5 #This is so we can get the actual rows with numbers 5-9, but we want the values of 0-4 since this is what we are classifying

array([2, 4, 0, ..., 4, 0, 1], dtype=uint8)

#### Creating the convolutional layer, flattening the image, adding dropout and activations

In [43]:
input_shape = (28, 28, 1)
feature_layers = [Conv2D(filters, kernel_size,padding='valid',input_shape=input_shape), Activation('relu'),
    Conv2D(filters, kernel_size),Activation('relu'),MaxPooling2D(pool_size=pool_size),Dropout(0.25),Flatten(),]

#### Creating the output layer

In [44]:
classification_layers = [Dense(128),Activation('relu'),Dropout(0.2),Dense(num_classes),Activation('softmax')]

In [45]:
model = Sequential(feature_layers + classification_layers)

In [46]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_4 (Conv2D)           (None, 26, 26, 32)        320       
                                                                 
 activation_8 (Activation)   (None, 26, 26, 32)        0         
                                                                 
 conv2d_5 (Conv2D)           (None, 24, 24, 32)        9248      
                                                                 
 activation_9 (Activation)   (None, 24, 24, 32)        0         
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 12, 12, 32)       0         
 2D)                                                             
                                                                 
 dropout_4 (Dropout)         (None, 12, 12, 32)        0         
                                                      

#### Training the model on digits 5-9

In [47]:
train_model(model,(X_train_gte5, y_train_gte5),(X_test_gte5, y_test_gte5), num_classes)

X_train shape: (29404, 28, 28, 1)
29404 train samples
4861 test samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Training time: 0:01:21.476132
Test score: 1.4116288423538208
Test accuracy: 0.6428718566894531


#### Freezing layers

Keras allows layers to be "frozen" during the training process.  That is, some layers would have their weights updated during the training process, while others would not.  This is a core part of transfer learning, the ability to train just the last one or several layers.

Note also, that a lot of the training time is spent "back-propagating" the gradients back to the first layer.  Therefore, if we only need to compute the gradients back a small number of layers, the training time is much quicker per iteration.  This is in addition to the savings gained by being able to train on a smaller data set.

In [34]:
for l in feature_layers:
    l.trainable = False

In [36]:
train_model(model,(X_train_lt5, y_train_lt5),(X_test_lt5, y_test_lt5), num_classes)

X_train shape: (30596, 28, 28, 1)
30596 train samples
5139 test samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Training time: 0:00:22.744790
Test score: 1.4418450593948364
Test accuracy: 0.657326340675354


#### Conclusion: 

Transfer learning is useful when we are training images that are similar to one another. We use the beggining layers, freeze certain layers 
as a reguarlization technique, and only train on the last few layers since that is where most of the learning is taking place. This speeds
up training and gives an accuracy when classifying images similar to a typical CNN at a much faster rate. 