### Transfert Learning: Explanation


One of the most popular idea in Deep Learning is that sometime you can take knowledge to neural network has learned for one task and apply that knowledge to a separate task. So for example, may be you can have a neural network learnt to recognize object like 'cat' and you use that knowledge to help you do a better job within extend cat. This is called:  Transfert Learning. 
Let take a look.

### When to use transfer learning?

Transfer learning allows you to transfer knowledge from one model to another. For example, you could transfer image recognition knowledge from a cat recognition app to a radiology diagnosis.
Implementing transfer learning involves retraining the last few layers of the network used for a similar application domain with much more data. 
The idea is that hidden units earlier in the network have a much broader application which is usually not specific to the exact task that you are using the network for.
In summary, transfert learning works when both taks you are trying to learn from has much more data than the task you are trying to train.

Let say we create a neural network on image recognition. 

x ----> FL1 ----> SL ----> TL ----> FL2 ----> FL3 ----> FL4 ----> LL ----> y^hat

(x,y) is given such that x represent an image and y represent some objects in image (like cat, dog, ...)

If you want to take this network and adapt or tranfert what is learn for different task as radiology diagnosis, what you can do is take the last layer of the network and just delete that, delete also the weight between this layer and the previous one. And create a new weight by random initialize a new layer and that have a new output for radiology diagnosis. 

So during the first training, you train for image recognition, you train all the weights and all the layers to get your output y^hat for image recognition. Having train that neural network, what we now do to implement transfert learning is swapping a new dataset (x,y) where now:

                     x: ---> radiology images
                     y: ---> diagnosis we want to predict
                     
And we do it by initialize the last layer-weight randomly and now we train a neural network on this new radiology dataset, we might retrain the weight of the last layer and keep the rest of the parameters fix, if you have enough data, you can also retrain all the layers of the rest of the network.
And we retrain all the parameters of the network, then this initial phase of training on image recongition, is sometimes called:
          
                    pre-training, because we are using image recognition data to reinitialize or really pretrain the rest of the weights of the network. 
                    
And then if you are updating all the weights and training on the radiology data, sometimes that os called:

                    fine-tuning
                    
So, we hear the words pre-training and fine-tuning in deep learning context, that is how the use it fro transfert learning tasks. So we have done in this example is that we take knowledge learnt from image recognition data and transfert it to radiology diagnosis. The reason this can be helpful is that having learn on more data for image recognition can help to learn better or faster for radiology dataset.




Here is another example for speech recognition system:

x ----> FL1 ----> SL ----> TL ----> FL2 ----> FL3 ----> FL4 ----> LL ----> y^hat

Where (x,y) is given such that :
                  
                  x: represent audio 
                  y: represent transcript

And let say you now want to build a wakeword detection

So what we are doing here is to delete the last layer or weight in speech recognition and create a new weight and multiple layer for the output y^hat for wakework detection.
And also depending on the size of the data we have, we can only retrain the new weight and layer for wakeword if we have small data or retrain all the weights and layer in the network if we have very lage of data.

So, transfert learning make sence, if we have a lot of data for the problem you are transfering from and usually less than the problem you are transfering to. 

So, transfert learning make sense when you transfert knowledge from a lot of data to a small dataset. 
                 
                Example: 1 000 000 image recognition
                         100       radiology diagnosis

Summary:
    
    When transfert learning makes sense?
    
    If you are trying to learn from a task A to a task B:
        
        
        1- Task A and B have the same input x
        2- You have a lot more data for task A than B
        3- Low level features from A could be helpful for learning B

Three types of transfer of learning:
    
    -Positive transfer: When learning in one situation facilitates learning in another task harder. It is known as positive transfer.
    -Negative transfer: When learning of one task makes the learning of another task harder. It is known as negative transfer
    -Neutral transfer

 ### When to use multi-task learning?

Multi-task learning forces a single neural network to learn multiple tasks at he same time (as opposed to having a separate neural network for each task). Andrew Ng explains that the approach works well when the set of tasks could benefit from having shared lower-level features and when the amount of data you have for each task is similar in magnitude.

###### Ref: Andrew Ng 

### Transfert Learning in Keras

We will be using the Cifar-10 dataset and the keras framework to implement our model. In this post, we will first build a model from scratch and then try to improve it by implementing transfer learning. Before we start to code, let’s discuss the Cifar-10 dataset in brief. Cifar-10 dataset consists of 60,000 32*32 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 testing images. Let’s begin by importing the dataset. Since this dataset is present in the keras database, we will import it from keras directly.

In [1]:
import numpy as np
from keras.datasets import cifar10

#Load the dataset:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Let’s check the shape of the train and test dataset.

In [2]:
print("There are {} train images and {} test images.".format(X_train.shape[0], X_test.shape[0]))
print('There are {} unique classes to predict.'.format(np.unique(y_train).shape[0]))

There are 50000 train images and 10000 test images.
There are 10 unique classes to predict.


We, can see that there are 50,000 train images and 10,000 test images with 10 unique classes to predict. Next, we will one-hot label our train and test labels.

In [3]:
#One-hot encoding the labels
num_classes = 10
from keras.utils import np_utils
y_train = np_utils.to_categorical(y_train, num_classes)
y_test = np_utils.to_categorical(y_test, num_classes)

Let’s visualize our training data. We will display the first eight images in the training data.

In [4]:
import matplotlib.pyplot as plt   # importing matplotlib
import numpy as np                # importing numpy
#%matplotlib inline                # see plot in Jupyter notebook


fig = plt.figure(figsize=(10, 10))

for i in range(1, 9):
    img = X_train[i-1]
    fig.add_subplot(2, 4, i)
    plt.imshow(img)

print('Shape of each image in the training data: ', X_train.shape[1:])


Shape of each image in the training data:  (32, 32, 3)


Each image in the dataset is of size: 32*32*3. Now, that we have got an idea of the dataset, let’s build a model from scratch. We will be sticking with the keras framework to build our model as it is easy to understand, but you may use other frameworks also.

In [5]:
#Importing the necessary libraries 
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D
from keras.layers import Dropout, Flatten, GlobalAveragePooling2D

#Building up a Sequential model
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu',input_shape = X_train.shape[1:]))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(GlobalAveragePooling2D())
model.add(Dense(10, activation='softmax'))
model.summary()


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 30, 30, 32)        896       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 13, 13, 32)        9248      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 6, 6, 32)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 4, 4, 64)          18496     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 2, 2, 64)          0         
_________________________________________________________________
global_average_pooling2d_1 ( (None, 64)              

Fig 2. Model summary of the model build from scratch.

From Fig 2., we can see that our model contains three convolutional layers, each followed by a max pooling layer and finally a Global Average Pooling layer followed by a dense layer with ‘softmax’ as the activation function. There are a total of 29,290 parameters to train. We will be using ‘binary cross-entropy’ as the loss function, ‘adam’ as the optimizer and ‘accuracy’ as the performance metric.

In [6]:
model.compile(loss='binary_crossentropy', optimizer='adam',
              metrics=['accuracy'])

Finally, we will rescale our data. Rescale is a value by which we will multiply the data such that the resultant values lie in the range (0-1). So, in general, scaling ensures that just because some features are big in magnitude, it doesn’t mean they act as the main features in predicting the label.

In [7]:
X_train_scratch = X_train/255.
X_test_scratch = X_test/255.

Next, we will create a checkpointer to save the weights of the best model (i.e. the model with minimum loss).

In [8]:
#Creating a checkpointer 
from keras.callbacks import ModelCheckpoint

checkpointer = ModelCheckpoint(filepath='scratchmodel.best.hdf5', 
                               verbose=1,save_best_only=True)

Finally, we will fit the model to the training data points and labels. We will split the whole training data in batches of 32 and train the model for 10 epochs. We will use be 20 percent of our training data as our validation data. Hence, we will train the model on 10000 samples and validate of 10000 samples.

In [9]:
#Fitting the model on the train data and labels.
model.fit(X_train, y_train, batch_size=32, epochs=10, 
          verbose=1, callbacks=[checkpointer], validation_split=0.2, shuffle=True)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Train on 40000 samples, validate on 10000 samples
Epoch 1/10

Epoch 00001: val_loss improved from inf to 0.24949, saving model to scratchmodel.best.hdf5
Epoch 2/10

Epoch 00002: val_loss improved from 0.24949 to 0.22839, saving model to scratchmodel.best.hdf5
Epoch 3/10

Epoch 00003: val_loss improved from 0.22839 to 0.21902, saving model to scratchmodel.best.hdf5
Epoch 4/10

Epoch 00004: val_loss did not improve from 0.21902
Epoch 5/10

Epoch 00005: val_loss improved from 0.21902 to 0.20340, saving model to scratchmodel.best.hdf5
Epoch 6/10

Epoch 00006: val_loss improved from 0.20340 to 0.20140, saving model to scratchmodel.best.hdf5
Epoch 7/10

Epoch 00007: val_loss improved from 0.20140 to 0.19184, saving model to scratchmodel.best.hdf5
Epoch 8/10

Epoch 00008: val_loss improved from 0.19184 to 0.19105, saving model to scratchmodel.best.hdf5
Epoch 9/10

Epoch 00009: val_loss improved from

<keras.callbacks.callbacks.History at 0x7f1278268898>

The best model produces an accuracy of 82.01% on the training samples and 81.96% on the validation samples. Let’s evaluate the performance of the model on the test dataset.

In [10]:
#Evaluate the model on the test data
score = model.evaluate(X_test, y_test)

#Accuracy on test data
print('Accuracy on the Test Images: ', score[1])

Accuracy on the Test Images:  0.9294396638870239


So, our CNN model produces an accuracy of 82% on the test dataset. That’s great, but can we do better. Let’s implement transfer learning and check if we can improve the model. We will be using the Resnet50 model, pre-trained on the ‘Imagenet weights’ to implement transfer learning. We are using ResNet50 model but may use other models (VGG16, VGG19, InceptionV3, etc.) also.

In [11]:
#Importing the ResNet50 model
from keras.applications.resnet50 import ResNet50, preprocess_input

#Loading the ResNet50 model with pre-trained ImageNet weights
model = ResNet50(weights='imagenet', include_top=False, input_shape=(200, 200, 3))



The Cifar-10 dataset is small and similar to the ‘ImageNet’ dataset. So, we will remove the fully connected layers of the pre-trained network near the end. To implement this, we set ‘include_top = False’, while loading the ResNet50 model.

In [19]:
X_train[i].shape

(32, 32, 3)

In [18]:
#Reshaping the training data
from scipy.misc import imresize    # pip3 install scipy==1.1.0 --user

X_train_new = np.array([imresize(X_train[i], (200, 200, 3)) for i in range(0, len(X_train))]).astype('float32')

#Preprocessing the data, so that it can be fed to the pre-trained ResNet50 model. 
resnet_train_input = preprocess_input(X_train_new)

#Creating bottleneck features for the training data
train_features = model.predict(resnet_train_input)

#Saving the bottleneck features
np.savez('resnet_features_train', features=train_features)

`imresize` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``skimage.transform.resize`` instead.
  after removing the cwd from sys.path.


MemoryError: Unable to allocate array with shape (50000, 200, 200, 3) and data type float32

As the minimum size of the image that can be supplied to the ResNet50 model is (197 * 197 * 3), we resize our training images to the size (200 * 200 * 3). Next, we preprocess the resized data so that it can be fed to the pre-trained ResNet50 model as input.

Finally, we will use the pre-trained ResNet50 model to create bottleneck features for the training data. Next, we will store these bottleneck features offline because calculating them could be computationally expensive, especially when you're working on the CPU, and we want to only do it once. Note that this prevents us from using data augmentation.

In [13]:
#Reshaping the testing data
X_test_new = np.array([imresize(X_test[i], (200, 200, 3)) for i in range(0, len(X_test))]).astype('float32')

#Preprocessing the data, so that it can be fed to the pre-trained ResNet50 model.
resnet_test_input = preprocess_input(X_test_new)

#Creating bottleneck features for the testing data
test_features = model.predict(resnet_test_input)

#Saving the bottleneck features
np.savez('resnet_features_test', features=test_features)

`imresize` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``skimage.transform.resize`` instead.
  


We will use the same process to create bottleneck features for the testing data. Now, that we have created the bottleneck features, we will supply them as input to a sequential model with newly added fully connected layers that match the number of classes in the Cifar-10 dataset.

In [18]:
model = Sequential()
model.add(GlobalAveragePooling2D(input_shape=train_features.shape[1:]))
model.add(Dropout(0.3))
model.add(Dense(10, activation='softmax'))
model.summary()

NameError: name 'train_features' is not defined

Fig 3., represents the model summary of our resnet50 transfer model. We can see that the number of trainable parameters has reduce to 20,490, when compared to the trainable parameters in the CNN model that was build from scratch. Next, we will compile the model. We will use the same ‘categorical cross-entropy‘ as our loss function, ‘adam’ as our optimizer and ‘accuracy’ as the performance metric.

In [19]:
model.compile(loss='categorical_crossentropy', optimizer='adam', 
              metrics=['accuracy'])

We will create a model checkpointer to save the best model and call the ‘fit’ method to train the model for 10 epochs. The model trains on 40000 samples and validates on the remaining 10000 samples.

In [20]:
model.fit(train_features, y_train, batch_size=32, epochs=10,
          validation_split=0.2, callbacks=[checkpointer], verbose=1, shuffle=True)

NameError: name 'train_features' is not defined

The model produces an accuracy of 90.01% and 88.68% on the training data and validation data respectively. Lastly, we evaluate our model on the test data.

In [21]:
#Evaluate the model on the test data
score  = model.evaluate(test_features, y_test)

#Accuracy on test data
print('Accuracy on the Test Images: ', score[1])

ValueError: Error when checking target: expected sequential_3_input to have 4 dimensions, but got array with shape (10000, 10)

The model produces an accuracy of 88.58% on the test data.

### Conclusion

We see that by using pre-trained features, the accuracy of the model jumped from 82% to 88.58% on the test data. Also, the number of trainable parameters in the transfer model is low as compared to our scratch model. Apart from this, the CNN scratch model took around 15 minutes to train on CPU, while the transfer model took less than a minute to train the model. We can conclude that the use of transfer learning not only improves the performance of the model but also is computationally efficient.

Now, the question one may ask is if we can further improve the model, and the answer is yes. We may use techniques such as the following: - Implement data Augmentation - Fine-tuning the optimizer and loss function - Use L1 and L2 regularization - Use a different pre-trained model - Fine-tune the layers of the pre-trained model

Next, I encourage you to apply transfer learning on Cifar-100 dataset (or any other dataset of your choice) and explore the results.

Have anything to say? Feel free to drop your suggestions, recommendations, or concerns in comments below.

### References

1-Transfer Learning, Lisa Torrey and Jude Shavlik, University of Wisconsin, Madison, WI, USA. \\
2- CS231n Convolutional Neural Networks for Visual Recognition. \\
3- Yosinski J, Clune J, Bengio Y, and Lipson H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 27 (NIPS ’14), NIPS Foundation, 2014.\\
4- https://www.cs.toronto.edu/~kriz/cifar.html