# Transfer Learning and Regularization

Proprietary material - Under Creative Commons 4.0 licence CC-BY-NC-ND https://creativecommons.org/licenses/by-nc-nd/4.0/



# Transfer Learning

## The Current State of Deep Learning Training

The most recent breakthroughs in DL have also come with the predictable problem of increased computational power and time needed in training. 

There is a constant balance between the designing of new operators and techniques that reduce the computing costs and the design of larger and larger architectures that produces better and better performances. 

<img src="CNN_Backbones_Complexity.png">
<img src="CNN_Backbones_Cost.png">

Is there a way to train complex architectures for our applications with easy to access hardware in a reasonable time?

Nowadays there are some services that rent servers like Google Cloud and AWS that can allow us to train our models in several GPUs. Still, being able to save in training time will reduce our cost and allow us to iterate more on our model. 

## Initial Approach

In the internet there is already a multitude of neural nets trained in huge datasets like ImageNet, COCO, Pascal and many more with great performances.

This networks could have taken weeks of training in very powerful GPUs, so replicating it could take us months or large amounts of money to replicate. 

Instead of repeating the training process, let's use these trained weights as the starting point for our training. Effectively ‘transferring’ what they learned to our model.

This could also help in cases where we don't have many samples in our dataset.

Let's see how we can transfer the knowledge from one net to another with an example.

## Transfer Example


Let's suppose we are tasked with the objective of training a classifier for four species of birds, but the amount of data that we have is not enough to train a CNN from scratch.

Instead, lets download a model and its trained weights from the internet trained over a 1000 classes, like the following net:

<img src="TransferLearning-Page-1.png">

To adapt it to our task we can adapt the last layer like so:

<img src="TransferLearning-Page-2.png">

This allows the new layer to utilize the features extracted by the downloaded weights. But instead of using all the weights, we could just use the shallower layers that extract the lowest level of features like this: 

<img src="TransferLearning-Page-3.png">

Once again, this allows our model to use the feature extraction from the model we downloaded, but also allows it to find a higher level extractor for it's own task.

But in case we needed a model larger than the one we downloaded, we can still grab the lower layers to use as feature extractors and just modify the deeper layers like so:

<img src="TransferLearning-Page-4.png">

## Fine Tunning

But what if we trained the whole model after loading the weights? 

This is called Fine Tunning, and is basically just training from a starting solution like this:

<img src="TransferLearning-Page-5.png">

The idea is that the weights provide a great initialization for the weights of our optimization problem. Before appliing Finne Tunning is very important to reduce the learning rate of the optimizer. If you don't reduce it you might loss the found solution. 

Fine Tunning and Transfer Learning are not a one or the other ordeal. Usually is recommended to apply a bit of Fine Tunning after training with Transfer Learning to "round" the model. 

## Transfer Learning in CNNs

Due to the general structure of CNNs they provide a great base for Transfer Learning. But it's important to understand it before applying it.

<img src="TransferLearning-Page-6.png">

The Fully Connected (FC) part of a CNN is often discarded when using Transfer Learning, while keeping the convolutional layers.

This is because the convolutional layers extract the features from the input.

If extra layers are added, those have to be after the frozen layers, so the learned information is not lost.

## How much to train

The most difficult part of Transfer Learning is usually defining the best layers to freeze and train. There is no easy solution, but there are some rules of thumb that we can follow.

<img src="TransferLearning-Page-7.png">
<img src="TransferLearning-Page-8.png">

Usually, the less data we have and more different the task that the model was trained compared to our own then the more layers we need to train.

For example, if we downloaded a model for face recognition, if we wanted to train a model for face mask detection we could get away with training fewer layers than if we wanted to use the same model for fruit classification.

# Regularization

With neural nets, a common problem that occurs is that the models tend to overfitt. As a reminder, let's go over a bit of overfitting and underfitting.

## Over and Under Fitting


<img src="overunderfitting.png">

Source: Kaggle

This figure shows the differences between underfitting an overfitting. Probably the easiest way to introduce regularization to our model is called "early stopping".

## Early Stopping

Early stopping is, as the names indicates, just stopping the training of the model early so it doesn't adjust too much to the training data.

The best time to do early stopping is when the training loss or metrics keep going down but the validation loss starts to go up. An example of this behavior can be seen here:

<img src="early.png">

Source: Kaggle

## L1 and L2 Regularization

The L1 and L2 regularization have a similar idea: Add a penalization to the weights of the model to avoid arriving to solutions that require very large weights. This avoids that model to finds very specific solutions that can;t generalize well. 

To apply L1 and L2 regularization you only need to add the following terms to the loss function:

L1: 

$$
\hat{L} = L + \lambda \sum^M |W_i|
$$

L2:

$$
\hat{L} = L + \lambda \sum^M W^2_i
$$

Here $L$ is our previous loss, $M$ or space of weights and $\lambda$ a positive constant often called the regularization rate.

The effects of applying L1 or L2 regularization to the solution can be seen in the following figure:

<img src="l1l2reg.png">

We can see that the L1 solution is sparse, compared to the L2 solution that is not. This allows the L1 regularization to perform feature selection, by choosing which features multiply by 0 it removes them from propagating trough the rest of the model. 

Another advantage of L1 over L2 is that is usually more robust to outliers, but there are ways to fix this with L2 with other regularization techniques. The main advantage that L2 has over L1 is that is way less computational intensive to calculate and tends to optimize easier. 

## Dropout

Dropout is the final regularization technique that we will cover today, but might be one of the most relevant. 

Similarly to the L1 regularization, it multiplies features by 0, but does this randomly to help the net find a more generalized solution. To help illustrate this let's look at an example.

We start with a FC net with five inputs:

<img src="TransferLearning-Page-9.png">

Then, for any epoch we turn off randomly some connections like so:

<img src="TransferLearning-Page-10.png">

Because it's random the following epoch could have tw connections turned off like this:

<img src="TransferLearning-Page-11.png">

The idea behind dropout is that this random disconnections allow the model to generalize the solution trough all the connections instead of letting just a couple of connections to be used.



# Introduction

This week we will cover the basics of how to apply Transfer Learning to a Neural Net in Keras.

Most, if not all the the big Deep Learning libraries come with some tools to apply Transfer Learning from another model. Some even include models trained and ready for deployment that can be used directly for Transfer Learning, and luckly for us, Keras includes that option.

## Layer Freezing

Before using Transfer Learning, its necessary to go over an intruduction on how to freeze a model layers and what does that mean.

In [None]:
# Importing relevant libraries
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.applications.vgg16 import VGG16

First, lets load a Neural Net from the Keras library. 

The architecture is an Xception Net and trained in the ImageNet dataset.



In [None]:
vgg_net = VGG16(weights='imagenet', input_shape=(224, 224, 3), include_top=True) 
vgg_net.summary()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5
Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     14758

A frozen layer in a model means that the layer weights wont be updated during training, so to apply transfer learning you have to control which layers need to be frozen or unfrozen before or in the middle of training.

To count the number of unfrozen or frozen weights, we can access the lists '*trainable\_weights*' and  '*non\_trainable\_weights*' respectively.

**Note:** Notice that the weights are 32, which is the double of the number of layers of the model (VGG16). This is because for each FC and Conv layer there is two weights (the kernel and bias).

In [None]:
print("Number of weights:", len(vgg_net.weights))
print("Number of trainable weights:", len(vgg_net.trainable_weights))
print("Number of frozen weights:", len(vgg_net.non_trainable_weights))

Number of weights: 32
Number of trainable weights: 32
Number of frozen weights: 0


We can freeze a layer by setting the layer '*trainable*' to False.
As a test, lets freeze the layers 'block1\_conv2' and 'block4\_conv3' in the positions 3 and 13 of the layers list.

In [None]:
vgg_net.layers[2].trainable = False
vgg_net.layers[13].trainable = False

Now lets check that those layers are frozzen.

In [None]:
print("Number of weights:", len(vgg_net.weights))
print("Number of trainable weights:", len(vgg_net.trainable_weights))
print("Number of frozen weights:", len(vgg_net.non_trainable_weights))

Number of weights: 32
Number of trainable weights: 28
Number of frozen weights: 4


Seting the trainable to False works on all the sublayers, so if we want to freeze all layers we can do this:

In [None]:
vgg_net.trainable = False

print("Number of weights:", len(vgg_net.weights))
print("Number of trainable weights:", len(vgg_net.trainable_weights))
print("Number of frozen weights:", len(vgg_net.non_trainable_weights))

Number of weights: 32
Number of trainable weights: 0
Number of frozen weights: 32


The summary also indicates how many parametres can be trained in a model (at the bootom). 

In [None]:
vgg_net.summary()

Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0     

# Transfer Learning

Now lets finally prepare our model to train with Transfer Learning!

First, lets grab one of the CNNs that come with Keras with no FC layers at the end.

This can be done by passing the 'include\_top' aprameter as False.

In [None]:
vgg_net = VGG16(weights='imagenet', input_shape=(224, 224, 3), include_top=False) 
vgg_net.trainable = False
vgg_net.summary()

Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0     

Now lets add a couple of FC layers on top. For now, lets asume our otput is a binary classifier for simplicity.

In [None]:
from keras.models import Sequential 
from keras.layers import Dense, Flatten
from keras import optimizers

model = Sequential()

model.add(vgg_net)
model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dense(1024, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer=optimizers.Adam(),
              metrics=['accuracy'])
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
vgg16 (Functional)           (None, 7, 7, 512)         14714688  
_________________________________________________________________
flatten_5 (Flatten)          (None, 25088)             0         
_________________________________________________________________
dense_12 (Dense)             (None, 4096)              102764544 
_________________________________________________________________
dense_13 (Dense)             (None, 1024)              4195328   
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 1025      
Total params: 121,675,585
Trainable params: 106,960,897
Non-trainable params: 14,714,688
_________________________________________________________________


Now we have our model ready for the training loop!

But lets leave that for the problem ;)

Before geting to that, try to keep in mind the following considerations:



1.   To accelerate the training, you can extract the features from the data using the frozen layers and train the rest of the model using those features.

2.   The previos point can't be used if you are using data augmentation douring the training, but it can be done beforehand.

3.  At the end of the training, its recomended to apply a bit of Fine Tunning to the model, by unfreezing the model and reducing the learning rate.

4.  You can't add new layers before the frozen layers. If you need to, Fine Tuning is the only option.

5. When using a CNN for transfer Learning, remember to modify your data so it has the expected size for the model.






## Exercise 

Lets build a cat vs dog image classifier with a CNN. For that, you are free to so as you want, but we recomend at leats some of the following steps:

1. Download the data from this link https://www.kaggle.com/c/dogs-vs-cats/data. If you want to use some other dataset, you are free to do so.

2. Explore the data, visualize some examples and check the size of the images.

3.  Choose an architecture from https://keras.io/api/applications/ to your liking. 

4. Reshape the images to the same size. For this you can use OpenCV like this:



```
import cv2
import os

path = 'path tho directory'
image_paths = os.listdir(path)

for file in image_paths:
    image = cv2.imread(os.path.join(path, file))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
	image = cv2.resize(image, (224, 224))

    ...
```

Then you can save the images to a list or data structure inside the loop. Some files include their class in the name, so this loop offers a great oportunity to prepare the labels too.



4. If the shape of the data is diferent to the ImageNet data, you need to reshape the input. For this you can add the parameter 'input\_tensor' when loading the trained model. For example, for an input of 128x128:



```
vgg_net = VGG16(weights="imagenet", include_top=False,
	input_tensor=Input(shape=(128, 128, 3)))
```

5. Freeze the model and add some new FC and Dropout layers with apropiate sizes. 

5. Separate the data into train and validation by 80% train and 20% validation. 

6. Train the model over the train data. Use the Adam optimizer and choose a loss apropiate to your problem. As an example, in case its a binary classifier use  Binary Cross Entropy.

7. Optional: Unfreeze the model at the end and train once again the model using Fine Tuning.

8. Analize the training time and the performance of the model in the test data. Optional: Share some of you results with the team and discuss about the architecture, loss and performace of your model. 


We understand that some of this steps might be a bit complicated or unclear for some, so dont be afraid to ask questions to the team in the public channels or check the references for some guidance.  



## References


The oficial Keras Tutorial

https://keras.io/guides/transfer_learning/

List and examples of some architectures in Keras 

https://keras.io/api/applications/

A great tutorial that goes deeper into Transfer Learning, but we recomend checking it out after being done with the exercise. And I trust you will.

https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a