# Transfer Learning
## Keras Xception for Stanford Cars Dataset

#### Students
Gizatullin Ramil

Grebenkin Ivan
#### Repository
https://github.com/Lolik111/transfer-learning-tf

## Introduction
Pre-trained neural networks allow to solve the tasks of computer vision without spending a significant amount of time learning the network. Such networks include a large number of layers, have high accuracy and are trained on large computing clusters with a GPU.

The technology of transfer learning allows to use pre-trained neural networks for solving problems of a new type, not those for which the networks were previously trained.

### Domain
To explore this domain we decided to work with Keras models. Keras Applications module includes a set of pre-trained neural networks. Among others it includes such popular architecture as [InceptionV3](https://keras.io/applications/#inceptionv3), [InceptionResNetV2](https://keras.io/applications/#inceptionresnetv2), [ResNet50](https://keras.io/applications/#resnet50) and [Xception](https://keras.io/applications/#xception). Xception seemed the most interesting for us, known due to the approach of depthwise separable convolution.

### Data set
Xception NN is pre-trained on the ImageNet dataset, which includes 1.2 million images related to the thousand classes. For experiments the [Stanford Stanford Cars Dataset](http://ai.stanford.edu/~jkrause/cars/car_dataset.html) was choosen. This dataset contains 16 thousands images of 196 classes of cars. Each class was splitted roughly in a 70-30 train/test split.  Several exaples of labeling:
* AM General Hummer SUV 2000
* Acura RL Sedan 2012

## Related	work

### Transfer Learning concept

In our study, we relied on formal understanding of transfer learning, gathered from [Stanford CS231N educational program materials](http://cs231n.github.io/).
Neural networks trained to solve image classification problems consist of two parts:
* Сonvolution part - to extract the characteristic features from the image.
* The fully-connected - to determine what kind of object is in the image based on the features that the convolution part extracted.

The essence of transfer learning is as follows. The convolutional part of the network learns to distinguish characteristic features in images during training. If the features are sufficiently general, then its can be applied to another classes. Thus, our was to use the approach of transfer learning. The concept is to change the Xception model architecture which is pre-trained on diverse ImageNet dataset. To do that we tune the weights on layers below the top. Then to make it suitable for classification on Cars Dataset, it must be re-trained on it.

### Xception model

To implement the transfer of training, we need to replace the classifier in a previously trained neural network. The work of Francois Chollet from Google, Inc ["Xception: Deep Learning with Depthwise Separable Convolutions"](https://arxiv.org/abs/1610.02357) has been considered. The motivation of this work was the fact that adding of additional layers to the network for its improving is not always possible because performance limitations do not allow the use of such deep networks. This became the impulse for using depthwise separable convolutions and creating the Xception architecture.

A convolutional layer simultaneously processes both spatial information (correlation of neighboring points within one channel) and inter-channel information, since convolution is applied to all channels at once. The Xception architecture is based on the assumption that these two types of information can be processed in series without losing the quality of the network, and decomposes the conventional convolution into pointwise convolution (which handles only inter-channel correlation) and the spatial convolution (which only processes spatial correlation within a single channel).

There is standart convolutional layer with filters $C_2$, each has size $3\times3$. The input tensor has dimension $M \cdot M \cdot C_1$, where $M$ is width and high of tensor and $C_1$ - number of chanels.
The standart convolutional layer folds all channels of the original signal $C_2$ simultaneously by different convolutions. The output tensor has dinension $(M - 2) \cdot (M - 2) \cdot C_2$. But in depthwise separable convolutions architecture there are two consecutive steps. 

1. curl the original tensor $1 \times 1$ with a convolution, obtaining a new tensor. This operation is called pointwise convolution
2. Curl each channel separately $3 \times 3$ convolution (in this case, the dimension does not change, since we do not fold all the channels together, as in the usual convolutional layer). This operation is called depthwise spatial convolution

<img src="imgs/xception.png">

### Xception model fine-tuning

Fine tuning allows to go further and increase the quality of the pre-trained network on a new task. For this purpose, not only the new classifier has to be trained, but also some layers of the previously trained neural network. This is especially effective when the new data set is quite different from the original set on which the network was trained.

For the comparison purposes work of Derrick Liu and Yushi Wang from Stanford University ["Image Classification of Vehicle Make and Model Using Convolutional Neural Networks and Transfer Learning"](http://cs231n.stanford.edu/reports/2015/pdfs/lediurfinal.pdf) has been considered.

The authors used the following algorithm:
1. Replace the classifier of the previously trained neural network with a new classifier suitable for our task.
2. "Freeze" the convolutional layers of the previously trained neural network. As a result, these layers has not been trained.
3. Train a composite network with a new classifier on a new data set.
4. "Defrost" several layers of the convolutional part of the previously trained neural network.
5. Train the network with the defrozen convolutional layers on the new data set. 

The results of their approach are shown in the <b>Results</b> section

## Implemented	Model

### Keras + Tensorflow

For the implementation purposes Keras and Tensorflow frameworks were chosen. Keras is providing a high-level neural networks API, written in Python and capable of running on top of TensorFlow. It allows us to use pretrained Xception model with the imagenet weights, and easily add or remove layers, change learning rate, etc. In the same time, tensorflow provides an effective way to build parallel input pipeline, so that we can use CPU power for training time preprocessing, augmentation and GPU for neural network training.  
<img src="imgs/pipelining.png">

### Xception Keras model to TF estimator

At the first stage, we load the previously trained Xception network without the classifier and "freeze" the convolution layers in it:
```python
xception.Xception(include_top=False, weights='imagenet', input_shape=(299, 299,3))
```

Then we create a composite network with Xception and a new classifier:
```python
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(196, activation='softmax')(x)
```

And the last step of initialisation is the tensor-flow estimator over the Keras Xception model:

```python
run_config = tf.estimator.RunConfig()
run_config = run_config.replace(keep_checkpoint_max=5, save_summary_steps=10)
model.compile(optimizer=optimizers.Adam(lr=0.001), 
              loss='categorical_crossentropy', metrics=['accuracy', metrics.top_k_categorical_accuracy, metrics.mean_absolute_error, auc])
est = tf.keras.estimator.model_to_estimator(model, model_dir='x_input', config=run_config)
```

We use learning rate (`lr=0.001`) at first. But then a series of training was conducted at a learning-rate lowering. The best result was received on (`lr=0.0001`). Learning rate is the powerful API tool which gives us the opportunity to regulate the degree of retraining without plunging into the complexity of choosing a particular layer.

Moreover we can not just choose some layers by hands for tuning, at least because they all have a good contribution. The reason for this is residual nature of Xception.

### Preprocessing

In this work, preprocessing can be divided into two parts: preliminary preprocessing and training-time preprocessing

Preliminary preparation of data included cropping pictures by bounding boxes, noise filtering and grouping them into shards of TFRecords, each of size approximately of 110 MB. Additionally, due to the fact that in the dataset there are black and white pictures, they must additionally be translated into three-channel pictures.

A training-time preprocessing included parallel reading, fetching and such powerful and common tool for increasing the dataset size and model generalizability - data augmentation. Essentially, data augmentation is the process of artificially increasing the size of your dataset via transformations. We have tried different approaches, including random crops, shifts, shears, rotations, so we limited ourselves to the random horizontal flips and mappings of the original images because of specifics of the dataset (different size and shape of images and cars in them)

For all such purposes the additional class has been used:
```python 
class ImageCoder(object):
    """Helper class that provides TensorFlow image coding utilities."""

    def __init__(self):
        self._sess = tf.Session()
        
    ...    
```

## Experiments and	Evaluation

All the training work has been made on [Google Colaboratory research project](https://colab.research.google.com).  It's a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud, which equipped with the powerful Tesla K80 GPU

The loss change is very revealing:
<img src="imgs/loss.jpg">

The demonstration example (which is is available [Final Submission.ipynb](/Final Submission.ipynb))

<img src="imgs/demo.png">


### Results

|Source|Algorithm|Top 1 accur.|Top 5 accur. |Final loss|AUC|
| :-: | :-: |:------------:| :-: | :-: | :-:|
|Paper|GoogLeNet|0.774|0.943|1.25| NA |
|Paper|VGGNet |0.789|0.942|1.01| NA |
|Our implemetation| Xception |0.893|0.980|0.45|0.982|

The results are better than in the article considered article, with the assumption that we used fairly simple operations for augmentation.

## Analysis and	Observations

For comparison, take two really deep architectures of convolutional networks - ResNet50 and InceptionResNetV2.

ResNet50 has ~25 millions  scales, and the pre-model in Keras weighs ~ 100MB. The accuracy that is achieved by this model on the ImageNet dataset is ~77-89% (89% in our case)

InceptionResNetV2 has ~55 millions learning parameters and weighs ~ 200 MB, reaching an accuracy of 91.5%.

The Xception network has ~22 millions scales and weighs ~90 MB. At the same time, classification accuracy on ImageNet is 89%.

Thus, we get a network architecture that surpasses ResNet50 in accuracy and only slightly inferior to InceptionResNetV2, while significantly benefiting in size, and therefore on the required resources for both training and use of this model.

Of course, additional experiments are required in order to compare these models. This may be interesting for further work.