# Transfer Learning
## Keras Xception for Stanford Cars Dataset

#### Students
Gizatullin Ramil

Grebenkin Ivan
#### Repository
https://github.com/Lolik111/transfer-learning-tf

## Introduction
Pre-trained neural networks allow to solve the tasks of computer vision without spending a significant amount of time learning the network. Such networks include a large number of layers, have high accuracy and are trained on large computing clusters with a GPU.

The technology of transfer learning allows to use pre-trained neural networks for solving problems of a new type, not those for which the networks were previously trained.

### Domain
To explore this domain we decided to work with Keras models. Keras Applications module includes a set of pre-trained neural networks. Among others it includes such popular architecture as [InceptionV3](https://keras.io/applications/#inceptionv3), [InceptionResNetV2](https://keras.io/applications/#inceptionresnetv2), [ResNet50](https://keras.io/applications/#resnet50) and [Xception](https://keras.io/applications/#xception). Xception seemed the most interesting for us, known due to the approach of depthwise separable convolution.

### Data set
Xception NN is pre-trained on large ImageNet dataset. This data set includes 14 million images related to 21 thousends classes. For experiments the [Stanford Stanford Cars Dataset](http://ai.stanford.edu/~jkrause/cars/car_dataset.html) was choosen. This dataset contains 16 thousands images of 196 classes of cars. The data is split into 8 thousands training images and  approximately same amount of testing images.  Several exaples of labeling:
* AM General Hummer SUV 2000
* Acura RL Sedan 2012

## Related	work

### Transfer Learning concept
In our study, we relied on formal understanding of transfer learning, gathered from [Stanford CS231N educational program materials](http://cs231n.github.io/).
Neural networks trained to solve image classification problems consist of two parts:
* Сonvolution part - to extract the characteristic features from the image.
* The fully-connected - to determine what kind of object is in the image based on the features that the convolution part extracted.

The essence of transfer learning is as follows. The convolutional part of the network learns to distinguish characteristic features in images during training. If the features are sufficiently general, then its can be applied to another classes. Thus, our was to use the approach of transfer learning. The concept is to change the Xception model architecture which is pre-trained on diverse ImageNet dataset. To do that we tune the weights on layers below the top. Then to make it suitable for classification on Cars Dataset, it must be re-trained on it.

### Xception model

To implement the transfer of training, we need to replace the classifier in a previously trained neural network. The work of Francois Chollet from Google, Inc "Xception: Deep Learning with Depthwise Separable Convolutions" has been considered. The motivation of this work was the fact that adding of additional layers to the network for its improving is not always possible because performance limitations do not allow the use of such deep networks. This became the impulse for using depthwise separable convolutions and creating the Xception architecture.

A convolutional layer simultaneously processes both spatial information (correlation of neighboring points within one channel) and inter-channel information, since convolution is applied to all channels at once. The Xception architecture is based on the assumption that these two types of information can be processed in series without losing the quality of the network, and decomposes the conventional convolution into pointwise convolution (which handles only inter-channel correlation) and the spatial convolution (which only processes spatial correlation within a single channel).

There is standart convolutional layer with filters $C_2$, each has size $3\times3$. The input tensor has dimension $M \cdot M \cdot C_1$, where $M$ is width and high of tensor and $C_1$ - number of chanels.
The standart convolutional layer folds all channels of the original signal $C_2$ simultaneously by different convolutions. The output tensor has dinension $(M - 2) \cdot (M - 2) \cdot C_2$. But in depthwise separable convolutions architecture there are two consecutive steps. 

1. curl the original tensor $1 \times 1$ with a convolution, obtaining a new tensor. This operation is called pointwise convolution
2. Curl each channel separately $3 \times 3$ convolution (in this case, the dimension does not change, since we do not fold all the channels together, as in the usual convolutional layer). This operation is called depthwise spatial convolution

<img src="imgs/xception.png">

## Implemented	Model

### Preprocessing

### 


### Modeling (Training)


## Experiments and	Evaluation


|Epoch N   | Timestamp | Loss     |
|----------|-----------|----------|
|epoch 10  | 17:26:44  | 6.850505 |
|epoch 20  | 17:35:02  | 5.449171 |
|epoch 30  | 17:43:44  | 5.444791 |
|epoch 40  | 17:51:56  | 5.442232 |
|epoch 50  | 17:59:52  | 5.440430 |
|epoch 60  | 18:08:01  | 5.438744 |
|epoch 70  | 18:16:52  | 5.437168 |
|epoch 80  | 18:25:35  | 5.435552 |
|epoch 90  | 18:35:08  | 5.433686 |
|epoch 100 | 18:43:33  | 5.432810 |


### Results

Tuning rate | 10     | 12.5   | 9.2     | 15     | 16     |
------------|--------|--------|---------|--------|--------|
Precision   | 0.5194 | 0.6061 | 0.5000  | 0.6154 | 0.5750 |
Recall      | 0.3350 | 0.3000 | 0.3550  | 0.2000 | 0.1150 |
Accuracy %  | 0.51   | 0.55   | 0.50    | 0.54   | 0.52   | 


Tuning rate | 10     | 12.5   | 9.2     |
------------|--------|--------|---------|
Precision   | 0.74   | 0.75   | 0.701   |
Recall      | 0.69   | 0.75   | 0.725   |
Accuracy %  | 0.63   | 0.70   | 0.73    |

## Analysis and	Observations


## Shit
Of cause, neural networks are taught not on the entire ImageNet set, but on its part from 1000 classes of objects. 
