This is my second Item Recognition experimental project.
Previously I trained a siamese network with triplet loss to identify the image of food products (GitHub Repo: TripletLoss_ItemRecognition). I worked on the project assuming the model would be implemented in a wholesale company's food product stock management web app so that the staffs would be able to manage stock information by taking a photo of the product they want to search or update instead of by inputting a lengthy product code in the environment where they had to handle a large variation of products. For me it didn't achieve satisfying performance. This time I would like to try the technique which uses the combination of softmax loss and center loss as a loss function. This technique was originally proposed to improve face recognition tasks.
The data set was exactly the same as what I used in the previous triplet loss project. Around 15000 images of food and drink products sold in Australia and Japan were collected on the internet. After selecting and dividing the dataset, I ended up to have 12070 images of 706 items as training data, as well as 1447 images of 89 items as test data.
In most cases, each item (class) includes a variety of looks. I sorted the images with different package designs or shapes into the same product as long as it has the same product name and the same flavour. Bottles, cans, bags, or boxes can be in the same class as a certain item. Sometimes it's quite confusing even from human's eyes.
Example 1) MATSO'S GINGER BEER
Example 2) SMITHS Thinly Cut Sour Cream & Onion
Example 3) POM Juice
According to A Discriminative Feature Learning Approach for Deep Face Recognition:
:
th class center of deep features
: a parameter used for balancing the two loss functions
Deep feature is the output of the second last layer, which is a fully connected layer. The distance between each sample point in the deep features and its corresponding class center is calculated, and the center loss is derived from them. The new centers are recalculated at every iteration as the deep features are updated.
(lambda) is used for for balancing the two loss functions.
= 0 is equivalent to the normal softmax loss.
-
Paper: A Discriminative Feature Learning Approach for Deep Face Recognition
-
Paper: Understanding Center Loss Based Network for Image Retrieval with Few Training Data
-
GitHub: handongfeng/MNIST-center-loss
I implemented the center loss layer and the related parts by referring to this GitHub, which helped me a lot!
Base Model is pretrained InceptionV3, which is followed by one global average pooling layer, one dropout, and two dense layers. The second dense layer connects to two output layers; one is the softmax layer and the other is the center loss layer.
I performed fine tuning method to train InceptionV3 pretrained model. First I made this base model untrainable and trained 20 epochs, then fine tuned in the following 30 epochs. Optimizer is SGD.
Firstly I played around with some value of lambda with the pretrained model untrainable. I found that:
-
lambda = 0.01
: Most of the points in the deep features became tightly close to each other, looking like it's just one point in the plot at the end of the first epoch. It didn't seem it worked well in this situation. -
lambda = 0.0007 to 0.003
: Better than0.01
. But after around 10th epoch, the center loss started to increase and keep increasing, while it was reasonably decreasing until that point. On the other hand, the softmax loss kept decreasing throughout the training.
I finally chose 0.001
for the value of lambda.
In fine tuning, the training proceeded more slowly. I tried lambda = 0.01
and lambda = 0.001
, and my final choice was the second one. At first I thought the first one seemed more reasonable because the center loss value decreased along with the softmax loss, whereas the center loss kept increasing with the other one. However, the model trained with lambda = 0.001
actually made a better result in evaluation.
I plotted the transition of loss values below:
Below I show an animation of how the plot of deep features changed over epochs (in the fine tuning phase). I selected the random 10 classes from the training data to monitor and plotted them during training. I can see that the points are spreading out in the larger space and are making distinctive groups over time.
The test data set I used here contained 89 classes(items), none of which was used in training (It means the model was never trained to classify those items).
I randomly selected 10 classes from test data, input the images under those classes to the model I trained (center loss lambda = 0.001
), and extracted deep features of them.
The plot of deep features and corresponding sample images are shown blow.
In the plot, classes are not clearly separated, but I see some sort of categorisation.
As a metric, I used CMC(Cumulative Match Characteristics) and evaluated the model especially with top 1 to 10 ranking.
In evaluation, I selected one representative image per class. So there was 89 representative item images. Then I randomly selected one image from each class, which I called target. For each target, I calculated distances between it and each representative image based on the mapping of deep features, and picked the top-K closest representative images. I counted the number of cases where the representative image from the class the target belongs to was included in top K. The Identification Rate shown in the following plot is the percentage of those out of all 89 items.
I computed CMC for two models (center loss lambda is 0.01
and 0.001
). As shown in the plot above, the model with center loss lambda = 0.001
performed better as I mentioned earlier.
To compare this model with the previous triple loss model, I calculated CMC curve for the triplet loss model I trained before using the same test data set.
In this experiment, the triplet loss model overperformed the center loss model.
Unfortunately, I couldn't improve the performance of my item recognition model by employing the center loss method. However, through this experiment, the experience of seeing the plot of deep features and evaluating CMC gave me a lot of insight about image classification in deep learning.
For further improvement,
-
The model would improve if I used more training data. Generally in Face Recognition studies, millions of images are used.
-
This task would be too challenging for the model because of the wide variety of examples in the same class as I mentioned earlier in "Training and Test Data" section. If I refine data so it has less variety (i.e. using only images with the same package design for a class, choosing only images of bottles for a drink, etc.), the model might achieve better performance with both center loss and triplet loss.
Notebooks:
-
itemrecognition_centerloss_finetuning2.ipynb
:- In untrainable transfer learning, center loss lambda = 0.001
- In fine tuning, center loss lambda = 0.01
-
itemrecognition_centerloss_finetuning3.ipynb
:- In untrainable transfer learning, center loss lambda = 0.001
- In fine tuning, center loss lambda = 0.001
-
loss_plots.ipynb
: Plotting loss values from training histories -
itemrecognition_centerloss_evaluate2.ipynb
: Evaluation ofitemrecognition_centerloss_finetuning2.ipynb
-
itemrecognition_centerloss_evaluate3.ipynb
: Evaluation ofitemrecognition_centerloss_finetuning3.ipynb
Data set and model files(*.h5
) are not included in this repository because they are too large.