# Rakuten France Multimodal Product Data Classification

## Context

The objective of this challenge is to perform large-scale product type code multimodal classification using both text and image data. The aim is to predict the type code for each product based on the catalog of Rakuten France.

Categorizing product listings through title and image is a crucial task for any e-commerce marketplace, as it has various applications such as personalized search, recommendations, and query understanding. Manual and rule-based approaches to categorization are not scalable since there are numerous classes of commercial products. Multimodal approaches are a useful technique for e-commerce companies as they face difficulty in categorizing products based on images and labels from merchants, especially when dealing with both new and used products from professional and non-professional merchants, as is the case with Rakuten. However, the lack of real data from actual commercial catalogs has limited progress in this area of research. The challenge presents several interesting research aspects due to the noisy nature of product labels and images, the large size of modern e-commerce catalogs, and the typical distribution of unbalanced data.

## Problem Description

The goal of this data challenge is large-scale multimodal (text and image) product data classification into product type codes.

To provide an example, consider a product in the Rakuten France catalog with a French name or title "Klarstein Présentoir 2 Montres Optique Fibre" and associated image, and possibly an additional description. This product is classified under the 1500 product type code. There are other products with varying titles, images, and descriptions that fall under the same product type code. This challenge aims to develop a classifier that can accurately categorize products into their corresponding product type code based on information such as the example given above.

## Metric

The metric used in this challenge to rank the participants is the *weighted-F1 score*.

Scikit-Learn package has an F1 score implementation and can be used for this challenge with its average parameter set to "weighted".

## Data Description

For this challenge, Rakuten France is releasing approximatively 99K product listings in CSV format, including the train (84,916) and test set (13,812). The dataset consists of product designations, product descriptions, product images and their corresponding product type code.

The data are divided under two criteria, forming four distinct sets: training or test, input or output.

* X_train.csv: training input file
* Y_train.csv: training output file
* X_test.csv: test input file

Additionally images.zip file is supplied containing all the images. Uncompressing this file will provide a folder named images with two subfolders named image_training and image_test, containing training and test images respectively.

The first line of the input files contains the header, and the columns are separated by comma (","). The columns are:

* An integer ID for the product. This ID is used to associate the product with its corresponding product type code.
* **designation** - The product title, a short text summarizing the product.
* **description** - A more detailed text describing the product. Not all the merchants use this field, so to retain originality of the data, **the description field can contain NaN value for many products**.
* **productid** - An unique ID for the product.
* **imageid** - An unique ID for the image associated with the product.

The fields **imageid** and **productid** are used to retrieve the images from the respective image folder. For a particular product the image file name is image_imageid_product_productid.jpg.

Here is an example of an input file:

,designation,description,productid,imageid
0,Olivia: Personalisiertes Notizbuch 150 Seiten Punktraster Ca Din A5 Rosen-Design,,3804725264,1263597046
1,Journal Des Arts (Le) NÃ Â° 133 Du 28/09/2001 - L'art Et Son Marche Salon D'art Asiatique A Paris - Jacques Barrere - Francois Perrier - La Reforme Des Ventes Aux Encheres Publiques - Le Sna Fete Ses Cent Ans.,,436067568,1008141237

For the first product the corresponding image file name is image_1263597046_product_3804725264.jpg, and the same for the second product is image_1008141237_product_436067568.jpg. One can recall that all the images corresponding to the training products listed in X_train.csv can be found in image_training subfolder, and all the images corresponding to the test products listed in X_test.csv can be found in image_test subfolder.

The training output file (Y_train.csv) contains the prdtypecode, the category for the classification task, for each integer id in the training input file (X_train.csv). Here also the first line of the file is the header and columns are separated by commas.

Here is an example of the output file:

,prdtypecode
0,10
1,2280

For the test input file X_test.csv, participants need to provide a test output file in the same format as the training output file (associating each integer id with the predicted prdtypecode). The first line of this test output file should contain the header ,prdtypecode.

## Benchmark Model

The benchmark algorithm uses two separate models for the images and the text. Participants can get an idea of the performances when these sources of informations are used separately. They are encouraged to use both these sources while designing a classifier, since they contain complementary information.

For the image data, a version of Residual Networks (ResNet) model (**reference**) is used. ResNet50 implementation from Keras is used as the base model. The details of the basic benchmark model can be found **in this notebook**. The model is a pre-trained ResNet50 with ImageNet dataset. 27 different layers from top are unfrozen, which include 8 Convolutional layers for the training. The final network contains 12,144,667 trainable and 23,643,035 non-trainable parameters.

For the text data a simplified CNN classifier used. Only the designation fields (product titles) are used in this benchmark model. The input size is the maximum possible designation length, 34 in this case. Shorter inputs are zero-padded. The architecture consists of an embedding layer and 6 convolutional, max-pooling blocks. The embeddings are trained with the entire architecture. Following is the model architecture:

Layer (type)	        Output Shape	        Number of Params	Connected to
InputLayer	            (None, 34)	            0	
Embedding Layer	        (None, 34, 300)	        17320500	        InputLayer
Reshape	                (None, 34, 300, 1)	    0	                Embedding Layer
Conv2D Block 1	        (None, 34, 1, 512)	    154112	            Reshape
MaxPooling2D Block 1	(None, 1, 1, 512)	    0	                Conv2D Block 1
Conv2D Block 2	        (None, 33, 1, 512)	    307712	            Reshape
MaxPooling2D Block 2	(None, 1, 1, 512)	    0	                Conv2D Block 2
Conv2D Block 3	        (None, 32, 1, 512)	    461312	            Reshape
MaxPooling2D Block 3	(None, 1, 1, 512)	    0	                Conv2D Block 2
Conv2D Block 4	        (None, 31, 1, 512)	    614912	            Reshape
MaxPooling2D Block 4	(None, 1, 1, 512)	    0	                Conv2D Block 2
Conv2D Block 5	        (None, 30, 1, 512)	    768512	            Reshape
MaxPooling2D Block 5	(None, 1, 1, 512)	    0	                Conv2D Block 2
Conv2D Block 6	        (None, 29, 1, 512)	    922112	            Reshape
MaxPooling2D Block 6	(None, 1, 1, 512)	    0	                Conv2D Block 2
Concatenate	            (None, 6, 1, 512)	    0	                All MaxPooling2D Blocks
Flatten	                (None, 3072)	        0	                Concatenate
Dropout Layer	        (None, 3072)	        0	                Flatten
Dense Layer	            (None, 27)	            8297	            Dropout Layer

This architecture contains total 20,632,143 trainable parameters.

* Layer (type)-----------------Output Shape-----------Number of Params------Connected to
* InputLaye-------------------(None, 34)--------------0	
* Embedding Layer-----------(None, 34, 300)---------17320500----------------InputLayer
* Reshape---------------------(None, 34, 300, 1)------0--------------------------Embedding Layer
* Conv2D Block 1-------------(None, 34, 1, 512)------154112-------------------Reshape
* MaxPooling2D Block 1------(None, 1, 1, 512)-------0--------------------------Conv2D Block 1
* Conv2D Block 2-------------(None, 33, 1, 512)------307712-------------------Reshape
* MaxPooling2D Block 2------(None, 1, 1, 512)-------0--------------------------Conv2D Block 2
* Conv2D Block 3-------------(None, 32, 1, 512)------461312-------------------Reshape
* MaxPooling2D Block 3------(None, 1, 1, 512)-------0--------------------------Conv2D Block 2
* Conv2D Block 4-------------(None, 31, 1, 512)------614912-------------------Reshape
* MaxPooling2D Block 4------(None, 1, 1, 512)-------0--------------------------Conv2D Block 2
* Conv2D Block 5-------------(None, 30, 1, 512)------768512-------------------Reshape
* MaxPooling2D Block 5------(None, 1, 1, 512)-------0--------------------------Conv2D Block 2
* Conv2D Block 6-------------(None, 29, 1, 512)------922112-------------------Reshape
* MaxPooling2D Block 6------(None, 1, 1, 512)-------0--------------------------Conv2D Block 2
* Concatenate-----------------(None, 6, 1, 512)-------0--------------------------All MaxPooling2D Blocks
* Flatten-----------------------(None, 3072)-----------0--------------------------Concatenate
* Dropout Layer---------------(None, 3072)-----------0--------------------------Flatten
* Dense Layer-----------------(None, 27)--------------8297----------------------Dropout Layer

## Benchmark Performance

Following are the weighted-F1 score obtained using the benchmark models described above:

Text: 0.8113

Images: 0.5534

As the benchmarking model using text is better performing, the Y benchmark file contains the output of the same.