<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/24286/logos/header.png?t=2021-01-07-16-57-37" style="width:80%;">

Do you scan online retailers in search of the best deals? You're joined by the many savvy shoppers who don't like paying extra for the same product depending on where they shop. Retail companies use a variety of methods to assure customers that their products are the cheapest. Among them is product matching, which allows a company to offer products at rates that are competitive to the same product sold by another retailer. To perform these matches automatically requires a thorough machine learning approach, which is where your data science skills could help.

Two different images of similar wares may represent the same product or two completely different items. Retailers want to avoid misrepresentations and other issues that could come from conflating two dissimilar products. Currently, a combination of deep learning and traditional machine learning analyzes image and text information to compare similarity. But major differences in images, titles, and product descriptions prevent these methods from being entirely effective.

Shopee is the leading e-commerce platform in Southeast Asia and Taiwan. Customers appreciate its easy, secure, and fast online shopping experience tailored to their region. The company also provides strong payment and logistical support along with a 'Lowest Price Guaranteed' feature on thousands of Shopee's listed products.

In this competition, you’ll apply your machine learning skills to build a model that predicts which items are the same products.

The applications go far beyond Shopee or other retailers. Your contributions to product matching could support more accurate product categorization and uncover marketplace spam. Customers will benefit from more accurate listings of the same or similar products as they shop. Perhaps most importantly, this will aid you and your fellow shoppers in your hunt for the very best deals.

Hi everyone, 

In this notebook we will explore the Shopee Dataset with simple visualizations. This is initial and not complete version for now. I will also share some solution ideas in the upcoming days.

Upvotes are too much appreciated if you like the notebook.

# Let's start with the imports

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2

In [None]:
BASE_DATA_DIR = Path("../input/shopee-product-matching/")
!ls {BASE_DATA_DIR}

We have only 3 public test image for the submission. We will use them to prepare our prediction pipeline. The notebook will run on approximately 70,000 images after the submission.

In [None]:
!ls {BASE_DATA_DIR / "test_images"}

# Reading the files

In [None]:
df_train = pd.read_csv(BASE_DATA_DIR / "train.csv")
df_test = pd.read_csv(BASE_DATA_DIR / "test.csv")
df_sub = pd.read_csv(BASE_DATA_DIR / "sample_submission.csv")

* Here we group the images based on the `label_group`. Definition from the competitions page is: 

> ID code for all postings that map to the same product. Not provided for the test set.

like this. So any image that belongs to the same group will be considered as the same product. We will also visualize and explore those images later in the notebook.

# Grouping based on label_group

In [None]:
group_by_label_images = df_train.groupby("label_group")["image"].apply(list)
len_groups = group_by_label_images.apply(len)

In [None]:
# Grouped images
group_by_label_images.head()

In [None]:
# Length of the groups
len_groups.head()

Let's check out the statistics about the grouped images (by length):

In [None]:
len_groups.describe()

Here, we can see that the most of the items have at most two other item similar to itself. Let's investigate this a bit further by plotting a histogram. We plot two histogram with the same data. Right image is the same as the left one but on a log scale. Since often, most of the items have a less similar items, we plot on a log scale.  

# Histogram plots

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(len_groups, bins=20, binwidth=3, ax=ax[0])
sns.histplot(len_groups, bins=20, binwidth=3, ax=ax[1])
ax[0].set_title("Histogram of grouped images lengths")
ax[1].set_title("Histogram of grouped images lengths (log scale)")
ax[1].set_yscale('log')
plt.show()

Now lets visualize some of the similar items to see how similar they are. First of all, we define several helper functions to read the images and plot them.

# Image Visualizations

In [None]:
def read_img_and_cvt_format(img_path, clr_format=cv2.COLOR_BGR2RGB):
    return cv2.cvtColor(cv2.imread(img_path), clr_format)


def visualize_batch(label_group, img_ids, texts, nrows=3, ncols=3, figsize=(24, 14)):
    
    plt.figure(figsize=figsize)
    plt.suptitle(f"Label group: {label_group}")
    for idx, (img_id, text) in enumerate(zip(img_ids, texts)):
        plt.subplot(nrows, ncols, idx + 1)
        img_fn = str(BASE_DATA_DIR / "train_images" / img_id)
        img = read_img_and_cvt_format(img_fn)
        plt.imshow(img)
        plt.title(f"{text}", fontsize=8, wrap=True)
        plt.axis("off")
        
    plt.show()

In [None]:
# Let's choose some ids for plotting.
len_groups.sort_values(ascending=False)[:5]

In [None]:
single_label_group = 1163569239
df_single = df_train[df_train.label_group == single_label_group]
df_single.head()

In [None]:
# Let's take the first 16 image
img_ids, texts = df_single.image.values[:16], df_single.title.values[:16]

visualize_batch(single_label_group, img_ids, texts, nrows=4, ncols=4)

In [None]:
single_label_group = 159351600
df_single = df_train[df_train.label_group == single_label_group]
df_single.head()

In [None]:
# Again, let's visualize the first 16 images
img_ids, texts = df_single.image.values[:16], df_single.title.values[:16]

visualize_batch(single_label_group, img_ids, texts, nrows=4, ncols=4)

In [None]:
single_label_group = 3627744656
df_single = df_train[df_train.label_group == single_label_group]
df_single.head()

In [None]:
img_ids, texts = df_single.image.values[:16], df_single.title.values[:16]

visualize_batch(single_label_group, img_ids, texts, nrows=4, ncols=4)

Before going further, let's stop here and explore some of the images together. Please, state your thoughts on the comment section as well. 

* So, first thing that I noticed here is the **diversity** both in the images and the texts. Some of the images seems to be taken at home, some of them seems to be a bit more professional, some of them are just catalog images. Also there are some texts on the images as well. Our models should be robust to that diversity in the images.

* Second thing to notice is titles of the images. They also seem to be diverse. They mostly include the brand of the product, but there is no particular format. Careful preprocessing would bring additional improvements here. 

Now, let's visualize some of the images from the other tail of the line:

In [None]:
len_groups.sort_values()[:5]

Here we will visualize the two length grouped images together on the same row. 

In [None]:
all_image_ids = []
all_titles = []
all_label_groups = []

for index in len_groups.sort_values()[:5].index:
    df_single = df_train[df_train.label_group == index]
    image_ids, titles = df_single.image.values, df_single.title.values
    all_image_ids.extend(image_ids.tolist())
    all_titles.extend(titles.tolist())
    all_label_groups.append(index)

In [None]:
all_label_groups

In [None]:
visualize_batch("\n" + "\n".join(map(str, all_label_groups)), 
                all_image_ids, all_titles, nrows=5, ncols=2, figsize=(14, 18))

# Starter Ideas

* Opposite to high length groups, here images are much more similar and less diverse. 

One thing to notice here is the language of the titles. There is no only English titles but Indonesian as well as we might expected. So, if we want to extract features from a language model like BERT, we should also consider the language detection or multi-lingual models as well.

# To Be Continued...

So, I will stop here for now. I plan to extend this notebook with some text exploration as well. I will also share some potential solution approaches like metric learning in the upcoming days.

Upvotes would be too much appreciated if you liked this notebook. Thanks and stay safe! 🤗 🤗