### Overview


Do you scan online retailers in search of the best deals? You're joined by the many savvy shoppers who don't like paying extra for the same product depending on where they shop. Retail companies use a variety of methods to assure customers that their products are the cheapest. Among them is product matching, which allows a company to offer products at rates that are competitive to the same product sold by another retailer. To perform these matches automatically requires a thorough machine learning approach, which is where your data science skills could help.

Two different images of similar wares may represent the same product or two completely different items. Retailers want to avoid misrepresentations and other issues that could come from conflating two dissimilar products. Currently, a combination of deep learning and traditional machine learning analyzes image and text information to compare similarity. But major differences in images, titles, and product descriptions prevent these methods from being entirely effective.

Shopee is the leading e-commerce platform in Southeast Asia and Taiwan. Customers appreciate its easy, secure, and fast online shopping experience tailored to their region. The company also provides strong payment and logistical support along with a 'Lowest Price Guaranteed' feature on thousands of Shopee's listed products.

In this competition, you’ll apply your machine learning skills to build a model that predicts which items are the same products.

The applications go far beyond Shopee or other retailers. Your contributions to product matching could support more accurate product categorization and uncover marketplace spam. Customers will benefit from more accurate listings of the same or similar products as they shop. Perhaps most importantly, this will aid you and your fellow shoppers in your hunt for the very best deals.

### Data description


Finding near-duplicates in large datasets is an important problem for many online businesses. In Shopee's case, everyday users can upload their own images and write their own product descriptions, adding an extra layer of challenge. Your task is to identify which products have been posted repeatedly. The differences between related products may be subtle while photos of identical products may be wildly different!

As this is a code competition, only the first few rows/images of the test set are published; the remainder are only available to your notebook when it is submitted. Expect to find roughly 70,000 images in the hidden test set. The few test rows and images that are provided are intended to illustrate the hidden test set format and folder structure.

Files
[train/test].csv - the training set metadata. Each row contains the data for a single posting. Multiple postings might have the exact same image ID, but with different titles or vice versa.

posting_id - the ID code for the posting.

image - the image id/md5sum.

image_phash - a perceptual hash of the image.

title - the product description for the posting.

label_group - ID code for all postings that map to the same product. Not provided for the test set.

[train/test]images - the images associated with the postings.

sample_submission.csv - a sample submission file in the correct format.

posting_id - the ID code for the posting.

matches - Space delimited list of all posting IDs that match this posting. Posts always self-match. Group sizes were capped at 50, so there's no need to predict more than 50 matches.

In [None]:
import pandas as pd
import numpy as np
import cupy as cp
import cudf, cuml

import os

import re
import string
import nltk
from nltk.corpus import stopwords

#from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.neighbors import NearestNeighbors

from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors
import tensorflow as tf
from tensorflow.keras.applications import EfficientNetB0
import gc


import cv2

import time

In [None]:
def getMetric(col):
    def f1score(row):
        n = len( np.intersect1d(row.target,row[col]) )
        return 2*n / (len(row.target)+len(row[col]))
    return f1score

In [None]:
TRAIN = False


PATH = 'train_images' if TRAIN else 'test_images'
CSV_FN = 'train.csv' if TRAIN else 'test.csv'
DATA_PATH = '../input/shopee-product-matching/'
IMG_PATH = os.path.join(DATA_PATH, PATH)


N_WORKERS = 4
BATCH_SIZE = 1024*4

In [None]:
dataset = pd.read_csv(os.path.join(DATA_PATH, CSV_FN))
dataset_cudf = cudf.read_csv(os.path.join(DATA_PATH, CSV_FN))

In [None]:
if TRAIN:
    tmp = dataset.groupby('label_group').posting_id.agg('unique').to_dict()
    dataset['target'] = dataset.label_group.map(tmp)
    print(f'Dataset shape {dataset.shape}')
    
dataset.head()

In [None]:
print('Computing text embeddings...')
model = TfidfVectorizer(stop_words='english', binary=True, max_features=25_000)
text_embeddings = model.fit_transform(dataset_cudf.title).toarray()
print('text embeddings shape',text_embeddings.shape)

In [None]:
preds = []

print('Finding similar titles...')
CTS = len(dataset)//BATCH_SIZE
if len(dataset)%BATCH_SIZE!=0: CTS += 1
for j in range( CTS ):
    
    a = j*BATCH_SIZE
    b = (j+1)*BATCH_SIZE
    b = min(b,len(dataset))
    print('chunk',a,'to',b)
    
    # COSINE SIMILARITY DISTANCE
    cts = cp.matmul(text_embeddings,text_embeddings[a:b].T).T
    
    for k in range(b-a):
        IDX = cp.where(cts[k,]>0.6)[0]
        o = dataset.iloc[cp.asnumpy(IDX)].posting_id.values
        preds.append(o)
    
del model, text_embeddings
_ = gc.collect()

In [None]:
dataset['preds'] = preds

In [None]:
class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, df, img_size=256, batch_size=32, path=''): 
        self.df = df
        self.img_size = img_size
        self.batch_size = batch_size
        self.path = path
        self.indexes = np.arange( len(self.df) )
        
    def __len__(self):
        'Denotes the number of batches per epoch'
        ct = len(self.df) // self.batch_size
        ct += int(( (len(self.df)) % self.batch_size)!=0)
        return ct

    def __getitem__(self, index):
        'Generate one batch of data'
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        X = self.__data_generation(indexes)
        return X
            
    def __data_generation(self, indexes):
        'Generates data containing batch_size samples' 
        X = np.zeros((len(indexes),self.img_size,self.img_size,3),dtype='float32')
        df = self.df.iloc[indexes]
        for i,(index,row) in enumerate(df.iterrows()):
            filename = os.path.join(self.path, row.image)
            img = cv2.imread(filename)
            X[i,] = cv2.resize(img,(self.img_size,self.img_size)) #/128.0 - 1.0
        return X

In [None]:
WGT = '../input/effnetb0/efficientnetb0_notop.h5'
model = EfficientNetB0(weights=WGT,include_top=False, pooling='avg', input_shape=None)

embeds = []

print('Computing image embeddings...')

CTS = len(dataset)//BATCH_SIZE
if len(dataset)%BATCH_SIZE!=0: CTS += 1
    
for i,j in enumerate(range(CTS)):
    
    a = j*BATCH_SIZE
    b = (j+1)*BATCH_SIZE
    b = min(b,len(dataset))
    print('chunk',a,'to',b)
    
    data_gen = DataGenerator(dataset.iloc[a:b], batch_size=32, path=IMG_PATH)
    image_embeddings = model.predict(data_gen,verbose=1,use_multiprocessing=True, workers=N_WORKERS)
    embeds.append(image_embeddings)

    #if i>=1: break

image_embeddings = np.concatenate(embeds)
print('image embeddings shape',image_embeddings.shape)

del model

In [None]:
n_neighbors = 50 if len(dataset) > 3 else 2
model = NearestNeighbors(n_neighbors=n_neighbors)
model.fit(image_embeddings)

In [None]:
preds = []


print('Finding similar images...')
CTS = len(image_embeddings)//BATCH_SIZE
if len(image_embeddings)%BATCH_SIZE!=0: CTS += 1
    
for j in range( CTS ):
    
    a = j*BATCH_SIZE
    b = (j+1)*BATCH_SIZE
    b = min(b,len(image_embeddings))
    print('chunk',a,'to',b)
    distances, indices = model.kneighbors(image_embeddings[a:b,])
    
    for k in range(b-a):
        IDX = np.where(distances[k,]<7.0)[0]
        IDS = indices[k,IDX]
        o = dataset.iloc[IDS].posting_id.values
        preds.append(o)
        
#del model, image_embeddings
_ = gc.collect()

In [None]:
dataset['preds2'] = preds

In [None]:
dataset['f1'] = dataset.apply(getMetric('preds2'),axis=1)
print('CV Score =', dataset.f1.mean())

In [None]:
tmp = dataset.groupby('image_phash').posting_id.agg('unique').to_dict()
dataset['preds3'] = dataset.image_phash.map(tmp)
dataset.head()

In [None]:
tmp = dataset.groupby('image').posting_id.agg('unique').to_dict()
dataset['preds4'] = dataset.image.map(tmp)
dataset.head()

In [None]:
def combine_for_sub(row):
    x = np.concatenate([row.preds,row.preds2, row.preds3, row.preds4])
    return ' '.join( np.unique(x) )

def combine_for_cv(row):
    x = np.concatenate([row.preds,row.preds2, row.preds3, row.preds4])
    return np.unique(x)

In [None]:
if TRAIN:
    tmp = dataset.groupby('label_group').posting_id.agg('unique').to_dict()
    dataset['target'] = dataset.label_group.map(tmp)
    dataset['oof'] = dataset.apply(combine_for_cv,axis=1)
    dataset['f1'] = dataset.apply(getMetric('oof'),axis=1)
    print('CV Score =', dataset.f1.mean())

dataset['matches'] = dataset.apply(combine_for_sub,axis=1)

CV Score = 0.7248077230326005 - base

CV Score = 0.7322238396656401 - tune

In [None]:
dataset[['posting_id','matches']].to_csv('submission.csv',index=False)

In [None]:
#sub = pd.read_csv('submission.csv')
#sub.head()