# Shopee Price Match Guarantee: Before we start
![](https://storage.googleapis.com/kaggle-competitions/kaggle/24286/logos/header.png?t=2021-01-07-16-57-37)

Do you scan online retailers in search of the best deals? You're joined by the many savvy shoppers who don't like paying extra for the same product depending on where they shop. Retail companies use a variety of methods to assure customers that their products are the cheapest. Among them is product matching, which allows a company to offer products at rates that are competitive to the same product sold by another retailer. To perform these matches automatically requires a thorough machine learning approach, which is where your data science skills could help.

Two different images of similar wares may represent the same product or two completely different items. Retailers want to avoid misrepresentations and other issues that could come from conflating two dissimilar products. Currently, a combination of deep learning and traditional machine learning analyzes image and text information to compare similarity. But major differences in images, titles, and product descriptions prevent these methods from being entirely effective.

Shopee is the leading e-commerce platform in Southeast Asia and Taiwan. Customers appreciate its easy, secure, and fast online shopping experience tailored to their region. The company also provides strong payment and logistical support along with a 'Lowest Price Guaranteed' feature on thousands of Shopee's listed products.

In this competition, you’ll apply your machine learning skills to build a model that predicts which items are the same products.

The applications go far beyond Shopee or other retailers. Your contributions to product matching could support more accurate product categorization and uncover marketplace spam. Customers will benefit from more accurate listings of the same or similar products as they shop. Perhaps most importantly, this will aid you and your fellow shoppers in your hunt for the very best deals.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tqdm
from tqdm.auto import tqdm as tqdmp
tqdmp.pandas()

# Work with phash
import imagehash

import cv2, os
import skimage.io as io
from PIL import Image

# ignoring warnings
import warnings
warnings.simplefilter("ignore")

<h2 style='color:white; background:#f15335; border:0'><center>Work directory</center></h2>

In [None]:
WORK_DIR = '../input/shopee-product-matching'
os.listdir(WORK_DIR)

<h2 style='color:white; background:#f15335; border:0'><center>Fast look at the data</center></h2>

In [None]:
train = pd.read_csv('../input/shopee-product-matching/train.csv')
test = pd.read_csv('../input/shopee-product-matching/test.csv')
ss = pd.read_csv('../input/shopee-product-matching/sample_submission.csv', index_col = 0)
print('-'*40, 'Train head', '-'*40)
print(train.head())
print('-'*40, 'Test head', '-'*40)
print(test.head())
print('-'*30, 'Sample submission head', '-'*30)
print(ss.head())

In [None]:
print('Train images: %d' %len(os.listdir(os.path.join(WORK_DIR, "train_images"))))
print('Test images: %d' %len(os.listdir(os.path.join(WORK_DIR, "test_images"))))

In [None]:
train_images = WORK_DIR + "/train_images/" + train['image']
train['path'] = train_images

test_images = WORK_DIR + "/test_images/" + test['image']
test['path'] = test_images

train.head()

In [None]:
print('label_group unique values: {}'.format(train['label_group'].nunique()))

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize = (10, 6))
plt.title('Distribution of title length', fontsize = '15')
sns.kdeplot(train['title'].apply(lambda x: len(x)), fill = True, 
            color = '#f15335', 
            edgecolor = 'black', alpha = 0.9)
plt.xlabel('Title length')
plt.show()

<h4 style='color:white; background:#f15335; border:0'><center>Image shapes distribution</center></h4>

In [None]:
# Shape columns
train['img_shape'] = train['path'].progress_apply(lambda x: np.shape(io.imread(x)))

In [None]:
shapes = pd.DataFrame().from_records(train['img_shape'])
shapes.columns = ['Width', 'Height', 'Colors']

sns.set_style("white")
sns.jointplot(x = shapes.iloc[:, 0].astype('float32'), 
              y = shapes.iloc[:, 1].astype('float32'),
              height = 8, color = '#f15335')
plt.show()

<h2 style='color:white; background:#f15335; border:0'><center>Work with image PHASH</center></h2>

The data has 'phash' values for images, which can greatly simplify our work.

Phash algorithm is really simple. It breaks images into fragments (in our case, the shape is 8x8), then analyzes the image structure on luminance (without color information) and simply assigns True or False depending on the value (above or below the mean). In order to analyze the similarity, it is necessary to subtract one phash matrix from another. Similar fragments will receive a null value (True - True = 0, False - False = 0). The closer the sum of all differences is to zero, the more similar the images are.

For instance, the phash matrix of the first image looks like this:

In [None]:
imagehash.hex_to_hash(train['image_phash'][0])

Let's write a little test function.

In [None]:
def match_matrix(phash_array):
    """
    A function that checks for matches by phash value.
    Takes phash values as input.
    Output - phash diff matrix (pandas data frame)
    """
    phashs = phash_array.apply(lambda x: imagehash.hex_to_hash(x))
    phash_matrix = pd.DataFrame()
    pbar = tqdm.tqdm(total = len(phash_array), desc = 'Progress:', 
                     position = 0, leave = True)
    for idx, i in enumerate(phash_array):
        pbar.update(1)
        phash_matrix = pd.concat([phash_matrix, phashs - imagehash.hex_to_hash(i)], 
                                 axis = 1)
    pbar.close()
    phash_matrix.columns = range(len(phash_array))
    return phash_matrix

Since the process of composing a matrix is quite resource-intensive, for clarity, we will take only the first thousand images.

In [None]:
train_part = train.iloc[:1000, :]
matches = match_matrix(train_part['image_phash'])
matches

In [None]:
test_match = match_matrix(test['image_phash'][:3])
test_match

In [None]:
match = []
for i in range(len(matches)):
    match.append(matches.iloc[i, :][(matches.iloc[i, :] == 0)].index.values)
match = pd.Series(match)

match[match.apply(lambda x: len(x) > 1)]

Let's take a look at a few matches.

In [None]:
def image_viz(image_path):
    """
    Function for visualization.
    Takes path to image as input.
    """
    img = cv2.imread(image_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)    
    plt.imshow(img)
    plt.axis('off')

In [None]:
train_part.loc[[11,12],['posting_id','image_phash','title','label_group']]

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[11, 'path'], 
                         train_part.loc[12, 'path']]):
    plt.subplot(1, 2, idx + 1)
    image_viz(i)
plt.show()

In [None]:
train_part.loc[[889,890,891],['posting_id','image_phash','title','label_group']]

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[889, 'path'], 
                         train_part.loc[890, 'path'], 
                         train_part.loc[891, 'path']]):
    plt.subplot(1, 3, idx + 1)
    image_viz(i)
plt.show()

In [None]:
train_part.loc[[997,520],['posting_id','image_phash','title','label_group']]

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[997, 'path'], 
                         train_part.loc[520, 'path']]):
    plt.subplot(1, 2, idx + 1)
    image_viz(i)
plt.show()

Phash analysis allows you to find matches. It allows you to find not only exact copies but also approximate ones. For instance:

In [None]:
match = []
for i in range(len(matches)):
    match.append(matches.iloc[i, :][(matches.iloc[i, :] > 0) & 
                                    (matches.iloc[i, :] <= 5)].index.values)
match = pd.Series(match)

match[match.apply(lambda x: len(x) >= 1)]

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[55, 'path'], 
                         train_part.loc[312, 'path']]):
    plt.subplot(1, 2, idx + 1)
    image_viz(i)
plt.show()

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[128, 'path'], 
                         train_part.loc[515, 'path']]):
    plt.subplot(1, 2, idx + 1)
    image_viz(i)
plt.show()

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate([train_part.loc[216, 'path'], 
                         train_part.loc[567, 'path']]):
    plt.subplot(1, 2, idx + 1)
    image_viz(i)
plt.show()

It works very well. In this competition, we don't need to compute phash ourselves. But this can be easily done using the [imagehash library](https://pypi.org/project/ImageHash/)

<h2 style='color:white; background:#f15335; border:0'><center>Baseline prediction</center></h2>

In [None]:
# Work functions
def phash_match(phash_array, element):
    """
    A function that calculates phash diffs.
    Takes phashs array and element as input.
    Output - phash diff
    """
    phash_diff = phash_array - phash_array[element]
    return phash_diff

def add_match(phash, i, dataset = train, threshold = 5):
    """
    A function that returns match names.
    Takes phash array, i element, dataset and threshold (default = 5).
    """
    diffs = phash_match(phash, i)
    matches = [x for x in diffs[diffs <= threshold].index.drop(i).values]
    str_matches = ''
    str_matches = str_matches + dataset.iloc[i, 0] + ' '
    for j in matches:
        str_matches = str_matches + dataset.iloc[j, 0] + ' '
    str_matches = str_matches[:-1]
    return str_matches

In [None]:
phashs = train['image_phash'][:1000].apply(lambda x: imagehash.hex_to_hash(x))
str_matches = []

for i in tqdm.tqdm(range(len(phashs)), desc = 'Progress:', position = 0, leave = True):
    str_matches.append(add_match(phashs, i))

str_matches[:15]

<h4 style='color:white; background:#f15335; border:0'><center>Test images</center></h4>

In [None]:
plt.figure(figsize = (15, 10))
for idx, i in enumerate(test['path']):
    plt.subplot(1, 3, idx + 1)
    image_viz(i)
plt.show()

In [None]:
test

In [None]:
test_phashs = test['image_phash'].apply(lambda x: imagehash.hex_to_hash(x))
test_matches = []

for i in tqdm.tqdm(range(len(test_phashs)), desc = 'Progress:', 
                   position = 0, leave = True):
    test_matches.append(add_match(test_phashs, i, test, threshold = 7))

test_matches

In [None]:
ss['matches'] = test_matches
ss.to_csv("submission.csv")
ss

This analysis is convenient and simple, but it has one disadvantage - speed. For large test data, we cannot use it, and therefore we restrict ourselves to the usual finding of all identical PHASH codes for each image.

In [None]:
def simple_match(dataset, element):
    """
    A function that returns match names.
    Takes dataset and i element.
    """
    matches = dataset[dataset['image_phash'] == 
                      dataset['image_phash'][element]]['posting_id'].drop(element).values
    str_matches = ''
    str_matches = str_matches + dataset.iloc[element, 0] + ' '
    for j in matches:
        str_matches = str_matches + j + ' '
    str_matches = str_matches[:-1]
    return str_matches

In [None]:
train_for_s = train[['posting_id', 'image_phash']]
str_matches = []

for i in tqdm.tqdm(range(len(train_for_s)), desc = 'Progress:', 
                   position = 0, leave = True):
    str_matches.append(simple_match(train_for_s, i))

str_matches[:15]

In [None]:
test_for_s = test.loc[:2, ['posting_id', 'image_phash']]
test_matches = []

for i in tqdm.tqdm(range(len(test_for_s)), desc = 'Progress:', 
                   position = 0, leave = True):
    test_matches.append(simple_match(test_for_s, i))
    
test_matches

In [None]:
# ss['matches'] = test_matches
# ss

In [None]:
# ss.to_csv("submission.csv")

<h2 style='color:white; background:#f15335; border:0'><center>WORK IN PROGRESS...</center></h2>