# Shopee-Product-Matching
![Shopee](https://cdn.lynda.com/course/563030/563030-636270778700233910-16x9.jpg)


1. [Business Problem](#motivation)  
2. [EDA and Data Exploration](#eda)
3. [Convolution AutoEncoder](#cae)
4. [Visualize Prediction](#testing)
5. [Creating Index](#indexing)


# 1. Business Problem
<!-- <div id="motivation"></div>
<div class="list-group" id="list-tab" role="tablist">
<h1 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#483d8b; border:0; color:white' role="tab" aria-controls="home"><center></center></h1> -->


Shopee is the leading e-commerce platform in Southeast Asia and Taiwan. Customers appreciate its easy, secure, and fast online shopping experience tailored to their region. The company also provides strong payment and logistical support along with a 'Lowest Price Guaranteed' feature on thousands of Shopee's listed products.

Finding near-duplicates in large datasets is an important problem for many online businesses. In Shopee's case, everyday users can upload their own images and write their own product descriptions, adding an extra layer of challenge. Your task is to identify which products have been posted repeatedly. The differences between related products may be subtle while photos of identical products may be wildly different!

Two different images of similar wares may represent the same product or two completely different items. Retailers want to avoid misrepresentations and other issues that could come from conflating two dissimilar products. Currently, a combination of deep learning and traditional machine learning analyzes image and text information to compare similarity. But major differences in images, titles, and product descriptions prevent these methods from being entirely effective.

In this competition, we’ll apply our machine learning skills to build a model that predicts which items are the same products.


## 2. EDA and Dataset Exploration
<!-- 

<div id="eda"></div>
<div class="list-group" id="list-tab" role="tablist">
<h1 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#483d8b; border:0; color:white' role="tab" aria-controls="home"><center></center></h1> -->

In this competition, we have items with an image and title. For the train data, the column label_group indicates the ground truth of which items are similar. We need to build a model that finds these similar images based on their image and title's text. In this notebook we explore some tools to help us.


### 2.1 Load Libraries

In [None]:
%config Completer.use_jedi = False

In [None]:
## Load Libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.image as im
import tqdm
import cv2
%matplotlib inline
import PIL
import gc
import time
from skimage import io, transform
import pickle
## Deep Learning Pytorch library
import tensorflow as tf
import tensorflow.keras.layers as layers
import tensorflow.keras.models as models






### 2.2 Load Dataset

In [None]:
# configuration params
BASE_PATH = '../input/shopee-product-matching/'
TRAIN_PATH = BASE_PATH + "train_images/"
TEST_PATH = BASE_PATH + "test_images/"

BATCH_SIZE=64
INP_WIDTH=256
INP_HEIGHT=256
MODEL_PATH='output/'

In [None]:
test_df = pd.read_csv(BASE_PATH+'test.csv')
train_df = pd.read_csv(BASE_PATH+'train.csv')


### 2.3 Display Duplicated items

using the label_group feature which indicated the item posting belongs to same product, we can display duplicated products.

In [None]:
labelGroups = train_df.label_group.value_counts()
plt.figure(figsize=(15,5))
plt.plot(np.arange(len(labelGroups)), labelGroups.values)
plt.xlabel("Index for unique label_group_item", size=12)
plt.ylabel("duplicated image count", size=12)
plt.title("duplicated item vs duplicated image count", size=15)
plt.show()


In [None]:
plt.figure(figsize=(15,5))
plt.bar(labelGroups.index[:30].astype('str'), labelGroups.values[:30])
plt.xlabel("label_group", size=14)
plt.xticks(rotation = 45)
plt.ylabel("duplicated  count", size=14)
plt.title("top 30  duplicated image count", size=16)
plt.show()

In [None]:
def display_image(df, COLS=6, ROWS=4, path=BASE_PATH, random=False):
    # iterate over rows     
    for k in range(ROWS):
        # for each row we will set the size of figure
        plt.figure(figsize=(20,5))
        # iterate over all the columns
        for j in range(COLS):
            # if random flag is true get the random index from data frame
            if random: 
                row = np.random.randint(0,len(df))
            else:
                
                row = COLS*k + j
                
            # image name this will help collect the true path of image
            name = df.iloc[row,1]
            # title of the image
            title = df.iloc[row,3]
            img = im.imread(path+name)
            plt.subplot(1,COLS,j+1)
            plt.title(title[:30])
            plt.axis('off')
            # display image
            plt.imshow(img)
        plt.show()
        


In [None]:
label_group_sample =train_df[train_df['label_group'] == 994676122]
display_image(label_group_sample, random=False, ROWS=1, COLS=4, path = BASE_PATH + 'train_images/')

### 2.4 Observation

- We can clearly see that all the above images belongs to same label_group, some of the images **(first row)** are different than others. 
- There are certain images that is almost identical, only difference is **zoom level,backgroud and orientation**.
- **Zoom Level Variation :** [Row 1, Col 2] [Row 2, Col 3] 
- **Background and Orientation Variation :** [ last Row ]
- If we check the title of this products we will find slight different.
- This variation makes this problem very interesting!


In [None]:
for k in range(2):
    print('*'*40)
    print('*** TOP %i DUPLICATED ITEM:'%(k+1),labelGroups.index[k])
    print('*'*40)
    top = train_df.loc[train_df.label_group==labelGroups.index[k]]
    display_image(top, random=False, ROWS=1, COLS=4, path = BASE_PATH + 'train_images/')

In [None]:
train_df['title'].describe()

In [None]:
top_frequent_title_df = train_df[train_df['title'] == 'Koko syubbanul muslimin koko azzahir koko baju']
top_frequent_title_df



### 2.5 Top Frequent Title and Label groups

In [None]:

# create subplot 4 rows 3 columns
figure, ax = plt.subplots(nrows=2, ncols=3, figsize=(20,10))
ax = ax.flatten()

for idx,imageIndexId in enumerate(top_frequent_title_df.index[:6]):
    imageId = train_df.loc[imageIndexId]['image']
    target_label = train_df.loc[imageIndexId]['label_group']
    ax[idx].imshow(im.imread("../input/shopee-product-matching/train_images/{}".format(imageId)).squeeze())
    ax[idx].title.set_text("Label: {}".format(target_label))
    


### 2.6 Top Frequent imageHash and labelGroups


In [None]:
top_frequent_image_hash = train_df[train_df['image_phash'] == 'fad28daa2ad05595' ] 
top_frequent_image_hash = top_frequent_image_hash.sort_values(by='title')
top_frequent_image_hash

In [None]:

# create subplot 4 rows 3 columns
figure, ax = plt.subplots(nrows=1, ncols=3, figsize=(20,10))
# flatten the index so we can easily put image using indexing
ax = ax.flatten()

for idx,imageIndexId in enumerate(top_frequent_image_hash.index[:3]):
    
    
    imageId = train_df.loc[imageIndexId]['image']
    ax[idx].imshow(im.imread("../input/shopee-product-matching/train_images/{}".format(imageId)).squeeze())
    ax[idx].title.set_text("imageId: {}".format(imageId))
    

### 2.7 Top Frequent image url and label group


In [None]:
top_frequent_image_url = train_df[train_df['image'] == "0cca4afba97e106abd0843ce72881ca4.jpg"]
top_frequent_image_url

We can see from the above table that we have two different label_group ( label_group : 2403374241, 4198148727) for same image.

In [None]:
# create subplot 4 rows 3 columns
figure, ax = plt.subplots(nrows=1, ncols=5, figsize=(15,15))
ax = ax.flatten()

for idx,imageIndexId in enumerate(top_frequent_image_url.index[:5]):
    imageId = train_df.loc[imageIndexId]['image']
    label_group = train_df.loc[imageIndexId]['label_group']
    ax[idx].imshow(im.imread("../input/shopee-product-matching/train_images/{}".format(imageId)).squeeze())
    ax[idx].title.set_text("label_group: {}".format(label_group))
    

In [None]:
top_frequent_image_group_by_labels = top_frequent_image_url.groupby('label_group')
top_frequent_image_group_by_labels.groups

In [None]:
filtered_2403374241 = train_df[train_df['label_group'] == 2403374241]
filtered_2403374241

In [None]:
filtered_4198148727 = train_df[train_df['label_group'] == 4198148727]
filtered_4198148727



# 3 Convolution AutoEncoder

In [None]:
# Plan of action
from matplotlib.pyplot import figure

figure(figsize=(25, 20), dpi=80)
plt.imshow(plt.imread("../input/cbir-pipelin/CBIR.jpeg") )

In [None]:

figure(figsize=(15, 10), dpi=80)
plt.imshow(plt.imread("../input/encoder-decoder-arch/Encoder_decoder.png"))


## 3.1 Dataset Creation and preprocessing


In [None]:
# train_df = pd.read_csv(BASE_PATH+'train.csv')
train_df.head(10)

In [None]:
"""
Preprocess the images to feed into convolution neural network
1) Read Images
2) decode images to jpeg or proper format
3) resize or apply transformation
4) convert to tensor


"""
def preprocessImages(path,_):
    
    path = TRAIN_PATH + path
    # read the file using tf.io
    image = tf.io.read_file(path)
    # decode image to jpeg
    image = tf.image.decode_jpeg(image, channels=3)
    # resize the image
    image = tf.image.resize(image, [INP_WIDTH,INP_HEIGHT])
    # convert to tensor and normalize
    image = tf.cast(image, tf.float32)/255.0
    
    return image, image

    
    

In [None]:
# create train dataset 
train_dataset = tf.data.Dataset.from_tensor_slices((train_df['image'].values, train_df['label_group'].values))
# preprocess the train dataset to feed into deep learning model
train_dataset = train_dataset.map(preprocessImages)

## 3.2 Visualize dataset




In [None]:
images = next(iter(train_dataset))
plt.imshow(images[0])

In [None]:
# convert dataset to batches
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(buffer_size = tf.data.experimental.AUTOTUNE)




## 3.3 Build Model




In [None]:
filters =[16,32,64]

def build_convolution_auto_encoder(input_size, filters=[16,32,64]):
    # create input layer
    inputs = layers.Input(shape=input_size)
    
    # Conv
    # BAtchNorm
    # maxPool
    # iterate over filters and pass input to layers and get latern features
    for idx,_filter in enumerate(filters):
        if idx==0:
            latent_features = layers.Conv2D(filters=_filter, kernel_size=(3,3), padding='same', activation='relu')(inputs)
        else:
            latent_features = layers.Conv2D(filters=_filter, kernel_size=(3,3), padding='same', activation='relu')(latent_features)
        latent_features= layers.BatchNormalization()(latent_features)
        latent_features = layers.MaxPooling2D(pool_size=(2,2), padding="same")(latent_features)
        
    
    
    # iterate over the filters in reverse order
    # and use transposed convolution to reconstruct the same iamge again using latent features
    
    # Conv 
    # UpSampling
    for idx,_filter in enumerate(reversed(filters)):
        if idx==0:
            decoded_features = layers.Conv2D(filters=_filter,kernel_size=(3,3), padding="same", activation="relu")(latent_features)
        else: 
            decoded_features = layers.Conv2D(filters=_filter,kernel_size=(3,3), padding="same", activation="relu")(decoded_features)
        decoded_features = layers.UpSampling2D(size = (2, 2))(decoded_features)
        
    decoded_features = layers.Conv2D(filters = 3, kernel_size = (3, 3), padding = "same", activation = "sigmoid")(decoded_features)
    
    
    # Encoder Part model only
    encoder_model  = models.Model(inputs=inputs, outputs = latent_features ) 
    # Encoder Decoder model
    encoder_decoder = models.Model(inputs = inputs, outputs = decoded_features)
    encoder_decoder.compile(optimizer = "Adam", loss = "binary_crossentropy")
    return encoder_decoder, encoder_model
    
    

In [None]:
tf.keras.backend.clear_session()
encoder_decoder, encoder = build_convolution_auto_encoder((256, 256, 3))
encoder_decoder.summary()

In [None]:
IS_TRAINING = False
# Fit Model
# use callbacks
if IS_TRAINING:
    history = encoder_decoder.fit(
        train_dataset, epochs = 3,
        callbacks = [
            tf.keras.callbacks.EarlyStopping(monitor = "train_loss", patience = 3, mode = "min"),
            tf.keras.callbacks.ModelCheckpoint(filepath = "encoder_decoder.h5", monitor = "train_loss", mode = "min", save_best_only = True, save_weights_only = True)
        ]
    )
    # Save model to disk
    encoder_decoder.save_weights('encoder_decoder.h5')
    encoder.save_weights('encoder.h5')
else:
    encoder_decoder.load_weights('../input/output-model/encoder_decoder.h5')
    encoder.load_weights('../input/output-model/encoder.h5')

## 3.4 Visualize Predictions



In [None]:
def visualize_predictions(predictions,truth, samples=5):
    # initialize our list of output images
    outputs = None
#     tf.enable_eager_execution()
    # loop over our number of output samples
    for i in range(0, samples):
        # grab the original image and reconstructed image
        true_image = (truth[i].numpy() * 255).astype("uint8")
        predicted_image = (predictions[i] * 255).astype("uint8")

        # stack the original and reconstructed image side-by-side
        output = np.hstack([true_image, predicted_image])

        # if the outputs array is empty, initialize it as the current
        # side-by-side image display
        if outputs is None:
            outputs = output

        # otherwise, vertically stack the outputs
        else:
            outputs = np.vstack([outputs, output])

    # return the output images
#     plt.imshow(outputs)
    return outputs

In [None]:
# create samples and prepare to array

random_sample = []
predictions_sample = []
for idx, batch in enumerate(train_dataset):
    # Extract first 5 sample from 1st batch
    random_sample=list(batch[0][:5,:])
    # get prediction for first 5 sample from 1st batch
    predictions_sample = list(encoder_decoder.predict(batch[0][:5,:]))
    break


In [None]:

vis = visualize_predictions(predictions_sample, random_sample)
# write image to output
cv2.imwrite('viz.jpeg', vis)
plt.figure(figsize=(15, 30))
plt.imshow(vis)

## 3.5  Observation

In a single epoch of training, we are able to reconstruct the image which looks like a original image, generated images are blury, possible reason could be MSE loss and architectural design and training issue.


# 4 Load Models and create indexing




In [None]:
tf.keras.backend.clear_session()
encoder_decoder, encoder = build_convolution_auto_encoder((256, 256, 3))
encoder_decoder.load_weights('../input/output-model/encoder_decoder.h5')
encoder.load_weights('../input/output-model/encoder.h5')

In [None]:
def preprocessImages(path,index):
    
    path = TRAIN_PATH + path
    # read the file using tf.io
    image = tf.io.read_file(path)
    # decode image to jpeg
    image = tf.image.decode_jpeg(image, channels=3)
    # resize the image
    image = tf.image.resize(image, [INP_WIDTH,INP_HEIGHT])
    # convert to tensor and normalize
    image = tf.cast(image, tf.float32)/255.0
    
    return image, index
# create train dataset 
train_dataset = tf.data.Dataset.from_tensor_slices((train_df['image'].values, train_df.index.values))
# preprocess the train dataset to feed into deep learning model
train_dataset = train_dataset.map(preprocessImages)
    
    

In [None]:
train_dataset_sample = train_df.sample(n = 1000,random_state=10)
indexes=train_dataset_sample.index
train_dataset_sample = tf.data.Dataset.from_tensor_slices((train_dataset_sample['image'].values, train_dataset_sample.index.values))
train_dataset_sample = train_dataset_sample.map(preprocessImages)


In [None]:
# make predictions over train_dataset
from tqdm import tqdm
test_encoded = []

for image in tqdm(train_dataset_sample.batch(64)):
    batch_size = image[0].shape[0]
    encoded = encoder.predict(image[0])
    test_encoded.append(encoded.reshape(batch_size, -1))
    
test_encoded = np.concatenate(test_encoded, axis = 0)
    


In [None]:
data_dict = {"indexes": indexes, "features": test_encoded}

# write the data dictionary to disk
print("[INFO] saving index...")
f = open('indexing1.pickle', "wb")
f.write(pickle.dumps(data_dict))
f.close()

In [None]:
def euclidean(a, b):
    return np.linalg.norm(a - b)

# 5 Visualize top similar product

In [None]:
def search_similar_images(queryFeatures, index, maxResults=5):
    results=[]
    
    # loop over our index
    for i in range(0, len(index["features"])):
        # compute the  distance euclidean between our query features
        # and the features for the current image in our index, then
        dist = euclidean(queryFeatures, index["features"][i])
        results.append((dist, index['indexes'][i]))

    # sort the results and grab the top ones
    results = sorted(results)[:maxResults]

    # return the list of results
    return results

In [None]:
# searchSimialar image
index = pickle.loads(open('../input/indexingfile/indexing.pickle', "rb").read())


    


In [None]:
train_indexes = index['indexes']
train_Features = index['features']

indexes_Features=dict()
for k,v in  zip(train_indexes,train_Features):
    indexes_Features[k]=v

In [None]:

# read the file using tf.io
qimage = tf.io.read_file('../input/shopee-product-matching/train_images/0000a68812bc7e98c42888dfb1c07da0.jpg')
# decode image to jpeg
qimage = tf.image.decode_jpeg(qimage, channels=3)
# resize the image
qimage = tf.image.resize(qimage, [INP_WIDTH,INP_HEIGHT])
# convert to tensor and normalize
qimage = tf.cast(qimage, tf.float32)/255.0

In [None]:
qimage =tf.expand_dims(qimage, 0)
qimage = encoder.predict(qimage)
feature_query = qimage.reshape(1, -1)
feature_query.shape

In [None]:
def search_similar_images(queryFeatures, index, maxResults=5):
    results=[]
    
    # loop over our index
    for i in range(0, len(index["features"])):
        # compute the  distance euclidean between our query features
        # and the features for the current image in our index, then
        dist = euclidean(queryFeatures, index["features"][i])
        if index['indexes'][i] == 13168:
            print(dist)
        results.append((dist, index['indexes'][i]))

    # sort the results and grab the top ones
    results = sorted(results)[:maxResults]

    # return the list of results
    return results

In [None]:
results = search_similar_images(feature_query,index)

In [None]:
results
# ../input/shopee-product-matching/train_images/0000a68812bc7e98c42888dfb1c07da0.jpg
res_index= [0]
res_index.extend([i[1] for i in results])
distance_matrix=[(0,0)]
distance_matrix.extend(results)
print(res_index)

In [None]:
# take indexes
sample_indexes = res_index
# set seed 
np.random.seed(100)

# create subplot 4 rows 3 columns
figure, ax = plt.subplots(nrows=2, ncols=3, figsize=(20,15))
ax = ax.flatten()

for idx,data in enumerate(distance_matrix):
    img_id = train_df.loc[data[1]]["image"]
    ax[idx].imshow(im.imread("../input/shopee-product-matching/train_images/{}".format(img_id)).squeeze())
    
    if idx == 0:
        ax[idx].title.set_text("Query Image :   dist {}".format( data[0]))
    else:
        ax[idx].title.set_text("prediction Image :   dist {}".format( data[0]) )                      
        