# Duplicate Product Detection from E-commerce Data

## Preview 

This notebook will explain the entire process of the project from the data prepocessing to intuition behind decisions taken to find duplicates. Below is the index:

1. [Overview](#Overview)
2. [Data Exploration and Cleaning](#Exploratory-Analysis-and-Data-Cleaning)
3. [Finding Duplicates using Locality Sensitive Hashing and Cosine Similarity of Images](#Finding-Duplicates-using-Locality-Sensitive-Hashing-and-Cosine-Similarity)
4. [Results](#Results)
5. [Further Improvements](#Further-Improvements)
6. [References](#References)


## Overview

With the number of sellers increasing on e-commerce platforms for a wide range of products, there is a need to filter out the redundancies, so that customers can have an easy experience while shopping online. These redundancies can be on the basis of product colour, product size, product design or product nomenclature. While removing these duplicates, it is essential to keep in mind the similarity quotients and achieve the accurate results. I have used Locality Sensitive Hashing (LSH), VGG-19 Neural Network and Cosine Similarity to identify redundancies in the dataset. This efficient cataloguing of products aims at providing the customer a user friendly and smooth experience as they shop for their favourite products online!


Below points attribute to the definition of duplicate products:

> A product is duplicate if it has the same image.

> Same products can have different prices, sizes, colors and sellers.

> A product can be duplicate only within the same brand.



## Exploratory Analysis and Data Cleaning

In [5]:
import os, string, re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

pd.set_option('max_columns', 35)

In [6]:
df = pd.read_csv('2oq-c1r.zip')
df.shape

  interactivity=interactivity, compiler=compiler, result=result)


(4057189, 32)

In [7]:
df.columns

Index(['productId', 'title', 'description', 'imageUrlStr', 'mrp',
       'sellingPrice', 'specialPrice', 'productUrl', 'categories',
       'productBrand', 'productFamily', 'inStock', 'codAvailable', 'offers',
       'discount', 'shippingCharges', 'deliveryTime', 'size', 'color',
       'sizeUnit', 'storage', 'displaySize', 'keySpecsStr', 'detailedSpecsStr',
       'specificationList', 'sellerName', 'sellerAverageRating',
       'sellerNoOfRatings', 'sellerNoOfReviews', 'sleeve', 'neck', 'idealFor'],
      dtype='object')

In [8]:
df = df.loc[df.categories.fillna('').str.contains('Top')]
df.shape

(648722, 32)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 648722 entries, 18237 to 4026363
Data columns (total 32 columns):
productId              648722 non-null object
title                  648717 non-null object
description            418353 non-null object
imageUrlStr            647463 non-null object
mrp                    648722 non-null float64
sellingPrice           644333 non-null float64
specialPrice           644333 non-null float64
productUrl             648722 non-null object
categories             648722 non-null object
productBrand           648719 non-null object
productFamily          642954 non-null object
inStock                648722 non-null bool
codAvailable           648722 non-null bool
offers                 0 non-null float64
discount               648630 non-null float64
shippingCharges        642961 non-null float64
deliveryTime           307966 non-null object
size                   648711 non-null object
color                  645033 non-null object
sizeUnit     

> There are few columns with only nan values. These need to be removed.

In [10]:
df = df.dropna(how='all', axis = 1)
df.shape

(648722, 27)

In [11]:
df.head()

Unnamed: 0,productId,title,description,imageUrlStr,mrp,sellingPrice,specialPrice,productUrl,categories,productBrand,productFamily,inStock,codAvailable,discount,shippingCharges,deliveryTime,size,color,sizeUnit,keySpecsStr,detailedSpecsStr,sellerName,sellerAverageRating,sellerNoOfRatings,sellerNoOfReviews,sleeve,neck
18237,TUNE9CG3ZHRAYBFY,Vea Kupia Printed Women's Tunic,,http://img.fkcdn.com/image/tunic/b/f/y/1-1-342...,1399.0,802.0,761.0,http://dl.flipkart.com/dl/vea-kupia-printed-wo...,"Apparels>Women>Western Wear>Shirts, Tops & Tun...",Vea Kupia,"TUNE9CG3CDTCKVHK,TUNE9CG3GT2KMM2R",False,True,42.0,0.0,,L,Black,,Occasion: Casual;Fabric: Georgette;Hemline: Hi...,Occasion: Casual;Fabric: Georgette;Hemline: Hi...,Shruti,4.1,542,4,Full Sleeve,Shirt Collar
18238,TUNE9CG3CDTCKVHK,Vea Kupia Printed Women's Tunic,,http://img.fkcdn.com/image/tunic/b/f/y/1-1-342...,1399.0,802.0,761.0,http://dl.flipkart.com/dl/vea-kupia-printed-wo...,"Apparels>Women>Western Wear>Shirts, Tops & Tun...",Vea Kupia,"TUNE9CG3GT2KMM2R,TUNE9CG3ZHRAYBFY",False,True,42.0,0.0,,M,Black,,Occasion: Casual;Fabric: Georgette;Hemline: Hi...,Occasion: Casual;Fabric: Georgette;Hemline: Hi...,Shruti,4.1,542,4,Full Sleeve,Shirt Collar
18239,TUNEYFGDFZTBSDPZ,U&F Solid Women's Tunic,"If you are hip, stylish, with loads of attitud...",http://img.fkcdn.com/image/tunic/n/b/u/14193-2...,1299.0,469.0,445.0,http://dl.flipkart.com/dl/u-f-solid-women-s-tu...,"Apparels>Women>Western Wear>Shirts, Tops & Tun...",U&F,"TUNEYFGBATZHYCV7,TUNEYFGBG7ZDYNBU,TUNEYFGCAFA8...",False,True,66.0,0.0,,L,Aqua,,Occasion: Casual;Fabric: Cotton;Sleeve: 3/4th ...,Occasion: Casual;Fabric: Cotton;Sleeve: 3/4th ...,Sunil Jain,3.8,997,8,3/4th Sleeve,
18240,TUNEYFGDUW3WSS5Z,U&F Solid Women's Tunic,"If you are hip, stylish, with loads of attitud...",http://img.fkcdn.com/image/tunic/n/b/u/14193-2...,1299.0,469.0,469.0,http://dl.flipkart.com/dl/u-f-solid-women-s-tu...,"Apparels>Women>Western Wear>Shirts, Tops & Tun...",U&F,"TUNEYFGBATZHYCV7,TUNEYFGBG7ZDYNBU,TUNEYFGCAFA8...",False,True,63.0,0.0,,S,Aqua,,Occasion: Casual;Fabric: Cotton;Sleeve: 3/4th ...,Occasion: Casual;Fabric: Cotton;Sleeve: 3/4th ...,Sunil Jain,3.8,999,8,3/4th Sleeve,
18241,TUNEYFGBEKUYYXKZ,U&F Solid Women's Tunic,"If you are hip, stylish, with loads of attitud...",http://img.fkcdn.com/image/tunic/h/f/j/14190-6...,1299.0,379.0,379.0,http://dl.flipkart.com/dl/u-f-solid-women-s-tu...,"Apparels>Women>Western Wear>Shirts, Tops & Tun...",U&F,"TUNEYFGB4MXJK2KV,TUNEYFGB6ZKK5BP4,TUNEYFGBF5GR...",False,True,70.0,0.0,,M,Olive Green,,Occasion: Casual;Fabric: Cotton;Sleeve: 3/4th ...,Occasion: Casual;Fabric: Cotton;Sleeve: 3/4th ...,Sunil Jain,3.8,999,8,3/4th Sleeve,


In [12]:
#Counting maximum productFamily values in a row
df.productFamily.str.count(',').max()+1

428.0

In [13]:
df.isnull().sum()

productId                   0
title                       5
description            230369
imageUrlStr              1259
mrp                         0
sellingPrice             4389
specialPrice             4389
productUrl                  0
categories                  0
productBrand                3
productFamily            5768
inStock                     0
codAvailable                0
discount                   92
shippingCharges          5761
deliveryTime           340756
size                       11
color                    3689
sizeUnit               638420
keySpecsStr              7863
detailedSpecsStr        10466
sellerName               5768
sellerAverageRating         0
sellerNoOfRatings           0
sellerNoOfReviews           0
sleeve                  13033
neck                   282004
dtype: int64

> Size unit has too many null values

In [14]:
(df['keySpecsStr']==df['detailedSpecsStr']).sum()

637976

> keySpecsStr and detailedSpecsStr have same values in most rows. detailedSpecsStr has higher number of null values

In [15]:
df.iloc[:5]['sleeve']

18237     Full Sleeve
18238     Full Sleeve
18239    3/4th Sleeve
18240    3/4th Sleeve
18241    3/4th Sleeve
Name: sleeve, dtype: object

In [16]:
df.loc[df.neck.notnull(), ['neck', 'keySpecsStr']].head().values.tolist()

[['Shirt Collar',
  'Occasion: Casual;Fabric: Georgette;Hemline: High-low;Sleeve: Full Sleeve;Pattern: Printed'],
 ['Shirt Collar',
  'Occasion: Casual;Fabric: Georgette;Hemline: High-low;Sleeve: Full Sleeve;Pattern: Printed'],
 ['Fashion Neck',
  'Occasion: Casual;Fabric: 100% Polyester;Sleeve: 3/4th Sleeve;Pattern: Solid'],
 ['Constructed Stitch design on neck',
  'Occasion: Casual;Fabric: 100% Khadi Cotton;Hemline: High-low;Sleeve: Fashion Sleeve;Pattern: Solid'],
 ['Wide scoop neck',
  'Occasion: Casual;Fabric: 100% Khadi Cotton;Hemline: Diagonal;Sleeve: Half Sleeve;Pattern: Floral Print']]

The neck column describes the similar information to keySpecsStr. We can combine the information of both columns into a single one.

In [17]:
concat_neck = df['keySpecsStr']+';Neck: '+df['neck']
df['keySpecsStr'] = np.where(df.neck.notnull(), concat_neck, df['keySpecsStr'])
df.loc[df.neck.notnull(), ['neck', 'keySpecsStr']].head().values.tolist()

[['Shirt Collar',
  'Occasion: Casual;Fabric: Georgette;Hemline: High-low;Sleeve: Full Sleeve;Pattern: Printed;Neck: Shirt Collar'],
 ['Shirt Collar',
  'Occasion: Casual;Fabric: Georgette;Hemline: High-low;Sleeve: Full Sleeve;Pattern: Printed;Neck: Shirt Collar'],
 ['Fashion Neck',
  'Occasion: Casual;Fabric: 100% Polyester;Sleeve: 3/4th Sleeve;Pattern: Solid;Neck: Fashion Neck'],
 ['Constructed Stitch design on neck',
  'Occasion: Casual;Fabric: 100% Khadi Cotton;Hemline: High-low;Sleeve: Fashion Sleeve;Pattern: Solid;Neck: Constructed Stitch design on neck'],
 ['Wide scoop neck',
  'Occasion: Casual;Fabric: 100% Khadi Cotton;Hemline: Diagonal;Sleeve: Half Sleeve;Pattern: Floral Print;Neck: Wide scoop neck']]

The `keySpecsStr` is in a dictionary-like format. Let's only extract the values to remove redundancy from this column as it contains valuable information to differentiate products

In [18]:
key_df = df['keySpecsStr'].str.extractall(':\s?([A-Za-z0-9\s/\-%]*)').unstack().fillna('')
key_df.shape

(639864, 8)

In [19]:
key_df.head()

Unnamed: 0_level_0,0,0,0,0,0,0,0,0
match,0,1,2,3,4,5,6,7
18237,Casual,Georgette,High-low,Full Sleeve,Printed,Shirt Collar,,
18238,Casual,Georgette,High-low,Full Sleeve,Printed,Shirt Collar,,
18239,Casual,Cotton,3/4th Sleeve,Solid,,,,
18240,Casual,Cotton,3/4th Sleeve,Solid,,,,
18241,Casual,Cotton,3/4th Sleeve,Solid,,,,


In [20]:
key_specs_text = ''
for col in key_df.select_dtypes('object'):
    key_specs_text += ' ' + key_df[col]
key_specs_text.iloc[:5]

18237     Casual Georgette High-low Full Sleeve Printed...
18238     Casual Georgette High-low Full Sleeve Printed...
18239                 Casual Cotton 3/4th Sleeve Solid    
18240                 Casual Cotton 3/4th Sleeve Solid    
18241                 Casual Cotton 3/4th Sleeve Solid    
Name: (0, 0), dtype: object

In [21]:
df['key_specs_text'] = key_specs_text.str.strip()

In [22]:
df.productBrand.nunique()

3444

In [23]:
df.productBrand.unique()[:10]

array(['Vea Kupia', 'U&F', 'Taurus', 'R Factor', 'Bohemian You', 'Elle',
       'Moda Vastra', 'India House', 'Highness Nyc', 'East West'],
      dtype=object)

With Further Analysis in the csv files, there seems to be few typos in the `productBrand` column. The following changes are solely based on intuition.

In [24]:
df['productBrand'] = df['productBrand'].replace({'Again? Vintage':'Again Vintage', '109°F': '109F', '20Dresses': '20 Dresses',
                                                '99Affair': '99 Affair', 'AKFASHION':'AK Fashion',
                                                'AVFASHION': 'AV Fashions', 'Adidas Originals': 'Adidas',
                                                'AltoModa by Pantaloons': 'Alto Moda by Pantaloons',
                                                'Being Human Clothing': 'Being Human',
                                                'ELYWOMEN': 'ELYwomen', 'F Fashionstylus': 'F Fashion Stylus',
                                                'IraSoleil': 'Ira Soleil', 'STYLESLAND': 'STYLES LAND',
                                               'SassyStripes': 'Sassy Stripes', 'Sassystripes': 'Sassy Stripes',
                                               'Sportelle USA  India':'Sportelle USA India',
                                               'Style Quotient by Noi': 'Style Quotient By Noi',
                                               'Superdry': 'Super Dry', 'YouthenX': 'Youthen', 'ZOEFEMME':'ZOE FEMME'})

In [25]:
df.categories.nunique()

26

In [26]:
df.categories.unique()

array(['Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tunics',
       'Apparels>Women>Fusion Wear>Shirts, Tops & Tunics>Tunics',
       'Apparels>Women>Maternity Wear>Shirts, Tops & Tunics>Tunics',
       'Apparels>Women>Western Wear>Shirts, Tops & Tunics>Polos & T-Shirts',
       'Apparels>Women>Fusion Wear>Shirts, Tops & Tunics>Polos & T-Shirts',
       'Apparels>Kids>Girls>T-Shirts & Tops>T-Shirts',
       'Apparels>Women>Western Wear>Shirts, Tops & Tunics>Shirts',
       'Apparels>Women>Winter & Seasonal Wear>Thermals>Thermal Tops',
       'Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops',
       'Apparels>Women>Fusion Wear>Shirts, Tops & Tunics>Tops',
       'Apparels>Women>Fusion Wear>Shirts, Tops & Tunics>Shirts',
       'Apparels>Women>Maternity Wear>Shirts, Tops & Tunics>Shirts',
       'Apparels>Women>Western Wear>Shirts, Tops & Tunics>Kaftans',
       'Apparels>Women>Maternity Wear>Shirts, Tops & Tunics>Tops',
       'Apparels>Women>Sports & Gym Wear>Track Tops',


Let's extract the last subcategory(Thermal Tops, Shrugs, etc..) and the gender category(Kids, Women, Men)

In [27]:
df['sub_category1'] = df.categories.str.split('>').str[1]
df['sub_category2'] = df.categories.str.split('>').str[-1]


In [28]:
df.sub_category1.unique()

array(['Women', 'Kids', 'Men'], dtype=object)

In [29]:
df.sub_category2.unique()

array(['Tunics', 'Polos & T-Shirts', 'T-Shirts', 'Shirts', 'Thermal Tops',
       'Tops', 'Kaftans', 'Track Tops', 'Polos & T-Shirts (OLD)',
       'Bodysuits', 'Top', 'Shrugs'], dtype=object)

Encoding these Categories 

In [30]:
cat_dict = {'Women':1, 'Kids':2, 'Men': 3}
cat_dict

{'Women': 1, 'Kids': 2, 'Men': 3}

In [31]:
#encoding category2
cat2_dict = {v:i  for i, v in enumerate(df.sub_category2.unique(), 1)}
cat2_dict

{'Tunics': 1,
 'Polos & T-Shirts': 2,
 'T-Shirts': 3,
 'Shirts': 4,
 'Thermal Tops': 5,
 'Tops': 6,
 'Kaftans': 7,
 'Track Tops': 8,
 'Polos & T-Shirts (OLD)': 9,
 'Bodysuits': 10,
 'Top': 11,
 'Shrugs': 12}

In [32]:
df['sub_category2'] = df['sub_category2'].map(cat2_dict)
df['sub_category1'] = df['sub_category1'].map(cat_dict)


In [33]:
#Count maximum urls in a row
df.imageUrlStr.str.count(';').max()+1

14.0

In [34]:
df.imageUrlStr.unique()[:5]

array(['http://img.fkcdn.com/image/tunic/b/f/y/1-1-3422-vea-kupia-s-original-imae7sr4t7unbrwz.jpeg;http://img.fkcdn.com/image/tunic/b/f/y/1-1-3422-vea-kupia-s-200x200-imae7sr4t7unbrwz.jpeg;http://img.fkcdn.com/image/tunic/b/f/y/1-1-3422-vea-kupia-s-400x400-imae7sr4t7unbrwz.jpeg;http://img.fkcdn.com/image/tunic/b/f/y/1-1-3422-vea-kupia-s-800x800-imae7sr4t7unbrwz.jpeg',
       'http://img.fkcdn.com/image/tunic/n/b/u/14193-2-u-f-m-original-imaeymtcqatbrpgm.jpeg;http://img.fkcdn.com/image/tunic/n/b/u/14193-2-u-f-m-200x200-imaeymtcqatbrpgm.jpeg;http://img.fkcdn.com/image/tunic/n/b/u/14193-2-u-f-m-400x400-imaeymtcqatbrpgm.jpeg;http://img.fkcdn.com/image/tunic/n/b/u/14193-2-u-f-m-800x800-imaeymtcqatbrpgm.jpeg',
       'http://img.fkcdn.com/image/tunic/h/f/j/14190-6-u-f-s-original-imaeymtcdrfczvaw.jpeg;http://img.fkcdn.com/image/tunic/h/f/j/14190-6-u-f-s-200x200-imaeymtcdrfczvaw.jpeg;http://img.fkcdn.com/image/tunic/h/f/j/14190-6-u-f-s-400x400-imaeymtcdrfczvaw.jpeg;http://img.fkcdn.com/image/t

In [35]:
#extracting only the first URL
df['imageUrl'] = df.imageUrlStr.str.split(';').str[0]

In [36]:
#Considering only subcategory tops. Top and Tops are considered the same category
df = df[df['sub_category2'].isin([6, 11])]
df.shape

(347713, 31)

In [37]:
df.to_csv('Only Tops Data.csv', index = False)

In [108]:
df['imageUrl'].duplicated().sum()

259728

As per the definition, two products are duplicate if they have the same image! Let's find the productIds of these duplicate images

In [109]:
df1 = (df.groupby('imageUrl')
       .apply(lambda x: tuple(x.productId))
       .reset_index(name='dup_idx'))
df1.head()

Unnamed: 0,imageUrl,dup_idx
0,http://img.fkcdn.com/image/apparels-combo/2/f/...,"(TOPEG4YDVT9WX6FB, TOPEG4YD3FK3BFSF, TOPEG4YDY..."
1,http://img.fkcdn.com/image/apparels-combo/3/s/...,"(TOPEDBEXXFA89JWG,)"
2,http://img.fkcdn.com/image/apparels-combo/6/r/...,"(TOPEGSUWZRYGNVPS, TOPEGSUW67FUGTQS, TOPEGSUWF..."
3,http://img.fkcdn.com/image/apparels-combo/b/a/...,"(TOPEG4YDYDNYUVGE, TOPEG4YDFJUTPGZD, TOPEG4YD4..."
4,http://img.fkcdn.com/image/apparels-combo/c/g/...,"(TOPEAZU6K7UGGHYH, TOPEAZU6TBXZNAUX, TOPEAZU6Y..."


In [66]:
duplicate_products = {value[0]: list(value[1:]) for value in df1.dup_idx}


In [67]:
duplicate_products['TOPEG4YDVT9WX6FB']

['TOPEG4YD3FK3BFSF', 'TOPEG4YDYQRQTGVX', 'TOPEG4YDVSHHYFWD']

In [110]:
df[df.productId.isin(duplicate_products['TOPEG4YDVT9WX6FB'])]['imageUrl'].tolist()

['http://img.fkcdn.com/image/apparels-combo/2/f/a/ru-5116-5120-ruhaan-s-original-imaeg3szjzepcmwz.jpeg',
 'http://img.fkcdn.com/image/apparels-combo/2/f/a/ru-5116-5120-ruhaan-s-original-imaeg3szjzepcmwz.jpeg',
 'http://img.fkcdn.com/image/apparels-combo/2/f/a/ru-5116-5120-ruhaan-s-original-imaeg3szjzepcmwz.jpeg']

Let's save the file in a pickle. This will be combined with future results.

In [124]:
import pickle
with open('duplicates.pickle', 'wb') as f:
    pickle.dump(duplicate_products, f)

They have the same URL, hence the same image and therfore same product. Now we can drop these duplicate imageUrl rows.

In [111]:
df = df.drop_duplicates(subset='imageUrl')
df.shape

(87985, 30)

In [112]:
df.iloc[:2]['productUrl'].values.tolist()

['http://dl.flipkart.com/dl/citrine-casual-short-sleeve-printed-women-s-pink-white-top/p/itmenvftgmhgdu83?pid=TOPE9ABBZU3HZRHN',
 'http://dl.flipkart.com/dl/leelan-casual-short-sleeve-solid-women-s-black-top/p/itme9azp85xahk6g?pid=TOPE9AZZSMSZFYAM']

This only has the link to the product page and doesn't add any useful information

### Dropping Unimportant columns

- **Columns Related to the Seller**: These don't add any useful information about the **product**. These include : shippingCharges, deliveryTime, sellerAverageRating, sellerNoOfRatings, sellerName, sellerNoOfReviews
- **Redundant Columns**: The relevant data has already been extracted above. These incluse: categories, imageUrlStr, sleeve, neck, detailedSpecsStr, sub_category2(Top and Tops are assumed to be the same category)
- **Other Unimportant Column**: These are expected to vary for the same and hence shouldn't be used. These include: size, sizeUnit, inStock, productUrl, discount, productFamily

In [113]:
delete_columns = ['size', 'sizeUnit', 'inStock','productUrl', 'shippingCharges', 'deliveryTime',
                  'sellerAverageRating', 'sellerNoOfRatings','sellerName', 
                  'sellerNoOfReviews', 'codAvailable', 'detailedSpecsStr', 'neck',
                 'discount', 'productUrl', 'discount', 'categories', 'imageUrlStr', 'sleeve', 'sub_category2',
                 'productFamily']

df = df.drop(columns=delete_columns)

In [118]:
df.shape

(87983, 11)

In [115]:
df.productId.duplicated().sum()

2

In [None]:
df=df.drop_duplicates(subset='productId')

In [132]:
df.columns

Index(['productId', 'title', 'description', 'mrp', 'sellingPrice',
       'specialPrice', 'productBrand', 'color', 'keySpecsStr', 'sub_category1',
       'imageUrl', 'key_specs_text'],
      dtype='object')

In [117]:
df.head()

Unnamed: 0,productId,title,description,mrp,sellingPrice,specialPrice,productBrand,productFamily,color,keySpecsStr,sub_category1,imageUrl
100759,TOPE9ABBZU3HZRHN,Citrine Casual Short Sleeve Printed Women's Pi...,This beautiful printed modal top from Citrine ...,1099.0,329.0,329.0,Citrine,"TOPE9ABBBTJYDSQE,TOPE9ABBHJ8HGGGK,TOPE9ABBPDAN...","Pink, Off White","Round Neck, Short Sleeve;Fabric: Modal;Pattern...",1,http://img.fkcdn.com/image/top/r/h/n/1-1-wwtpw...
100761,TOPE9AZZSMSZFYAM,Leelan Casual Short Sleeve Solid Women's Black...,,524.0,262.0,262.0,Leelan,"TOPE9AZPDMGTYHKR,TOPE9AZPGGQ32HWG,TOPE9AZPJSS2...",Black,"Round Neck, Short Sleeve;Fabric: Cotton;Patter...",1,http://img.fkcdn.com/image/top/y/a/m/1-1-10009...
100770,TOPE6ZCYFCQ3H6EV,Cottinfab Casual Sleeveless Solid Women's Purp...,"Round neck, sleeveless stylish top with pack o...",1399.0,799.0,719.0,Cottinfab,"TOPE6ZCYRR54GY6S,TOPE6ZCYYJZFZGQS,TOPE6ZCYYUGJ...",Purple,"V-Neck, Sleeveless;Fabric: Knit;Pattern: Solid...",1,http://img.fkcdn.com/image/top/6/e/v/1-1-newds...
100773,TOPE6ZCYHTJEMZMW,Cottinfab Casual Sleeveless Solid Women's Purp...,"V Neck with black net on front yoke, sleeveles...",799.0,249.0,224.0,Cottinfab,"TOPE5K6B5DY95JP2,TOPE5K6B7JKVT2XJ,TOPE5K6BAFC5...",Purple,"V-Neck, Sleeveless;Fabric: Knit;Pattern: Solid...",1,http://img.fkcdn.com/image/top/x/7/r/1-1-newds...
100790,TOPE6XZPUVT9C7RU,Butterfly Wears Casual Short Sleeve Solid Wome...,,799.0,799.0,799.0,Butterfly Wears,"TOPE6XZP6GKPSFWZ,TOPE6XZPJRNVSAQS,TOPE6XZPVF6H...",Multicolor,"Round Neck, Short Sleeve;Fabric: Cotton;Patter...",1,http://img.fkcdn.com/image/top/y/h/c/5245-butt...


This will be the data I will be working with to find other duplicate products

In [120]:
df.to_csv('Final Data.csv', index=False)

## Finding Duplicates using Locality Sensitive Hashing and Cosine Similarity

### Locality Sensitive Hashing

**Locality Sensitive Hasing**(LSH) is an **Approximation Algorithm.** Approximate algorithms can be used to find nearest neighbours in a data set. These algorithms don’t guarantee to give you the exact answer, but they provide a good approximation. These are **faster** and **scalable**. One application of LSH is to find duplicate documents, which is required for our use case.

From [this article](https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134):
> LSH refers to a family of functions (known as LSH families) to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity.

LSH returns **candidate pairs**. These pairs are likely to be similar.

Steps in LSH:

1. **Shingling**
    - This converts each document into a set of k-grams or k-shingles. I have used **k=3**. Similar documents have similar shingles.
    <img src='https://miro.medium.com/max/399/1*SYzH6iJ3Y-5kGbA4psnTDw.png' alt='Shingling Example' title='Documet Shingling Matrix'
    width="100" height="75"/>
    
    - Uses Jaccard similarity to find similarity between metrics. It basically checks how many intersections are there between two sets over the union of two sets. More common shingles, higher Jaccard Similarity and higher probability of similar documents.
    
2. **Min Hashing**

    - The document matrix, as shown above, is a sparse matrix and takes up unnecessary space. Hashing is used to convert each document to a signature. Min hashing is the hashing is the hashing function.
    <img src='https://miro.medium.com/max/697/1*wkjgq-9UPR2rDjQ68bI9tg.png' alt='Signature Matrix' title='Signature Matrix'
    width="300" height="200"/>
    - To create a signature, a random index(permutation of numbers) is given to each row(shingles) of the document matrix. The hash function is the index of the first row(in the permuted order) in which a column C(representing documents). This is for one permutation. Perform multiple permutatatios to get a signature of a column.
    > Higher the number of permutations, lower the error, the better the results. I have used 256 permutations.
    - The similarity of the signature is the fraction of the min-hash funtions in which they agree. Similarity of two signatures is equal to similarity of the columns. 
    > Signature Similarity is a proxy for Jaccard Similarity.
    - Doing this step, solves the space complexity problem.
    
3. **Locality Sensitive Hashing**:

    - Find documents with Jaccard similarity of at least t from signature matrix
    
I have used LSH to find the similarity between the text data of products. This is used to filter the products whose images will be compared for similarity.

### Cosine Similarity

Each Vector, eg: a = $[1, 2, 3]^{T}$, can be projected to a coordinate system of dimensionality equal to the length of the vector. Each element in the vector belongs to a certain dimensionality. If two vectors have similar directions(orientation), they are considered similar. 

The cosine similarity between two vectors is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation.

I have used cosine similarity to find the similarity between images. Feature vectors of the images are extracted and compared.

### Algorithm used to find Duplicate Products

Below are the steps of the algorithm:

1. Groupby the `productBrand` and `sub_category1`(Women, kids etc).
    - Although most data in the `sub_category1` is of one category in the final `df`, it's more efficient and intuitive to compare data related to women products with other women products only.
    - If we were processing the entire data set that is provided, then it'd be better to group by `productBrand`, `sub_category1` and `sub_category2`(shrubs, tunips, etc.).
    
2. Each group's `keySpecsStr`, `description` and `title` is concatenated into a single column `full_text`. 

3. From the `full_text` column, 5 most frequent words(from the entire column) are removed from each row, if they are present.
    - This is important as to remove redundancy from the data and only input unique words of row to find duplicate/similar text.
    
4. Apply the **LSH** to `full_text` column. Each row of `full_text` is the input with the productId as the key.
    - To obtain the results, input a productId and a list of productIds are returned.
    - This step will filter out irrelevant images to compare with for a product resulting in decreased computation time.
    
5. For each key and its results from LSH, the images are compared and a dictionary is built with key as one productID and values are the duplicate productIDs of the key.
    - The final result will be the Ids that satisfy a certain threshold on **cosine_similarity**.
    - The images are converted to **grayscale**.
        - It's important to only consider the grayscale of the images as same products can have different colors. Grayscaling will remove the color difference between same products.
    - The features are extracted from the images using VGG19 network.


The threshold for LSH to find similar document is 0.9 and the threshold of cosine similarity for images if 0.45.

Reference to the [notebook](Source Code Explanation.ipynb) and [html](Source Code Explanation.html) of the main code.

## Results

Considering that the definition of duplicates is having same/similar images, the results are promising. The testing was done by creating few utility functions, that when provided with a productID, prints a urls of duplicate ids. [Click here](Evaluation.ipynb) or [here](Evaluation.html)(html) to check the function.

### Summarizing results:

- The overall designs of duplicate images are same or similar. For example: all sleeveless tops will be clustered together.
- Different colored clothes that belong to same product/clothing are also clustered together.

**Scope for improvement**: There can be further improvement to distinguish between the prints and patterns of the clothing. For example: The outline might be similar but, the prints or text on the clothing could be different. 

## Further Improvements

- A more sophisticated method to extract features from the images could be used. For example, the features of only the tops should be extracted, but in this method the entire image feature is extracted, which includes the pose of the model. This just adds noise to our vectors.
- Different shingle lengths and permutations could be used for LSH
- There can be further improvement to distinguish between the prints and patterns of the clothing. For example: The outline might be similar but, the prints or text on the clothing could be different. 

## References

[Cosine Similarity](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/)

[Locality Sensitive Hashing](https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134)

[Understanding Duplicate Detection in Ecommerce](https://medium.com/walmartlabs/product-matching-in-ecommerce-4f19b6aebaca)