# Product Matching

Using ML/DL techniques, match similar products from the Flipkart dataset with the Amazon dataset. Once similar products are matched, display the retail price from FK and AMZ side by side. Please explore as many techniques as possible before choosing the final technique.

**Dataset Link:** https://www.dropbox.com/sh/aypq6h3254207bs/AACzMLvo-XtK9sYAAma6FW0la?dl=0



In [None]:
# Importing Warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Import the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Reading the amazon dataset - (here we are taking only 5000 products)
amazon = pd.read_csv("/content/amz_com-ecommerce_sample.csv", encoding = 'unicode_escape')
amazon = amazon.head(5000)

In [None]:
# Top 5 rows
amazon.head()

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,982,438,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32143,29121,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,991,551,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,694,325,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,208,258,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,No rating available,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."


In [None]:
# Shape od Amazon Dataset
amazon.shape

(5000, 15)

In [None]:
# Unwanted columns : uniq_id,crawl_timestamp,product_url,pid,image,is_FK_Advantage_product,product_rating,overall_rating

In [None]:
# Dropping Unwanted Columns
amazon.drop(["uniq_id","crawl_timestamp","product_url","image","is_FK_Advantage_product","product_rating","overall_rating","product_category_tree","product_specifications","brand"],inplace=True,axis=1)

In [None]:
amazon.shape

(5000, 5)

In [None]:
# Information about Amazon Dataset
amazon.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   product_name      5000 non-null   object
 1   pid               5000 non-null   object
 2   retail_price      5000 non-null   int64 
 3   discounted_price  5000 non-null   int64 
 4   description       4999 non-null   object
dtypes: int64(2), object(3)
memory usage: 195.4+ KB


In [None]:
# Here i create the new column for better understanding of product on which website. e.g- Amazon-a,Flipkart-f
amazon["product_on"]="a" #---->label for amazon-a

In [None]:
# Top 5 rows of amazon dataset
amazon.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,Alisha Solid Women's Cycling Shorts,SRTEH2FF9KEDEFGF,982,438,Key Features of Alisha Solid Women's Cycling S...,a
1,FabHomeDecor Fabric Double Sofa Bed,SBEEH3QGU7MFYJFY,32143,29121,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,a
2,AW Bellies,SHOEH4GRSUBJGZXE,991,551,Key Features of AW Bellies Sandals Wedges Heel...,a
3,Alisha Solid Women's Cycling Shorts,SRTEH2F6HUZMQ6SJ,694,325,Key Features of Alisha Solid Women's Cycling S...,a
4,Sicons All Purpose Arnica Dog Shampoo,PSOEH3ZYDMSYARJ5,208,258,Specifications of Sicons All Purpose Arnica Do...,a


In [None]:
# Getting null values from description column
amazon[amazon["description"].isna()]

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
553,Ozel Studio Casual Sleeveless Printed Women's Top,TOPEYV38KYVJKM54,1278,781,,a


In [None]:
# Filling the null value with "no description".
amazon["description"].fillna("no description",inplace=True)

In [None]:
# checking the null value in description.
amazon[amazon["description"].isna()]

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on


In [None]:
# Information of amazon dataset. - look clean!
amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   product_name      5000 non-null   object
 1   pid               5000 non-null   object
 2   retail_price      5000 non-null   int64 
 3   discounted_price  5000 non-null   int64 
 4   description       5000 non-null   object
 5   product_on        5000 non-null   object
dtypes: int64(2), object(4)
memory usage: 234.5+ KB


In [None]:
# Reading the flipkart dataset - (Here i take only 5000 products)
flipkart=pd.read_csv("/content/flipkart_com-ecommerce_sample.csv",encoding= 'unicode_escape')
flipkart=flipkart.head(5000)

In [None]:
# Shape of Flipkart Dataset
flipkart.shape

(5000, 15)

In [None]:
# Top 10 rows in flipkart dataset.
flipkart.head(10)

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,No rating available,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."
5,c2a17313954882c1dba461863e98adf2,2016-03-25 22:59:23 +0000,http://www.flipkart.com/eternal-gandhi-super-s...,Eternal Gandhi Super Series Crystal Paper Weig...,"[""Eternal Gandhi Super Series Crystal Paper We...",PWTEB7H2E4KCYUE3,430.0,430.0,"[""http://img5a.flixcart.com/image/paper-weight...",False,Key Features of Eternal Gandhi Super Series Cr...,No rating available,No rating available,Eternal Gandhi,"{""product_specification""=>[{""key""=>""Model Name..."
6,ce5a6818f7707e2cb61fdcdbba61f5ad,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FVVKRBAXHB,1199.0,479.0,"[""http://img6a.flixcart.com/image/short/p/j/z/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
7,8542703ca9e6ebdf6d742638dfb1f2ca,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGYGHFUEXN,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/e/x...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
8,29c8d290caa451f97b1c32df64477a2c,2016-03-25 22:59:23 +0000,http://www.flipkart.com/dilli-bazaaar-bellies-...,"dilli bazaaar Bellies, Corporate Casuals, Casuals","[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH3DZBFR88SCK,699.0,349.0,"[""http://img6a.flixcart.com/image/shoe/b/p/n/p...",False,"Key Features of dilli bazaaar Bellies, Corpora...",No rating available,No rating available,dilli bazaaar,"{""product_specification""=>[{""key""=>""Occasion"",..."
9,4044c0ac52c1ee4b28777417651faf42,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FVUHAAVH9X,1199.0,479.0,"[""http://img5a.flixcart.com/image/short/5/z/c/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."


In [None]:
# Dropping the unwanted columns.
flipkart.drop(["uniq_id","crawl_timestamp","product_url","image","is_FK_Advantage_product","product_rating","overall_rating","product_category_tree","product_specifications","brand"],inplace=True,axis=1)

In [None]:
# checking the shape of dataset
flipkart.shape

(5000, 5)

In [None]:
# Information about Flipkart Dataset.
flipkart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      5000 non-null   object 
 1   pid               5000 non-null   object 
 2   retail_price      4989 non-null   float64
 3   discounted_price  4989 non-null   float64
 4   description       4999 non-null   object 
dtypes: float64(2), object(3)
memory usage: 195.4+ KB


###### C- Null values in retail_price, Discounted_price, Description columns. The null values in retail_price column replace by 0 (Bcoz we dont know the actual price), Same for discounted_price,The null value in description column replace by "no description".

In [None]:
# Checking the null values in description column.
flipkart[flipkart["description"].isna()]

Unnamed: 0,product_name,pid,retail_price,discounted_price,description
553,Ozel Studio Casual Sleeveless Printed Women's Top,TOPEYV38KYVJKM54,1290.0,645.0,


In [None]:
# Filling the null value with "no description".
flipkart["description"].fillna("no description",inplace=True)

In [None]:
#Checking the null value.
flipkart[flipkart["description"].isna()]

Unnamed: 0,product_name,pid,retail_price,discounted_price,description


In [None]:
# fillng the null values in retail_price column with 0
flipkart["retail_price"].fillna(0,inplace=True)

In [None]:
# fillng the null values in discounted_price column with 0
flipkart["discounted_price"].fillna(0,inplace=True)

In [None]:
# Information about flipkart dataset
flipkart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      5000 non-null   object 
 1   pid               5000 non-null   object 
 2   retail_price      5000 non-null   float64
 3   discounted_price  5000 non-null   float64
 4   description       5000 non-null   object 
dtypes: float64(2), object(3)
memory usage: 195.4+ KB


In [None]:
# Here i create the new column for better understanding of product on which website. e.g- Amazon-a,Flipkart-f
flipkart["product_on"]="f" #----> Label for flipkart - f

In [None]:
# Top 5 Rows in dataset
flipkart.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,Alisha Solid Women's Cycling Shorts,SRTEH2FF9KEDEFGF,999.0,379.0,Key Features of Alisha Solid Women's Cycling S...,f
1,FabHomeDecor Fabric Double Sofa Bed,SBEEH3QGU7MFYJFY,32157.0,22646.0,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,f
2,AW Bellies,SHOEH4GRSUBJGZXE,999.0,499.0,Key Features of AW Bellies Sandals Wedges Heel...,f
3,Alisha Solid Women's Cycling Shorts,SRTEH2F6HUZMQ6SJ,699.0,267.0,Key Features of Alisha Solid Women's Cycling S...,f
4,Sicons All Purpose Arnica Dog Shampoo,PSOEH3ZYDMSYARJ5,220.0,210.0,Specifications of Sicons All Purpose Arnica Do...,f


In [None]:
# description of 1st row/product
flipkart["description"][0]

"Key Features of Alisha Solid Women's Cycling Shorts Cotton Lycra Navy, Red, Navy,Specifications of Alisha Solid Women's Cycling Shorts Shorts Details Number of Contents in Sales Package Pack of 3 Fabric Cotton Lycra Type Cycling Shorts General Details Pattern Solid Ideal For Women's Fabric Care Gentle Machine Wash in Lukewarm Water, Do Not Bleach Additional Details Style Code ALTHT_3P_21 In the Box 3 shorts"

In [None]:
# description of 3rd row/product
flipkart["description"][3]

"Key Features of Alisha Solid Women's Cycling Shorts Cotton Lycra Black, Red,Specifications of Alisha Solid Women's Cycling Shorts Shorts Details Number of Contents in Sales Package Pack of 2 Fabric Cotton Lycra Type Cycling Shorts General Details Pattern Solid Ideal For Women's Fabric Care Gentle Machine Wash in Lukewarm Water, Do Not Bleach Additional Details Style Code ALTGHT_11 In the Box 2 shorts"

In [None]:
# Checking the duplicate rows in amazon dataset
amazon.duplicated().sum()

0

In [None]:
# checking the duplicate rows in flipkart dataset
flipkart.duplicated().sum()

0

In [None]:
# shape of amazon dataset
amazon.shape

(5000, 6)

In [None]:
# shape of flipkart dataset
flipkart.shape

(5000, 6)

In [None]:
# Creation new dataset by concatinating the amaon and flipkart datasets
df=pd.concat([amazon,flipkart],ignore_index=True)

In [None]:
# Top 5 rows of new dataset 
df.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,Alisha Solid Women's Cycling Shorts,SRTEH2FF9KEDEFGF,982.0,438.0,Key Features of Alisha Solid Women's Cycling S...,a
1,FabHomeDecor Fabric Double Sofa Bed,SBEEH3QGU7MFYJFY,32143.0,29121.0,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,a
2,AW Bellies,SHOEH4GRSUBJGZXE,991.0,551.0,Key Features of AW Bellies Sandals Wedges Heel...,a
3,Alisha Solid Women's Cycling Shorts,SRTEH2F6HUZMQ6SJ,694.0,325.0,Key Features of Alisha Solid Women's Cycling S...,a
4,Sicons All Purpose Arnica Dog Shampoo,PSOEH3ZYDMSYARJ5,208.0,258.0,Specifications of Sicons All Purpose Arnica Do...,a


In [None]:
# Shuffling the rows (bcoz in new dataset the top 5000 rows are amazon and bottom 5000 rows are flipkart)
df=df.sample(frac = 1)

In [None]:
# after shuffling,top 5 rows in new dataset
df.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
4386,URY Steel Necklace,NKCE7S69FPFG9S5C,1291.0,316.0,URY Steel Necklace - Buy URY Steel Necklace on...,a
2817,Wolfie Designred110 Women's Push-up Bra,BRAEBMWVSDWZYNK8,982.0,363.0,Wolfie Designred110 Women's Push-up Bra - Buy ...,a
150,Casela CAS-W-13 Basic Analog Watch - For Boys...,WATEBJ37KBB3SEBR,3979.0,1064.0,Casela CAS-W-13 Basic Analog Watch - For Boys...,a
2763,Grafion Comfortable Women's Full Coverage Bra,BRAE3TSHYRVTD8GA,486.0,168.0,Grafion Comfortable Women's Full Coverage Bra ...,a
3771,Allure Auto sur-4619 Car Mat Honda Mobilio,CRTECUWZXMV2YZ9G,3696.0,1979.0,Buy Allure Auto sur-4619 Car Mat Honda Mobilio...,a


In [None]:
# reseting the index of dataset
df=df.reset_index(drop=True)

In [None]:
# top 5 rows in dataset
df.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,URY Steel Necklace,NKCE7S69FPFG9S5C,1291.0,316.0,URY Steel Necklace - Buy URY Steel Necklace on...,a
1,Wolfie Designred110 Women's Push-up Bra,BRAEBMWVSDWZYNK8,982.0,363.0,Wolfie Designred110 Women's Push-up Bra - Buy ...,a
2,Casela CAS-W-13 Basic Analog Watch - For Boys...,WATEBJ37KBB3SEBR,3979.0,1064.0,Casela CAS-W-13 Basic Analog Watch - For Boys...,a
3,Grafion Comfortable Women's Full Coverage Bra,BRAE3TSHYRVTD8GA,486.0,168.0,Grafion Comfortable Women's Full Coverage Bra ...,a
4,Allure Auto sur-4619 Car Mat Honda Mobilio,CRTECUWZXMV2YZ9G,3696.0,1979.0,Buy Allure Auto sur-4619 Car Mat Honda Mobilio...,a


In [None]:
# Shape of new dataset - 10000 rows, 6 columns
df.shape

(10000, 6)

###### In this datasets,the names of products on amazon and flipkart are same and there are same products with different quantity packs,so to identify them i create the new product column which has name of product,productid & website of product. 

In [None]:
# modifying the product name column
df["product_name"]=df["product_name"]+" "+df["pid"]+" "+df["product_on"]

In [None]:
# Top 5 rows of dataset
df.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,URY Steel Necklace NKCE7S69FPFG9S5C a,NKCE7S69FPFG9S5C,1291.0,316.0,URY Steel Necklace - Buy URY Steel Necklace on...,a
1,Wolfie Designred110 Women's Push-up Bra BRAEBM...,BRAEBMWVSDWZYNK8,982.0,363.0,Wolfie Designred110 Women's Push-up Bra - Buy ...,a
2,Casela CAS-W-13 Basic Analog Watch - For Boys...,WATEBJ37KBB3SEBR,3979.0,1064.0,Casela CAS-W-13 Basic Analog Watch - For Boys...,a
3,Grafion Comfortable Women's Full Coverage Bra ...,BRAE3TSHYRVTD8GA,486.0,168.0,Grafion Comfortable Women's Full Coverage Bra ...,a
4,Allure Auto sur-4619 Car Mat Honda Mobilio CRT...,CRTECUWZXMV2YZ9G,3696.0,1979.0,Buy Allure Auto sur-4619 Car Mat Honda Mobilio...,a


In [None]:
# product name of 1st product.
df["product_name"][0]

'URY Steel Necklace NKCE7S69FPFG9S5C a'

In [None]:
# dropping the extra columns.
df.drop(["pid","product_on"],inplace=True,axis=1)

In [None]:
# top 5 rows of dataset
df.head()

Unnamed: 0,product_name,retail_price,discounted_price,description
0,URY Steel Necklace NKCE7S69FPFG9S5C a,1291.0,316.0,URY Steel Necklace - Buy URY Steel Necklace on...
1,Wolfie Designred110 Women's Push-up Bra BRAEBM...,982.0,363.0,Wolfie Designred110 Women's Push-up Bra - Buy ...
2,Casela CAS-W-13 Basic Analog Watch - For Boys...,3979.0,1064.0,Casela CAS-W-13 Basic Analog Watch - For Boys...
3,Grafion Comfortable Women's Full Coverage Bra ...,486.0,168.0,Grafion Comfortable Women's Full Coverage Bra ...
4,Allure Auto sur-4619 Car Mat Honda Mobilio CRT...,3696.0,1979.0,Buy Allure Auto sur-4619 Car Mat Honda Mobilio...


### Text Preprocessing

In [None]:
# Importing the Libraries
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm
import os
from bs4 import BeautifulSoup

In [None]:
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [None]:
from tqdm import tqdm
preprocessed_description = []
# tqdm is for printing the status bar
for sentance in tqdm(df["description"].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_description.append(sentance.strip())

100%|██████████| 10000/10000 [00:04<00:00, 2395.06it/s]


In [None]:
# shuffling the rows with random_state. (random_state fixed the samples at every we run the program)
sample_data = df.sample(n = 10000,random_state=2)
sample_data.head()

Unnamed: 0,product_name,retail_price,discounted_price,description
7878,FS Mini Klub Printed Baby Girl's Basic Shorts ...,548.0,639.0,Specifications of FS Mini Klub Printed Baby Gi...
3224,Sixthsense Women of African Tribes Showpiece ...,1100.0,1100.0,Buy Sixthsense Women of African Tribes Showpie...
1919,SR Crafts Showpiece - 15.24 cm SHIECPXYNJ8UH...,1850.0,1399.0,"SR Crafts Showpiece - 15.24 cm (Polyresin, M..."
4432,Rockmantra Green Eyes Cat Ceramic Mug MUGEAGGK...,499.0,199.0,Buy Rockmantra Green Eyes Cat Ceramic Mug for ...
4835,Glory Fashion Shoulder Bag HMBEDCJHT4FCFKGD f,1099.0,399.0,Flipkart.com: Buy Glory Fashion Shoulder Bag f...


In [None]:
# here i arrange the preprocessed_description on the bases of index of sample_data
sample_description = [ preprocessed_description[i] for i in sample_data.index.values]
sample_description[0]

'specifications fs mini klub printed baby girl basic shorts shorts details number contents sales package pack fabric cotton type basic shorts general details pattern printed ideal baby girl occasion casual additional details style code fuschia box basic shorts'

In [None]:
# Create the new preprocessed column
sample_data['preprocessed'] = sample_description
sample_data.head()

Unnamed: 0,product_name,retail_price,discounted_price,description,preprocessed
7878,FS Mini Klub Printed Baby Girl's Basic Shorts ...,548.0,639.0,Specifications of FS Mini Klub Printed Baby Gi...,specifications fs mini klub printed baby girl ...
3224,Sixthsense Women of African Tribes Showpiece ...,1100.0,1100.0,Buy Sixthsense Women of African Tribes Showpie...,buy sixthsense women african tribes showpiece ...
1919,SR Crafts Showpiece - 15.24 cm SHIECPXYNJ8UH...,1850.0,1399.0,"SR Crafts Showpiece - 15.24 cm (Polyresin, M...",sr crafts showpiece cm polyresin multicolor pr...
4432,Rockmantra Green Eyes Cat Ceramic Mug MUGEAGGK...,499.0,199.0,Buy Rockmantra Green Eyes Cat Ceramic Mug for ...,buy rockmantra green eyes cat ceramic mug onli...
4835,Glory Fashion Shoulder Bag HMBEDCJHT4FCFKGD f,1099.0,399.0,Flipkart.com: Buy Glory Fashion Shoulder Bag f...,flipkart com buy glory fashion shoulder bag rs...


In [None]:
# Dropping the description column
sample_data.drop("description",axis=1,inplace=True)
sample_data.head()

Unnamed: 0,product_name,retail_price,discounted_price,preprocessed
7878,FS Mini Klub Printed Baby Girl's Basic Shorts ...,548.0,639.0,specifications fs mini klub printed baby girl ...
3224,Sixthsense Women of African Tribes Showpiece ...,1100.0,1100.0,buy sixthsense women african tribes showpiece ...
1919,SR Crafts Showpiece - 15.24 cm SHIECPXYNJ8UH...,1850.0,1399.0,sr crafts showpiece cm polyresin multicolor pr...
4432,Rockmantra Green Eyes Cat Ceramic Mug MUGEAGGK...,499.0,199.0,buy rockmantra green eyes cat ceramic mug onli...
4835,Glory Fashion Shoulder Bag HMBEDCJHT4FCFKGD f,1099.0,399.0,flipkart com buy glory fashion shoulder bag rs...


In [None]:
# resetting the index
sample_data=sample_data.reset_index(drop=True)

In [None]:
# top 1 row in sample_data
sample_data.head(1)

Unnamed: 0,product_name,retail_price,discounted_price,preprocessed
0,FS Mini Klub Printed Baby Girl's Basic Shorts ...,548.0,639.0,specifications fs mini klub printed baby girl ...


In [None]:
# used lammatization to get the root word.
#(Lammatization used bcoz it always give output in english/porterstemmer gives root word but not always in english).
from nltk.stem import WordNetLemmatizer

In [None]:
import nltk
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# object creation 
lemmatizer = WordNetLemmatizer()

In [None]:
# creating user defined function to lammatization.
def lema(text):
    y=[]
    for i in text.split():
        y.append(lemmatizer.lemmatize(i))
    
    return " ".join(y)

In [None]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
# Applying lammatization on each word in document.
sample_data["preprocessed"]=sample_data["preprocessed"].map(lema)

In [None]:
# top 5 rows after lammatization.
sample_data.head()

Unnamed: 0,product_name,retail_price,discounted_price,preprocessed
0,FS Mini Klub Printed Baby Girl's Basic Shorts ...,548.0,639.0,specification f mini klub printed baby girl ba...
1,Sixthsense Women of African Tribes Showpiece ...,1100.0,1100.0,buy sixthsense woman african tribe showpiece c...
2,SR Crafts Showpiece - 15.24 cm SHIECPXYNJ8UH...,1850.0,1399.0,sr craft showpiece cm polyresin multicolor pri...
3,Rockmantra Green Eyes Cat Ceramic Mug MUGEAGGK...,499.0,199.0,buy rockmantra green eye cat ceramic mug onlin...
4,Glory Fashion Shoulder Bag HMBEDCJHT4FCFKGD f,1099.0,399.0,flipkart com buy glory fashion shoulder bag r ...


In [None]:
# Shape of dataset.
sample_data.shape

(10000, 4)

In [None]:
# Vectorization - convert text to vectors
# tfidf used bcoz it gives importance to words, the weightage of each word is different.
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer #---> convert text to vector,here we take only top 5000 words.
tfidf = TfidfVectorizer(max_features=5000)

In [None]:
# Creating the vectors
vectors=tfidf.fit_transform(sample_data["preprocessed"]).toarray()

In [None]:
vectors

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
# 5000 features which are selected
tfidf.get_feature_names()

['aa',
 'aaa',
 'aadivasi',
 'aadyaa',
 'aahana',
 'aaina',
 'aaishwarya',
 'aakash',
 'aakshi',
 'aaliya',
 'aao',
 'aapno',
 'aara',
 'aaradhi',
 'ab',
 'abaya',
 'abdominal',
 'abhira',
 'ability',
 'able',
 'abony',
 'abrasive',
 'abroad',
 'absolute',
 'absolutely',
 'absorbing',
 'absorbs',
 'absorption',
 'abstract',
 'abstrcts',
 'absurd',
 'ac',
 'accent',
 'accentuate',
 'accentuates',
 'accesories',
 'access',
 'accessible',
 'accessoreez',
 'accessorise',
 'accessorize',
 'accessory',
 'accident',
 'accord',
 'according',
 'accu',
 'acheived',
 'achieve',
 'achievement',
 'acid',
 'acm',
 'across',
 'acrylic',
 'act',
 'action',
 'active',
 'activity',
 'actn',
 'actual',
 'actually',
 'ada',
 'adaa',
 'adapater',
 'adapt',
 'adapter',
 'adaptive',
 'add',
 'added',
 'addiction',
 'adding',
 'addition',
 'additional',
 'addons',
 'address',
 'addyvero',
 'aden',
 'adhesive',
 'adidas',
 'adimani',
 'adiwalk',
 'adjust',
 'adjustable',
 'adjustble',
 'adjusting',
 'adjustmen

In [None]:
# used cosine similarity to get similarity score between vectors. 
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Similarity metrix
similarity=cosine_similarity(vectors)

In [None]:
#similarity scores of 1st product with remaining products.
similarity[0]

array([1., 0., 0., ..., 0., 0., 0.])

In [None]:
# Top 5 similar products of 1st product.
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])[1:6]

[(8982, 1.0000000000000002),
 (282, 0.9645928647364957),
 (7655, 0.9645928647364957),
 (7575, 0.9446629918339415),
 (8246, 0.9446629918339415)]

In [None]:
# creating the user defined function which finally gives the details of selected product and similar product on another website.
def match(product):
    p1_name=product
    p1_index=sample_data[sample_data["product_name"]==product].index[0]
    p1_retail_price=sample_data.iloc[p1_index].retail_price
    p1_discounted_price=sample_data.iloc[p1_index].discounted_price
    
    distances=similarity[p1_index]
    product_lists=sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]
    
    products=[]
    for i in product_lists:
        if sample_data.iloc[i[0]].product_name[-1]!=product[-1]:
            products.append(sample_data.iloc[i[0]].product_name)
    p2_name=products[0]
    p2_index=sample_data[sample_data["product_name"]==products[0]].index[0]
    p2_retail_price=sample_data.iloc[p2_index].retail_price
    p2_discounted_price=sample_data.iloc[p2_index].discounted_price
    
    final_products=[[p1_name,p1_retail_price,p1_discounted_price],[p2_name,p2_retail_price,p2_discounted_price]]
    
    return final_products

In [None]:
# Testing
sample_data["product_name"][1]

'Sixthsense Women of African Tribes Showpiece  -  9 cm SHIE53GCQDQEK7SF f'

In [None]:
# Testing
match("Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a")

[['Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a', 439.0, 250.0],
 ['Printland PMR1329 Ceramic Mug MUGEACY8CDCZTUMD f', 650.0, 299.0]]

In [None]:
# Importing the pickle
import pickle

In [None]:
# Dumping the sample data
pickle.dump(sample_data.to_dict(),open("product_dict.pkl","wb"))

In [None]:
# Dumping the similarity matrix
pickle.dump(similarity,open("similarity.pkl","wb"))