# **APPROACH**
* Problem definition
* Data
* Evaluation
* Features
* Modelling
* Experimentation


# **About the problem**
* Using ML/DL techniques, match similar products from the Flipkart dataset with the Amazon dataset. 
* Once similar products are matched, display the retail price from FK and AMZ side by side. 

# **Data**
* Data used in this problem can be found here : 
  * https://www.dropbox.com/sh/aypq6h3254207bs/AACzMLvo-XtK9sYAAma6FW0la?dl=0

# **Evaluation**
 * Used **`sentence_transformsers`** for embedding the data and converting the product name into tensors.
 * Used **`Pytorch`** for computing the similarity between product names from amazon and filpkart data.

# Importing the important libraries.
* Libraries for data visualization.

In [1]:
#Import required module
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn

# Mounting the google drive.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Unzipping the content of the file.

In [3]:
# Unzip the data available in the drive an save it there for the our use.
#!unzip "/content/drive/MyDrive/Internship_2/DS - Assignment Part 2 data set.zip" -d "/content/drive/MyDrive/Internship_2/"

In [4]:
df_amazon  = pd.read_csv("/content/drive/MyDrive/Internship_2/amz_com-ecommerce_sample.csv",encoding='latin-1')
df_flipkart = pd.read_csv("/content/drive/MyDrive/Internship_2/flipkart_com-ecommerce_sample.csv",encoding='latin-1')

# Extracting the column from the dataframe which is needed for our problem.

In [5]:
amazon_data = df_amazon[['product_name','retail_price','discounted_price']]
flipkart_data = df_flipkart[['product_name','retail_price','discounted_price']]

# Installing the `sentence_transformers`.

In [6]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 4.0 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 14.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 47.0 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 49.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 39.4 MB/s 
Building wheels for collected 

## **`sentence_transformers`**
* SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings
* This framework generates embeddings for each input sentence.
* Sentences are passed as a list of string.

## **`Torch`**
* PyTorch is a Python package that provides two high-level features:
   * Tensor computation (like NumPy) with strong GPU acceleration.
   * Deep neural networks built on a tape-based autograd system.

In [7]:
from sentence_transformers import SentenceTransformer, util
import torch

## **`all-MiniLM-L6-v2`**
* This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

## **`cosine similarity`**
* It is a measure of similarity between two sequences of numbers.

## **`convert_to_tensor`**
* This is set true so as to convert the product names into numerical form .
As in numerical form it is easier compute their similarity.

In [8]:
# Instantiate a model of the SentenceTransformer class
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embedding both the Amazon and Flipkart product names and converting them into tensors.
amazon_embeds = model.encode(amazon_data.product_name,convert_to_tensor=True)
flipkart_embeds = model.encode(flipkart_data.product_name,convert_to_tensor=True)

# cosine similarities so as to find out the product with nearly similr names in both the file.
cosine_similarities = util.cos_sim(amazon_embeds,flipkart_embeds)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Storing the index of a similar Flipkart product corresponding to each Amazon product. 

In [9]:
similar_prod_indexes = []
for i in range(20000):
    max_score_idxs = torch.topk(cosine_similarities[i],k=5,largest=True,sorted=True).indices
    for idx in max_score_idxs:
        if idx != i:
            similar_prod_indexes.append(idx)
            break

In [10]:
similar_prod_indexes = [x.item() for x in similar_prod_indexes]
amazon_data.product_name[2936], flipkart_data.product_name[similar_prod_indexes[2936]]

("Folklore Festive Sleeveless Solid Women's Top",
 "Folklore Festive Sleeveless Printed Women's Top")

## Saving the filtered data into another list.

In [11]:
flipkart_prod_data = []
amazon_prod_data = []

for idx, prod_name in enumerate(amazon_data.product_name):
    flipkart_prod_data.append(flipkart_data.iloc[similar_prod_indexes[idx]])
    amazon_prod_data.append(amazon_data.iloc[idx])

## Saving the final result into one file and exporting it into a single csv file.
* `.concat()`
  * This funcion is used to join two list that we made above after filtering  the data from main file.  

In [12]:
flipkart_prod_data = pd.DataFrame(flipkart_prod_data)
flipkart_prod_data.columns = ['Product name in Flipkart','Retail Price in Flipkart','Discounted Price in Flipkart']
amazon_prod_data = pd.DataFrame(amazon_prod_data)
amazon_prod_data.columns = ['Product name in Amazon','Retail Price in Amazon','Discounted Price in Amazon']

In [13]:
amazon_prod_data['Retail Price in Amazon'] = amazon_prod_data['Retail Price in Amazon'].astype(np.float64)
amazon_prod_data['Discounted Price in Amazon'] = amazon_prod_data['Discounted Price in Amazon'].astype(np.float64)

In [14]:
flipkart_prod_final_data = flipkart_prod_data.copy()
flipkart_prod_final_data = flipkart_prod_final_data.reset_index(drop=True)
flipkart_prod_final_data.head()
final_prod_data = pd.concat([flipkart_prod_final_data,amazon_prod_data],axis=1)
final_prod_data.to_csv('final_result.csv')

In [15]:
final_prod_data 

Unnamed: 0,Product name in Flipkart,Retail Price in Flipkart,Discounted Price in Flipkart,Product name in Amazon,Retail Price in Amazon,Discounted Price in Amazon
0,Alisha Solid Women's Cycling Shorts,699.0,267.0,Alisha Solid Women's Cycling Shorts,982.0,438.0
1,FabHomeDecor Fabric Double Sofa Bed,32157.0,22646.0,FabHomeDecor Fabric Double Sofa Bed,32143.0,29121.0
2,i-Sports Bellies,899.0,399.0,AW Bellies,991.0,551.0
3,Alisha Solid Women's Cycling Shorts,999.0,379.0,Alisha Solid Women's Cycling Shorts,694.0,325.0
4,Sicons Conditioning Conditoner Dog Shampoo,110.0,100.0,Sicons All Purpose Arnica Dog Shampoo,208.0,258.0
...,...,...,...,...,...,...
19995,WallDesign Small Vinyl Sticker,1220.0,600.0,WALLDESIGN SMALL VINYL STICKER,1498.0,876.0
19996,Wallmantra Large Vinyl Stickers Sticker,727.0,582.0,WALLMANTRA LARGE VINYL STICKERS STICKER,1415.0,1424.0
19997,Elite Collection Medium Acrylic Sticker,1499.0,1199.0,ELITE COLLECTION MEDIUM ACRYLIC STICKER,1284.0,1196.0
19998,Elite Collection Medium Acrylic Sticker,1299.0,999.0,ELITE COLLECTION MEDIUM ACRYLIC STICKER,1492.0,1364.0
