# Systematic Similarity

## Data Preparation:
- The code starts by importing necessary libraries and loading a cleaned dataset (t2.csv).
- It defines functions to clean text data by removing punctuation and stopwords using NLTK.
- The dataset is then sliced to a smaller size (df1) due to computational limitations.

In [1]:
import numpy as np   #for numerical calculation
import pandas as pd  #for data analysis

import string 
import regex # regular expression
from nltk.corpus import stopwords # to remove stopwords using nltk library

In [2]:
df=pd.read_csv('t2.csv') # reading the data which we cleaned before

In [3]:
df.info() # data information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2683 entries, 0 to 2682
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   2683 non-null   int64  
 1   category     2683 non-null   object 
 2   description  2683 non-null   object 
 3   title        2683 non-null   object 
 4   brand        2683 non-null   object 
 5   date         2683 non-null   object 
 6   price        2683 non-null   float64
 7   asin         2683 non-null   object 
 8   imageURL     2683 non-null   object 
 9   overall      2683 non-null   int64  
 10  verified     2683 non-null   bool   
 11  reviewTime   2683 non-null   object 
 12  reviewText   2683 non-null   object 
 13  reviewerID   2683 non-null   object 
 14  Sentimental  2683 non-null   object 
dtypes: bool(1), float64(1), int64(2), object(11)
memory usage: 296.2+ KB


In [4]:
df.isnull().sum() # finding null values

Unnamed: 0     0
category       0
description    0
title          0
brand          0
date           0
price          0
asin           0
imageURL       0
overall        0
verified       0
reviewTime     0
reviewText     0
reviewerID     0
Sentimental    0
dtype: int64

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,category,description,title,brand,date,price,asin,imageURL,overall,verified,reviewTime,reviewText,reviewerID,Sentimental
0,128,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,False,"01 25, 2008",These Delorme Atlas & Gazetters are wondeful. ...,A2IUHI0QMEC9US,Positive
1,129,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,3,False,"10 28, 2007",I purchased the maps so I could see the elevat...,APPM2Z3VPETEX,Neutral
2,130,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"05 7, 2007",Great Product! Nearly as good as having a sepe...,A210MD07WALT56,Positive
3,131,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"04 3, 2007",I have nothing but praise for DeLorme. We hav...,A2DTG02DSNOLQY,Positive
4,132,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"02 15, 2007",This is a must have for finding the tucked awa...,A1T4WK0L835QHQ,Positive


In [6]:
df1=df[:10000] # slicing data because my pc doesnt support large data

## Text Cleaning:
- The product names, categories, and descriptions are cleaned using the defined text processing functions.
- Cleaned text is converted back to strings and combined into a single column (combined_text) for each product.

In [7]:
def text_process(text): # cleaning the data
    nopunc = [char for char in  text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [8]:
def list_to_string(s): # converting list to string
    str1 = " "
    return (str1.join(s))

In [9]:
# cleaning the dataset
df1['Product_name']=df1['title'].apply(text_process)
df1['category_for_rec']=df1['category'].apply(text_process)
df1['description_for_rec']=df1['description'].apply(text_process)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Product_name']=df1['title'].apply(text_process)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['category_for_rec']=df1['category'].apply(text_process)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['description_for_rec']=df1['description'].apply(text_process)


In [10]:
# converting data from list to string
df1['Product_name']=df1['Product_name'].apply(list_to_string)
df1['category_for_rec']=df1['category_for_rec'].apply(list_to_string)
df1['description_for_rec']=df1['description_for_rec'].apply(list_to_string)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Product_name']=df1['Product_name'].apply(list_to_string)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['category_for_rec']=df1['category_for_rec'].apply(list_to_string)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['description_for_rec']=df1['description_for_rec'].apply(list_to_strin

In [13]:
df1.head()

Unnamed: 0.1,Unnamed: 0,category,description,title,brand,date,price,asin,imageURL,overall,verified,reviewTime,reviewText,reviewerID,Sentimental,Product_name,category_for_rec,description_for_rec
0,128,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,False,"01 25, 2008",These Delorme Atlas & Gazetters are wondeful. ...,A2IUHI0QMEC9US,Positive,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...
1,129,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,3,False,"10 28, 2007",I purchased the maps so I could see the elevat...,APPM2Z3VPETEX,Neutral,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...
2,130,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"05 7, 2007",Great Product! Nearly as good as having a sepe...,A210MD07WALT56,Positive,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...
3,131,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"04 3, 2007",I have nothing but praise for DeLorme. We hav...,A2DTG02DSNOLQY,Positive,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...
4,132,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"02 15, 2007",This is a must have for finding the tucked awa...,A1T4WK0L835QHQ,Positive,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...


In [14]:
df1['combined_text']=df1['Product_name']+' '+df1['category_for_rec']+' '+df1['description_for_rec'] #combining both column

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['combined_text']=df1['Product_name']+' '+df1['category_for_rec']+' '+df1['description_for_rec'] #combining both column


In [15]:
df1.head()

Unnamed: 0.1,Unnamed: 0,category,description,title,brand,date,price,asin,imageURL,overall,verified,reviewTime,reviewText,reviewerID,Sentimental,Product_name,category_for_rec,description_for_rec,combined_text
0,128,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,False,"01 25, 2008",These Delorme Atlas & Gazetters are wondeful. ...,A2IUHI0QMEC9US,Positive,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...
1,129,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,3,False,"10 28, 2007",I purchased the maps so I could see the elevat...,APPM2Z3VPETEX,Neutral,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...
2,130,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"05 7, 2007",Great Product! Nearly as good as having a sepe...,A210MD07WALT56,Positive,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...
3,131,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"04 3, 2007",I have nothing but praise for DeLorme. We hav...,A2DTG02DSNOLQY,Positive,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...
4,132,"['Sports & Outdoors', 'Outdoor Recreation', 'C...","[""Rely on delorme ATLAS & gazetteer paper maps...",Garmin DeLorme Atlas &amp; Gazetteer Paper Map...,Garmin,2003-08-04,21.96,899333257,['https://images-na.ssl-images-amazon.com/imag...,5,True,"02 15, 2007",This is a must have for finding the tucked awa...,A1T4WK0L835QHQ,Positive,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...


In [16]:
df2=df1[['Product_name','category_for_rec','description_for_rec','combined_text']] # taking only required column for further anlaysis

In [17]:
df2.combined_text[0]

'Garmin DeLorme Atlas amp Gazetteer Paper Maps Arizona AA000005000 Sports Outdoors Outdoor Recreation Camping Hiking Navigation Electronics Topographic Maps Rely delorme ATLAS gazetteer paper maps utmost trip Planning backcountry access available paperback 11inches x 155Inches 50 States'

In [18]:
#pip install sentence-transformers

## Sentence Embedding:
- Hugging Face's Sentence Transformers library is used to load a pre-trained model (all-MiniLM-L6-v2).
- Each combined text is encoded into a vector representation using the pre-trained model, resulting in combined_embedding.

In [19]:
from sentence_transformers import SentenceTransformer # sentence tranformer


In [20]:
model = SentenceTransformer('all-MiniLM-L6-v2') # one of model from hugging face

# Embedding

In [22]:
df2['combined_embedding']=df2['combined_text'].apply(lambda x:model.encode(x)) # encoding the text to vector

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['combined_embedding']=df2['combined_text'].apply(lambda x:model.encode(x)) # encoding the text to vector


In [23]:
df2.sample(20)

Unnamed: 0,Product_name,category_for_rec,description_for_rec,combined_text,combined_embedding
1186,Intex Explorer 200 2Person Inflatable Boat Set...,Sports Outdoors Sports Fitness Boating Sailing...,Explorer 200 Set Pool Boat Marketing Informati...,Intex Explorer 200 2Person Inflatable Boat Set...,"[-0.0050696256, -0.062518805, 0.021104803, 0.0..."
478,SOG Multitool Pliers Pocket Tool ndash ldquoPo...,Sports Outdoors Sports Fitness Hunting Fishing...,PowerLock Government Services Administrationap...,SOG Multitool Pliers Pocket Tool ndash ldquoPo...,"[-0.07354514, -0.03458747, -0.039415684, -0.04..."
292,Case Black Ridgeback Hunter Knife,Sports Outdoors Sports Fitness Hunting Fishing...,Black Ridgeback Hunter Blackie Collins design ...,Case Black Ridgeback Hunter Knife Sports Outdo...,"[-0.06744544, -0.006403379, -0.021779224, 0.00..."
2226,Intex Explorer 200 2Person Inflatable Boat Set...,Sports Outdoors Sports Fitness Boating Sailing...,Explorer 200 Set Pool Boat Marketing Informati...,Intex Explorer 200 2Person Inflatable Boat Set...,"[-0.0050696256, -0.062518805, 0.021104803, 0.0..."
1639,Intex Explorer 200 2Person Inflatable Boat Set...,Sports Outdoors Sports Fitness Boating Sailing...,Explorer 200 Set Pool Boat Marketing Informati...,Intex Explorer 200 2Person Inflatable Boat Set...,"[-0.0050696256, -0.062518805, 0.021104803, 0.0..."
107,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,"[-0.0075754165, -0.0015259096, -0.023112586, 0..."
1404,Intex Explorer 200 2Person Inflatable Boat Set...,Sports Outdoors Sports Fitness Boating Sailing...,Explorer 200 Set Pool Boat Marketing Informati...,Intex Explorer 200 2Person Inflatable Boat Set...,"[-0.0050696256, -0.062518805, 0.021104803, 0.0..."
495,SOG Multitool Pliers Pocket Tool ndash ldquoPo...,Sports Outdoors Sports Fitness Hunting Fishing...,PowerLock Government Services Administrationap...,SOG Multitool Pliers Pocket Tool ndash ldquoPo...,"[-0.07354514, -0.03458747, -0.039415684, -0.04..."
2355,Intex Explorer 200 2Person Inflatable Boat Set...,Sports Outdoors Sports Fitness Boating Sailing...,Explorer 200 Set Pool Boat Marketing Informati...,Intex Explorer 200 2Person Inflatable Boat Set...,"[-0.0050696256, -0.062518805, 0.021104803, 0.0..."
29,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,Sports Outdoors Outdoor Recreation Camping Hik...,Rely delorme ATLAS gazetteer paper maps utmost...,Garmin DeLorme Atlas amp Gazetteer Paper Maps ...,"[-0.0075754165, -0.0015259096, -0.023112586, 0..."


In [24]:
df2.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2683 entries, 0 to 2682
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Product_name         2683 non-null   object
 1   category_for_rec     2683 non-null   object
 2   description_for_rec  2683 non-null   object
 3   combined_text        2683 non-null   object
 4   combined_embedding   2683 non-null   object
dtypes: object(5)
memory usage: 104.9+ KB


In [25]:
df2.combined_embedding[0]

array([-7.57541647e-03, -1.52590964e-03, -2.31125858e-02,  2.05572005e-02,
       -1.74666988e-03, -9.97751858e-03,  2.38002557e-02, -4.12319638e-02,
       -6.25356361e-02,  1.97021253e-02, -1.22488877e-02,  4.10493900e-04,
        7.93758184e-02,  1.02800932e-02,  1.20936586e-02, -1.01885740e-02,
       -4.06932272e-02,  3.04490384e-02,  2.67656073e-02, -3.08387000e-02,
        1.89062543e-02,  8.22362229e-02, -2.40688194e-02,  4.74595204e-02,
       -3.99705302e-03,  3.45882103e-02,  1.56669831e-03,  2.29063630e-02,
       -7.30910376e-02, -2.73126140e-02, -1.75094530e-02,  2.77200826e-02,
        4.41192463e-02, -5.23010828e-03,  4.37136963e-02, -9.90534201e-03,
       -2.77040303e-02, -2.89203003e-02, -2.26457808e-02, -2.78614424e-02,
       -1.98608655e-02, -3.91041562e-02,  1.10147186e-02,  7.66877756e-02,
       -5.95434681e-02,  4.96196561e-03, -5.23751043e-02,  3.00502181e-02,
        4.40588109e-02,  1.04640141e-01,  7.60663534e-04,  4.42886874e-02,
       -5.28513044e-02, -

In [26]:
#df2.to_csv('data_with_embeddings.csv',index=False)


# Cosine Similarity
- Cosine similarity is calculated between the input text and the embeddings of all products in the dataset.
- The most similar products are identified based on the cosine similarity scores.
- The top similar products are retrieved and displayed.

In [27]:
from sentence_transformers import util # cosine similarity

In [32]:
def get_similar_products(combined_input, df, top_n=1):
    combined_embedding = model.encode(combined_input)
    
    similarities = util.pytorch_cos_sim(combined_embedding, df2['combined_embedding'])
    
    print(similarities)
    
    similar_indices = similarities.argsort(descending= True,axis = 1)[0][:top_n]
    print(similar_indices)
    
    similar_items_df = df2.iloc[similar_indices][['Product_name','category_for_rec','description_for_rec']]
    #print(similar_items)
    return similar_items_df

In [33]:
product_info_to_search = "Colorful Chalkboard"

similar_items = get_similar_products(product_info_to_search,df2)
similar_items

tensor([[0.0721, 0.0721, 0.0721,  ..., 0.0148, 0.0148, 0.0148]])
tensor([150])


Unnamed: 0,Product_name,category_for_rec,description_for_rec
150,ZENY 15quotx30ft Battle Rope Workout Training ...,Sports Outdoors Sports Fitness Exercise Fitness,Practice Level Guide br Levels Sizes Rope br J...


In [34]:
print(product_info_to_search)
print("\nSimilar Products:")
for idx,row in similar_items.iterrows():
    print(f"Product : {row['Product_name']}")

Colorful Chalkboard

Similar Products:
Product : ZENY 15quotx30ft Battle Rope Workout Training Undulation Rope Fitness Rope Exercise


In [35]:
product_info_to_search = input("Enter Keywords : ")

similar_items = get_similar_products(product_info_to_search,df2)
print(product_info_to_search)
print("\nSimilar Products:")
for idx,row in similar_items.iterrows():
    print(f"Product : {row['Product_name']}")
similar_items

Enter Keywords : BAT
tensor([[0.1312, 0.1312, 0.1312,  ..., 0.0733, 0.0733, 0.0733]])
tensor([1022])
BAT

Similar Products:
Product : Vaughan RB 28Ounce Rig Builders Hatchet Hickory Handle heavy construction 17Inch Long


Unnamed: 0,Product_name,category_for_rec,description_for_rec
1022,Vaughan RB 28Ounce Rig Builders Hatchet Hickor...,Sports Outdoors Outdoor Recreation Camping Hik...,Forged American made high carbon steel fully p...


**This project aims to provide users with personalized and relevant product recommendations based on textual input, thereby enhancing their shopping experience and increasing engagement with the platform.**