# Systematic Similarity

## Data Preparation:
- The code starts by importing necessary libraries and loading a cleaned dataset (t2.csv).
- It defines functions to clean text data by removing punctuation and stopwords using NLTK.
- The dataset is then sliced to a smaller size (df1) due to computational limitations.

In [3]:
import numpy as np   #for numerical calculation
import pandas as pd  #for data analysis

import string 
import regex # regular expression
from nltk.corpus import stopwords # to remove stopwords using nltk library

In [4]:
df=pd.read_csv('t1.csv') # reading the data which we cleaned before

In [5]:
df.info() # data information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33573 entries, 0 to 33572
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   33573 non-null  int64  
 1   category     33573 non-null  object 
 2   description  33573 non-null  object 
 3   title        33573 non-null  object 
 4   brand        33573 non-null  object 
 5   date         33573 non-null  object 
 6   price        33573 non-null  float64
 7   asin         33573 non-null  object 
 8   imageURL     33573 non-null  object 
 9   overall      33573 non-null  int64  
 10  verified     33573 non-null  bool   
 11  reviewTime   33573 non-null  object 
 12  reviewText   33573 non-null  object 
 13  reviewerID   33573 non-null  object 
 14  Sentimental  33573 non-null  object 
dtypes: bool(1), float64(1), int64(2), object(11)
memory usage: 3.6+ MB


In [6]:
df.isnull().sum() # finding null values

Unnamed: 0     0
category       0
description    0
title          0
brand          0
date           0
price          0
asin           0
imageURL       0
overall        0
verified       0
reviewTime     0
reviewText     0
reviewerID     0
Sentimental    0
dtype: int64

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,category,description,title,brand,date,price,asin,imageURL,overall,verified,reviewTime,reviewText,reviewerID,Sentimental
0,0,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],3,False,"11 3, 2006","I thought the book was entertaining and cute, ...",A2WJLOXXIB7NF3,Neutral
1,1,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"05 9, 2006",This adorable story is an all time favorite fa...,A1RKICUK0GG6VF,Positive
2,2,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"03 11, 2006",Lisa's bear Corduroy gets lost in the laundrom...,A1QA5E50M398VW,Positive
3,3,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"01 24, 2001",In this installment of Corduroy's adventures w...,A3N0HBW8IP8CZQ,Positive
4,4,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"07 30, 2000",Researchers constantly find that reading to ch...,A1K1JW1C5CUSUZ,Positive


In [8]:
df1=df[:10000] # slicing data because my pc doesnt support large data

## Text Cleaning:
- The product names, categories, and descriptions are cleaned using the defined text processing functions.
- Cleaned text is converted back to strings and combined into a single column (combined_text) for each product.

In [9]:
def text_process(text): # cleaning the data
    nopunc = [char for char in  text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [10]:
def list_to_string(s): # converting list to string
    str1 = " "
    return (str1.join(s))

In [11]:
# cleaning the dataset
df1['Product_name']=df1['title'].apply(text_process)
df1['category_for_rec']=df1['category'].apply(text_process)
df1['description_for_rec']=df1['description'].apply(text_process)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Product_name']=df1['title'].apply(text_process)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['category_for_rec']=df1['category'].apply(text_process)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['description_for_rec']=df1['description'].apply(text_process)


In [12]:
# converting data from list to string
df1['Product_name']=df1['Product_name'].apply(list_to_string)
df1['category_for_rec']=df1['category_for_rec'].apply(list_to_string)
df1['description_for_rec']=df1['description_for_rec'].apply(list_to_string)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Product_name']=df1['Product_name'].apply(list_to_string)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['category_for_rec']=df1['category_for_rec'].apply(list_to_string)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['description_for_rec']=df1['description_for_rec'].apply(list_to_strin

In [13]:
df1.head()

Unnamed: 0.1,Unnamed: 0,category,description,title,brand,date,price,asin,imageURL,overall,verified,reviewTime,reviewText,reviewerID,Sentimental,Product_name,category_for_rec,description_for_rec
0,0,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],3,False,"11 3, 2006","I thought the book was entertaining and cute, ...",A2WJLOXXIB7NF3,Neutral,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...
1,1,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"05 9, 2006",This adorable story is an all time favorite fa...,A1RKICUK0GG6VF,Positive,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...
2,2,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"03 11, 2006",Lisa's bear Corduroy gets lost in the laundrom...,A1QA5E50M398VW,Positive,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...
3,3,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"01 24, 2001",In this installment of Corduroy's adventures w...,A3N0HBW8IP8CZQ,Positive,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...
4,4,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"07 30, 2000",Researchers constantly find that reading to ch...,A1K1JW1C5CUSUZ,Positive,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...


In [14]:
df1['combined_text']=df1['Product_name']+' '+df1['category_for_rec']+' '+df1['description_for_rec'] #combining both column

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['combined_text']=df1['Product_name']+' '+df1['category_for_rec']+' '+df1['description_for_rec'] #combining both column


In [15]:
df1.head()

Unnamed: 0.1,Unnamed: 0,category,description,title,brand,date,price,asin,imageURL,overall,verified,reviewTime,reviewText,reviewerID,Sentimental,Product_name,category_for_rec,description_for_rec,combined_text
0,0,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],3,False,"11 3, 2006","I thought the book was entertaining and cute, ...",A2WJLOXXIB7NF3,Neutral,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...,Pocket Corduroy Office Products Office School ...
1,1,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"05 9, 2006",This adorable story is an all time favorite fa...,A1RKICUK0GG6VF,Positive,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...,Pocket Corduroy Office Products Office School ...
2,2,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"03 11, 2006",Lisa's bear Corduroy gets lost in the laundrom...,A1QA5E50M398VW,Positive,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...,Pocket Corduroy Office Products Office School ...
3,3,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"01 24, 2001",In this installment of Corduroy's adventures w...,A3N0HBW8IP8CZQ,Positive,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...,Pocket Corduroy Office Products Office School ...
4,4,"['Office Products', 'Office & School Supplies'...",['Corduroy the bear goes to the launderette wi...,A Pocket for Corduroy,Ingram Book & Distributor,2006-09-14,0.95,140503528,[],5,False,"07 30, 2000",Researchers constantly find that reading to ch...,A1K1JW1C5CUSUZ,Positive,Pocket Corduroy,Office Products Office School Supplies Educati...,Corduroy bear goes launderette Lisa overhears ...,Pocket Corduroy Office Products Office School ...


In [16]:
df2=df1[['Product_name','category_for_rec','description_for_rec','combined_text']] # taking only required column for further anlaysis

In [17]:
df2.combined_text[0]

'Pocket Corduroy Office Products Office School Supplies Education Crafts Early Childhood Education Materials Corduroy bear goes launderette Lisa overhears mother warn taking things pockets washing clothes Corduroy discovers pocket begins search find one'

In [18]:
#pip install sentence-transformers

## Sentence Embedding:
- Hugging Face's Sentence Transformers library is used to load a pre-trained model (all-MiniLM-L6-v2).
- Each combined text is encoded into a vector representation using the pre-trained model, resulting in combined_embedding.

In [19]:
from sentence_transformers import SentenceTransformer # sentence tranformer


In [20]:
model = SentenceTransformer('all-MiniLM-L6-v2') # one of model from hugging face

# Embedding

In [21]:
df2['combined_embedding']=df2['combined_text'].apply(lambda x:model.encode(x)) # encoding the text to vector

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['combined_embedding']=df2['combined_text'].apply(lambda x:model.encode(x)) # encoding the text to vector


In [22]:
df2.sample(20)

Unnamed: 0,Product_name,category_for_rec,description_for_rec,combined_text,combined_embedding
1868,Scholastic Teachers Friend Schedule Cards Pock...,Office Products Office amp School Supplies Edu...,use Scholastic Daily Schedule Pocket Chart,Scholastic Teachers Friend Schedule Cards Pock...,"[-0.026725624, 0.00227817, 0.028070852, -0.032..."
1340,Scholastic File Organizer Pocket Chart TF5104,Office Products Office amp School Supplies Edu...,Pocket chart features 10 sturdy pockets fit le...,Scholastic File Organizer Pocket Chart TF5104 ...,"[0.0060373796, 0.0038241984, -0.016716043, -0...."
3119,Carson Dellosa 5239 KidDrawn Christian Faith S...,Office Products Office amp School Supplies Edu...,Perfect reward recognition pack acidfree ligni...,Carson Dellosa 5239 KidDrawn Christian Faith S...,"[-0.060244426, -0.03650533, 0.008916414, 0.003..."
3395,MarkMyTime Digital Bookmark Reading Timer Neon...,Office Products Office amp School Supplies Boo...,bTheres product quite like accurately track ch...,MarkMyTime Digital Bookmark Reading Timer Neon...,"[-0.066248976, -0.025308834, 0.005696673, 0.04..."
5451,Freedom Journal Best Daily Planner Accomplish ...,Office Products Office amp School Supplies Cal...,,Freedom Journal Best Daily Planner Accomplish ...,"[-0.03452844, 0.071058184, 0.028710531, 0.0805..."
7059,CarsonDellosa File Folders CDP136002,Office Products Office School Supplies Filing ...,Make organizing fun cool bubbly blues file fol...,CarsonDellosa File Folders CDP136002 Office Pr...,"[-0.011596656, 0.0046337475, -0.013802588, 0.0..."
1550,Standard Pocket Chart,Office Products Office amp School Supplies Edu...,Pocket chart features 10 seethrough plastic po...,Standard Pocket Chart Office Products Office a...,"[-0.03116564, 0.021455582, -0.004495336, -0.02..."
4377,pretty simple cards quotThank Cardsquot 12 Car...,Office Products Office School Supplies Paper C...,Simple Direct Understated Thank Notes Black Sc...,pretty simple cards quotThank Cardsquot 12 Car...,"[-0.05072979, 0.066627346, 0.019013071, 0.0749..."
5113,Freedom Journal Best Daily Planner Accomplish ...,Office Products Office amp School Supplies Cal...,,Freedom Journal Best Daily Planner Accomplish ...,"[-0.03452844, 0.071058184, 0.028710531, 0.0805..."
1706,Scholastic Primary Math Charts Bulletin Board ...,,allinone math reference set includes place val...,Scholastic Primary Math Charts Bulletin Board ...,"[0.027112152, -0.04390265, -0.0042294986, -0.0..."


In [23]:
df2.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Product_name         10000 non-null  object
 1   category_for_rec     10000 non-null  object
 2   description_for_rec  10000 non-null  object
 3   combined_text        10000 non-null  object
 4   combined_embedding   10000 non-null  object
dtypes: object(5)
memory usage: 390.8+ KB


In [24]:
df2.combined_embedding[0]

array([-1.13985658e-01,  2.96170972e-02,  2.20395047e-02,  1.28198704e-02,
        3.35913268e-04,  2.03762064e-03,  1.04566537e-01,  3.95324789e-02,
        3.63291777e-03,  1.43586588e-03,  5.99806979e-02,  4.42380346e-02,
       -8.74844193e-03,  3.83620858e-02, -2.00432912e-03,  5.50881736e-02,
        2.62357015e-02,  6.41684681e-02,  5.43782562e-02,  9.08802729e-03,
        5.65564372e-02,  5.05657531e-02,  6.05862029e-02, -7.71827204e-03,
       -3.28158997e-02,  1.20924011e-01, -2.05528997e-02, -1.44500554e-01,
        1.27150456e-03, -5.79404943e-02,  1.79672372e-02,  8.07602555e-02,
        6.75414205e-02,  5.05498238e-02,  4.71836179e-02,  2.55447384e-02,
        8.65877196e-02,  2.16620881e-02,  3.92201096e-02,  8.00263211e-02,
        8.73124693e-03, -3.34960967e-02, -2.30383556e-02, -4.29795682e-02,
       -1.20390449e-02, -4.59842309e-02, -1.72692221e-02,  2.66106017e-02,
        5.10983244e-02, -1.76471863e-02, -1.84013788e-02, -7.52899125e-02,
       -4.58665602e-02,  

In [25]:
#df2.to_csv('data_with_embeddings.csv',index=False)


# Cosine Similarity
- Cosine similarity is calculated between the input text and the embeddings of all products in the dataset.
- The most similar products are identified based on the cosine similarity scores.
- The top similar products are retrieved and displayed.

In [26]:
from sentence_transformers import util # cosine similarity

In [27]:
def get_similar_products(combined_input, df, top_n=2):
    combined_embedding = model.encode(combined_input)
    
    similarities = util.pytorch_cos_sim(combined_embedding, df2['combined_embedding'])
    
    print(similarities)
    
    similar_indices = similarities.argsort(descending= True,axis = 1)[0][:top_n]
    print(similar_indices)
    
    similar_items_df = df2.iloc[similar_indices][['Product_name','category_for_rec','description_for_rec']]
    #print(similar_items)
    return similar_items_df

In [28]:
product_info_to_search = "Colorful Chalkboard"

similar_items = get_similar_products(product_info_to_search,df2)
similar_items

tensor([[0.2228, 0.2228, 0.2228,  ..., 0.2378, 0.2378, 0.2378]])
tensor([7102, 7114])


  b = torch.tensor(b)


Unnamed: 0,Product_name,category_for_rec,description_for_rec
7102,Colorful Chalkboard EZ Letters,Office Products Office School Supplies Educati...,DIVEasily create eyecatching messages contempo...
7114,Colorful Chalkboard EZ Letters,Office Products Office School Supplies Educati...,DIVEasily create eyecatching messages contempo...


In [29]:
print(product_info_to_search)
print("\nSimilar Products:")
for idx,row in similar_items.iterrows():
    print(f"Product : {row['Product_name']}")

Colorful Chalkboard

Similar Products:
Product : Colorful Chalkboard EZ Letters
Product : Colorful Chalkboard EZ Letters


In [30]:
product_info_to_search = input("Enter Keywords : ")

similar_items = get_similar_products(product_info_to_search,df2)
print(product_info_to_search)
print("\nSimilar Products:")
for idx,row in similar_items.iterrows():
    print(f"Product : {row['Product_name']}")
similar_items

Enter Keywords : notebook
tensor([[0.3006, 0.3006, 0.3006,  ..., 0.2912, 0.2912, 0.2912]])
tensor([3600, 3586])
notebook

Similar Products:
Product : Scientific Notebook Laboratory Notebook 192 numbered pages black hardcover 2001HC
Product : Scientific Notebook Laboratory Notebook 192 numbered pages black hardcover 2001HC


Unnamed: 0,Product_name,category_for_rec,description_for_rec
3600,Scientific Notebook Laboratory Notebook 192 nu...,Office Products Office School Supplies Paper N...,Bound laboratory notebook archival quality sec...
3586,Scientific Notebook Laboratory Notebook 192 nu...,Office Products Office School Supplies Paper N...,Bound laboratory notebook archival quality sec...


**This project aims to provide users with personalized and relevant product recommendations based on textual input, thereby enhancing their shopping experience and increasing engagement with the platform.**