<h1 align="center">Electronics Recommender System</h1>

### Introduction to Recommender Systems: Addressing Information Overload
We live in an era saturated with content, where the sheer volume of movies, news articles, shopping products, and websites overwhelms individual attention spans. The average Google search yields over a million results, yet how often do we venture beyond the first page of links? This phenomenon, known as the "long tail problem," highlights how a small fraction of content receives disproportionate attention, while the majority remains undiscovered.

In the face of this challenge, service providers must ask: "How do I curate a manageable selection of content for users that is both relevant and desired?" Thankfully, decades of research have produced a solution: recommender systems.

Understanding Recommender Systems
Recommender systems predict a user's preference for an item, allowing service providers to offer a tailored selection of content, thereby enhancing user engagement and broadening content exploration.

Fundamental Concepts
Terminology: Users, Items, and Ratings
In the realm of recommender systems, two primary entities exist: Users and Items.

Items are the content being consumed—movies, articles, products, etc. They remain passive, with fixed properties.
Users interact with these items, providing ratings based on their preferences. Ratings can be explicit (e.g., giving a movie a star rating) or implicit (e.g., watching a movie without rating it directly).
Implementing Content-Based Filtering: An Example
Let's delve into one of the primary methods employed in recommender systems: content-based filtering. In this context, we'll focus on building an "Electronics Recommender System."






## Measuring Similarity 

<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Cosine_similarity.jpg"
     alt="Cosine Similarity "
     style="float: center; padding-bottom=0.5em"
     width=600px/>
Measuring the similarity between the ratings of two users (A) and (B) for the books 'Harry Potter and the Philosopher's Stone' and 'The Diary of a Young Girl', using the Cosine similarity metric.  
</div>


Having learnt about the entities which exist within recommender systems, we may wonder how they function. While this is something that we'll learn throughout this entire train, one fundamental principal that we need to understand is that recommender systems are built up by utilising the _relations_ which  exist between items and users. As such, these systems always need a mechanism to measure how related or _similar_ a user is to another user, or an item is to another item. 

We accomplish this measurement of similarity through, you guessed it, a _similarity metric_.  

Generally speaking, a similarity metric can be thought of as being the inverse of a distance measure: if two things are considered to be very similar they should be assigned a high similarity value (close to 1), while dissimilar items should receive a low similarity value (close to zero). Other [important properties](https://online.stat.psu.edu/stat508/lesson/1b/1b.2/1b.2.1) include:
 - (Symmetry) $Sim(A,B) = Sim(B,A)$ 
 - (Identity) $Sim(A,A) = 1$
 - (Uniqueness) $Sim(A,B) = 1 \leftrightarrow A = B$
 
While there are many similarity metrics to choose from when building a recommender system (and more than one can certainly be used simultaneously), a popular choice is the **Cosine similarity**. We won't go into the fundamental trig here (we hope that you remember this from high school), but recall that as an angle becomes smaller (approaching $0^o$) the value of its cosine increases. Conversely, as the angle increases the cosine value decreases. It turns out that this behavior makes the cosine of the angle between two p-dimensional vectors desirable as a [similarity metric](https://en.wikipedia.org/wiki/Cosine_similarity) which can easily be computed.

Using the figure above to help guide our understanding, the Cosine similarity between two p-dimensional vectors ${A}$ and $B$ can be given as:

$$ \begin{align}
Sim(A,B)  &= \frac{A \cdot B}{||A|| \times ||B||} \\ \\
& = \frac{\sum_{i=1}^{p}A_{i}B_{i}}{\sqrt{{\sum_{i=1}^{p}A_{i}^2}} \sqrt{\sum_{i=1}^{p}B_{i}^2}}, \\
\end{align} $$ 
  

To make things a little more concrete, let's work out the cosine similarity using our provided example above. Here, each vector represents the ratings given by one of two *users*, $A$ and $B$, who have each rated two books (rating#1 $ \rightarrow r_1$, and rating#2 $ \rightarrow r_2$). To work out how similar these two users are based on their supplied ratings, we can use the Cosine similarity definition as follows:   


$$ \begin{align}
Sim(A,B)  & = \frac{(A_{r1} \times B_{r1})+(A_{r2} \times B_{r2})}{\sqrt{A_{r1}^2 + A_{r2}^2} \times \sqrt{B_{r1}^2 + B_{r2}^2}} \\ \\
& = \frac{(3 \times 5) + (4 \times 2)}{\sqrt{9 + 16} \times \sqrt{25 + 4}} \\ \\
& = \frac{23}{26.93} \\ \\
& = 0.854
\end{align} $$

It would be a pain to work this out manually each time! Thankfully, we can obtain this same result using the `cosine_similarity` function provided to us in `sklearn`. 

As usual before we can go ahead and use this function we need to import the libraries that we will need.  

In [176]:
##Importing Libraries
# Import our regular old heroes 
import numpy as np
import pandas as pd
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.feature_extraction.text import TfidfVectorizer
from surprise import SVD, Reader, Dataset
import re
import string
import nltk   #Importing nltk
from nltk.corpus import stopwords  #importing Stopwords
import string
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
import pickle


# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists

# Imported for our sanity
import warnings
warnings.filterwarnings('ignore')

In [177]:
data = pd.read_csv("data/electronics_products_pricing.csv")
data.head()

Unnamed: 0,id,prices.availability,prices.condition,prices.currency,prices.dateSeen,prices.isSale,prices.merchant,prices.shipping,prices.sourceURLs,asins,...,imageURLs,keys,manufacturer,manufacturerNumber,name,primaryCategories,sourceURLs,upc,weight,price
0,AVphrugr1cnluZ0-FOeH,Yes,New,USD,"2017-05-10T20:00:00Z,2017-05-09T15:00:00Z",False,Bestbuy.com,,http://www.bestbuy.com/site/products/7100293.p...,B00I9HD8PK,...,https://i5.walmartimages.com/asr/dd5f42c4-076c...,"819127010485,ecoxgearecostonebluetoothspeaker/...",Ecoxgear,GDI-EGST701,EcoXGear Ecostone Bluetooth Speaker,Electronics,http://www.walmart.com/ip/EcoXGear-Ecostone-Bl...,819000000000.0,3 pounds,92.99
1,AVrI6FDbv8e3D1O-lm4R,Yes,New,USD,"2017-10-10T02:00:00Z,2017-08-12T03:00:00Z,2017...",False,Bestbuy.com,,https://www.bestbuy.com/site/lenovo-100s-14ibr...,B06ZY63J8H,...,https://i5.walmartimages.com/asr/fcc50cce-a3c1...,"190793918948,lenovo100s14ibr14laptopintelceler...",,100s-14ibr,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,Electronics,https://www.walmart.com/ip/Lenovo-100S-14IBR-1...,191000000000.0,4.3 pounds,229.99
2,AVpiLlubilAPnD_xBoTa,Yes,New,USD,"2017-10-10T19:00:00Z,2017-09-12T14:00:00Z,2017...",False,Bestbuy.com,,https://www.bestbuy.com/site/house-of-marley-s...,B00G3P9UMU,...,https://i5.walmartimages.com/asr/c124aa15-b9e3...,"0846885007037,houseofmarleysmilejamaicainearea...",House Of Marley,EM-JE041-MI,House of Marley Smile Jamaica In-Ear Earbuds,Electronics,https://www.walmart.com/ip/House-of-Marley-Smi...,847000000000.0,0.6 ounces,16.99
3,AVpgQP5vLJeJML43LQbd,Yes,New,USD,"2017-09-08T05:00:00Z,2017-09-18T13:00:00Z,2017...",False,Bestbuy.com,,https://www.bestbuy.com/site/products/6311012....,B00TTWZFFA,...,https://i5.walmartimages.com/asr/1be435f7-5f3a...,"sonyultraportablebluetoothspeaker/sosrsx11bk,s...",Sony,SRSX11/BLK,Sony Ultra-Portable Bluetooth Speaker,Electronics,https://www.walmart.com/ip/Sony-Ultra-Portable...,27242886599.0,1 pounds,69.99
4,AV1YDsmoGV-KLJ3adcbe,More on the Way,New,USD,2017-12-05T13:00:00Z,True,bhphotovideo.com,Free Expedited Shipping for most orders over $49,https://www.bhphotovideo.com/c/product/1105014...,B00MHPAF38,...,http://i.ebayimg.com/thumbs/images/g/TBUAAOSwd...,sonyalphaa5100digitalcamerakitwith1650mmlenswh...,,ILCE5100L/W,Alpha a5100 Mirrorless Digital Camera with 16-...,Electronics,https://reviews.bestbuy.com/3545/8429343/revie...,27242883246.0,9.98 oz 4.09 oz,846.0


In [178]:
data.columns

Index(['id', 'prices.availability', 'prices.condition', 'prices.currency',
       'prices.dateSeen', 'prices.isSale', 'prices.merchant',
       'prices.shipping', 'prices.sourceURLs', 'asins', 'brand', 'categories',
       'dateAdded', 'dateUpdated', 'ean', 'imageURLs', 'keys', 'manufacturer',
       'manufacturerNumber', 'name', 'primaryCategories', 'sourceURLs', 'upc',
       'weight', 'price'],
      dtype='object')

In [179]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5436 entries, 0 to 5435
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   5436 non-null   object 
 1   prices.availability  5436 non-null   object 
 2   prices.condition     5436 non-null   object 
 3   prices.currency      5436 non-null   object 
 4   prices.dateSeen      5436 non-null   object 
 5   prices.isSale        5436 non-null   bool   
 6   prices.merchant      5436 non-null   object 
 7   prices.shipping      3199 non-null   object 
 8   prices.sourceURLs    5436 non-null   object 
 9   asins                5436 non-null   object 
 10  brand                5436 non-null   object 
 11  categories           5436 non-null   object 
 12  dateAdded            5436 non-null   object 
 13  dateUpdated          5436 non-null   object 
 14  ean                  1175 non-null   object 
 15  imageURLs            5436 non-null   o

In [180]:
new_data = data[['name', 'brand', 'categories']]
new_data.head()

Unnamed: 0,name,brand,categories
0,EcoXGear Ecostone Bluetooth Speaker,Grace Digital,"Electronics,Home Audio & Theater,Home Audio,Al..."
1,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,Lenovo,"Electronics,Computers,Laptops,Laptops By Brand..."
2,House of Marley Smile Jamaica In-Ear Earbuds,House of Marley,"Headphones,Consumer Electronics,Portable Audio..."
3,Sony Ultra-Portable Bluetooth Speaker,Sony,"Electronics,Home Audio & Theater,Home Audio,Al..."
4,Alpha a5100 Mirrorless Digital Camera with 16-...,Sony,"Digital Cameras,Cameras & Photo,Used:Digital P..."


In [181]:
new_data.isnull().sum()

name          0
brand         0
categories    0
dtype: int64

### Data Cleaning

In [182]:
#concating the features before cleaning

new_data['tags'] =new_data['brand'] + " " + new_data['categories']
new_data['tags']

0       Grace Digital Electronics,Home Audio & Theater...
1       Lenovo Electronics,Computers,Laptops,Laptops B...
2       House of Marley Headphones,Consumer Electronic...
3       Sony Electronics,Home Audio & Theater,Home Aud...
4       Sony Digital Cameras,Cameras & Photo,Used:Digi...
                              ...                        
5431    Apple iPhones,All Cell Phones with Plans,iPhon...
5432    ZAGG Computers,Bags, Cases & Sleeves,Computer ...
5433    360fly Cameras & Photo,360 Cameras,VR 360 Vide...
5434    Alpine Auto & Tires,Auto Electronics,Car Speak...
5435    Pioneer Speaker Separates tdrbbzebscxdcufzwatt...
Name: tags, Length: 5436, dtype: object

In [183]:
new_data['tags'][0]

'Grace Digital Electronics,Home Audio & Theater,Home Audio,All Home Speakers,Speaker Systems,Portable Audio & Video,Portable Speakers & Docks,Portable Bluetooth Speakers,Audio,Bluetooth & Wireless Speakers,Stereos,Electrical,Home Electronics,Portable Audio'

In [184]:
#preprocesing text
def preprocess_text(text):
    # Remove noise (extra spaces and newlines)
    text = text.strip()
    # Remove numbers using regular expressions
    #text = re.sub(r'\d+', '', text)
    
    # Remove punctuation and convert to lowercase
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    text = text.lower()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords and lemmatize
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    preprocessed_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    
    # Join the lemmatized tokens back into a string
    preprocessed_text = ' '.join(preprocessed_tokens)
    
    return preprocessed_text

In [186]:
# test = preprocess_text('EcoXGear Ecostone Bluetooth Speaker')
# test

In [185]:
new_data['processed_tags'] = new_data['tags'].apply(lambda x: preprocess_text(x))
new_data

Unnamed: 0,name,brand,categories,tags,processed_tags
0,EcoXGear Ecostone Bluetooth Speaker,Grace Digital,"Electronics,Home Audio & Theater,Home Audio,Al...","Grace Digital Electronics,Home Audio & Theater...",grace digital electronicshome audio theaterhom...
1,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,Lenovo,"Electronics,Computers,Laptops,Laptops By Brand...","Lenovo Electronics,Computers,Laptops,Laptops B...",lenovo electronicscomputerslaptopslaptops bran...
2,House of Marley Smile Jamaica In-Ear Earbuds,House of Marley,"Headphones,Consumer Electronics,Portable Audio...","House of Marley Headphones,Consumer Electronic...",house marley headphonesconsumer electronicspor...
3,Sony Ultra-Portable Bluetooth Speaker,Sony,"Electronics,Home Audio & Theater,Home Audio,Al...","Sony Electronics,Home Audio & Theater,Home Aud...",sony electronicshome audio theaterhome audioal...
4,Alpha a5100 Mirrorless Digital Camera with 16-...,Sony,"Digital Cameras,Cameras & Photo,Used:Digital P...","Sony Digital Cameras,Cameras & Photo,Used:Digi...",sony digital camerascameras photouseddigital p...
...,...,...,...,...,...
5431,Apple iPhone SE Gold 16GB for Sprint ( MLY92LL...,Apple,"iPhones,All Cell Phones with Plans,iPhone SE,C...","Apple iPhones,All Cell Phones with Plans,iPhon...",apple iphonesall cell phone plansiphone secarr...
5432,Rugged Book Keyboard and Case for iPad Air 2,ZAGG,"Computers,Bags, Cases & Sleeves,Computer Acces...","ZAGG Computers,Bags, Cases & Sleeves,Computer ...",zagg computersbags case sleevescomputer access...
5433,4K Video Camera,360fly,"Cameras & Photo,360 Cameras,VR 360 Video,Camco...","360fly Cameras & Photo,360 Cameras,VR 360 Vide...",360fly camera photo360 camerasvr 360 videocamc...
5434,"Alpine - 5 x 7"" 2-Way Coaxial Car Speakers wit...",Alpine,"Auto & Tires,Auto Electronics,Car Speakers and...","Alpine Auto & Tires,Auto Electronics,Car Speak...",alpine auto tiresauto electronicscar speaker s...


In [187]:
new_df = new_data[['name', 'processed_tags']]
new_df

Unnamed: 0,name,processed_tags
0,EcoXGear Ecostone Bluetooth Speaker,grace digital electronicshome audio theaterhom...
1,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,lenovo electronicscomputerslaptopslaptops bran...
2,House of Marley Smile Jamaica In-Ear Earbuds,house marley headphonesconsumer electronicspor...
3,Sony Ultra-Portable Bluetooth Speaker,sony electronicshome audio theaterhome audioal...
4,Alpha a5100 Mirrorless Digital Camera with 16-...,sony digital camerascameras photouseddigital p...
...,...,...
5431,Apple iPhone SE Gold 16GB for Sprint ( MLY92LL...,apple iphonesall cell phone plansiphone secarr...
5432,Rugged Book Keyboard and Case for iPad Air 2,zagg computersbags case sleevescomputer access...
5433,4K Video Camera,360fly camera photo360 camerasvr 360 videocamc...
5434,"Alpine - 5 x 7"" 2-Way Coaxial Car Speakers wit...",alpine auto tiresauto electronicscar speaker s...


In [189]:
new_df.drop_duplicates(inplace=True)

In [192]:
df = new_df.reset_index(drop=True)
df

Unnamed: 0,name,processed_tags
0,EcoXGear Ecostone Bluetooth Speaker,grace digital electronicshome audio theaterhom...
1,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,lenovo electronicscomputerslaptopslaptops bran...
2,House of Marley Smile Jamaica In-Ear Earbuds,house marley headphonesconsumer electronicspor...
3,Sony Ultra-Portable Bluetooth Speaker,sony electronicshome audio theaterhome audioal...
4,Alpha a5100 Mirrorless Digital Camera with 16-...,sony digital camerascameras photouseddigital p...
...,...,...
814,"NB-13L Lithium-Ion Battery Pack (3.6V, 1250mAh)",canon officecamera photo accessoriesdigital ca...
815,151 SE Outdoor Environmental Speakers (White),bose audio video accessoriesoutdoor speakersco...
816,TiVo - Roamio OTA VOX 1TB Digital Video Record...,tivo consumer electronicstv video home audiodv...
817,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,sanus audio video accessoriestv mountstv acces...


In [193]:
df

Unnamed: 0,name,processed_tags
0,EcoXGear Ecostone Bluetooth Speaker,grace digital electronicshome audio theaterhom...
1,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,lenovo electronicscomputerslaptopslaptops bran...
2,House of Marley Smile Jamaica In-Ear Earbuds,house marley headphonesconsumer electronicspor...
3,Sony Ultra-Portable Bluetooth Speaker,sony electronicshome audio theaterhome audioal...
4,Alpha a5100 Mirrorless Digital Camera with 16-...,sony digital camerascameras photouseddigital p...
...,...,...
814,"NB-13L Lithium-Ion Battery Pack (3.6V, 1250mAh)",canon officecamera photo accessoriesdigital ca...
815,151 SE Outdoor Environmental Speakers (White),bose audio video accessoriesoutdoor speakersco...
816,TiVo - Roamio OTA VOX 1TB Digital Video Record...,tivo consumer electronicstv video home audiodv...
817,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,sanus audio video accessoriestv mountstv acces...


In [220]:
df.to_csv('new_df.csv')

In [71]:
# def cosine_sim(txt1, txt2):
#     #vectorize text
#     vect = TfidfVectorizer()
#     #saving the vectorizer
#     model_save_path1 = "model/vect.pkl"
#     with open(model_save_path1,'wb') as file:
#         pickle.dump(vect ,file)

#     matrix = vect.fit_transform([txt1, txt2])
#     #finding cosine_similarity
#     similarity = cosine_similarity(matrix)[0][1]
#     #saving sim file
#     model_save_path2 = "model/cosim.pkl"
#     with open(model_save_path2,'wb') as file:
#         pickle.dump(similarity ,file)
        
#     return similarity

In [72]:
# def recommend_product(query):
#     cleaned_query = preprocess_text(query)
#     new_data['similarity'] = new_data['concat'].apply(lambda x: cosine_sim(cleaned_query, x))
#     final_output = new_data.sort_values(by=['similarity'], ascending=False).head(30)
#     return final_output

In [29]:
#recommend_product('HDR-AS200V Full HD Action Cam')

In [203]:
#Vectorise
vectorizer = TfidfVectorizer()
model_save_path = "model/vectorizer.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(vectorizer,file)
feature_vectors = vectorizer.fit_transform(df['processed_tags']).toarray()

In [211]:
similarity = cosine_similarity(feature_vectors)
model_save_path1 = "model/cos_sim.pkl"
with open(model_save_path1, 'wb') as file:
    pickle.dump(similarity, file)

In [212]:
similarity[0]

array([1.        , 0.        , 0.11429785, 0.48680742, 0.03552561,
       0.        , 0.02083956, 0.04997838, 0.40666965, 0.        ,
       0.1694426 , 0.        , 0.0643331 , 0.        , 0.        ,
       0.        , 0.02027563, 0.2392944 , 0.        , 0.23659137,
       0.        , 0.        , 0.18954632, 0.        , 0.        ,
       0.27291387, 0.        , 0.        , 0.        , 0.        ,
       0.03620778, 0.        , 0.00859011, 0.        , 0.        ,
       0.01762886, 0.2218484 , 0.06674088, 0.08043256, 0.02345596,
       0.1178657 , 0.01090919, 0.        , 0.03386455, 0.        ,
       0.        , 0.04028423, 0.04876271, 0.28195869, 0.        ,
       0.        , 0.02148143, 0.08350179, 0.06127462, 0.0608724 ,
       0.        , 0.        , 0.02053582, 0.00580283, 0.01274635,
       0.23440893, 0.        , 0.        , 0.        , 0.04600969,
       0.02669258, 0.        , 0.01555645, 0.154366  , 0.        ,
       0.04961893, 0.09694406, 0.00371221, 0.03012892, 0.     

In [214]:
sorted(list(enumerate(similarity[0])),reverse=True, key=lambda x:x[1])[1:20]

[(142, 0.5727284996116648),
 (3, 0.48680741873255995),
 (811, 0.45159470317416117),
 (8, 0.40666964880053574),
 (614, 0.3729719257819732),
 (278, 0.36602209603452335),
 (155, 0.3091165453737273),
 (744, 0.2913071484406734),
 (252, 0.2819785401788548),
 (48, 0.28195868795694784),
 (772, 0.27803678832868484),
 (25, 0.2729138691071112),
 (198, 0.27056422622866394),
 (421, 0.26865092340037977),
 (537, 0.2628656214479896),
 (102, 0.2505348101539833),
 (704, 0.2472479735759074),
 (224, 0.2449243971276365),
 (509, 0.24255200276486183)]

In [215]:
# indices = pd.Series(new_df.index, index=new_df['name'])

names = df['name']

In [216]:
def recommend(product):
    names = df['name']
    product_index = df[df['name']==product].index[0]
    matrix = similarity[product_index]    
    product_list = sorted(list(enumerate(matrix)), reverse=True, key=lambda x:x[1])[1:10]
    product_indices = [i[0] for i in product_list]
    # Convert the indexes back into titles 
    return names.iloc[product_indices]

In [218]:
recommend('EcoXGear Ecostone Bluetooth Speaker')

142    Monster SuperStar BackFloat High-Definition Bl...
3                  Sony Ultra-Portable Bluetooth Speaker
811    ECOXGEAR - SolJam Portable Bluetooth Speaker -...
8      KICKER - 6.5 2-Way Full-Range Speakers (Pair) ...
614    Kicker 41IK5BT2V2 Amphitheater High-Performanc...
278    Sabrent Sp-byta Speaker System - 2 W Rms - Wir...
155    ECOXGEAR ECOXBT Rugged and Waterproof Wireless...
744         ECOXGEAR ECOXBT Waterproof Bluetooth Speaker
252                                      House of Marley
Name: name, dtype: object

In [121]:
# def recommend(product):
#     product_index = new_df[new_df['name']==product].index[0]
#     matrix = similarity[product_index]    
#     product_list = sorted(list(enumerate(matrix)), reverse=True, key=lambda x:x[1])[1:50]
#     # Use a set to track seen product names and avoid duplicates
#     seen_products = set()
    
#     for i in product_list:
#         product_name = new_df.iloc[i[0]]['name']
#         # Only print the product if it's not already in the seen_products set
#         if product_name not in seen_products:
#             print(f"Index: {i[0]}, Product Name: {product_name}")
#             seen_products.add(product_name)


In [202]:
recommend('Rugged Book Keyboard and Case for iPad Air 2')

21     Logitech iPad Slim Folio: Case with Wireless K...
200                 Samsung Galaxy Tab S3 Keyboard Cover
658    Logitech Focus Case with Integrated Keyboard f...
521     Case for Microsoft Surface Pro and Pro 4 (Black)
725    Logitech iPad Pro 12.9 inch Keyboard Case SLIM...
176    Microsoft Surface Pro 4 Type Cover with Finger...
477                  ND Case for iPad mini 1/2/3 (Black)
495      PELICAN - ProGear Case for Most Tablets - Black
105    Ultimate Keyboard Case for iPad 2nd, 3rd, 4th Gen
Name: name, dtype: object

In [217]:
df['name'][0]

'EcoXGear Ecostone Bluetooth Speaker'

In [61]:
data['price'][167]

118.8

In [13]:
print("1st Sub Array Shape: ",similarity[0].shape)
print("\n", "-"*35, "\n Similarity of First Product [EcoXGear Ecostone Bluetooth Speaker] ","\n","-"*35, "\n", similarity[0])

1st Sub Array Shape:  (5436,)

 ----------------------------------- 
 Similarity of First Product [EcoXGear Ecostone Bluetooth Speaker]  
 ----------------------------------- 
 [1.         0.         0.0591506  ... 0.         0.09626645 0.07860426]


In [29]:
indices = pd.Series(data.index, index=data['name'])

names = new_data['name']

In [56]:
def content_generate_top_N_recommendations(product_name, N=10):
    # Convert the string product name to a numeric index for our similarity matrix
    num_index = indices[product_name]
    # Extract all similarity values computed with the reference book title
    sim_scores = list(enumerate(similarity[num_index]))
    # Sort the values, keeping a copy of the original index of each value
    sim_scores = sorted(sim_scores, key=lambda x: x[0], reverse=True)
    # Select the top-N values for recommendation
    sim_scores = sim_scores[1:N]
    # Collect indexes 
    product_indices = [i[0] for i in sim_scores]
    # Convert the indexes back into titles 
    return names.iloc[product_indices]

In [57]:
# Example usage
recommendations = content_generate_top_N_recommendations('HDR-AS200V Full HD Action Cam', N=10)
recommendations

16    Samsung - 960 PRO 512GB Internal PCI Express 3...
15                Sony - BC-TRX Battery Charger - Black
14    Samsung EVO+ 256GB UHS-I microSDXC U3 Memory C...
13    Logitech - Harmony 665 10-Device Universal Rem...
12    Sandisk Extreme CompactFlash Memory Card - 64 ...
11    Russound - Acclaim 5 Series 5-1/4 2-Way Indoor...
10    Turtle Beach Ear Force Recon 320 7.1 Surround ...
9     Alpha a5100 Mirrorless Digital Camera with 16-...
8     KICKER - 6.5 2-Way Full-Range Speakers (Pair) ...
Name: name, dtype: object

In [8]:
data['name_cate'] = (pd.Series(data[['name', 'categories']]
                      .fillna('')
                      .values.tolist()).str.join(' '))

# Convienient indexes to map between product names and indexes of the product dataframe
names = data['name']
indices = pd.Series(data.index, index=data['name'])

In [18]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2),
                     stop_words='english')

# Produce a feature matrix, where each row corresponds to a book,
# with TF-IDF features as columns 
matrix = tf.fit_transform(data['name_cate'])

In [19]:
cosine_sim_matrix = cosine_similarity(matrix,
                                        matrix)
print (cosine_sim_matrix.shape)

(5436, 5436)


In [12]:
cosine_sim_matrix[:5]

array([[1.00000000e+00, 1.23189627e-03, 8.89510179e-02, ...,
        1.22820258e-02, 1.45482662e-01, 2.06954706e-01],
       [1.23189627e-03, 1.00000000e+00, 1.45887650e-03, ...,
        4.78963816e-04, 1.65962544e-03, 1.30056032e-03],
       [8.89510179e-02, 1.45887650e-03, 1.00000000e+00, ...,
        1.50789146e-03, 1.15885939e-02, 3.59218603e-02],
       [6.39580799e-01, 9.13454640e-03, 1.11187730e-01, ...,
        8.42002732e-03, 2.05751914e-01, 2.55866877e-01],
       [1.06699437e-03, 4.01364673e-04, 1.26359099e-03, ...,
        2.76346874e-01, 1.43746763e-03, 1.12646704e-03]])

In [16]:
# def content_generate_top_N_recommendations(product_name, N=10):
#     # Convert the string product name to a numeric index for our similarity matrix
#     num_index = indices[product_name]
#     # Extract all similarity values computed with the reference book title
#     sim_scores = list(enumerate(cosine_sim_matrix[num_index]))
#     # Sort the values, keeping a copy of the original index of each value
#     sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=False)
#     # Select the top-N values for recommendation
#     sim_scores = sim_scores[1:N]
#     # Collect indexes 
#     product_indices = [i[0] for i in sim_scores]
#     # Convert the indexes back into titles 
#     return names.iloc[product_indices]

In [154]:
#content_generate_top_N_recommendations('HDR-AS200V Full HD Action Cam', N=10)

In [24]:
#applying NLP to clean punctuations, symbols, special characters, tokenize, convert to lowercase, remove stopwords and lemmatize

def preprocess_text_column(dataframe):
    """
    This function helps preprocess the concatenated data in the DataFrame by 
    performing noise removal, punctuation removal,
    converting text to lowercase, tokenizing, 
    stopword removal, lemmatizing, and 
    generating the final output
    in a sentence form, using the NLTK Library

    Args:
        dataframe (pandas.DataFrame): The input pandas DataFrame containing the concatenated column.
        message_column (str): The name of the text column in the DataFrame to be preprocessed.

    Returns:
        pandas.DataFrame: The modified DataFrame with the preprocessed text column. 
    
    """
    # Convert any float values in the Narration column to strings
    dataframe = dataframe.astype(str)
    # Noise Removal
    # Remove numbers
    #dataframe = dataframe.apply(lambda x: re.sub(r'\d+', '', x))
    dataframe = dataframe.apply(lambda x: '  '.join(x.split()))

    # Punctuation Removal
    dataframe = dataframe.apply(lambda x: x.translate
                                                          (str.maketrans("", "", string.punctuation)))

    # Converting text to lowercase
    dataframe = dataframe.apply(lambda x: x.lower())

    # Tokenizing
    dataframe = dataframe.apply(lambda x: word_tokenize(x))

    # Stopword Removal
    stop_words = set(stopwords.words('english'))
    dataframe = dataframe.apply(lambda x: [token for token in x if 
                                                                     token not in stop_words])

    # Lemmatizing
    lemmatizer = WordNetLemmatizer()
    dataframe = dataframe.apply(lambda x: [lemmatizer.lemmatize
                                                                     (token) for token in x])

    # Final output in sentence form
    dataframe = dataframe.apply(lambda x: ' '.join(x))

    return dataframe

cleaned_data  = preprocess_text_column(data['concat'])
cleaned_data

0       ecoxgear ecostone bluetooth speaker grace digi...
1       lenovo 100s14ibr 14 laptop intel celeron 2gb m...
2       house marley smile jamaica inear earbuds house...
3       sony ultraportable bluetooth speaker sony elec...
4       alpha a5100 mirrorless digital camera 1650mm l...
                              ...                        
5431    apple iphone se gold 16gb sprint mly92lla appl...
5432    rugged book keyboard case ipad air 2 zagg comp...
5433    4k video camera 360fly camera photo360 cameras...
5434    alpine 5 x 7 2way coaxial car speaker polymica...
5435    spfs52 andrew jones designed floorstanding lou...
Name: concat, Length: 5436, dtype: object

In [25]:
##Vectorize data
def vectorize_text(dataframe: pd.Series) -> pd.DataFrame:
  """Vectorizes a text column using CountVectorizer.

  Args:
    text_column: A pandas Series containing the text to be vectorized.
    max_features: The maximum number of features to consider. If not specified,
      all features will be considered.

  Returns:
    A pandas DataFrame containing the vectorized text.
  """

  vectorizer = CountVectorizer()
  #pickling the vectoriser and storing for later use during development of our app
  v = vectorizer.fit(dataframe)
  model_save_path = "model/vectorizer.pkl"
  with open(model_save_path,'wb') as file:
      pickle.dump(v,file)
  vectorized_text = v.transform(dataframe)
  return vectorized_text

vectorized_data = vectorize_text(cleaned_data)
vectorized_data

<5436x6054 sparse matrix of type '<class 'numpy.int64'>'
	with 138493 stored elements in Compressed Sparse Row format>

In [26]:
#getting cosine similarity

similarity = cosine_similarity(vectorized_data)
model_save_path = "model/sim.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(similarity ,file)

In [31]:
print("1st Sub Array Shape: ",similarity[0].shape)
print("\n", "-"*35, "\n Similarity of First product [EcoXGear Ecostone Bluetooth Speaker] ","\n","-"*35, "\n", similarity[0])

1st Sub Array Shape:  (5436,)

 ----------------------------------- 
 Similarity of First product [EcoXGear Ecostone Bluetooth Speaker]  
 ----------------------------------- 
 [1.         0.         0.15569979 ... 0.         0.20313469 0.2540839 ]


In [27]:
# indices = pd.Series(data.index, index=data['name'])

# name = data['name']

In [28]:
# def recommend(product_name, N=6):
    
#     product_name = input('Enter product name:').lower()  
#     b_idx = indices[product_name]
#     # Extract all similarity values computed with the reference book title
#     sim_scores = list(enumerate(similarity[b_idx]))
#     # Sort the values, keeping a copy of the original index of each value
#     sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
#     # Select the top-N values for recommendation
#     sim_scores = sim_scores[1:N]
#     # Collect indexes 
#     product_indices = [i[0] for i in sim_scores]
#     # Convert the indexes back into titles 
#     return name.iloc[product_indices]

In [155]:

#recommend("")

In [32]:
# Recommender Function to Return Movie Names Only
def recommend_product_names(xProduct):
    # Get Index of given Movie
    product_index = data[data["name"].str.lower() == xProduct.lower()].index[0]
    distances = similarity[product_index]
    listofproducts = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:6]

    for i in listofproducts:
        print(data.iloc[i[0]]["name"])

In [33]:
# Recommended Movies Name with Example of 3-Movies
product_list = {"EcoXGear Ecostone Bluetooth Speaker", "4K Video Camera"}
for x in product_list:
    print("-"*20, "Recommendations for [", x,"]", "-"*20)
    recommend_product_names(x)

-------------------- Recommendations for [ EcoXGear Ecostone Bluetooth Speaker ] --------------------
EcoXGear Ecostone Bluetooth Speaker
EcoXGear Ecostone Bluetooth Speaker
EcoXGear Ecostone Bluetooth Speaker
EcoXGear Ecostone Bluetooth Speaker
EcoXGear Ecostone Bluetooth Speaker
-------------------- Recommendations for [ 4K Video Camera ] --------------------
4K Video Camera
4K Video Camera
4K Video Camera
4K Video Camera
4K Video Camera
