<h1 align="center">Electronics Recommender System</h1>

### Introduction to Recommender Systems: Addressing Information Overload
We live in an era saturated with content, where the sheer volume of movies, news articles, shopping products, and websites overwhelms individual attention spans. The average Google search yields over a million results, yet how often do we venture beyond the first page of links? This phenomenon, known as the "long tail problem," highlights how a small fraction of content receives disproportionate attention, while the majority remains undiscovered.

In the face of this challenge, service providers must ask: "How do I curate a manageable selection of content for users that is both relevant and desired?" Thankfully, decades of research have produced a solution: recommender systems.

Understanding Recommender Systems
Recommender systems predict a user's preference for an item, allowing service providers to offer a tailored selection of content, thereby enhancing user engagement and broadening content exploration.

Fundamental Concepts
Terminology: Users, Items, and Ratings
In the realm of recommender systems, two primary entities exist: Users and Items.

Items are the content being consumed—movies, articles, products, etc. They remain passive, with fixed properties.
Users interact with these items, providing ratings based on their preferences. Ratings can be explicit (e.g., giving a movie a star rating) or implicit (e.g., watching a movie without rating it directly).
Implementing Content-Based Filtering: An Example
Let's delve into one of the primary methods employed in recommender systems: content-based filtering. In this context, we'll focus on building an "Electronics Recommender System."






## Measuring Similarity 

<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Cosine_similarity.jpg"
     alt="Cosine Similarity "
     style="float: center; padding-bottom=0.5em"
     width=600px/>
Measuring the similarity between the ratings of two users (A) and (B) for the books 'Harry Potter and the Philosopher's Stone' and 'The Diary of a Young Girl', using the Cosine similarity metric.  
</div>


Having learnt about the entities which exist within recommender systems, we may wonder how they function. While this is something that we'll learn throughout this entire train, one fundamental principal that we need to understand is that recommender systems are built up by utilising the _relations_ which  exist between items and users. As such, these systems always need a mechanism to measure how related or _similar_ a user is to another user, or an item is to another item. 

We accomplish this measurement of similarity through, you guessed it, a _similarity metric_.  

Generally speaking, a similarity metric can be thought of as being the inverse of a distance measure: if two things are considered to be very similar they should be assigned a high similarity value (close to 1), while dissimilar items should receive a low similarity value (close to zero). Other [important properties](https://online.stat.psu.edu/stat508/lesson/1b/1b.2/1b.2.1) include:
 - (Symmetry) $Sim(A,B) = Sim(B,A)$ 
 - (Identity) $Sim(A,A) = 1$
 - (Uniqueness) $Sim(A,B) = 1 \leftrightarrow A = B$
 
While there are many similarity metrics to choose from when building a recommender system (and more than one can certainly be used simultaneously), a popular choice is the **Cosine similarity**. We won't go into the fundamental trig here (we hope that you remember this from high school), but recall that as an angle becomes smaller (approaching $0^o$) the value of its cosine increases. Conversely, as the angle increases the cosine value decreases. It turns out that this behavior makes the cosine of the angle between two p-dimensional vectors desirable as a [similarity metric](https://en.wikipedia.org/wiki/Cosine_similarity) which can easily be computed.

Using the figure above to help guide our understanding, the Cosine similarity between two p-dimensional vectors ${A}$ and $B$ can be given as:

$$ \begin{align}
Sim(A,B)  &= \frac{A \cdot B}{||A|| \times ||B||} \\ \\
& = \frac{\sum_{i=1}^{p}A_{i}B_{i}}{\sqrt{{\sum_{i=1}^{p}A_{i}^2}} \sqrt{\sum_{i=1}^{p}B_{i}^2}}, \\
\end{align} $$ 
  

To make things a little more concrete, let's work out the cosine similarity using our provided example above. Here, each vector represents the ratings given by one of two *users*, $A$ and $B$, who have each rated two books (rating#1 $ \rightarrow r_1$, and rating#2 $ \rightarrow r_2$). To work out how similar these two users are based on their supplied ratings, we can use the Cosine similarity definition as follows:   


$$ \begin{align}
Sim(A,B)  & = \frac{(A_{r1} \times B_{r1})+(A_{r2} \times B_{r2})}{\sqrt{A_{r1}^2 + A_{r2}^2} \times \sqrt{B_{r1}^2 + B_{r2}^2}} \\ \\
& = \frac{(3 \times 5) + (4 \times 2)}{\sqrt{9 + 16} \times \sqrt{25 + 4}} \\ \\
& = \frac{23}{26.93} \\ \\
& = 0.854
\end{align} $$

It would be a pain to work this out manually each time! Thankfully, we can obtain this same result using the `cosine_similarity` function provided to us in `sklearn`. 

As usual before we can go ahead and use this function we need to import the libraries that we will need.  

In [1]:
##Importing Libraries


# Import our regular old heroes 
import numpy as np
import pandas as pd
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.feature_extraction.text import TfidfVectorizer
from surprise import SVD, Reader, Dataset
import re
import string
import nltk   #Importing nltk
from nltk.corpus import stopwords  #importing Stopwords
import string
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
import pickle


# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists

# Imported for our sanity
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv("data/electronics_products_pricing.csv")
data.head()

Unnamed: 0,id,prices.availability,prices.condition,prices.currency,prices.dateSeen,prices.isSale,prices.merchant,prices.shipping,prices.sourceURLs,asins,...,imageURLs,keys,manufacturer,manufacturerNumber,name,primaryCategories,sourceURLs,upc,weight,price
0,AVphrugr1cnluZ0-FOeH,Yes,New,USD,"2017-05-10T20:00:00Z,2017-05-09T15:00:00Z",False,Bestbuy.com,,http://www.bestbuy.com/site/products/7100293.p...,B00I9HD8PK,...,https://i5.walmartimages.com/asr/dd5f42c4-076c...,"819127010485,ecoxgearecostonebluetoothspeaker/...",Ecoxgear,GDI-EGST701,EcoXGear Ecostone Bluetooth Speaker,Electronics,http://www.walmart.com/ip/EcoXGear-Ecostone-Bl...,819000000000.0,3 pounds,92.99
1,AVrI6FDbv8e3D1O-lm4R,Yes,New,USD,"2017-10-10T02:00:00Z,2017-08-12T03:00:00Z,2017...",False,Bestbuy.com,,https://www.bestbuy.com/site/lenovo-100s-14ibr...,B06ZY63J8H,...,https://i5.walmartimages.com/asr/fcc50cce-a3c1...,"190793918948,lenovo100s14ibr14laptopintelceler...",,100s-14ibr,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,Electronics,https://www.walmart.com/ip/Lenovo-100S-14IBR-1...,191000000000.0,4.3 pounds,229.99
2,AVpiLlubilAPnD_xBoTa,Yes,New,USD,"2017-10-10T19:00:00Z,2017-09-12T14:00:00Z,2017...",False,Bestbuy.com,,https://www.bestbuy.com/site/house-of-marley-s...,B00G3P9UMU,...,https://i5.walmartimages.com/asr/c124aa15-b9e3...,"0846885007037,houseofmarleysmilejamaicainearea...",House Of Marley,EM-JE041-MI,House of Marley Smile Jamaica In-Ear Earbuds,Electronics,https://www.walmart.com/ip/House-of-Marley-Smi...,847000000000.0,0.6 ounces,16.99
3,AVpgQP5vLJeJML43LQbd,Yes,New,USD,"2017-09-08T05:00:00Z,2017-09-18T13:00:00Z,2017...",False,Bestbuy.com,,https://www.bestbuy.com/site/products/6311012....,B00TTWZFFA,...,https://i5.walmartimages.com/asr/1be435f7-5f3a...,"sonyultraportablebluetoothspeaker/sosrsx11bk,s...",Sony,SRSX11/BLK,Sony Ultra-Portable Bluetooth Speaker,Electronics,https://www.walmart.com/ip/Sony-Ultra-Portable...,27242886599.0,1 pounds,69.99
4,AV1YDsmoGV-KLJ3adcbe,More on the Way,New,USD,2017-12-05T13:00:00Z,True,bhphotovideo.com,Free Expedited Shipping for most orders over $49,https://www.bhphotovideo.com/c/product/1105014...,B00MHPAF38,...,http://i.ebayimg.com/thumbs/images/g/TBUAAOSwd...,sonyalphaa5100digitalcamerakitwith1650mmlenswh...,,ILCE5100L/W,Alpha a5100 Mirrorless Digital Camera with 16-...,Electronics,https://reviews.bestbuy.com/3545/8429343/revie...,27242883246.0,9.98 oz 4.09 oz,846.0


In [3]:
data.columns

Index(['id', 'prices.availability', 'prices.condition', 'prices.currency',
       'prices.dateSeen', 'prices.isSale', 'prices.merchant',
       'prices.shipping', 'prices.sourceURLs', 'asins', 'brand', 'categories',
       'dateAdded', 'dateUpdated', 'ean', 'imageURLs', 'keys', 'manufacturer',
       'manufacturerNumber', 'name', 'primaryCategories', 'sourceURLs', 'upc',
       'weight', 'price'],
      dtype='object')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5436 entries, 0 to 5435
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   5436 non-null   object 
 1   prices.availability  5436 non-null   object 
 2   prices.condition     5436 non-null   object 
 3   prices.currency      5436 non-null   object 
 4   prices.dateSeen      5436 non-null   object 
 5   prices.isSale        5436 non-null   bool   
 6   prices.merchant      5436 non-null   object 
 7   prices.shipping      3199 non-null   object 
 8   prices.sourceURLs    5436 non-null   object 
 9   asins                5436 non-null   object 
 10  brand                5436 non-null   object 
 11  categories           5436 non-null   object 
 12  dateAdded            5436 non-null   object 
 13  dateUpdated          5436 non-null   object 
 14  ean                  1175 non-null   object 
 15  imageURLs            5436 non-null   o

In [5]:
data = data[['name', 'brand', 'categories', 'manufacturer']]
data.head()

Unnamed: 0,name,brand,categories,manufacturer
0,EcoXGear Ecostone Bluetooth Speaker,Grace Digital,"Electronics,Home Audio & Theater,Home Audio,Al...",Ecoxgear
1,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,Lenovo,"Electronics,Computers,Laptops,Laptops By Brand...",
2,House of Marley Smile Jamaica In-Ear Earbuds,House of Marley,"Headphones,Consumer Electronics,Portable Audio...",House Of Marley
3,Sony Ultra-Portable Bluetooth Speaker,Sony,"Electronics,Home Audio & Theater,Home Audio,Al...",Sony
4,Alpha a5100 Mirrorless Digital Camera with 16-...,Sony,"Digital Cameras,Cameras & Photo,Used:Digital P...",


In [6]:
data.isnull().sum()

name               0
brand              0
categories         0
manufacturer    2959
dtype: int64

In [7]:
data.drop("manufacturer", axis=1)

Unnamed: 0,name,brand,categories
0,EcoXGear Ecostone Bluetooth Speaker,Grace Digital,"Electronics,Home Audio & Theater,Home Audio,Al..."
1,Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...,Lenovo,"Electronics,Computers,Laptops,Laptops By Brand..."
2,House of Marley Smile Jamaica In-Ear Earbuds,House of Marley,"Headphones,Consumer Electronics,Portable Audio..."
3,Sony Ultra-Portable Bluetooth Speaker,Sony,"Electronics,Home Audio & Theater,Home Audio,Al..."
4,Alpha a5100 Mirrorless Digital Camera with 16-...,Sony,"Digital Cameras,Cameras & Photo,Used:Digital P..."
...,...,...,...
5431,Apple iPhone SE Gold 16GB for Sprint ( MLY92LL...,Apple,"iPhones,All Cell Phones with Plans,iPhone SE,C..."
5432,Rugged Book Keyboard and Case for iPad Air 2,ZAGG,"Computers,Bags, Cases & Sleeves,Computer Acces..."
5433,4K Video Camera,360fly,"Cameras & Photo,360 Cameras,VR 360 Video,Camco..."
5434,"Alpine - 5 x 7"" 2-Way Coaxial Car Speakers wit...",Alpine,"Auto & Tires,Auto Electronics,Car Speakers and..."


### Data Cleaning

In [8]:
#concating the features before cleaning

data = data['name'] + " " + data['brand'] + " " + data['categories']
data

0       EcoXGear Ecostone Bluetooth Speaker Grace Digi...
1       Lenovo - 100S-14IBR 14 Laptop - Intel Celeron ...
2       House of Marley Smile Jamaica In-Ear Earbuds H...
3       Sony Ultra-Portable Bluetooth Speaker Sony Ele...
4       Alpha a5100 Mirrorless Digital Camera with 16-...
                              ...                        
5431    Apple iPhone SE Gold 16GB for Sprint ( MLY92LL...
5432    Rugged Book Keyboard and Case for iPad Air 2 Z...
5433    4K Video Camera 360fly Cameras & Photo,360 Cam...
5434    Alpine - 5 x 7" 2-Way Coaxial Car Speakers wit...
5435    SP-FS52 Andrew Jones Designed Floorstanding Lo...
Length: 5436, dtype: object

In [9]:
#applying NLP to clean punctuations, symbols, special characters, tokenize, convert to lowercase, remove stopwords and lemmatize

def preprocess_text_column(dataframe):
    """
    This function helps preprocess the concatenated data in the DataFrame by 
    performing noise removal, punctuation removal,
    converting text to lowercase, tokenizing, 
    stopword removal, lemmatizing, and 
    generating the final output
    in a sentence form, using the NLTK Library

    Args:
        dataframe (pandas.DataFrame): The input pandas DataFrame containing the concatenated column.
        message_column (str): The name of the text column in the DataFrame to be preprocessed.

    Returns:
        pandas.DataFrame: The modified DataFrame with the preprocessed text column. 
    
    """
    # Convert any float values in the Narration column to strings
    dataframe = dataframe.astype(str)
    # Noise Removal
    # Remove numbers
    #dataframe = dataframe.apply(lambda x: re.sub(r'\d+', '', x))
    dataframe = dataframe.apply(lambda x: '  '.join(x.split()))

    # Punctuation Removal
    dataframe = dataframe.apply(lambda x: x.translate
                                                          (str.maketrans("", "", string.punctuation)))

    # Converting text to lowercase
    dataframe = dataframe.apply(lambda x: x.lower())

    # Tokenizing
    dataframe = dataframe.apply(lambda x: word_tokenize(x))

    # Stopword Removal
    stop_words = set(stopwords.words('english'))
    dataframe = dataframe.apply(lambda x: [token for token in x if 
                                                                     token not in stop_words])

    # Lemmatizing
    lemmatizer = WordNetLemmatizer()
    dataframe = dataframe.apply(lambda x: [lemmatizer.lemmatize
                                                                     (token) for token in x])

    # Final output in sentence form
    dataframe = dataframe.apply(lambda x: ' '.join(x))

    return dataframe

cleaned_data  = preprocess_text_column(data)
cleaned_data

0       ecoxgear ecostone bluetooth speaker grace digi...
1       lenovo 100s14ibr 14 laptop intel celeron 2gb m...
2       house marley smile jamaica inear earbuds house...
3       sony ultraportable bluetooth speaker sony elec...
4       alpha a5100 mirrorless digital camera 1650mm l...
                              ...                        
5431    apple iphone se gold 16gb sprint mly92lla appl...
5432    rugged book keyboard case ipad air 2 zagg comp...
5433    4k video camera 360fly camera photo360 cameras...
5434    alpine 5 x 7 2way coaxial car speaker polymica...
5435    spfs52 andrew jones designed floorstanding lou...
Length: 5436, dtype: object

In [12]:
##Vectorize data
def vectorize_text(dataframe: pd.Series) -> pd.DataFrame:
  """Vectorizes a text column using CountVectorizer.

  Args:
    text_column: A pandas Series containing the text to be vectorized.
    max_features: The maximum number of features to consider. If not specified,
      all features will be considered.

  Returns:
    A pandas DataFrame containing the vectorized text.
  """

  vectorizer = CountVectorizer()
  #pickling the vectoriser and storing for later use during development of our app
  v = vectorizer.fit(dataframe)
  model_save_path = "model/vectorizer.pkl"
  with open(model_save_path,'wb') as file:
      pickle.dump(v,file)
  vectorized_text = v.transform(dataframe)
  return vectorized_text

vectorized_data = vectorize_text(cleaned_data)
vectorized_data

<5436x6054 sparse matrix of type '<class 'numpy.int64'>'
	with 138493 stored elements in Compressed Sparse Row format>

In [11]:
#getting cosine similarity

similarity = cosine_similarity(vectorized_data)
model_save_path = "model/sim.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(similarity ,file)