# Chapter 3 - Content-Based Recommender Systems

Content-based filtering is used to recommend products or items very similar to those being clicked or liked. User recommendations are based on the description of an item and a profile of the user’s interests. Content-based recommender systems are widely used in e-commerce platforms. It is one of the basic algorithms in a recommendation engine. Content-based filtering can be triggered for any event; for example, on click, on purchase, or add to cart.

### Approach
The following steps build a content-based recommender engine.
1. Do the data collection (should have complete item description).
2. Do the data preprocessing.
3. Convert text to features.
4. Perform similarity measures.
5. Recommend products.

<div style="text-align:center;">
    <img src='images/cbrs.JPG' width='600'>
</div>

### Links to download required word embeddings

download w2v
gdown https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM

download glove
wget https://nlp.stanford.edu/data/glove.6B.zip

download fastext
gdown https://drive.google.com/uc?id=1vz6659Atv9OOXiakzj1xaKhZ9jxJkeFF

In [19]:
#Importing the libraries

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, manhattan_distances, euclidean_distances
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from gensim import models
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style
%matplotlib inline
from gensim.models import FastText as ft
from IPython.display import Image
import os

import warnings
warnings.filterwarnings("ignore")

In [2]:
#read csv data
df = pd.read_csv('data/Rec_sys_content.csv')

#view first 5 rows
df.head()

Unnamed: 0,StockCode,Product Name,Description,Category,Brand,Unit Price
0,22629,Ganma Superheroes Ordinary Life Case For Samsu...,"New unique design, great gift.High quality pla...",Cell Phones|Cellphone Accessories|Cases & Prot...,Ganma,13.99
1,21238,Eye Buy Express Prescription Glasses Mens Wome...,Rounded rectangular cat-eye reading glasses. T...,Health|Home Health Care|Daily Living Aids,Eye Buy Express,19.22
2,22181,MightySkins Skin Decal Wrap Compatible with Ni...,Each Nintendo 2DS kit is printed with super-hi...,Video Games|Video Game Accessories|Accessories...,Mightyskins,14.99
3,84879,Mediven Sheer and Soft 15-20 mmHg Thigh w/ Lac...,The sheerest compression stocking in its class...,Health|Medicine Cabinet|Braces & Supports,Medi,62.38
4,84836,Stupell Industries Chevron Initial Wall D cor,Features: -Made in the USA. -Sawtooth hanger o...,Home Improvement|Paint|Wall Decals|All Wall De...,Stupell Industries,35.99


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3958 entries, 0 to 3957
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   StockCode     3958 non-null   object 
 1   Product Name  3958 non-null   object 
 2   Description   3958 non-null   object 
 3   Category      3856 non-null   object 
 4   Brand         3818 non-null   object 
 5   Unit Price    3943 non-null   float64
dtypes: float64(1), object(5)
memory usage: 185.7+ KB


### Data Preparation

In [4]:
#null check
df.isnull().sum().sort_values(ascending=False)

Brand           140
Category        102
Unit Price       15
StockCode         0
Product Name      0
Description       0
dtype: int64

In [9]:
#Drop N/A
df.dropna().reset_index(inplace = True)

# Data Shape
df.shape

(3958, 6)

In [7]:
df.head()

Unnamed: 0,StockCode,Product Name,Description,Category,Brand,Unit Price
0,22629,Ganma Superheroes Ordinary Life Case For Samsu...,"New unique design, great gift.High quality pla...",Cell Phones|Cellphone Accessories|Cases & Prot...,Ganma,13.99
1,21238,Eye Buy Express Prescription Glasses Mens Wome...,Rounded rectangular cat-eye reading glasses. T...,Health|Home Health Care|Daily Living Aids,Eye Buy Express,19.22
2,22181,MightySkins Skin Decal Wrap Compatible with Ni...,Each Nintendo 2DS kit is printed with super-hi...,Video Games|Video Game Accessories|Accessories...,Mightyskins,14.99
3,84879,Mediven Sheer and Soft 15-20 mmHg Thigh w/ Lac...,The sheerest compression stocking in its class...,Health|Medicine Cabinet|Braces & Supports,Medi,62.38
4,84836,Stupell Industries Chevron Initial Wall D cor,Features: -Made in the USA. -Sawtooth hanger o...,Home Improvement|Paint|Wall Decals|All Wall De...,Stupell Industries,35.99


### Loading pretrained models

In [13]:
# Importing Word2Vec
word2vecModel = models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [16]:
# Importing FastText
fasttext_model=ft.load_fasttext_format("data/cc.en.300.bin.gz")

  fasttext_model=ft.load_fasttext_format("data/cc.en.300.bin.gz")


In [18]:
# Import Glove
glove_df = pd.read_csv('data/glove.6B.300d.txt', sep=" ",
                       quoting=3, header=None, index_col=0)
glove_model = {key: value.values for key, value in glove_df.T.items()}

**CountVectorizer**- 
The drawback to the OHE approach is if a word appears multiple times in a sentence, i gets the same importance as any other word that appears only once. CountVectorize helps overcome this because it counts the tokens present in an observation instead o tagging everything as 1 or 0

**Term Frequency–Inverse Document Frequency (TF-IDF)**- 
CountVectorizer won’t answer all questions. If the length of the sentences is inconsisten or a word is repeated in all the sentences, it becomes tricky. TF-IDF addresses thes problems
.
The term frequency (TF) is the “number of times the token appeared in a corpus d cdivided by the total number of tokens ”
The inverse document frequency (IDF) is a log of the total number of such cor usdocs in overall docs we have divided by the number of overall docs with the selectedword. It helps provide more weight to rare words in the corp s.
Multiplying them gives the TF-IDF vector for a word in the corpus..

<div style="text-align:center;">
    <img src='images/tiidf.JPG' width='500'>
</div>

In [20]:
# Importing Count Vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Importing IFIDF
tfidf_vec = TfidfVectorizer(stop_words='english', analyzer='word', ngram_range=(1,3))

### Preprocessing

The following describes the preprocessing steps.
1. Remove duplicates.
2. Convert the string to lowercase.
3. Remove special characters.

In [21]:
# Combining Product and Description
df['Description'] = df['Product Name'] + ' ' +df['Description']

# Dropping Duplicates and keeping first record
unique_df = df.drop_duplicates(subset=['Description'], keep='first')

# Converting String to Lower Case
unique_df['desc_lowered'] = unique_df['Description'].apply(lambda x: x.lower()) 

# Remove Stop special Characters
unique_df['desc_lowered'] = unique_df['desc_lowered'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# Coverting Description to List
desc_list = list(unique_df['desc_lowered'])

In [22]:
unique_df= unique_df.reset_index(drop=True)

unique_df.reset_index(inplace=True)

### Similarity Measures

#### Manhattan distance
It is calculated as the sum of the absolute differences between the two vectors.

<div style="text-align:center;">
    <img src='images/md.png' width='500'>
</div>

#### Euclidean distance
It is calculated as the square root of the sum of the squared differences between the two vectors.

<div style="text-align:center;">
    <img src='images/euclidean.png' width='500'>
</div>

#### Cosine Similarity
It is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors' lengths (or magnitudes).

<div style="text-align:center;">
    <img src='images/coss.png' width='500'>
</div>

### Functions for Ranking