# Books-recommender-system-from-scratch-in-Python-Content-Based Filtering

#### Content-Based Filtering
A content-based recommender system provides users with suggestions based on similarity in content. It is a simple method of providing recommendations based on a customer’s preferences for particular content. However, the main disadvantage of this approach is that it will not be able to suggest a product that the user has never seen before. For instance, if a reader read motivation book, then the can model will never suggest that they read a romance or comedy book. This means that the user will never get a recommendation outside genres they have already interacted with.

In [132]:
#### Importing the libraries

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [133]:
# Reading the Dataset

data = pd.read_csv(r'C:\Users\avisa\Downloads\BX-CSV-Dump\BX-Books.csv', error_bad_lines=False,encoding='latin-1', sep=';')
data.head()

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


Notice that the dataframe contains information of different books such as its author, publisher, and title. We will use this data to build a recommendation system that suggests what a user should read next based on their current book preferences.

In [93]:
# Now, let’s list these variables to better understand them

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


The dataframe above has over 271K rows of data. We will randomly sample 15,000 rows to build the recommender system, since processing a large amount of data will take up too much memory in the system and cause it to slow down.

Also, we will only use three variables to build this recommender system - “Book Title,” “Book Author,” and “Publisher.”

## Pre-processing Data to Build the Recommendation System
### 1. Removing Duplicates

First, let us check if there are any duplicate book titles. These are redundant to the algorithm and must be removed:

In [94]:
data.duplicated(subset='Book-Title').sum()

29225

In [95]:
data.drop_duplicates(subset='Book-Title', inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 242135 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 242135 non-null  object
 1   Book-Title           242135 non-null  object
 2   Book-Author          242134 non-null  object
 3   Year-Of-Publication  242135 non-null  object
 4   Publisher            242134 non-null  object
 5   Image-URL-S          242135 non-null  object
 6   Image-URL-M          242135 non-null  object
 7   Image-URL-L          242132 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


### 2. Random Sampling

We need to randomly sample 15,000 rows from the dataframe to avoid running into memory errors.

In [143]:
sam_size = 15000
df = data.sample(n=sam_size, replace=False, random_state=1000).reset_index()
df.drop(columns='index',inplace=True)

### 3. Data Cleaning         

Now, let us print the head of the dataframe again:

In [144]:
df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,671872524,Warped (Star Trek Deep Space Nine),K. W. Jeter,1995,Pocket Books,http://images.amazon.com/images/P/0671872524.0...,http://images.amazon.com/images/P/0671872524.0...,http://images.amazon.com/images/P/0671872524.0...
1,8571641145,A Grande Arte,Rubem Fonseca,1998,Companhia das Letras,http://images.amazon.com/images/P/8571641145.0...,http://images.amazon.com/images/P/8571641145.0...,http://images.amazon.com/images/P/8571641145.0...
2,373194994,Boss'S Baby Mistake (Silhouette Romance),Raye Morgan,2001,Silhouette,http://images.amazon.com/images/P/0373194994.0...,http://images.amazon.com/images/P/0373194994.0...,http://images.amazon.com/images/P/0373194994.0...
3,312872178,Jupiter,Ben Bova,2000,Tor Books,http://images.amazon.com/images/P/0312872178.0...,http://images.amazon.com/images/P/0312872178.0...,http://images.amazon.com/images/P/0312872178.0...
4,3442132916,Der Tod und die lachende Jungfrau / Hexenflug....,Ellis Peters,2000,Goldmann,http://images.amazon.com/images/P/3442132916.0...,http://images.amazon.com/images/P/3442132916.0...,http://images.amazon.com/images/P/3442132916.0...


The dataframe contains columns that are not relevant to the model, such as each book’s ISBN code, its year of publication, and a link to its image.

Since we only need “Book-Title,” “Book-Author,” and “Publisher” columns to build the model. Since this is text data, we need to transform it into a vector representation.

In [145]:
# to combine the authors’ first and last names

def clean_text(author):               
    result = str(author).lower()
    return (result.replace(' ',''))

In [146]:
df['Book-Author'] = df['Book-Author'].apply(clean_text)

In [147]:
# Now, let’s convert the book title and publisher to lowercase:

df['Book-Title'] = df['Book-Title'].str.lower()
df['Publisher'] = df['Publisher'].str.lower()

In [148]:
display(df.head())
df.info()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,671872524,warped (star trek deep space nine),k.w.jeter,1995,pocket books,http://images.amazon.com/images/P/0671872524.0...,http://images.amazon.com/images/P/0671872524.0...,http://images.amazon.com/images/P/0671872524.0...
1,8571641145,a grande arte,rubemfonseca,1998,companhia das letras,http://images.amazon.com/images/P/8571641145.0...,http://images.amazon.com/images/P/8571641145.0...,http://images.amazon.com/images/P/8571641145.0...
2,373194994,boss's baby mistake (silhouette romance),rayemorgan,2001,silhouette,http://images.amazon.com/images/P/0373194994.0...,http://images.amazon.com/images/P/0373194994.0...,http://images.amazon.com/images/P/0373194994.0...
3,312872178,jupiter,benbova,2000,tor books,http://images.amazon.com/images/P/0312872178.0...,http://images.amazon.com/images/P/0312872178.0...,http://images.amazon.com/images/P/0312872178.0...
4,3442132916,der tod und die lachende jungfrau / hexenflug....,ellispeters,2000,goldmann,http://images.amazon.com/images/P/3442132916.0...,http://images.amazon.com/images/P/3442132916.0...,http://images.amazon.com/images/P/3442132916.0...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ISBN                 15000 non-null  object
 1   Book-Title           15000 non-null  object
 2   Book-Author          15000 non-null  object
 3   Year-Of-Publication  15000 non-null  object
 4   Publisher            15000 non-null  object
 5   Image-URL-S          15000 non-null  object
 6   Image-URL-M          15000 non-null  object
 7   Image-URL-L          14999 non-null  object
dtypes: object(8)
memory usage: 937.6+ KB


In [149]:
# We only need “Book-Title,” “Book-Author,” and “Publisher” columns, so drop rest ones.

df1 = df.drop(columns=['ISBN', 'Year-Of-Publication','Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis=1)
df1.head()

Unnamed: 0,Book-Title,Book-Author,Publisher
0,warped (star trek deep space nine),k.w.jeter,pocket books
1,a grande arte,rubemfonseca,companhia das letras
2,boss's baby mistake (silhouette romance),rayemorgan,silhouette
3,jupiter,benbova,tor books
4,der tod und die lachende jungfrau / hexenflug....,ellispeters,goldmann


In [150]:
# combine all strings:

df1['data'] = df1[df1.columns[1:]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)
df1.head()

Unnamed: 0,Book-Title,Book-Author,Publisher,data
0,warped (star trek deep space nine),k.w.jeter,pocket books,k.w.jeter pocket books
1,a grande arte,rubemfonseca,companhia das letras,rubemfonseca companhia das letras
2,boss's baby mistake (silhouette romance),rayemorgan,silhouette,rayemorgan silhouette
3,jupiter,benbova,tor books,benbova tor books
4,der tod und die lachende jungfrau / hexenflug....,ellispeters,goldmann,ellispeters goldmann


### 4. Vectorize the Dataframe/Processing Text Data

#### Note:
The biggest limitation of CountVectorizer is that it solely takes word frequency into account. This means that even if there are less important words like “and”, “a”, and “the” in the same sentence, these words will be given the same weight as highly important words.

However, CountVectorizer is suitable for building a recommender system in this specific use-case, since we will not be working with complete sentences like in the above example. We will instead deal with data points like the book’s title, writer, and publisher, and we can treat each word with equal importance.

After converting these variables into a word vector, we will measure the likeness between all of them based on the number of words they have in common. This will be achieved using a distance measure called cosine similarity, which will be explained below.

We can apply Scikit-Learn’s CountVectorizer() on the combined text data.

In [151]:
from sklearn.feature_extraction.text import CountVectorizer

vector = CountVectorizer()
vectored =  vector.fit_transform(df1['data'])

The variable “vectorized” is a sparse matrix with a numeric representation of the strings we extracted.

## Building the Recommendation System 

Now, we will use a distance measure called cosine similarity to find the resemblance between each bag-of-words. Cosine similarity is a metric that calculates the cosine of the angle between two or more vectors to determine if they are pointing in the same direction.

Cosine similarity ranges between 0 and 1. A value of 0 indicates that the two vectors are not similar at all, while 1 tells us that they are identical.

In [152]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(vectored)

In [153]:
print(similarities)

[[1. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


We have a vector of values ranging between 1 and 0, and each vector represents the similarity of one book relative to another. Since the book titles are not mentioned here, we need to map this vector back to the previous dataframe:

In [154]:
df = pd.DataFrame(similarities, columns=df['Book-Title'], index=df['Book-Title']).reset_index()

df.head()

Book-Title,Book-Title.1,warped (star trek deep space nine),a grande arte,boss's baby mistake (silhouette romance),jupiter,der tod und die lachende jungfrau / hexenflug. zwei romane in einem band.,megalodon: the prehistoric shark (dig and discover),the energy of nature,from the land of shadows: the making of grey owl,doctor zhivago,...,women and russia: feminist writings from the soviet union,picking apples &amp; pumpkins (read with me),"the secret war against hanoi : the untold story of spies, saboteurs, and covert warriors in north vietnam",speed cleaning,learning windows 95,halls of the arcanum: pilgrims of the glittering path (mage),klippenmond. (life). ( ab 13 j.).,holiday entertaining for dummies,bicycling magazine's mountain biking skills (bicycling magazine),starry night (christy miller)
0,warped (star trek deep space nine),1.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,a grande arte,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,boss's baby mistake (silhouette romance),0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,jupiter,0.333333,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,der tod und die lachende jungfrau / hexenflug....,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Observe that we have converted the similarity vector into a dataframe with book titles listed vertically and horizontally. The dataframe values represent the cosine similarity between different books.

Also, notice that the diagonal is always 1.0, since it displays the similarity of each book with itself.

## Displaying User Recommendations

let’s use the dataframe above to display book recommendations. If a book is entered as input, the top 10 similar books must be returned

In [174]:
input_book = "chicken soup for the surviving soul: 101 healing stories to comfort cancer patients and their loved ones"
recommendations = pd.DataFrame(df.nlargest(11,input_book)['Book-Title'])
recommendations = recommendations[recommendations['Book-Title']!=input_book]
print(recommendations)

                                              Book-Title
1876   chicken soup for the soul christmas treasury f...
2754   chicken soup for the nurse's soul: 101 stories...
8007   chicken soup for the jewish soul : 101 stories...
8066     chicken soup for the soul: a christmas treasury
8232   chicken soup for the teenage soul (chicken sou...
8718   chicken soup for the preteen soul - 101 storie...
9495   chicken soup for the unsinkable soul - stories...
11616  chicken soup for the father's soul, 101 storie...
13304  chicken soup for the christian family soul : s...
13208  mentors, masters and mrs. macgregor : stories ...


### Summary

* The main drawback of content-based filtering was that similar items would be grouped together, and users will not be recommended products with content that they have not previously liked.
* Notice that even in this dataframe, we are only being recommended books "chicken soup" since we used that as input.
* To refine the algorithm and ensure that we are not solely recommending products with the same content, a collaborative-filtering based recommender system can be used.
