# Book Recommendation System

A recommendation engine is a class of machine learning which offers relevant suggestions to the customer.  

## Types Of Recommendation System

A recommendation system is usually built using 3 techniques which are **content-based filtering**, **collaborative filtering** , and a combination of both.

1) **Content-Based Filtering** - The algorithm recommends a product that is similar to those which used as watched. In simple words, In this algorithm, we try to find finding item look alike.

2) **Collaborative-based** Filtering - Collaborative based filtering recommender systems are based on past interactions of users and target items.  In simple words here, we try to search for the look-alike customers and offer products based on what his or her lookalike has chosen.

3) **Hybrid Filtering Method** - It is basically a combination of both the above methods. It is a too complex model which recommends product based on your history as well based on similar users like you.

**Book recommendation system** is a type of recommendation system where we have to recommend similar books to the reader based on his interest.

## Dataset

* Books – Contain all the information related to books like an author, title, publication year, etc.
* Users – The second file contains registered user’s information like user id, location.
* Ratings – Ratings contain information like which user has given how much rating to which book.

## Loading the data

While loading the file we have some problems like:
* The values in the CSV file are separated by semicolons, not by a comma.
* There are some lines which not work like we cannot import it with pandas and It throws an error because python is Interpreted language.
* Encoding of a file is in Latin

So while loading data we have to handle these exceptions and after running the below code we will get some warnings and they will show us which lines have an error that we have skipped while loading.

In [1]:
import numpy as np
import pandas as pd
books = pd.read_csv("Data/BX-Books.csv", sep=';', encoding="latin-1", error_bad_lines=False)
users = pd.read_csv("Data/BX-Users.csv", sep=';', encoding="latin-1", error_bad_lines=False)
ratings = pd.read_csv("Data/BX-Book-Ratings.csv", sep=';', encoding="latin-1", error_bad_lines=False)

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## Preprocessing data

Now in the books file, we have some extra columns which are not required for our task like image URLs. And we will rename the columns in each dataset as to make it easy to use them.

In [2]:
books = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]
books.rename(columns = {'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'}, inplace=True)
users.rename(columns = {'User-ID':'user_id', 'Location':'location', 'Age':'age'}, inplace=True)
ratings.rename(columns = {'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)

In [3]:
books.head()

Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [4]:
users.head()

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [5]:
ratings.head()

Unnamed: 0,user_id,ISBN,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [6]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   ISBN       271360 non-null  object
 1   title      271360 non-null  object
 2   author     271359 non-null  object
 3   year       271360 non-null  object
 4   publisher  271358 non-null  object
dtypes: object(5)
memory usage: 10.4+ MB


In [7]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   user_id   278858 non-null  int64  
 1   location  278858 non-null  object 
 2   age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [8]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   user_id  1149780 non-null  int64 
 1   ISBN     1149780 non-null  object
 2   rating   1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


We have **271360** books data and total registered users on the website are approximately **278000** and they have given **over one milion** ratings. Hence we can say that the dataset we have is very nice and reliable.

## Problem

Our goal is to find user **A** who has liked and read book *x* and *y*, and user **B** who has also liked and read book *x* and *y*. Now if user **A** reads a new book *z* and likes this book we can recommend this book to user **B**. This is collaborative filtering.

We can achieve this if we use Matrix Factorization, we will create a matrix where columns will be users, indexes will be books and values will be individual ratings.

For this kind of approach to work we need to limit which users and which books we are going to use. We will only take users who rated more than 200 books and books that have been rated at least 50 times. This way the system will work accordingly.

In [9]:
ratings['user_id'].value_counts()

11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
271728        1
245123        1
234886        1
259466        1
187812        1
Name: user_id, Length: 105283, dtype: int64

We can see only that 105283 peoples have given a rating to a book among 278000 of users recorded. We will extract the user ids of those who have given more than 200 ratings. After that we will extract the ratings of these people from the ratings dataframe.

In [12]:
x = ratings['user_id'].value_counts() > 200
y = x[x].index  #user_ids
print(y.shape)
ratings = ratings[ratings['user_id'].isin(y)]

(899,)


We have about 900 users who have given about 520000 ratings. Now we will merge ratings with books on basis of ISBN code so that we will get the rating of each user on each book and for the user who has not rated the book the rating will be zero.

In [11]:
rating_with_books = ratings.merge(books, on='ISBN')
rating_with_books.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc


In [13]:
rating_with_books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 487671 entries, 0 to 487670
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_id    487671 non-null  int64 
 1   ISBN       487671 non-null  object
 2   rating     487671 non-null  int64 
 3   title      487671 non-null  object
 4   author     487671 non-null  object
 5   year       487671 non-null  object
 6   publisher  487669 non-null  object
dtypes: int64(2), object(5)
memory usage: 29.8+ MB


Now dataframe size has decreased and we have 480000 entries because we merged the dataframe. Now we will count the rating of each book so we will group data based on title and aggregate based on rating.

In [15]:
number_rating = rating_with_books.groupby('title')['rating'].count().reset_index()
number_rating.rename(columns= {'rating':'number_of_ratings'}, inplace=True)
final_rating = rating_with_books.merge(number_rating, on='title')
print(final_rating.shape)
final_rating = final_rating[final_rating['number_of_ratings'] >= 50]
final_rating.drop_duplicates(['user_id','title'], inplace=True)

(487671, 8)


We have to drop duplicate values because if the same user has rated the same book multiple times it would create a problem. At last we have our final dataset where we have only the users who rated more than 200 books and books that have been rated at least 50 times. The shape of the final dataframe is 59850 rows and 8 columns.

In [16]:
final_rating.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,number_of_ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82


## Creating the matrix

We will create a pivot table where columns will be user ids, the index will be book title and the value is ratings. And the user id who has not rated any book will have value as NAN.

In [17]:
book_pivot = final_rating.pivot_table(columns='user_id', index='title', values="rating")
book_pivot.fillna(0, inplace=True)

We can notice that about 11 more users have been removed because their ratings were on those books which do not receive more than 50 ratings.

## Modelling

We have prepared our dataset for modeling. we will use the nearest neighbors algorithm which is the same as K nearest which is used for clustering based on euclidian distance.

Our pivot table has a lots of zero values which is going to increase computing power needed for the algorith. We are going to convert the pivot table to a sparse matrix and then ffeed it to the model.

In [19]:
from scipy.sparse import csr_matrix
book_sparse = csr_matrix(book_pivot)

Now we will train the nearest neighbors algorithm. here we need to specify an algorithm which is *brute* meaning the algorith will find the distance of every point to every other point.

In [20]:
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(algorithm='brute')
model.fit(book_sparse)

NearestNeighbors(algorithm='brute')

Let’s make a prediction and see whether it is suggesting books or not. We will find the nearest neighbors to the input book id and after that, we will print the top 5 books with the least distance. We are going to pass Harry Potter which is at index 237. This will provide us with distance and book id at that distance.

In [21]:
distances, suggestions = model.kneighbors(book_pivot.iloc[237, :].values.reshape(1, -1))

In [23]:
distances

array([[ 0.        , 68.78953409, 69.5413546 , 72.64296249, 76.83098333]])

In [24]:
suggestions

array([[237, 240, 238, 241, 184]])

We can now print the suggested books.

In [22]:
for i in range(len(suggestions)):
  print(book_pivot.index[suggestions[i]])

Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive'],
      dtype='object', name='title')
