# Book Recommendation System Using Collaborative Filtering

Reg. No. : 1004

Name: Mananya Ugadhi

Table Of Contents
1. Introduction
2. Exploratory Data Analysis
3. Model Selection and Evaluation
4. Conclusions
5. References

1. Introduction

Nowadays the amount of information especially in Internet growth very rapidly. Finding necessary information becomes more difficult. Recommendation systems aim to solve this kind of problems. With the help of them one can quickly access relevant information without searching the web manually. As such many web sites today benefit from recommendation systems to promote and sell their products. There is a wide range of products like music, movies, articles and etc. that can be recommended to the customer based on their profiles in internet shops or even social networks, browsing history such as visited links, browsing activity like number and time of visits and other online behavior. Online shops are increasing their sales using such technologies.      

In this paper we propose using recommendation systems for recommending books. We developed a system, which learns user preferences by asking to rate books and choosing favorite categories and then generate the list of books user most probably would like to read.

Existing recommendation services despite their powerfulness need a strong user profile information and history. User register to such systems, browse books, rate them, write their feedbacks, recommend to others, share, read appropriate information and etc. Based on such an information a system makes its recommendations. The examples of such services are whichbook.net, whatshouldireadnext.com, lazylibrary.com and etc. Instead our recommender system focuses on simplicity and speed. The user makes a registration and is asked to select 10 favorite books from at least 3 categories (genres). Based on this information the system makes recommendations. Further the user can continue to rate the books, buy them and add them to read list and thus allow to improve the quality of recommendations. The system overview is demonstrated in Fig. 1. A user, using an intuitive search and filtering interface updates a database by rating the books  and then gets appropriate recommendations. The recommendations in turn are calculated based on collaborative filtering method.

1.1 Dataset Description

The Book-Crossing dataset comprises 3 tables.
BX-Users
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

BX-Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

BX-Book-Ratings
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.


Importing Libraries and Reading the Dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pl

In [2]:
books = pd.read_csv('https://internships-data.s3.ap-south-1.amazonaws.com/Projects/Data/1004_BX-Books.csv', sep=';', error_bad_lines=False, encoding='latin-1')

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


In [5]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [6]:
books = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]

We are renaming the columns of the dataset to make it easier to work with. The project can be donw without renaming the columns also, but it is a generally recommended practice for efficiency and speed.

In [7]:
books.rename(columns={'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'}, inplace=True)

In [8]:
books.head(2)

Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


In [9]:
users = pd.read_csv('https://internships-data.s3.ap-south-1.amazonaws.com/Projects/Data/1004_BX-Users.csv', sep=';', error_bad_lines=False, encoding='latin-1')

In [10]:
users.head(2)

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [11]:
users.rename(columns={'User-ID': 'user_id', 'Location':'location', 'Age':'age'}, inplace=True)

In [12]:
users.head(2)

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [13]:
ratings = pd.read_csv('https://internships-data.s3.ap-south-1.amazonaws.com/Projects/Data/1004_BX-Book-Ratings.csv',sep=';', error_bad_lines=False, encoding='latin-1')

In [14]:
ratings.head(2)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5


In [15]:
ratings.rename(columns={'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)

In [16]:
books.shape

(271360, 5)

In [17]:
users.shape

(278858, 3)

In [18]:
ratings.shape

(1149780, 3)

In [19]:
ratings['user_id'].value_counts().shape

(105283,)

Out of the 278858 users present, only 105283 have rated books after reading them. The ohter users are normal users that just read the book and didn't give any ratings.

In [20]:
x = ratings['user_id'].value_counts()>200

In [21]:
x[x].shape
#899 ppl are the ones that have given ratings for more than 200 books

(899,)

Out of all the users, only 899 users have rated more than 200 books. To improve the accuracy of the recommendation, we are only considering the users that have been extensively reading and rating the books. Another reason is also that the data is too big and certain constraints have to be applied to the dataset to get the results that we need.

In [22]:
y = x[x].index
y

Int64Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352,
            110973, 235105,
            ...
            116122,  28634, 188951,  59727, 155916, 274808,  73681,   9856,
            268622,  44296],
           dtype='int64', length=899)

In [23]:
ratings = ratings[ratings['user_id'].isin(y)]

In [24]:
ratings.shape

(526356, 3)

In [25]:
ratings.head()

Unnamed: 0,user_id,ISBN,rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0


Now, we are merging the `ratings` dataset and the `books` dataset on the column `ISBN` since that is the common column among those tables. This enables us to know which books have how many ratings. In other words, the ratings of the book will be present against the book title.

In [26]:
ratings_with_books = ratings.merge(books, on='ISBN')

In [27]:
ratings_with_books.shape
# the books wihtout data hvae been removed isbn match nahi hua so data decrease

(487671, 7)

The data has decreased because the books without data on either `title` or `ISBN` wasn't matching and that data has been automactically removed.

In [28]:
number_ratings = ratings_with_books.groupby('title')['rating'].count().reset_index()

In [29]:
number_ratings.rename(columns={'rating':'number of ratings'}, inplace=True)

In [30]:
number_ratings

Unnamed: 0,title,number of ratings
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1
...,...,...
160264,Ã?Â?ber die Pflicht zum Ungehorsam gegen den S...,3
160265,Ã?Â?lpiraten.,1
160266,Ã?Â?rger mit Produkt X. Roman.,1
160267,Ã?Â?stlich der Berge.,1


In [31]:
final_ratings = ratings_with_books.merge(number_ratings, on='title')

In [32]:
final_ratings

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,number of ratings
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
...,...,...,...,...,...,...,...,...
487666,275970,1892145022,0,Here Is New York,E. B. White,1999,Little Bookroom,1
487667,275970,1931868123,0,There's a Porcupine in My Outhouse: Misadventu...,Mike Tougias,2002,Capital Books (VA),1
487668,275970,3411086211,10,Die Biene.,Sybil GrÃ?Â¤fin SchÃ?Â¶nfeldt,1993,"Bibliographisches Institut, Mannheim",1
487669,275970,3829021860,0,The Penis Book,Joseph Cohen,1999,Konemann,1


The `final_ratings` dataset consists of 6 columns and these are the columns that are giving us the information we need and hence this dataset will be used to train the recommendation model.

In [33]:
final_ratings.shape

(487671, 8)

In [34]:
final_ratings.columns

Index(['user_id', 'ISBN', 'rating', 'title', 'author', 'year', 'publisher',
       'number of ratings'],
      dtype='object')

In [35]:
final_ratings = final_ratings[final_ratings['number of ratings']>=50]

In [36]:
final_ratings.shape

(61853, 8)

In [37]:
final_ratings.drop_duplicates(['user_id','title'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_ratings.drop_duplicates(['user_id','title'], inplace=True)


In [38]:
final_ratings.shape

(59850, 8)

Pivot Table

Given a dataset, Pivot Tables are a powerful tool to analyze, answer questions, and tell stories. Pivoting is not limited to analysis in Excel, rather it’s a general technique for transforming data, often into more human-interpretable format.

We could answer these question with a pivot table

-- How many type of each thing did we make?
-- What was the average length for each type of thing?
-- What was the fastest and slowest speed for each type of thing?

And the powerful part is that we could present the answers to these questions in multiple formats depending on the intended audiences preference. Instead of one row per type of thing we can change make one small tweak and display one row per measurement and make a column for each type in the input dataset.

Missing Data

As with any aggregation missing data must be dealt with. In the example data used here there are numerous months with 0 sales. The most probably explanation is that data wasn’t collected for those month, not that there were actually zero sales. There’s also the scenario where data is missing entirely and the “cell” is blank or contains a null value. Whatever the program you’ll need to accept (and be aware of) the default behaviour for these cases or specify what to do. In our scenario, we replaced the NaN values with 0.0.

In [39]:
book_pivot = final_ratings.pivot_table(columns='user_id', index='title', values='rating')

In [40]:
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,,,,,,0.0,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2nd Chance,,10.0,,,,,,,,,...,,,,0.0,,,,,0.0,
4 Blondes,,,,,,,,,,0.0,...,,,,,,,,,,
84 Charing Cross Road,,,,,,,,,,,...,,,,,,10.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,,,,7.0,,,,,7.0,,...,,,,,,0.0,,,,
You Belong To Me,,,,,,,,,,,...,,,,,,,,,,
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,,,,,0.0,,,,,0.0,...,,,,,,0.0,,,,
Zoya,,,,,,,,,,,...,,,,,,,,,,


In [41]:
book_pivot.shape

(742, 888)

In [42]:
# books = 742
# users = 888

In [43]:
book_pivot.fillna(0, inplace=True)

In [44]:
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Cosine Similarity

Cosine similarity is a metric, helpful in determining, how similar the data objects are irrespective of their size. We can measure the similarity between two sentences in Python using Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a vector. The formula to find the cosine similarity between two vectors is –

Cos(x, y) = x . y / ||x|| * ||y||
where,

-- x . y = product (dot) of the vectors ‘x’ and ‘y’.
-- ||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
-- ||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.

The cosine similarity between two vectors is measured in ‘θ’.
If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.
If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.

The cosine similarity is beneficial because even if the two similar data objects are far apart by the Euclidean distance because of the size, they could still have a smaller angle between them. Smaller the angle, higher the similarity.

When plotted on a multi-dimensional space, the cosine similarity captures the orientation (the angle) of the data objects and not the magnitude.

In [45]:
from sklearn.metrics.pairwise import cosine_similarity

In [46]:
sim_score = cosine_similarity(book_pivot)

In [47]:
sim_score.shape

(742, 742)

In [48]:
def recommend(book_name):
    # index fetch
    index = np.where(book_pivot.index==book_name)[0][0]
    similar_items = sorted(list(enumerate(sim_score[index])),key=lambda x:x[1],reverse=True)[1:5]
    
    data = []
    for i in similar_items:
        item = []
        temp_df = books[books['title'] == book_pivot.index[i[0]]]
        item.extend(list(temp_df.drop_duplicates('title')['title'].values))
        item.extend(list(temp_df.drop_duplicates('title')['author'].values))
        
        data.append(item)
    
    return data

In [49]:
recommend('Animal Farm')

[['1984', 'George Orwell'],
 ['Angus, Thongs and Full-Frontal Snogging: Confessions of Georgia Nicolson',
  'Louise Rennison'],
 ['Midnight', 'Dean R. Koontz'],
 ['Second Nature', 'Alice Hoffman']]

Conclusion

In this project, we present a recommendation system that is based on collaborative filtering method. The main goal was the speed of recommendations i.e., to create such a system, which can give qualitative recommendations to their users without need of registration for a long time and have profile information, browsing history etc. Experiment results show that the proposed method provides relevant recommendations.

The above project can be made better by creating a front-end interface so that the user can enter the book and get recommendations. More information can be made available to the user like the published year, reviews of other users, more books by the same author etc.

The proposed work can be applied for other domains to suggest such items like movies, music, and other products.

References

1. 
https://en.wikipedia.org/wiki/Recommender_system

2. https://www.researchgate.net/publication/300412849_Online_book_recommendation_system

3. 
https://www.geeksforgeeks.org/cosine-similarity/

4. 
https://www.geeksforgeeks.org/collaborative-filtering-ml/

5. 
https://www.youtube.com/watch?v=sf93xpq8vaA



Connect: 

LinkedIn: https://www.linkedin.com/in/mananyaugadhi/
GitHub Repo: https://github.com/Mananya07/Book-Recommendation-System