# Collaborative Filtering on Book Recommender System

# Introduction

A recommender system, often known as a recommendation system, is a type of information filtering system that aims to anticipate the "rating" or "preference" a user would assign to a certain item. They are mainly applied in commercial settings. (Reference: Wikipedia) To offer products and services to customers, machine learning-based recommendation systems primarily use three different types of filtering.

    -Content Filtering
    -Collaborative Filtering
    -Hybrid Filtering
    
Here, we try to search for look alike customers and offer products based on what his/her lookalike has chosen.This algorithm is very effective but takes a lot of time and resources.

# Dataset Collection

Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.
Link:http://www2.informatik.uni-freiburg.de/~cziegler/BX/

   ## Format

The Book-Crossing dataset comprises 3 tables.
    
* BX-Users: 
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

* BX-Books:
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

* BX-Book-Ratings:
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

Let's start with importing necessary libraries.

In [1]:
import os
import numpy as np
import pandas as pd

# Importing and Exploring Dataset

In [2]:
books = pd.read_csv('BX-Books.csv', sep=";", error_bad_lines=False, encoding='latin-1')



  books = pd.read_csv('BX-Books.csv', sep=";", error_bad_lines=False, encoding='latin-1')
b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8

In [58]:
books.head(2)

Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


In [4]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [5]:
books=books[['ISBN','Book-Title','Book-Author','Year-Of-Publication','Publisher']]

In [6]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [7]:
books.rename(columns={'Book-Title':'title','Book-Author':'author','Year-Of-Publication':'year','Publisher':'publisher'},inplace=True)

In [8]:
books.head()

Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [9]:
users = pd.read_csv('BX-Users.csv', sep=";", error_bad_lines=False, encoding='latin-1')



  users = pd.read_csv('BX-Users.csv', sep=";", error_bad_lines=False, encoding='latin-1')


In [10]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [11]:
users.rename(columns={'User-ID':'user_id','Location':'location','Age':'age'}, inplace=True)

In [12]:
users.head()

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [13]:
ratings=pd.read_csv('BX-Book-Ratings.csv', sep=";", error_bad_lines=False, encoding='latin-1')



  ratings=pd.read_csv('BX-Book-Ratings.csv', sep=";", error_bad_lines=False, encoding='latin-1')


In [14]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [15]:
ratings.rename(columns={'User-ID':'user_id','Book-Rating':'rating'}, inplace=True)

In [16]:
ratings.head()

Unnamed: 0,user_id,ISBN,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [17]:
print(books.shape)
print(ratings.shape)
print(users.shape)

(271360, 5)
(1149780, 3)
(278858, 3)


In [18]:
ratings['user_id'].value_counts().shape

(105283,)

# Feature Engineering
Selecting the users who have rated books more than 200 times.


In [19]:
ratings['user_id'].value_counts()>200

11676      True
198711     True
153662     True
98391      True
35859      True
          ...  
116180    False
116166    False
116154    False
116137    False
276723    False
Name: user_id, Length: 105283, dtype: bool

In [20]:
x=ratings['user_id'].value_counts()>200

In [21]:
x[x]

11676     True
198711    True
153662    True
98391     True
35859     True
          ... 
274808    True
28634     True
59727     True
268622    True
188951    True
Name: user_id, Length: 899, dtype: bool

In [22]:
x[x].shape

(899,)

In [23]:
y=x[x].index

In [24]:
y

Int64Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352,
            110973, 235105,
            ...
            260183,  73681,  44296, 155916,   9856, 274808,  28634,  59727,
            268622, 188951],
           dtype='int64', length=899)

In [25]:
ratings = ratings[ratings['user_id'].isin(y)]

In [26]:
ratings.shape

(526356, 3)

In [27]:
ratings.head()

Unnamed: 0,user_id,ISBN,rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0


**Merging dataframes ratings and books based on isbn so we can get book name in front of rating.**

In [28]:
ratings_with_books=ratings.merge(books, on='ISBN')

In [29]:
ratings_with_books

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
...,...,...,...,...,...,...,...
487666,275970,1892145022,0,Here Is New York,E. B. White,1999,Little Bookroom
487667,275970,1931868123,0,There's a Porcupine in My Outhouse: Misadventu...,Mike Tougias,2002,Capital Books (VA)
487668,275970,3411086211,10,Die Biene.,Sybil GrÃ?Â¤fin SchÃ?Â¶nfeldt,1993,"Bibliographisches Institut, Mannheim"
487669,275970,3829021860,0,The Penis Book,Joseph Cohen,1999,Konemann


In [30]:
ratings_with_books.shape 

(487671, 7)

**Number of books got reduced because we don't have data of all books which users have rated**

Performing groupby operation on the DataFrame ratings_with_books and grouping the data by the title column. Then, selecting the rating column and applying the count() method to count the number of ratings for each book title in the dataset. The resulting output is a Series object that contains the count of ratings for each book title.

In [31]:
ratings_with_books.groupby('title')['rating'].count()

title
 A Light in the Storm: The Civil War Diary of Amelia Martin, Fenwick Island, Delaware, 1861 (Dear America)    2
 Always Have Popsicles                                                                                        1
 Apple Magic (The Collector's series)                                                                         1
 Beyond IBM: Leadership Marketing and Finance for the 1990s                                                   1
 Clifford Visita El Hospital (Clifford El Gran Perro Colorado)                                                1
                                                                                                             ..
Ã?Â?ber die Pflicht zum Ungehorsam gegen den Staat.                                                           3
Ã?Â?lpiraten.                                                                                                 1
Ã?Â?rger mit Produkt X. Roman.                                                                    

Converting the resulting grouped data into a new DataFrame with a reset index, where the book titles are listed as rows and their respective count of ratings are listed as a new column called "rating". 

In [32]:
number_rating = ratings_with_books.groupby('title')['rating'].count().reset_index()

In [33]:
number_rating.rename(columns={'rating':'number of ratings'}, inplace=True)

In [34]:
number_rating

Unnamed: 0,title,number of ratings
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1
...,...,...
160264,Ã?Â?ber die Pflicht zum Ungehorsam gegen den S...,3
160265,Ã?Â?lpiraten.,1
160266,Ã?Â?rger mit Produkt X. Roman.,1
160267,Ã?Â?stlich der Berge.,1


Merging the count of ratings column to the original DataFrame

In [35]:
#joining dataframe with title
final_rating= ratings_with_books.merge(number_rating, on='title')

In [36]:
final_rating.shape

(487671, 8)

In [37]:
final_rating= final_rating[final_rating['number of ratings']>=50]

In [38]:
final_rating.shape 

(61853, 8)

In [39]:
final_rating.drop_duplicates(['user_id','title'], inplace=True)

In [40]:
final_rating.shape #minimun 50 rating on a book and user rated 200 times

(59850, 8)

Here, we have got final_rating dataframe with minimun 50 rating on a book and user rated 200 times. now, we will do factorizing the table. 

Therefore, the user_id data will be the columns, the title values will be the rows, and the rating values will be the cell values in the final book_pivot pivot table. The data can then be further analyzed and visualized using this pivot table.

In [41]:
book_pivot=final_rating.pivot_table(columns='user_id', index='title', values='rating')

In [42]:
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,,,,,,0.0,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2nd Chance,,10.0,,,,,,,,,...,,,,0.0,,,,,0.0,
4 Blondes,,,,,,,,,,0.0,...,,,,,,,,,,
84 Charing Cross Road,,,,,,,,,,,...,,,,,,10.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,,,,7.0,,,,,7.0,,...,,,,,,0.0,,,,
You Belong To Me,,,,,,,,,,,...,,,,,,,,,,
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,,,,,0.0,,,,,0.0,...,,,,,,0.0,,,,
Zoya,,,,,,,,,,,...,,,,,,,,,,


In [43]:
book_pivot.shape

(742, 888)

In [44]:
book_pivot.fillna(0, inplace=True)

In [45]:
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Creating the sparse matrix from the book_pivot DataFrame, which contains the rating data for different users and books.**

In [46]:
#nearest neig. clustering algo making sparse matrix
from scipy.sparse import csr_matrix
book_sparse=csr_matrix(book_pivot)

In [47]:
type(book_sparse)

scipy.sparse.csr.csr_matrix

# Nearest Neighbours

The NearestNeighbors class is a machine learning model that can be used for both supervised and unsupervised learning tasks. In this case, it will be used for unsupervised learning to identify the nearest neighbors of a given book based on its similarity to other books in the dataset.

The model instance can be fit on a dataset to compute the nearest neighbors, and then used to make recommendations based on a user's ratings or preferences. The algorithm parameter is set to 'brute', which indicates that the model will use a brute-force algorithm to compute the nearest neighbors.

In [48]:
from sklearn.neighbors import NearestNeighbors
model=NearestNeighbors(algorithm='brute')

Fitting the model instance on the book_sparse sparse matrix, which contains the rating data for different users and books in a compressed format.

Fitting the model involves computing the distances between all pairs of books in the dataset, which is a computationally expensive task that can be done efficiently using the sparse matrix representation.

In [49]:
model.fit(book_sparse)

NearestNeighbors(algorithm='brute')

In [59]:
distances, suggestions=model.kneighbors(book_pivot.iloc[237, :].values.reshape(1,-1), n_neighbors=6)

In [60]:
distances

array([[ 0.        , 68.78953409, 69.5413546 , 72.64296249, 76.83098333,
        77.28518616]])

In [61]:
suggestions

array([[237, 240, 238, 241, 184, 536]], dtype=int64)

In [62]:
for i in range(len(suggestions)):
    print(book_pivot.index[suggestions[i]])

Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive',
       'The Cradle Will Fall'],
      dtype='object', name='title')


In [54]:
book_pivot.index[54]

'Animal Farm'

In [55]:
np.where(book_pivot.index=='Animal Farm')[0][0]

54

# Recommend_book Function
The function recommend_book takes a book name as input and returns a list of 5 book recommendations based on the k-nearest neighbors of the input book in the book_pivot dataframe.

The function first finds the index of the input book in the book_pivot dataframe using the np.where function and assigns it to the variable book_id.

It then calls the kneighbors() method of the model instance with the feature vector of the input book to find the distances and indices of the 6 nearest neighbors in the book_pivot dataframe.

The function then loops over the 6 nearest neighbors and prints the names of the top 5 recommendations, excluding the input book itself.

In [74]:
def recommend_book(book_name):
    # book_id fetch
    book_id = np.where(book_pivot.index==book_name)[0][0]
    distances, suggestions=model.kneighbors(book_pivot.iloc[book_id, :].values.reshape(1,-1), n_neighbors=6)
    
    for i in range(len(suggestions)):
        if i == 0:
            print("The suggestions for", book_name,"are :")
        if not i:
            print(book_pivot.index[suggestions[i]])

In [75]:
recommend_book('Animal Farm')

The suggestions for Animal Farm are :
Index(['Animal Farm', 'Exclusive', 'Jacob Have I Loved', 'Second Nature',
       'Pleading Guilty', 'No Safe Place'],
      dtype='object', name='title')


# Conclusion

To sum up, collaborative filtering has demonstrated to be a very successful method for creating book recommender systems. Collaborative filtering algorithms can accurately anticipate which books a user will like by examining user behavior and preferences. These systems can also assist readers in finding new books that they might not have otherwise known about, making for a fuller and more enjoyable reading experience. These systems will probably get increasingly more precise and perceptive as the amount of data available to them expands, offering readers an ever-expanding universe of books to explore. Overall, collaborative filtering has the potential to transform the way we discover and consume literature, making it easier than ever before to locate the perfect book for any occasion.