# Book Recommendation System Using Collaborative Filtering

# Introduction

In this notebook, I developed a book recommendation system to recommend Top 50 Books based on popularity and also recommend books using collaborative filtering techniques. Goal is to analyze user rating data and suggest books that align with individual preferences, enhancing user experience on book platforms. This project involves data exploration, preprocessing, the implementation of various unsupervised machine learning models and finaly making a book recommendation function.


### Import Data

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Handle warning messages
import warnings
warnings.filterwarnings('ignore')

/kaggle/input/Ratings.csv
/kaggle/input/Users.csv
/kaggle/input/classicRec.png
/kaggle/input/Books.csv
/kaggle/input/DeepRec.png
/kaggle/input/recsys_taxonomy2.png


### Importing Necessary Libraries

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.mixture import GaussianMixture

### Loading Dataset into pandas dataframes.

In [3]:
books = pd.read_csv("/kaggle/input/Books.csv")
users = pd.read_csv("/kaggle/input/Users.csv")
ratings = pd.read_csv("/kaggle/input/Ratings.csv")
print('Books:',books.shape)
print('Users:',users.shape)
print('Ratinsg:',ratings.shape)

Books: (271360, 8)
Users: (278858, 3)
Ratinsg: (1149780, 3)


# Data Exploration & Preprocessing

## Books

In [4]:
print('Books:',books.shape)

Books: (271360, 8)


In [5]:
books.sample(5)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
241150,684196301,OUT OF CONTROL GLOBAL TURMOIL ON THE EVE OF TH...,Zbigniew Brzezinski,1993,Scribner,http://images.amazon.com/images/P/0684196301.0...,http://images.amazon.com/images/P/0684196301.0...,http://images.amazon.com/images/P/0684196301.0...
231818,671671685,Talk That Talk : An Anthology of African-Ameri...,Linda Goss,1989,Touchstone,http://images.amazon.com/images/P/0671671685.0...,http://images.amazon.com/images/P/0671671685.0...,http://images.amazon.com/images/P/0671671685.0...
96192,446516910,A Love Divine,Alexandra Ripley,1996,Warner Books,http://images.amazon.com/images/P/0446516910.0...,http://images.amazon.com/images/P/0446516910.0...,http://images.amazon.com/images/P/0446516910.0...
158333,394744128,A Practical Handbook for the Actor,Melissa Bruder,1986,Vintage Books USA,http://images.amazon.com/images/P/0394744128.0...,http://images.amazon.com/images/P/0394744128.0...,http://images.amazon.com/images/P/0394744128.0...
153249,893751928,L. Frank Baum's Dorothy and the Wizard (Wizard...,Corinne J. Naden,1980,Troll Communications Llc,http://images.amazon.com/images/P/0893751928.0...,http://images.amazon.com/images/P/0893751928.0...,http://images.amazon.com/images/P/0893751928.0...


In [6]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [7]:
books.rename(columns={'Book-Title':'Title','Book-Author':'Author','Year-Of-Publication':'Publication_Year'},inplace=True)

In [8]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   ISBN              271360 non-null  object
 1   Title             271360 non-null  object
 2   Author            271358 non-null  object
 3   Publication_Year  271360 non-null  object
 4   Publisher         271358 non-null  object
 5   Image-URL-S       271360 non-null  object
 6   Image-URL-M       271360 non-null  object
 7   Image-URL-L       271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


In [9]:
books.isnull().sum()

ISBN                0
Title               0
Author              2
Publication_Year    0
Publisher           2
Image-URL-S         0
Image-URL-M         0
Image-URL-L         3
dtype: int64

In [10]:
books.duplicated().sum()

0

In [11]:
books.shape

(271360, 8)

In [12]:
books['Image-URL-S'] = books['Image-URL-S'].str.replace('http://', 'https://')
books['Image-URL-M'] = books['Image-URL-M'].str.replace('http://', 'https://')
books['Image-URL-L'] = books['Image-URL-L'].str.replace('http://', 'https://')

## Users

In [13]:
print('Users:',users.shape)

Users: (278858, 3)


In [14]:
users.sample(5)

Unnamed: 0,User-ID,Location,Age
227662,227663,"bloomer, wisconsin, usa",51.0
252341,252342,"lakeland, florida, usa",28.0
191039,191040,"santiago, santiago, chile",34.0
57217,57218,"glen ridge, new jersey, usa",34.0
211819,211820,"nanjing, jiangsu,nanjing, china",


In [15]:
users.rename(columns={'User-ID':'User_ID'},inplace=True)

In [16]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   User_ID   278858 non-null  int64  
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [17]:
users.isnull().sum()

User_ID          0
Location         0
Age         110762
dtype: int64

In [18]:
users.duplicated().sum()

0

## Ratings

In [19]:
print('Ratings:',ratings.shape)

Ratings: (1149780, 3)


In [20]:
ratings.sample(5)

Unnamed: 0,User-ID,ISBN,Book-Rating
264284,60809,2020564777,7
279992,66483,345833422,0
41422,10502,373264216,0
435007,104123,515132624,0
205364,47090,1562314416,0


In [21]:
ratings.rename(columns={'User-ID':'User_ID','Book-Rating':'Rating'},inplace=True)

In [22]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   User_ID  1149780 non-null  int64 
 1   ISBN     1149780 non-null  object
 2   Rating   1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [23]:
ratings.duplicated().sum()

0

# Top 50 Books based on Popularity

Creating Dataset by merging **Ratings** and **Books**

In [24]:
ratings_with_books = ratings.merge(books,on='ISBN')
ratings_with_books.shape

(1031136, 10)

In [25]:
num = ratings_with_books.groupby('Title')['Rating'].count().reset_index().rename(columns={'Rating':'Number_of_Ratings'})
num

Unnamed: 0,Title,Number_of_Ratings
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1
...,...,...
241066,Ã?Â?lpiraten.,2
241067,Ã?Â?rger mit Produkt X. Roman.,4
241068,Ã?Â?sterlich leben.,1
241069,Ã?Â?stlich der Berge.,3


In [26]:
avg = ratings_with_books.groupby('Title')['Rating'].mean().reset_index().rename(columns={'Rating':'Average_Rating'})
avg

Unnamed: 0,Title,Average_Rating
0,A Light in the Storm: The Civil War Diary of ...,2.250000
1,Always Have Popsicles,0.000000
2,Apple Magic (The Collector's series),0.000000
3,"Ask Lily (Young Women of Faith: Lily Series, ...",8.000000
4,Beyond IBM: Leadership Marketing and Finance ...,0.000000
...,...,...
241066,Ã?Â?lpiraten.,0.000000
241067,Ã?Â?rger mit Produkt X. Roman.,5.250000
241068,Ã?Â?sterlich leben.,7.000000
241069,Ã?Â?stlich der Berge.,2.666667


In [27]:
Popularity_df = num.merge(avg,on='Title')
Popularity_df.sample(5)

Unnamed: 0,Title,Number_of_Ratings,Average_Rating
2110,"A Caress of Twilight (Meredith Gentry, 2)",1,8.0
196526,The Mad King: The Life and Times of Ludwig II ...,1,0.0
28119,"Brit-Think, Ameri-Think",2,0.0
23033,Beware of the Dog: A Cliff Hardy Novel,1,0.0
102765,La lirica: [romanzo] (Collana La salamandra),1,0.0


In [28]:
Popularity_df = Popularity_df[Popularity_df['Number_of_Ratings']>250].sort_values('Average_Rating',ascending=False).head(50)
Popularity_df.head(5)

Unnamed: 0,Title,Number_of_Ratings,Average_Rating
80434,Harry Potter and the Prisoner of Azkaban (Book 3),428,5.852804
80422,Harry Potter and the Goblet of Fire (Book 4),387,5.824289
80441,Harry Potter and the Sorcerer's Stone (Book 1),278,5.73741
80426,Harry Potter and the Order of the Phoenix (Boo...,347,5.501441
80414,Harry Potter and the Chamber of Secrets (Book 2),556,5.183453


In [29]:
books1 = books.drop_duplicates('Title')

In [30]:
df = Popularity_df.merge(books1,on='Title')

In [31]:
df.shape

(50, 10)

In [32]:
Popularity_Final = df[['ISBN','Title','Author','Publication_Year','Publisher','Image-URL-L','Number_of_Ratings', 'Average_Rating']]

### Top 50 Popular Books

In [33]:
Popularity_Final

Unnamed: 0,ISBN,Title,Author,Publication_Year,Publisher,Image-URL-L,Number_of_Ratings,Average_Rating
0,0439136350,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling,1999,Scholastic,https://images.amazon.com/images/P/0439136350....,428,5.852804
1,0439139597,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic,https://images.amazon.com/images/P/0439139597....,387,5.824289
2,0590353403,Harry Potter and the Sorcerer's Stone (Book 1),J. K. Rowling,1998,Scholastic,https://images.amazon.com/images/P/0590353403....,278,5.73741
3,043935806X,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,2003,Scholastic,https://images.amazon.com/images/P/043935806X....,347,5.501441
4,0439064872,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic,https://images.amazon.com/images/P/0439064872....,556,5.183453
5,0345339681,The Hobbit : The Enchanting Prelude to The Lor...,J.R.R. TOLKIEN,1986,Del Rey,https://images.amazon.com/images/P/0345339681....,281,5.007117
6,0345339703,The Fellowship of the Ring (The Lord of the Ri...,J.R.R. TOLKIEN,1986,Del Rey,https://images.amazon.com/images/P/0345339703....,368,4.94837
7,059035342X,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books,https://images.amazon.com/images/P/059035342X....,575,4.895652
8,0345339711,"The Two Towers (The Lord of the Rings, Part 2)",J.R.R. TOLKIEN,1986,Del Rey,https://images.amazon.com/images/P/0345339711....,260,4.880769
9,0446310786,To Kill a Mockingbird,Harper Lee,1988,Little Brown &amp; Company,https://images.amazon.com/images/P/0446310786....,510,4.7


# Collaborative Filtering Recommendation system

## Creating Dataset for ML algorithm

* aGoodUser is a user who gave ratings to atleast 200 different books.
* aGoodBook is a book which received ratings from at least 50 unique users.

In [34]:
aGoodBook = 50
aGoodUser = 200

In [35]:
x = ratings_with_books.groupby('User_ID')['Rating'].count()>aGoodUser
good_users_index = x[x].index
filtered_ratings = ratings_with_books[ratings_with_books['User_ID'].isin(good_users_index)]
filtered_ratings.shape

(474007, 10)

In [36]:
y = filtered_ratings.groupby('Title')['Rating'].count()>aGoodBook
good_book_index = y[y].index
filtered_ratings_books = filtered_ratings[filtered_ratings['Title'].isin(good_book_index)]
filtered_ratings_books.shape

(57236, 10)

In [37]:
pt = filtered_ratings_books.pivot_table(index='Title',columns='User_ID',values='Rating')

In [38]:
pt.fillna(0,inplace=True)

In [39]:
pt.shape

(679, 810)

In [40]:
pt.head()

User_ID,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Bend in the Road,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
final_dataset = pt.copy()

In [42]:
final_dataset_sparse = csr_matrix(final_dataset)

## Algorithm Development

### Nearest Neighbors

In [43]:
nn_model = NearestNeighbors(metric='cosine',algorithm='brute')
nn_model.fit(final_dataset_sparse)

### KMeans

In [44]:
kmeans_model = KMeans(n_clusters=3,init="k-means++",random_state=12)
km = kmeans_model.fit(final_dataset)
km_pred = km.predict(final_dataset)

silhouette_score(final_dataset,km_pred)

0.1173160456070885

### GMM

In [45]:
gm_model = GaussianMixture(n_components=3,random_state=12)
gm = gm_model.fit(final_dataset)
gm_pred = km.predict(final_dataset)

silhouette_score(final_dataset,gm_pred)

0.1173160456070885

# Making Recommendation Function

In [46]:
def recommend_nn(book_name):
    dist , sugg = nn_model.kneighbors(final_dataset[final_dataset.index == book_name],n_neighbors=6)
    print('Book Recommendation for',book_name,'are:')
    for i in range(len(sugg[0])):
        if i!=0:
            print(i,final_dataset.index[sugg[0][i]])

In [47]:
recommend_nn('Message in a Bottle')

Book Recommendation for Message in a Bottle are:
1 Nights in Rodanthe
2 The Mulberry Tree
3 A Walk to Remember
4 River's End
5 Nightmares &amp; Dreamscapes


# Notes / Scope of Improvement / Future Work:

* Does Scaling Required?
* Publication_Year data type
* Model Evaluation
* EDA
* Model deployment