Book Recommendation System

The purpose of this project is to develop a machine learning-based book recommendation system that provides personalized book suggestions to users based on their preferences. By analyzing user ratings and book attributes, the system aims to predict which books a user is most likely to enjoy, enhancing their overall experience.

To identify relevant book recommendations, the system utilizes cosine similarity, a method that measures the similarity between books. This technique compares the characteristics of books a user has shown interest in with those of other books, ensuring that the recommendations are closely aligned with the user’s taste.

The required datasets for this project, including book details, user ratings, and other relevant attributes, are imported in the form of CSV files sourced from the Kaggle website. These datasets provide the foundation for building the recommendation model, which processes and analyzes the data to generate accurate book suggestions.

This recommendation uses collaborative filtering which says that if users A and B liked book X and user B likes book Y also, then user A might also like book Y. This way, book Y can be recommended to user A.

In [None]:
import numpy as np # python library for array creation and manipulation for data analysis
import pandas as pd # python library for data manipulation and analysis received through CSV files

# load the necessary CSV files as pandas dataframes

books = pd.read_csv('books.csv')
users = pd.read_csv('users.csv')
ratings = pd.read_csv('ratings.csv')

  books = pd.read_csv('books.csv')


let's view first 5 rows of all three CSV files to see if they were loaded properly

In [4]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [5]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [6]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, new york, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


let's also check number of rows and columns in each dataframe

In [7]:
print(books.shape)
print(ratings.shape)
print(users.shape)

(271360, 8)
(1149780, 3)
(278858, 3)


dataframe 'books' has 271360 rows and 8 columns
dataframe 'ratings' has 1149780 rows and 3 columns
dataframe 'users' has 278858 rows and 3 columns

let's check number of empty cells in each column of all three dataframes

In [8]:
books.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64

In [9]:
ratings.isnull().sum()

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

In [10]:
users.isnull().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

now let's check number of duplicate rows in all three dataframes

In [16]:
print(books.duplicated().sum())

0


In [17]:
print(ratings.duplicated().sum())

0


In [19]:
print(users.duplicated().sum())

0


as we see that there are no duplicate dataframes in all the tree dataframes

In [None]:
ratings_with_name = ratings.merge(books,on='ISBN') # merges dataframes 'ratings' and 'books' based of value of column 'ISBN'

In [None]:
num_rating_df = ratings_with_name.groupby('Book-Title').count()['Book-Rating'].reset_index() # create a pandas dataframe by grouping dataframe
# 'ratings_with_name' by book titles and counting how many times each book was rated and use 'reset_index' function to create dataframe

num_rating_df.rename(columns={'Book-Rating':'num_ratings'},inplace=True) # rename 'Book-Rating' to 'num_ratings

In [43]:
# now we do the same thing again but instead of count, we use mean/average as aggregate function
avg_rating_df = ratings_with_name.groupby('Book-Title')['Book-Rating'].mean().reset_index()
avg_rating_df.rename(columns={'Book-Rating':'avg_rating'}, inplace=True)

In [44]:
popular_df = num_rating_df.merge(avg_rating_df,on='Book-Title') # create a dataframe by merging both the dataframes we created on 'Book-Title' column.
# This dataframe consists of book title, number of ratings and average ratings

In [45]:
popular_df = popular_df[popular_df['num_ratings'] >= 250].sort_values('avg_rating', ascending=False).head(50)
# filters 'popular_df' such that it keeps 50 most loved books by first keeping books with more than 250 ratings
# and then sorts the rows of these more than 250 ratings books in descending order so that book with best average ratings come first
# finally only 50 of the rows we got are kept in 'popular_df'

In [46]:
popular_df = popular_df.merge(books,on='Book-Title').drop_duplicates('Book-Title')[['Book-Title','Book-Author','Image-URL-M','num_ratings','avg_rating']]
# removes duplicate rows for the same book title and keeps only the necessary columns which are book title, author, image of cover page, number of ratings and average ratings

In [22]:
x = ratings_with_name.groupby('User-ID').count()['Book-Rating'] > 200 # creates a pandas dataframe of User-ID and boolean flag
# based on whether they rated more than 200 books (True) or not (False)

In [23]:
flag = x[x].index # returns User-ID of users from dataframe 'x' who rated more than 200 books ie value of boolean flag is True

In [25]:
filtered_rating = ratings_with_name[ratings_with_name['User-ID'].isin(flag)] # creates a filtered dataframe of dataframe 'x'
# which contains only User-ID and boolean flag True ie only users which have rated more than 200 books

In [26]:
y = filtered_rating.groupby('Book-Title').count()['Book-Rating']>=50 # creates a filtered dataframe 'y' of dataframe 'filtered_rating'
# containing book titles of books that have been rated by 50 or more users

In [27]:
famous_books = y[y].index # extracts book titles of books from dataframe 'y'
# ie titles of books that have been rated by 50 or more users

In [28]:
final_ratings = filtered_rating[filtered_rating['Book-Title'].isin(famous_books)] # returns a dataframe which is filtered form of dataframe
# 'filtered_rating'. This filtered dataframe consists of rows for which book title is of famous books ie rated by more than 50 users

In [30]:
pt = final_ratings.pivot_table(index='Book-Title',columns='User-ID',values='Book-Rating') # creates a table whose first row is User-ID
# first column is book titles and other matrix[x][y] is rating of book 'x' by user with User-ID 'y'
# if value of a cell is 'NaN', it means that user with User-ID 'y' has not rated book with book title 'x'

In [31]:
pt.fillna(0,inplace=True) # modifies table 'pt' such that cells having value 'NaN' is replaced by 0 ie rating becomes 0

In [33]:
print(pt) # checking the table formed

User-ID                                             254     2276    2766    \
Book-Title                                                                   
1984                                                   9.0     0.0     0.0   
1st to Die: A Novel                                    0.0     0.0     0.0   
2nd Chance                                             0.0    10.0     0.0   
4 Blondes                                              0.0     0.0     0.0   
A Bend in the Road                                     0.0     0.0     7.0   
...                                                    ...     ...     ...   
Year of Wonders                                        0.0     0.0     0.0   
You Belong To Me                                       0.0     0.0     0.0   
Zen and the Art of Motorcycle Maintenance: An I...     0.0     0.0     0.0   
Zoya                                                   0.0     0.0     0.0   
\O\" Is for Outlaw"                                    0.0     0

we see that the table formed has 706 rows and 810 columns.

now that we have got all the necessary data for making a recommendation system, we can start making one using machine learning's cosine similarity algorithm though python's scikit-learn library.

In [34]:
from sklearn.metrics.pairwise import cosine_similarity # import the cosine similarity algorithm from sklearn library

In [35]:
similarity_scores = cosine_similarity(pt) # create cosine similarity matrix from table 'pt'

In [36]:
similarity_scores.shape # checking out the matrix formed

(706, 706)

we see that the matrix formed has 706 rows and 706 columns.

now, create an actual function that will recommend books by taking name of a book as input

In [37]:
def recommend(book_name):
    # return co-ordinates of first occurece of given book name in table 'pt'
    # then find 4 most similar books to it based on cosine similarity matrix values (excluding book given as input)
    # sort them in descending order so that most similar book comes first
    index = np.where(pt.index==book_name)[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:5]
    
    data = [] # create an empty list in which similar books will be stored

    for i in similar_items: # iterate through the 4 most similar books we found
        item = [] # create an empty list in which necessary characteristics of similar books will be appended
        
        temp_df = books[books['Book-Title'] == pt.index[i[0]]] # find the similar books in the table 'pt'
        
        # add book title, author name, and image URL to 'item' for each similar book
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Book-Title'].values))
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Book-Author'].values))
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Image-URL-M'].values))
        
        data.append(item) # append the list formed to 'data' so 'data' is now a list of 'item' lists
    
    return data # return the 2D list formed

In [38]:
recommend('1984') # testing the 'recommend' function by checking it with book titled '1984'

[['Animal Farm',
  'George Orwell',
  'http://images.amazon.com/images/P/0451526341.01.MZZZZZZZ.jpg'],
 ["The Handmaid's Tale",
  'Margaret Atwood',
  'http://images.amazon.com/images/P/0449212602.01.MZZZZZZZ.jpg'],
 ['Brave New World',
  'Aldous Huxley',
  'http://images.amazon.com/images/P/0060809833.01.MZZZZZZZ.jpg'],
 ['The Vampire Lestat (Vampire Chronicles, Book II)',
  'ANNE RICE',
  'http://images.amazon.com/images/P/0345313860.01.MZZZZZZZ.jpg']]

In [48]:
# import pickle module to store dataframe 'popular_df' as a byte stream for light easy and fast data retrieval
import pickle
pickle.dump(popular_df,open('popular.pkl','wb'))

In [49]:
books.drop_duplicates('Book-Title') # drop duplicate rows for same book title

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
...,...,...,...,...,...,...,...,...
271354,0449906736,Flashpoints: Promise and Peril in a New World,Robin Wright,1993,Ballantine Books,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...


In [50]:
# store pivot table 'pt', CSV file 'books', and cosine similarity table as byte stream
pickle.dump(pt,open('pt.pkl','wb'))
pickle.dump(books,open('books.pkl','wb'))
pickle.dump(similarity_scores,open('similarity_scores.pkl','wb'))