**Name:** Hrithik Singh

**Project** : Book Recommendation System [Collaborative filter method]

**Business Context:**
During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.

In a very general way, recommender systems are algorithms aimed at suggesting relevant.

items to users (items being movies to watch, text to read, products to buy, or anything else depending on industries). Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. The main objective is to create a book recommendation system for users.

Dataset Description

The Book-Crossing dataset comprises 3 files.

Users:

Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL values.

Books:

Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in the case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon website.

Ratings:

Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import files
uploaded = files.upload()

Saving Books.csv to Books.csv


In [None]:
books = pd.read_csv('Books.csv' , encoding = 'latin-1' )

  books = pd.read_csv('Books.csv' , encoding = 'latin-1' )


In [None]:
books.head(2)
list(books.columns)

#Removing columns like ['Image-URL-S','Image-URL-M','Image-URL-L'] as it is of no use to build recommending system
books.drop(columns =['Image-URL-S','Image-URL-M','Image-URL-L'] , inplace = True)

#Renaming columns by suitable names
books.rename(columns = {'Book-Title': 'title' , 'Book-Author' : 'author' , 'Year-Of-Publication' : 'year' , 'Publisher' : 'publisher'} , inplace = True)

In [None]:
books.head(1)


Unnamed: 0,ISBN,title,author,year,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press


In [None]:
from google.colab import files
uploaded = files.upload()

Saving Users.csv to Users.csv


In [None]:
users = pd.read_csv('Users.csv' , encoding = 'latin-1')

In [None]:
users.head(2)
#Renaming columns with suitable names
users.rename(columns = {'User-ID' : 'user-id' , 'Location' : 'location' , 'Age' : 'age'} , inplace = True)

In [None]:
users.head(1)


Unnamed: 0,user-id,location,age
0,1,"nyc, new york, usa",


In [None]:
from google.colab import files
uploaded = files.upload()

Saving Ratings.csv to Ratings.csv


In [None]:
rating = pd.read_csv('Ratings.csv' , encoding = 'Latin-1')

In [None]:
rating.head(2)
#Renaming columns with suitable names
rating.rename(columns = {'User-ID' : 'user-id' , 'Book-Rating' : 'ratings'} , inplace = True)

In [None]:
rating.head(1)


Unnamed: 0,user-id,ISBN,ratings
0,276725,034545104X,0


In [None]:
books.shape

(271360, 5)

In [None]:
users.shape

(278858, 3)

In [None]:
rating.shape

(1149780, 3)

**Aproach:** Since we are building book recommendation system so we cant rely on a person who has just read 1 book and they have no knowledge . So we cant recommend there suggestions or there contents . Hence , we will be taking only those users who has given atleast 200 ratings .

On the other side we cant just take all the books . Like assume a scenario where no one has read the book or you can say no one has rated on that book . Hence we will be taking only those books whose number of rating is greater than equals to 50.

In [None]:
#Take only those users who have given atleast 200 ratings
x = rating['user-id'].value_counts()>200
y = x[x].index #no of users given more than 200 ratings
y

Index([ 11676, 198711, 153662,  98391,  35859, 212898, 278418,  76352, 110973,
       235105,
       ...
       260183,  73681,  44296, 155916,   9856, 274808,  28634,  59727, 268622,
       188951],
      dtype='int64', name='user-id', length=899)

In [None]:
rating = rating[rating['user-id'].isin(y)] #calling only those rows which matches the specified condition by giving the index
rating.shape

(526356, 3)

In [None]:
rating.head()

Unnamed: 0,user-id,ISBN,ratings
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0


In [None]:
#merge ratings and books dataframe
ratings_with_books = rating.merge(books , on = 'ISBN')

In [None]:
ratings_with_books.shape

(487671, 7)

We can see 40000 datas are missing . This is because the lack of data in the books dataframe . So the missing data in the books dataframe are removed and thus 40000 thousands data are removed.

In [None]:
#Filtering out books with number of ratings
number_rating = ratings_with_books.groupby('title')['ratings'].count().reset_index()
number_rating.rename(columns = {'ratings' : 'number_of_rating'} , inplace = True)

#merging number_rating and ratings_with_books
final_ratings = ratings_with_books.merge(number_rating , on = 'title')
final_ratings.shape

(487671, 8)

In [None]:
#Filtering out only those books whose number of ratings is equals to or greater than 50
final_ratings=final_ratings[final_ratings['number_of_rating']>=50]
final_ratings.shape

(61853, 8)

In [None]:
#Dropping duplicate rows
final_ratings.drop_duplicates(['user-id' , 'title'] , inplace = True)

In [None]:
final_ratings.shape

(59850, 8)

Our data is clean and now we can go ahead and do the rest of the process .

In [None]:
final_ratings.head(2)

Unnamed: 0,user-id,ISBN,ratings,title,author,year,publisher,number_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82


In [None]:
pivot_table = pd.pivot_table(final_ratings , columns = 'user-id' , values = 'ratings' , index = 'title')
pivot_table.shape

(742, 888)

In [None]:
pivot_table.fillna(0 , inplace = True)
pivot_table

user-id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Model Building:** We will build a model which is similar to knn algorithm but is not same . Since we know KNN is a classification model . There's a clustering algorithm known as Nearest Neighbor which will calculate distance of nearest point and form a cluster .

But there's a problem I dont want my model to calculate distance of 0 as it is of no use and it will increase computational time . I only wanted my model to calculate distance of non zeros .

So I will be converting my table to csr matrix .

In [None]:
from scipy.sparse import csr_matrix
book_sparse = csr_matrix(pivot_table)
type(book_sparse)

In [None]:
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(algorithm = 'brute')# In default the model may skip calculating distances , so we have mentioned Brute and asking the model to calculate distance of every data points

In [None]:
model.fit(book_sparse)

In [None]:
distance , suggestion = model.kneighbors(pivot_table.iloc[358, :].values.reshape(1,-1) , n_neighbors = 6)

In [None]:
for i in range(len(suggestion)):
  print(pivot_table.index[suggestion[i]])

Index(['Naked', 'No Safe Place', 'Deck the Halls (Holiday Classics)',
       'Long After Midnight', 'Exclusive', 'Lake Wobegon days'],
      dtype='object', name='title')


In [None]:
pivot_table.index[358]

'Naked'

In [None]:
np.where(pivot_table.index == 'Naked')[0][0] #We will get the id by just giving name of the book

358

In [None]:
def recommendation(book_name):
  bookid = np.where(pivot_table.index == book_name)[0][0]
  distance , suggestion = model.kneighbors(pivot_table.iloc[bookid, :].values.reshape(1,-1) , n_neighbors = 6)
  for i in range(len(suggestion)):
    if i == 0:
      print('The suggestions for' , book_name , 'are:')
    if not i:
      print(pivot_table.index[suggestion[i]])


In [None]:
recommendation('Animal Farm')

The suggestions for Animal Farm are:
Index(['Animal Farm', 'Exclusive', 'Jacob Have I Loved', 'Second Nature',
       'Pleading Guilty', 'No Safe Place'],
      dtype='object', name='title')


In [None]:
recommendation('Naked')

The suggestions for Naked are:
Index(['Naked', 'No Safe Place', 'Deck the Halls (Holiday Classics)',
       'Long After Midnight', 'Exclusive', 'Lake Wobegon days'],
      dtype='object', name='title')


**Conclusion:** In this project, we successfully built a book recommendation system using the Nearest Neighbors algorithm. This system can recommend books based on the ratings provided by users, allowing us to identify books that are similar in terms of user preferences.

**Key Steps and Insights**

**Data Preparation:**

We started by organizing our dataset into a pivot table format where rows represent books, columns represent users, and the values are the ratings given by the users. This structure is essential for applying the KNN algorithm effectively.

**Model Training:**

The KNN algorithm was used to identify similar books based on their ratings. We fitted the model using the NearestNeighbors class from the sklearn.neighbors module. The model was trained on the pivot table, which enabled it to find the nearest neighbors for any given book.

**Recommendation Function:**

We implemented a recommendation function that takes a book name as input, finds its index in the pivot table, and uses the trained KNN model to find and print similar books. This function leverages the kneighbors method to obtain the nearest neighbors.
