# Collaborative Recommender System

### "We need to create a recommendation model for books that suggests 5 similar books based on their features or characteristics."

### Data Set

#### BX-Users:
    Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

#### BX-Books:

    Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

#### BX-Book-Ratings:
     Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
import numpy as np
import pandas as pd

In [4]:
#sep is used to seprate the data because our data is not comma seprated value.
# error bad line is used if any mismatch data is occured some where in the table
#latin-1 is used for data interpretation
books=pd.read_csv("BX-Books.csv",sep=";",error_bad_lines = False,encoding = "latin-1")

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


In [5]:
#check the data
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [6]:
#check column name
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [7]:
#from above check these columns are not useful for information Image-URL-S,Image-URL-M,Image-URL-L, so we remove it

In [8]:
books=books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]
books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company
...,...,...,...,...,...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm)
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press


In [9]:
#change the column name for our conveniance
books.rename(columns={'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'},inplace=True)

In [10]:
books

Unnamed: 0,ISBN,title,author,year,publisher
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company
...,...,...,...,...,...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm)
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press


In [11]:
#load 2nd data set of users
users = pd.read_csv('BX-users.csv',sep=";",error_bad_lines = False,encoding = "latin-1")

In [12]:
#check data
users

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
...,...,...,...
278853,278854,"portland, oregon, usa",
278854,278855,"tacoma, washington, united kingdom",50.0
278855,278856,"brampton, ontario, canada",
278856,278857,"knoxville, tennessee, usa",


In [13]:
#change column name
users.rename(columns={'User-ID':'user_id','Location':'location','Age':'age'},inplace = True)

In [14]:
users.head()

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [15]:
#load 3rd data set of rating
rating=pd.read_csv("BX-Book-Ratings.csv", sep=';',error_bad_lines=False,encoding='latin-1')

In [16]:
#check data set
rating.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [17]:
#change column name for convience
rating.rename(columns={'User-ID':'user_id','Book-Rating':'book_rating'},inplace=True)

In [18]:
books.shape

(271360, 5)

In [19]:
users.shape

(278858, 3)

In [20]:
rating.shape

(1149780, 3)

In [21]:
#approach
# we are creating collaborating recommender system
# we dont want to extract the similartity b/w user, books
# we want to find that if there is a user A and user B both read a book 1 
# and user A also read a book 2 which does not read by user B so we have to recommend the book 2 for user B
# this is our problem and solution is to create a matrix 
# we have to create a matrix which have column = user, index=books, values=rating

In [22]:
# if we take a whole books and whole users it will create a problem why?

In [23]:
# filter on user
# because we have a 278858 users so  we have to  find out how many users are really reads the book.
# so we will stick on those users who have ratied more than 200 book

In [24]:
# same for books
# because we have a 271360 books so we have to find out how many books are really good.
# so we will stick on those books who has ratied more than 50 times rating by user on  book

In [25]:
# we have only 105283 user who give the rating
rating['user_id'].value_counts().shape 

(105283,)

In [26]:
#we take those user who rateed more than 200 books, it give boolean series
rating_count = rating['user_id'].value_counts()>200
#multiple this variavle by it self so we get actual value of user who gave review on more than 200 books 
rating_count[rating_count]

11676     True
198711    True
153662    True
98391     True
35859     True
          ... 
274808    True
28634     True
59727     True
268622    True
188951    True
Name: user_id, Length: 899, dtype: bool

In [27]:
#these 899 people will help me to build a model
#extract the user id of 899 people
rating_count_index=rating_count[rating_count].index

In [28]:
# we got the user_id, but we need to user_id and user_rating both
rating=rating[rating['user_id'].isin(rating_count_index)]
rating

Unnamed: 0,user_id,ISBN,book_rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0
...,...,...,...
1147612,275970,3829021860,0
1147613,275970,4770019572,0
1147614,275970,896086097,0
1147615,275970,9626340762,8


In [29]:
# these 899 people give the review of 5.2L and 
# total 278858 people give review of 11.4L
# 277950 give review on 6.2L
# so that is why we consider only 899 people for our model

In [30]:
rating.head()

Unnamed: 0,user_id,ISBN,book_rating
1456,277427,002542730X,10
1457,277427,0026217457,0
1458,277427,003008685X,8
1459,277427,0030615321,0
1460,277427,0060002050,0


In [31]:
#we are join the data frame rating and books on the basis of ISBN, so we get rating and books name
ratings_with_books = rating.merge(books,on='ISBN')

In [32]:
ratings_with_books

Unnamed: 0,user_id,ISBN,book_rating,title,author,year,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
...,...,...,...,...,...,...,...
487666,275970,1892145022,0,Here Is New York,E. B. White,1999,Little Bookroom
487667,275970,1931868123,0,There's a Porcupine in My Outhouse: Misadventu...,Mike Tougias,2002,Capital Books (VA)
487668,275970,3411086211,10,Die Biene.,Sybil GrÃ?Â¤fin SchÃ?Â¶nfeldt,1993,"Bibliographisches Institut, Mannheim"
487669,275970,3829021860,0,The Penis Book,Joseph Cohen,1999,Konemann


In [33]:
# data has reduce by apprx 40K
# why ? it means we dont have the data of books which people have rated

In [34]:
# now we find the books name which have more than 50 rating
# now we group by on title or ISBN for getting book name
number_rating = ratings_with_books.groupby('title')['book_rating'].count().reset_index()

In [35]:
number_rating.rename(columns={'book_rating':'number_of_rating'},inplace=True)

In [36]:
#join the data ratings_with_books and number_rating to get final_rating dataset
final_rating=ratings_with_books.merge(number_rating, on='title')

In [37]:
final_rating.shape

(487671, 8)

In [38]:
#extract the data which have number_of_rating is more than 50
final_rating=final_rating[final_rating['number_of_rating']>=50]
final_rating.shape

(61853, 8)

In [39]:
final_rating.drop_duplicates(['title','user_id'],inplace=True)

In [40]:
final_rating.head()

Unnamed: 0,user_id,ISBN,book_rating,title,author,year,publisher,number_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,82


In [41]:
#we drop approx 2000 row which are duplicates
# this 59850 rating are those rating which have more than 200 rating on books and those books which have more than 50 rating per book

In [42]:
#create a pivot table, user_id,title_books_rating
book_pivot=final_rating.pivot_table(columns='user_id',index='title',values='book_rating')
book_pivot.shape

(742, 888)

In [43]:
#but we have 899 users that means those users give the rating for books but 11 doesnt have rating of 50
#that means 11 books have less than 50 rating
# so fianally we have 742 books having more than 50 times rating on it and 888 user whose rate the books more than 200.

In [44]:
#fill NAN value with 0 in book_pivot
book_pivot.fillna(0,inplace=True)

In [45]:
book_pivot

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
84 Charing Cross Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zoya,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
#now we use the clustering alog that will do the clustering, get the distance and which have min distace save it for one cluster
# but the peoblem is, we dont need to take distance from where have 0 rating, so to solve it
# so we convert this matrics into sparse matrics

In [47]:
#import scipy for converting the matrics into sparse matrics
#this sparse matrix remove the 0 value which are not usefull and computation time will also reduce by it
#this matrix is used for model
from scipy.sparse import csr_matrix
book_sparse=csr_matrix(book_pivot)

In [48]:
#check the matrix type , internally this will not consider 0 values, it use only value which are not 0 to calculate the distnce
type(book_sparse)

scipy.sparse._csr.csr_matrix

In [49]:
#import NearestNeighbors from sklearn, and call the algorithm 'brute'
#why brute because it will help to calculate the distance from every point
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(algorithm='brute')

In [50]:
model.fit(book_sparse)

NearestNeighbors(algorithm='brute')

In [51]:
# iloc[237] this is use to pass the book name to get the nearest cluster of that book
# 237 = Harry Potter and the chamber of secreats
# n_neighbors = 6, it means it will give nearest 5 value and 1 is it self
# reshape(1,-1) it will provide the data fram of 1 row and 888 column
distances, suggestions = model.kneighbors(book_pivot.iloc[237,:].values.reshape(1,-1),n_neighbors=6)

In [52]:
# it will give the 5 nearest distance from that book
distances

array([[ 0.        , 68.78953409, 69.5413546 , 72.64296249, 76.83098333,
        77.28518616]])

In [53]:
#it will give the nearest suggested books
suggestions

array([[237, 240, 238, 241, 184, 536]], dtype=int64)

In [54]:
#how we see the names of those books?
# using for loop
#book_pivot.index it will give the index and suggestions[i] it will give the suggestion id
# when we use both we will get the name of book which model want to suggest against the book i want
for i in range(len(suggestions)):
    print(book_pivot.index[suggestions[i]])

Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive',
       'The Cradle Will Fall'],
      dtype='object', name='title')


In [55]:
# cross check for id = 536 , book_name = 'The Cradle Will Fall'
distances, suggestions = model.kneighbors(book_pivot.iloc[536,:].values.reshape(1,-1),n_neighbors=6)
for i in range(len(suggestions)):
    print(book_pivot.index[suggestions[i]])
suggestions

Index(['The Cradle Will Fall', 'Exclusive', 'The Long Road Home',
       'Eyes of a Child', 'Jacob Have I Loved', 'No Safe Place'],
      dtype='object', name='title')


array([[536, 184, 597, 187, 291, 372]], dtype=int64)

In [56]:
#cross check for id = 372, book  = 'no safe place'
distances, suggestions = model.kneighbors(book_pivot.iloc[372,:].values.reshape(1,-1),n_neighbors=6)
for i in range(len(suggestions)):
    print(book_pivot.index[suggestions[i]])


Index(['No Safe Place', 'Long After Midnight', 'A Civil Action', 'Exclusive',
       'Table For Two', 'Lake Wobegon days'],
      dtype='object', name='title')


In [60]:
#trying to create a function in which i will pass a book name and get the 5 books name near that book
#np.where will extract a boolen true value from an array so extract the book_id
#[0][0] coverting series into 2 - dimension
#np.where(book_pivot.index==book_name)[0][0]
def recommend_book(book_name):
    book_id=np.where(book_pivot.index == book_name)[0][0]
    distances, suggestions = model.kneighbors(book_pivot.iloc[book_id,:].values.reshape(1,-1),n_neighbors=6)
    for i in range(len(suggestions)):
        if i == 0:
            print("The suggestions for " , book_name , "are : ")
        if not i  :    # if i != 0 (why not working)
            print(book_pivot.index[suggestions[i]])

In [61]:
recommend_book('Animal Farm')

The suggestions for  Animal Farm are : 
Index(['Animal Farm', 'Exclusive', 'Jacob Have I Loved', 'Second Nature',
       'Pleading Guilty', 'No Safe Place'],
      dtype='object', name='title')
