# Book Recommendation System

### Downloading Dataset

In [101]:
!wget -O /content/BX-CSV-Dump.zip http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
print('unziping ...')
!unzip -o -j /content/BX-CSV-Dump.zip

--2020-09-23 14:00:46--  http://www2.informatik.uni-freiburg.de/~cziegler/BX/BX-CSV-Dump.zip
Resolving www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)... 132.230.105.133
Connecting to www2.informatik.uni-freiburg.de (www2.informatik.uni-freiburg.de)|132.230.105.133|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘/content/BX-CSV-Dump.zip’


2020-09-23 14:00:48 (16.0 MB/s) - ‘/content/BX-CSV-Dump.zip’ saved [26085508/26085508]

unziping ...
Archive:  /content/BX-CSV-Dump.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


### Importing Libraries

In [102]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [103]:
BR = pd.read_csv("BX-Book-Ratings.csv",encoding= 'unicode_escape',low_memory=False,sep=';') 

In [104]:
BR.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [105]:
B = pd.read_csv("BX-Books.csv",encoding='unicode_escape',low_memory=False,sep=';',error_bad_lines=False)

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 9\nSkipping line 51751: expected 8 fields, saw 9\nSkipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\nSkipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\nSkipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


In [106]:
B.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [107]:
U = pd.read_csv("BX-Users.csv",encoding='unicode_escape',low_memory=False,sep=';',error_bad_lines=False)

In [108]:
U.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


## Collaborative Filtering

###Collaborative Filtering or User-User Filtering is the technique that uses other users to recommend items to the input user. It finds users that have similar preferences as the input and then recommends items that they have liked to the new user. Here, Pearson Correlation Function is used for finding similar users.

###Books read by the User

In [109]:
userInput = [
            {'Book-Title':'The Satanic Verses', 'Book-Rating':3.5},
            {'Book-Title':'Don Quixote', 'Book-Rating':5},
            {'Book-Title':'To Kill a Mockingbird', 'Book-Rating':4.5},
            {'Book-Title':'A Passage to India', 'Book-Rating':2},
            {'Book-Title':'Beloved', 'Book-Rating':5}               
         ] 
inputBooks = pd.DataFrame(userInput)
inputBooks

Unnamed: 0,Book-Title,Book-Rating
0,The Satanic Verses,3.5
1,Don Quixote,5.0
2,To Kill a Mockingbird,4.5
3,A Passage to India,2.0
4,Beloved,5.0


In [110]:
#Filtering out the books by title
inputId = B[B['Book-Title'].isin(inputBooks['Book-Title'].tolist())]
#Then merging it so we can get ISBN. Merging is done implicitly by title.
inputBooks = pd.merge(inputId, inputBooks)
#Dropping columns that are not required
inputBooks = inputBooks.drop('Year-Of-Publication', 1)
inputBooks = inputBooks.drop('Image-URL-S', 1)
inputBooks = inputBooks.drop('Image-URL-M', 1)
inputBooks = inputBooks.drop('Image-URL-L', 1)
inputBooks = inputBooks.drop('Publisher', 1)
inputBooks = inputBooks.drop('Book-Author', 1)
inputBooks.head()

Unnamed: 0,ISBN,Book-Title,Book-Rating
0,0446310786,To Kill a Mockingbird,4.5
1,0446310492,To Kill a Mockingbird,4.5
2,0060935464,To Kill a Mockingbird,4.5
3,0397001517,To Kill a Mockingbird,4.5
4,006017322X,To Kill a Mockingbird,4.5


###Users who have read the same Books

In [111]:
NewReaderSubset = BR[BR['ISBN'].isin(inputBooks['ISBN'].tolist())] 
NewReaderSubset

Unnamed: 0,User-ID,ISBN,Book-Rating
648,276953,0446310786,10
2827,277743,0446310786,9
3176,277954,0446310492,0
4125,278243,0156711427,8
5005,278418,0345327853,0
...,...,...,...
1144914,275306,0446310786,10
1145467,275520,0446310786,8
1147047,275970,0670825379,0
1149450,276680,0060188707,0


In [112]:
#Grouping by User-ID
NewReaderSubsetGroup = NewReaderSubset.groupby(['User-ID'])

In [113]:
#Sorting it so that the users with books most common with the input will have priority
NewReaderSubsetGroup = sorted(NewReaderSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

In [114]:
NewReaderSubsetGroup[0:5]

[(11676,        User-ID        ISBN  Book-Rating
  45730    11676  0060935464           10
  48925    11676  0394535979           10
  50007    11676  0446310786            0
  52282    11676  0670825379            0
  55139    11676  0899668585            8),
 (55492,         User-ID        ISBN  Book-Rating
  238498    55492  0156711427            0
  239467    55492  0446310786            0
  239759    55492  0452280621            0
  240613    55492  0963270702            0),
 (271705,          User-ID        ISBN  Book-Rating
  1132864   271705  0156711427            7
  1132936   271705  0446310786            0
  1132954   271705  0452280621            9
  1133030   271705  0963270702            8),
 (23768,         User-ID        ISBN  Book-Rating
  101937    23768  0446310786            0
  102045    23768  0451161394            0
  102857    23768  1580601200            0),
 (23902,         User-ID        ISBN  Book-Rating
  104007    23902  0156711427            8
  104336   

###Using Pearson Correlation

In [115]:
#Storing the Pearson Correlation in a dictionary
pearsonCorrelationDict = {}
#For every new reader group in our subset
for name, group in NewReaderSubsetGroup:
    #Sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='ISBN')
    inputBooks = inputBooks.sort_values(by='ISBN')
    nRatings = len(group)
    #Getting the review scores for books that they both have in common
    temp_df = inputBooks[inputBooks['ISBN'].isin(group['ISBN'].tolist())]
    #And then storing them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['Book-Rating'].tolist()
    #Putting the current user group reviews in a list format
    tempGroupList = group['Book-Rating'].tolist()
    #Calculating the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx!= 0 and Syy!= 0:
      pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
      pearsonCorrelationDict[name] = 0


In [116]:
pearsonCorrelationDict.items()

dict_items([(11676, 0.6877119754621323), (55492, 0), (271705, -0.18516401995451032), (23768, 0), (23902, -0.43355498476206183), (87555, 0), (142524, 0), (149907, 0), (185275, 0), (192093, 0), (209516, -0.6286185570937125), (211426, 0), (230522, 0.6286185570937124), (4017, 0), (8253, 1.0), (11601, -1.0), (21014, 0), (28150, 1.0), (30511, 1.0), (32440, 0), (36836, 0), (36907, 0), (39646, 0), (46398, -1.0), (50225, 0), (55548, 1.0), (60277, 0), (62558, 0), (66942, 1.0), (70882, 0), (76352, 0), (80538, 1.0), (93426, 0), (95359, 0), (104636, 0), (105979, -1.0), (110973, 0), (113519, -1.0), (114414, 0), (114868, 0), (115490, 0), (117594, 0), (129074, 0), (135831, 0), (138198, 0), (144255, 1.0), (153662, 0), (158254, 1.0), (168144, 0), (168816, 1.0), (170861, 0), (177432, 0), (185233, 0), (187145, 1.0), (197364, -1.0), (197775, 0), (201526, 1.0), (203075, 0), (210485, 1.0), (211919, 1.0), (222296, 0), (225986, 0), (228681, 0), (231210, 1.0), (234623, 0), (238120, 0), (239594, 1.0), (256407, -

In [117]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='Index')
pearsonDF.columns = ['Similarity Index']
pearsonDF['User-ID'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,Similarity Index,User-ID
0,0.687712,11676
1,0.0,55492
2,-0.185164,271705
3,0.0,23768
4,-0.433555,23902


###Most Similar Users

In [118]:
topUsers=pearsonDF.sort_values(by='Similarity Index', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,Similarity Index,User-ID
31,1.0,80538
17,1.0,28150
59,1.0,211919
58,1.0,210485
56,1.0,201526


###Ratings of Selected Users for all Books

In [119]:
topUsersRating=topUsers.merge(BR,left_on='User-ID',right_on='User-ID',left_index=False, right_index=False,how='inner')
topUsersRating.head()

Unnamed: 0,Similarity Index,User-ID,ISBN,Book-Rating
0,1.0,80538,0006499465,0
1,1.0,80538,0006499503,0
2,1.0,80538,0006510884,0
3,1.0,80538,002542730X,7
4,1.0,80538,0028604199,0


In [120]:
topUsersRating['Weighted Rating'] = topUsersRating['Similarity Index']*topUsersRating['Book-Rating']
topUsersRating.head()

Unnamed: 0,Similarity Index,User-ID,ISBN,Book-Rating,Weighted Rating
0,1.0,80538,0006499465,0,0.0
1,1.0,80538,0006499503,0,0.0
2,1.0,80538,0006510884,0,0.0
3,1.0,80538,002542730X,7,7.0
4,1.0,80538,0028604199,0,0.0


In [121]:
tempTopUsersRating = topUsersRating.groupby('ISBN').sum()[['Similarity Index','Weighted Rating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
9022906116,0.687712,4.813984
*0452281903,1.0,0.0
0 7336 1053 6,0.687712,0.0
0000000000,0.687712,6.189408
00000000000,0.687712,5.501696


In [122]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Taking the weighted average
recommendation_df['Weighted Average Recommendation Score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['ISBN'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,Weighted Average Recommendation Score,ISBN
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
9022906116,7.0,9022906116
*0452281903,0.0,*0452281903
0 7336 1053 6,0.0,0 7336 1053 6
0000000000,9.0,0000000000
00000000000,8.0,00000000000


In [123]:
recommendation_df = recommendation_df.sort_values(by='Weighted Average Recommendation Score', ascending=False)
recommendation_df.head()

Unnamed: 0_level_0,Weighted Average Recommendation Score,ISBN
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
0440977096,10.0,0440977096
0439288568,10.0,0439288568
0766607631,10.0,0766607631
0770428312,10.0,0770428312
077042239X,10.0,077042239X


###Recommended Books for New User

In [124]:
B.loc[B['ISBN'].isin(recommendation_df.head(10)['ISBN'].tolist())]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
5316,767914767,The Devil Wears Prada : A Novel,LAUREN WEISBERGER,2004,Broadway Books,http://images.amazon.com/images/P/0767914767.0...,http://images.amazon.com/images/P/0767914767.0...,http://images.amazon.com/images/P/0767914767.0...
9131,345469674,Girls' Poker Night,JILL DAVIS,2004,Ballantine Books,http://images.amazon.com/images/P/0345469674.0...,http://images.amazon.com/images/P/0345469674.0...,http://images.amazon.com/images/P/0345469674.0...
58789,440977096,The Secret Garden,Frances Hodgson Burnett,1989,Laure Leaf,http://images.amazon.com/images/P/0440977096.0...,http://images.amazon.com/images/P/0440977096.0...,http://images.amazon.com/images/P/0440977096.0...
97751,770428312,The Dionnes,Ellie Tesher,2000,Bantam Books,http://images.amazon.com/images/P/0770428312.0...,http://images.amazon.com/images/P/0770428312.0...,http://images.amazon.com/images/P/0770428312.0...
