**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


#### Execute the below cell to load the datasets

In [4]:
#Loading data
books = pd.read_csv("/content/drive/My Drive/Great_Lakes_Assignments/Lab External | Residency 5/books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('/content/drive/My Drive/Great_Lakes_Assignments/Lab External | Residency 5/users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('/content/drive/My Drive/Great_Lakes_Assignments/Lab External | Residency 5/ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [13]:
print(books.shape)
print("No. of Records :- ", books.shape[0])
print("No. of Features :- ", books.shape[1])

(271360, 8)
No. of Records :-  271360
No. of Features :-  8


In [14]:
print(users.shape)
print("No. of Records :- ", users.shape[0])
print("No. of Features :- ", users.shape[1])

(278858, 3)
No. of Records :-  278858
No. of Features :-  3


In [15]:
print(ratings.shape)
print("No. of Records :- ", ratings.shape[0])
print("No. of Features :- ", ratings.shape[1])

(1149780, 3)
No. of Records :-  1149780
No. of Features :-  3


## Exploring books dataset

In [16]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [17]:
books.columns

Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher',
       'imageUrlS', 'imageUrlM', 'imageUrlL'],
      dtype='object')

In [0]:
books = books.drop(columns=['imageUrlS','imageUrlM','imageUrlL'],axis=1)

In [19]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [20]:
print(books.shape)
print("No. of Records :- ", books.shape[0])
print("No. of Features :- ", books.shape[1])

(271360, 5)
No. of Records :-  271360
No. of Features :-  5


**yearOfPublication**

### Check unique values of yearOfPublication


In [22]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
dtypes: object(5)
memory usage: 10.4+ MB


In [21]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [25]:
books[books['yearOfPublication'].str.contains("DK Publishing Inc")== True]


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [27]:
books[books['yearOfPublication'].str.contains("DK Publishing Inc")== True].count()

ISBN                 2
bookTitle            2
bookAuthor           2
yearOfPublication    2
publisher            2
dtype: int64

### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [29]:
books[books['yearOfPublication'].str.contains("Gallimard")== True]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...


In [36]:
books[books['yearOfPublication'].str.contains("Gallimard")== True].count()

ISBN                 1
bookTitle            1
bookAuthor           1
yearOfPublication    1
publisher            1
dtype: int64

In [37]:
books = books[~books['yearOfPublication'].str.contains("Gallimard",na=False)]
books = books[~books['yearOfPublication'].str.contains("DK Publishing Inc",na=False)]
print(books[books['yearOfPublication'].str.contains("Gallimard")== True].count())
print(books[books['yearOfPublication'].str.contains("DK Publishing Inc")== True].count())

ISBN                 0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64
ISBN                 0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64


In [38]:
print(books.shape)
print("No. of Records :- ", books.shape[0])
print("No. of Features :- ", books.shape[1])

(271357, 5)
No. of Records :-  271357
No. of Features :-  5


##We can see total 3 records are deleted

### Change the datatype of yearOfPublication to 'int'

In [0]:
books['yearOfPublication'] = books['yearOfPublication'].astype(int)

In [42]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [47]:
books.shape

(271357, 5)

In [52]:
books  = books[~books['publisher'].isna()]
books.shape

(271355, 5)

## Exploring Users dataset

In [53]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [55]:
users['Age'].unique()

array([ nan,  18.,  17.,  61.,  26.,  14.,  25.,  19.,  46.,  55.,  32.,
        24.,  20.,  34.,  23.,  51.,  31.,  21.,  44.,  30.,  57.,  43.,
        37.,  41.,  54.,  42.,  50.,  39.,  53.,  47.,  36.,  28.,  35.,
        13.,  58.,  49.,  38.,  45.,  62.,  63.,  27.,  33.,  29.,  66.,
        40.,  15.,  60.,   0.,  79.,  22.,  16.,  65.,  59.,  48.,  72.,
        56.,  67.,   1.,  80.,  52.,  69.,  71.,  73.,  78.,   9.,  64.,
       103., 104.,  12.,  74.,  75., 231.,   3.,  76.,  83.,  68., 119.,
        11.,  77.,   2.,  70.,  93.,   8.,   7.,   4.,  81., 114., 230.,
       239.,  10.,   5., 148., 151.,   6., 101., 201.,  96.,  84.,  82.,
        90., 123., 244., 133.,  91., 128.,  94.,  85., 141., 110.,  97.,
       219.,  86., 124.,  92., 175., 172., 209., 212., 237.,  87., 162.,
       100., 156., 136.,  95.,  89., 106.,  99., 108., 210.,  88., 199.,
       147., 168., 132., 159., 186., 152., 102., 116., 200., 115., 226.,
       137., 207., 229., 138., 109., 105., 228., 18

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [56]:
users['Age'] = users['Age'].apply(lambda x:np.nan if x > 90 else x)
users['Age'] = users['Age'].apply(lambda x:np.nan if x < 5 else x)
users['Age'].unique()

array([nan, 18., 17., 61., 26., 14., 25., 19., 46., 55., 32., 24., 20.,
       34., 23., 51., 31., 21., 44., 30., 57., 43., 37., 41., 54., 42.,
       50., 39., 53., 47., 36., 28., 35., 13., 58., 49., 38., 45., 62.,
       63., 27., 33., 29., 66., 40., 15., 60., 79., 22., 16., 65., 59.,
       48., 72., 56., 67., 80., 52., 69., 71., 73., 78.,  9., 64., 12.,
       74., 75., 76., 83., 68., 11., 77., 70.,  8.,  7., 81., 10.,  5.,
        6., 84., 82., 90., 85., 86., 87., 89., 88.])

### Replace null values in column `Age` with mean

In [59]:
user_age_mean = users['Age'].mean()
user_age_mean

34.72384041634689

In [61]:
users['Age'] = users['Age'].replace(np.nan,user_age_mean)
users['Age'].unique()

array([34.72384042, 18.        , 17.        , 61.        , 26.        ,
       14.        , 25.        , 19.        , 46.        , 55.        ,
       32.        , 24.        , 20.        , 34.        , 23.        ,
       51.        , 31.        , 21.        , 44.        , 30.        ,
       57.        , 43.        , 37.        , 41.        , 54.        ,
       42.        , 50.        , 39.        , 53.        , 47.        ,
       36.        , 28.        , 35.        , 13.        , 58.        ,
       49.        , 38.        , 45.        , 62.        , 63.        ,
       27.        , 33.        , 29.        , 66.        , 40.        ,
       15.        , 60.        , 79.        , 22.        , 16.        ,
       65.        , 59.        , 48.        , 72.        , 56.        ,
       67.        , 80.        , 52.        , 69.        , 71.        ,
       73.        , 78.        ,  9.        , 64.        , 12.        ,
       74.        , 75.        , 76.        , 83.        , 68.  

In [62]:
users['Age'].isna().sum()

0

In [63]:
users['Age'].isnull().sum()

0

### Change the datatype of `Age` to `int`

In [0]:
users['Age'] = users['Age'].astype(int)

In [65]:
users.dtypes

userID       int64
Location    object
Age          int64
dtype: object

In [66]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [67]:
ratings.shape

(1149780, 3)

In [0]:
n_users = users.shape[0]
n_books = books.shape[0]

In [70]:
print('No. of Users :-',n_users)
print('No. of Books :-',n_books)

No. of Users :- 278858
No. of Books :- 271355


In [72]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [77]:
print(books.columns)
print(books.shape[0])

Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher'], dtype='object')
271355


In [79]:
print(ratings.columns)
print(ratings.shape[0])

Index(['userID', 'ISBN', 'bookRating'], dtype='object')
1149780


In [80]:
ratings = ratings[ratings['ISBN'].isin(books['ISBN'])]
ratings.shape

(1031130, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [87]:
ratings = ratings[ratings['userID'].isin(users['userID'])]
ratings.shape

(1031130, 3)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [89]:
ratings.sample(5)

Unnamed: 0,userID,ISBN,bookRating
672716,163687,385505833,5
386224,93130,345313097,0
85651,18133,804106304,0
296629,70186,316769495,0
320952,76499,679723420,10


In [90]:
ratings = ratings[ratings['bookRating']>=1]
ratings = ratings[ratings['bookRating']<=10]
print(ratings.shape)

(383839, 3)


In [91]:
ratings.sample(5)

Unnamed: 0,userID,ISBN,bookRating
478094,114368,505524503,5
275099,64241,590425919,8
12466,1211,2070364283,7
774834,187517,449006522,8
180767,39400,8403598505,8


### Find out which rating has been given highest number of times

In [104]:
ratings['bookRating'].value_counts().max

<bound method Series.max of 8     91804
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64>

###We can see that 8 rating has been given highest number of times (91804)

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 

---

books

In [108]:
(ratings['userID'].value_counts() > 99 ).sum()

449

In [105]:
users_100b = pd.DataFrame(ratings['userID'].value_counts() > 99)
users_100b = users_100b[users_100b['userID'] == True].index
users_100b.shape

(449,)

In [107]:
users_100b

Int64Index([ 11676,  98391, 189835, 153662,  23902, 235105,  76499, 171118,
             16795, 248718,
            ...
            117384,  36299, 169682, 211919, 156300,  95010,  33145,  26544,
            208406,  36609],
           dtype='int64', length=449)

In [109]:
ratings_users_100b = ratings[ratings['userID'].isin(users_100b)]
ratings_users_100b.shape

(103269, 3)

In [113]:
ratings_users_100b.sample(5)

Unnamed: 0,userID,ISBN,bookRating
289168,69078,0807002097,8
925591,225087,0849911788,10
369125,88693,051513628X,8
422327,100906,0345350499,9
390419,94347,0525477950,10


In [114]:
#Lets confirm users who have rated atleast 100 in the new rating data set
(ratings_users_100b['userID'].value_counts() > 99 ).sum()

449

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [119]:
ex_ratings = ratings_users_100b.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)
ex_ratings.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,0001056107,0001845039,0001935968,0001944711,0001952803,0001953877,0002000547,0002005018,0002005050,0002005557,0002006588,0002115328,0002116286,0002118580,0002154900,0002158973,0002163713,0002176181,0002176432,0002179695,0002181924,0002184974,0002190915,0002197154,0002223929,0002228394,000223257X,0002233509,0002239183,0002240114,...,987960170X,9974643058,999058284X,9992003766,9992059958,9993584185,9994256963,9994348337,9997405137,9997406567,9997406990,999740923X,9997409728,9997411757,9997411870,9997412044,9997412958,9997507002,999750805X,9997508769,9997512952,9997519086,9997555635,9998914140,B00001U0CP,B00005TZWI,B00006CRTE,B00006I4OX,B00007FYKW,B00008RWPV,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [120]:
ex_ratings.shape

(449, 66572)

In [125]:
# cross checking na values
print("Checking NA :-", ex_ratings.isna().sum().sum())

Checking NA :- 0


In [126]:
ex_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 449 entries, 2033 to 278418
Columns: 66572 entries, 0000913154 to B000234N3A
dtypes: float64(66572)
memory usage: 228.1 MB


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [0]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(ex_ratings, k = 50)

In [129]:
U

array([[-0.0287257 ,  0.00276858, -0.00163491, ...,  0.00468381,
         0.00146285,  0.00115382],
       [-0.00316014,  0.00280843, -0.00656488, ...,  0.00106169,
         0.00106767,  0.00048237],
       [ 0.02266219,  0.00362367,  0.00392174, ...,  0.00025662,
         0.00290815,  0.001983  ],
       ...,
       [ 0.01809824, -0.00283048,  0.01413323, ...,  0.00053466,
         0.00421407,  0.00171786],
       [ 0.01331746, -0.00389798,  0.00813826, ...,  0.00109245,
         0.00485681,  0.00238477],
       [-0.00396598, -0.00564164,  0.00528432, ...,  0.00287071,
         0.00101021,  0.0006922 ]])

In [130]:
sigma

array([147.92121613, 149.3438051 , 150.07400599, 152.20116297,
       152.87416391, 154.61308307, 154.80093432, 155.95760177,
       158.05646578, 159.21079484, 159.81670657, 162.01963916,
       162.77851768, 163.33054635, 166.02489324, 166.8162391 ,
       168.04972004, 170.77485167, 171.01325686, 173.29428498,
       174.57624968, 176.65724713, 178.61913749, 180.29517222,
       182.25079063, 184.10706957, 187.61687534, 189.75276623,
       190.96966388, 195.14643609, 199.83133018, 201.70083339,
       202.18713912, 203.48697581, 207.26449173, 209.92986988,
       213.23598777, 216.88280493, 224.26954726, 231.66186197,
       235.67095629, 249.9581775 , 252.02866425, 261.24756904,
       267.98197504, 281.0120779 , 293.69539562, 379.58327277,
       634.72875357, 680.30978318])

In [131]:
Vt

array([[-3.92054115e-04, -1.12885905e-03, -7.52572700e-04, ...,
         1.21259171e-04,  1.40728079e-03, -9.39266652e-04],
       [ 1.83812046e-04,  8.04527056e-05,  5.36351371e-05, ...,
        -2.03616968e-05, -1.13338633e-05,  8.89199391e-05],
       [-6.45220069e-04, -1.81903383e-04, -1.21268922e-04, ...,
         6.52663017e-05,  5.77432315e-05, -2.03720513e-03],
       ...,
       [ 9.70964741e-05,  8.85065735e-05,  5.90043823e-05, ...,
         9.93843072e-07, -1.57012541e-06,  1.95172262e-04],
       [ 1.24358727e-04,  1.40102246e-04,  9.34014973e-05, ...,
         6.42409858e-06,  3.12147472e-05,  1.36580955e-04],
       [ 5.27949012e-05,  4.51140032e-05,  3.00760021e-05, ...,
         1.98299389e-06,  9.61038459e-06,  4.83048138e-05]])

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [0]:
userID = 2110

In [0]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [134]:
sigma = np.diag(sigma)
sigma

array([[147.92121613,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 149.3438051 ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 150.07400599, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 379.58327277,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        634.72875357,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 680.30978318]])

In [0]:
all_users_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

In [136]:
preds_df = pd.DataFrame(all_users_predicted_ratings, columns = ex_ratings.columns)
preds_df.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,0001056107,0001845039,0001935968,0001944711,0001952803,0001953877,0002000547,0002005018,0002005050,0002005557,0002006588,0002115328,0002116286,0002118580,0002154900,0002158973,0002163713,0002176181,0002176432,0002179695,0002181924,0002184974,0002190915,0002197154,0002223929,0002228394,000223257X,0002233509,0002239183,0002240114,...,987960170X,9974643058,999058284X,9992003766,9992059958,9993584185,9994256963,9994348337,9997405137,9997406567,9997406990,999740923X,9997409728,9997411757,9997411870,9997412044,9997412958,9997507002,999750805X,9997508769,9997512952,9997519086,9997555635,9998914140,B00001U0CP,B00005TZWI,B00006CRTE,B00006I4OX,B00007FYKW,B00008RWPV,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025341,-0.002146,-0.001431,-0.002146,-0.002146,0.002971,-0.00392,0.007035,0.007035,0.012316,0.054595,-0.01294,0.006665,-0.010082,0.053074,-0.001822,0.007457,-0.013443,0.017193,0.010776,0.00123,0.000226,0.005593,-0.015124,0.007875,-0.020879,-0.001908,-0.00164,-0.008402,0.010628,0.007035,-0.005855,-0.001822,0.00689,0.004294,0.000226,0.002308,-0.007902,-0.001662,0.003342,...,-0.008402,-0.011763,6e-06,-0.001192,0.030001,-0.040839,0.021513,-0.000155,0.004755,0.003854,-0.001669,0.000219,-0.001908,-0.001669,0.004798,0.006792,0.013479,0.007035,-0.001431,0.08038,0.013479,-0.001192,-0.001908,0.005367,0.164765,-0.032351,0.008362,0.010628,0.019123,-0.009564,0.00018,0.000226,0.042081,-0.016804,-0.080028,0.004746,0.028314,0.00012,-0.001693,0.067503
1,-0.010012,-0.003669,-0.002446,-0.003669,-0.003669,0.001075,0.00144,-0.0035,-0.0035,0.001612,0.001259,-0.005109,-0.000171,0.000662,-0.003104,9e-06,0.001131,0.000883,0.005803,0.001411,-0.000134,0.000403,0.000848,0.000993,0.002255,-0.005046,-0.003262,8e-06,0.000552,0.007652,-0.0035,0.002155,9e-06,0.001973,0.003213,0.000403,0.000251,-0.003468,-0.00265,0.00121,...,0.000552,0.000773,0.000784,-0.002039,-0.005804,0.120308,-0.005266,3.2e-05,0.021818,-0.000255,-0.002854,-0.00066,-0.003262,-0.002854,0.001023,0.031169,0.002922,-0.0035,-0.002446,0.008922,0.002922,-0.002039,-0.003262,0.004016,-0.015303,-0.020895,0.001265,0.007652,-0.004681,0.003764,-0.000363,0.000403,0.008142,0.001104,-0.029224,0.000999,0.002363,-0.000242,2.9e-05,-0.013059
2,-0.015054,-0.015457,-0.010304,-0.015457,-0.015457,0.007281,-0.014033,0.011941,0.011941,0.011796,-0.004049,0.023737,0.006648,0.003442,-0.036117,-0.002065,0.00807,0.00459,0.015218,0.010321,0.000954,0.001907,0.006052,0.005163,0.001332,0.018083,-0.013739,-0.001858,0.002869,0.013533,0.011941,0.016617,-0.002065,0.001166,0.030942,0.001907,0.007312,-0.01054,-0.004909,0.008192,...,0.002869,0.004016,0.003425,-0.008587,-0.025641,-0.012612,0.013397,0.002103,-0.037849,0.006883,-0.012022,0.013289,-0.013739,-0.012022,0.003307,-0.05407,0.012416,0.011941,-0.010304,0.011615,0.012416,-0.008587,-0.013739,0.038678,0.006883,-0.072622,0.003984,0.013533,0.011908,-0.009658,-0.000455,0.001907,0.047982,0.005737,0.117859,0.006945,0.003119,-0.000304,0.009009,-0.057692
3,-0.021499,0.035602,0.023735,0.035602,0.035602,0.030307,0.024215,-0.001053,-0.001053,0.067579,0.023864,-0.025956,0.033893,-0.005291,0.263229,0.001122,0.035044,-0.007054,0.099573,0.059131,0.00766,0.009912,0.026283,-0.007936,0.001811,0.04912,0.031646,0.00101,-0.004409,0.025253,-0.001053,0.033152,0.001122,0.001585,0.040178,0.009912,0.0413,0.060636,0.087445,0.034096,...,-0.004409,-0.006172,0.018133,0.019779,-0.020739,-0.091102,-0.059037,0.000851,-0.088358,0.02952,0.02769,0.024363,0.031646,0.02769,0.003501,-0.126226,0.063417,-0.001053,0.023735,0.029974,0.063417,0.019779,0.031646,0.050222,-0.059763,-0.143405,0.011302,0.025253,-0.052477,-0.037072,0.002971,0.009912,0.086248,-0.008818,0.016154,0.028848,-0.000125,0.001981,0.031201,-0.046664
4,0.002077,-0.007965,-0.00531,-0.007965,-0.007965,0.002947,0.003057,0.000231,0.000231,0.00608,-0.021917,-0.001408,0.006661,0.005652,-0.026389,3.6e-05,0.005533,0.007536,0.007065,0.00532,0.000705,0.001597,0.00415,0.008478,-0.001238,0.026154,-0.00708,3.3e-05,0.00471,-0.001892,0.000231,0.011853,3.6e-05,-0.001083,-0.004126,0.001597,0.004514,-0.000233,0.00208,0.003315,...,0.00471,0.006594,-0.000443,-0.004425,0.021233,0.005352,-0.007687,0.001126,-0.038776,0.009884,-0.006195,0.010187,-0.00708,-0.006195,-0.002295,-0.055394,0.006868,0.000231,-0.00531,-0.016919,0.006868,-0.004425,-0.00708,-0.005157,0.002095,-0.007807,-0.0035,-0.001892,-0.006833,0.003431,0.00212,0.001597,-0.012181,0.00942,0.673459,0.002591,-0.008229,0.001413,0.004918,0.047773


In [0]:
def recommend_books(predictions_df, userID, userId, books_df, original_ratings_df, num_recommendations = False):
    user_row_number = userID   #UserID starts at zero not 1
    sorted_user_predictions = predictions_df.loc[user_row_number].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userId)]
    user_full = (user_data.merge(books, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )
    print('User {0} has already rated {1} books.'.format(userID, user_full.dropna().shape[0]))
    print('Recommending the highest {0} predicted ratings books not read yet.'.format(num_recommendations))
    
    recommendations = (books_df[~books_df['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').
                      rename(columns = {user_row_number: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
    return user_full, recommendations, sorted_user_predictions, user_data, user_full

In [148]:
already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_books(preds_df, 2,2110,books, ratings, 10)

User 2 has already rated 103 books.
Recommending the highest 10 predicted ratings books not read yet.


In [163]:
already_rated.shape

(103, 7)

In [164]:
predictions.shape

(10, 5)

In [149]:
predictions

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
407,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
2116,0345350499,The Mists of Avalon,MARION ZIMMER BRADLEY,1987,Del Rey
2438,0440214041,The Pelican Brief,John Grisham,1993,Dell
455,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
521,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
20670,0345318862,Golem in the Gears (Xanth Novels (Paperback)),PIERS ANTHONY,1986,Del Rey
4810,0345313151,Bearing an Hourglass (Incarnations of Immortal...,Piers Anthony,1991,Del Rey Books
6320,0380752891,"Man from Mundania (Xanth Trilogy, No 12)",Piers Anthony,1990,Harper Mass Market Paperbacks
44448,051511605X,Undue Influence,Steven Paul Martini,1995,Jove Books
8977,043936213X,Harry Potter and the Sorcerer's Stone (Book 1),J. K. Rowling,2001,Scholastic


In [150]:
sorted_user_predictions

ISBN
0316666343    1.015397
059035342X    0.778665
0345350499    0.697309
0440214041    0.665439
044021145X    0.663549
                ...   
0380709562   -0.162346
0553213164   -0.173301
0440219078   -0.174497
0807508527   -0.204146
0743235150   -0.209452
Name: 2, Length: 66572, dtype: float64

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [151]:
user_data

Unnamed: 0,userID,ISBN,bookRating
14448,2110,0060987529,7
14449,2110,0064472779,8
14450,2110,0140022651,10
14452,2110,0142302163,8
14453,2110,0151008116,5
...,...,...,...
14603,2110,1558504184,8
14605,2110,1561008931,7
14606,2110,1565111575,10
14608,2110,1570420564,10


In [142]:
user_data.shape

(103, 3)

In [141]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating
14448,2110,60987529,7
14449,2110,64472779,8
14450,2110,140022651,10
14452,2110,142302163,8
14453,2110,151008116,5


In [152]:
user_full

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
76,2110,067166865X,10,STAR TREK YESTERDAY'S SON (Star Trek: The Orig...,A.C. Crispin,1988,Audioworks
52,2110,0590109715,10,"The Andalite Chronicles (Elfangor's Journey, A...",Katherine Applegate,1997,Apple
64,2110,0590629786,10,"The Visitor (Animorphs, No 2)",K. A. Applegate,1996,Scholastic
63,2110,0590629778,10,"The Invasion (Animorphs, No 1)",K. A. Applegate,1996,Scholastic
61,2110,059046678X,10,The Yearbook,Peter Lerangis,1994,Scholastic
...,...,...,...,...,...,...,...
49,2110,0515134384,5,The Cat Who Went Up the Creek,Lilian Jackson Braun,2003,Jove Books
19,2110,037361490X,5,Age of War (Super Bolan #90),Don Pendleton,2003,Gold Eagle
4,2110,0151008116,5,Life of Pi,Yann Martel,2002,Harcourt
50,2110,0515136557,3,The Cat Who Brought Down the House,Lilian Jackson Braun,2004,Jove Books


### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [0]:
user_full_info = user_data.merge(books, how = 'left', left_on ='ISBN' , right_on = 'ISBN')
user_full_info = user_full_info.drop(columns=['userID','bookRating'],axis=1)

In [155]:
user_full_info.shape

(103, 5)

In [156]:
user_full_info.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,60987529,Confessions of an Ugly Stepsister : A Novel,Gregory Maguire,2000,Regan Books
1,64472779,All-American Girl,Meg Cabot,2003,HarperTrophy
2,140022651,Journey to the Center of the Earth,Jules Verne,1965,Penguin Books
3,142302163,The Ghost Sitter,Peni R. Griffin,2002,Puffin Books
4,151008116,Life of Pi,Yann Martel,2002,Harcourt


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [162]:
predictions.shape

(10, 5)

In [157]:
predictions

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
407,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
2116,0345350499,The Mists of Avalon,MARION ZIMMER BRADLEY,1987,Del Rey
2438,0440214041,The Pelican Brief,John Grisham,1993,Dell
455,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
521,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
20670,0345318862,Golem in the Gears (Xanth Novels (Paperback)),PIERS ANTHONY,1986,Del Rey
4810,0345313151,Bearing an Hourglass (Incarnations of Immortal...,Piers Anthony,1991,Del Rey Books
6320,0380752891,"Man from Mundania (Xanth Trilogy, No 12)",Piers Anthony,1990,Harper Mass Market Paperbacks
44448,051511605X,Undue Influence,Steven Paul Martini,1995,Jove Books
8977,043936213X,Harry Potter and the Sorcerer's Stone (Book 1),J. K. Rowling,2001,Scholastic


In [0]:
sorted_user_predictions1 = preds_df.loc[2].sort_values(ascending = False)

In [0]:
recommendations = (books[~books['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions1).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').rename(columns = {2: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                   iloc[:10, :-1])

In [160]:
recommendations

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
407,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
2116,0345350499,The Mists of Avalon,MARION ZIMMER BRADLEY,1987,Del Rey
2438,0440214041,The Pelican Brief,John Grisham,1993,Dell
455,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
521,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
20670,0345318862,Golem in the Gears (Xanth Novels (Paperback)),PIERS ANTHONY,1986,Del Rey
4810,0345313151,Bearing an Hourglass (Incarnations of Immortal...,Piers Anthony,1991,Del Rey Books
6320,0380752891,"Man from Mundania (Xanth Trilogy, No 12)",Piers Anthony,1990,Harper Mass Market Paperbacks
44448,051511605X,Undue Influence,Steven Paul Martini,1995,Jove Books
8977,043936213X,Harry Potter and the Sorcerer's Stone (Book 1),J. K. Rowling,2001,Scholastic
