In [98]:
import warnings
warnings.filterwarnings('ignore')

**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [99]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as pls
import seaborn as sns 

In [100]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [101]:
print ('shape of books data set record ',books.shape[0])
print ('shape of books data set features ',books.shape[1])

shape of books data set record  271360
shape of books data set features  8


In [102]:
print ('shape of users data set record ',users.shape[0])
print ('shape of users data set features ',users.shape[1])

shape of users data set record  278858
shape of users data set features  3


In [103]:
print ('shape of ratings data set record ',ratings.shape[0])
print ('shape of ratings data set features ',ratings.shape[1])

shape of ratings data set record  1149780
shape of ratings data set features  3


## Exploring books dataset

In [104]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [105]:
books = books.drop(columns= ['imageUrlS','imageUrlM','imageUrlL'],axis=1)

In [106]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [107]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

In [108]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
dtypes: object(5)
memory usage: 10.4+ MB


As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [109]:
books.isna().sum().sum()


3

In [110]:
yr_bad_entries = books[books['yearOfPublication'].str.contains("DK Publishing Inc")== True]
yr_bad_entries




Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [111]:
books = books[~books['yearOfPublication'].str.contains("Gallimard",na=False)]
books = books[~books['yearOfPublication'].str.contains("DK Publishing Inc",na=False)]
books

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company
...,...,...,...,...,...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm)
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press


>- We Can observe rows reduce by three i.e ("271360 to 271357") after droping strings from yearofpublication

### Change the datatype of yearOfPublication to 'int'

In [112]:
books['yearOfPublication'] = books['yearOfPublication'].astype(int)

In [113]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [114]:
books  = books[~books['publisher'].isna()]
books.shape

(271355, 5)

## Exploring Users dataset

In [115]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [116]:
users['Age'].unique()

array([ nan,  18.,  17.,  61.,  26.,  14.,  25.,  19.,  46.,  55.,  32.,
        24.,  20.,  34.,  23.,  51.,  31.,  21.,  44.,  30.,  57.,  43.,
        37.,  41.,  54.,  42.,  50.,  39.,  53.,  47.,  36.,  28.,  35.,
        13.,  58.,  49.,  38.,  45.,  62.,  63.,  27.,  33.,  29.,  66.,
        40.,  15.,  60.,   0.,  79.,  22.,  16.,  65.,  59.,  48.,  72.,
        56.,  67.,   1.,  80.,  52.,  69.,  71.,  73.,  78.,   9.,  64.,
       103., 104.,  12.,  74.,  75., 231.,   3.,  76.,  83.,  68., 119.,
        11.,  77.,   2.,  70.,  93.,   8.,   7.,   4.,  81., 114., 230.,
       239.,  10.,   5., 148., 151.,   6., 101., 201.,  96.,  84.,  82.,
        90., 123., 244., 133.,  91., 128.,  94.,  85., 141., 110.,  97.,
       219.,  86., 124.,  92., 175., 172., 209., 212., 237.,  87., 162.,
       100., 156., 136.,  95.,  89., 106.,  99., 108., 210.,  88., 199.,
       147., 168., 132., 159., 186., 152., 102., 116., 200., 115., 226.,
       137., 207., 229., 138., 109., 105., 228., 18

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [117]:
users['Age'] = users['Age'].apply(lambda x:np.nan if x > 90 else x)
users['Age'] = users['Age'].apply(lambda x:np.nan if x < 5 else x)
users['Age'].mean()

34.72384041634689

### Replace null values in column `Age` with mean

In [118]:
users['Age'] = users['Age'].replace(np.nan,users['Age'].mean())
users['Age'].isnull().sum()

0

### Change the datatype of `Age` to `int`

In [119]:
users['Age'] = users['Age'].astype(int)


In [120]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [121]:
ratings.shape

(1149780, 3)

In [122]:
n_users = users.shape[0]
n_books = books.shape[0]

In [123]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [124]:
ratings = ratings[ratings.index.isin(books.index)]


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [125]:
ratings = ratings[ratings.index.isin(users.index)]

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [126]:
ratings = ratings[ratings['bookRating']>=1]
ratings = ratings[ratings['bookRating']<=10]
ratings.shape

(106348, 3)

### Find out which rating has been given highest number of times

In [127]:
ratings['bookRating'].value_counts()

8     25276
10    19815
7     19480
9     15329
5     12005
6      9367
4      2247
3      1543
2       724
1       562
Name: bookRating, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [128]:
#df.index[df['BoolCol']].tolist()
c = pd.crosstab(ratings['userID'],ratings['bookRating'])
more_100 = pd.DataFrame(ratings['userID'].value_counts() > 99)
more = more_100[more_100['userID'] == True].index
more

Int64Index([11676, 23902, 16795, 56399, 35859,  3757, 23872, 49144, 60244,
            31826,
            ...
            56554, 23288,  2110, 53174, 10560, 36609, 26544, 49460, 55892,
            36299],
           dtype='int64', length=116)

In [129]:
(ratings['userID'].value_counts() > 99 ).sum()

116

In [130]:
ratings = ratings[ratings['userID'].isin(more)]
ratings

Unnamed: 0,userID,ISBN,bookRating
1456,277427,002542730X,10
1458,277427,003008685X,8
1461,277427,0060006641,10
1465,277427,0060542128,7
1474,277427,0061009059,9
...,...,...,...
265399,61147,3898119122,10
265400,61147,3923214111,10
265401,61147,9631330702,10
265402,61147,9976100248,10


In [131]:
ratings['userID'].value_counts()>99

11676    True
23902    True
16795    True
56399    True
35859    True
         ... 
26544    True
36609    True
49460    True
55892    True
36299    True
Name: userID, Length: 116, dtype: bool

In [132]:
book_data1 = pd.merge(ratings, books, on='ISBN')
book_data1

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,52584,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,277427,003008685X,8,Pioneers,James Fenimore Cooper,1974,Thomson Learning
...,...,...,...,...,...,...,...
29220,61147,3872491377,10,Dictionary of gastronomy: For the translation ...,E Neiger,1992,Carl Gerber Verlag
29221,61147,3891045336,10,"Vogelnamen: Englisch-Deutsch-Latein, Deutsch-E...",Raimar Bernard,1993,Aula-Verlag
29222,61147,9631330702,10,A guide to birdwatching in Hungary,Gerard Gorman,1991,Corvina
29223,61147,9976100248,10,Kamusi ya wanyama na nyoka wa Tanzania =: A gl...,Musa Maimu,1982,Tanzania Pub. House


### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [133]:
from sklearn.model_selection import train_test_split

trainDF, tempDF = train_test_split(ratings, test_size = 0.2, random_state = 100)



In [134]:
testDF = tempDF.copy()
testDF

Unnamed: 0,userID,ISBN,bookRating
57603,11676,3821829699,7
140371,31556,0590485857,7
82398,16966,0812551621,6
158277,35859,0373263562,8
173316,37644,044840561X,10
...,...,...,...
55425,11676,1551430096,8
154714,35433,3442415578,7
211515,49144,8422691493,5
63054,12824,0375416218,7


In [135]:
ratings = pd.concat([trainDF, tempDF])
ratings

Unnamed: 0,userID,ISBN,bookRating
226679,52584,068480042X,8
130561,30276,0373764596,8
41577,10560,0451458559,9
238211,55490,0898794161,8
47544,11676,0345321065,10
...,...,...,...
55425,11676,1551430096,8
154714,35433,3442415578,7
211515,49144,8422691493,5
63054,12824,0375416218,7


In [137]:
R_df = ratings.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)
R_df.head()


ISBN,9022906116,0000000000,00000000000,0001046438,000104687X,0001047213,0001047973,0001048082,0001055666,0001845039,...,O439060737,O446611638,O590418262,O67174142X,O9088446X,Q380708353,X000000000,ZR903CX0003,"\0432534220\""""","\2842053052\"""""
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [138]:
R_df.isna().sum().sum()

0

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [139]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_df, k = 50)

In [140]:
sigma

array([104.91481077, 105.34776482, 105.73758018, 106.19092993,
       106.93367875, 107.25037768, 108.79693364, 109.21883131,
       109.34110865, 110.11058242, 110.75468188, 112.48934078,
       114.15599089, 115.44737305, 116.4335195 , 116.97347873,
       117.70489997, 119.94932496, 120.51116605, 121.84996405,
       122.57971135, 123.98945371, 124.76889643, 125.92373266,
       127.14947398, 127.52891667, 128.72070994, 128.97554671,
       130.2802364 , 133.42341713, 134.74240925, 139.98776332,
       142.46274561, 142.56073249, 144.2831655 , 146.65076276,
       150.10521593, 158.27407731, 160.93394681, 164.90547921,
       175.61601509, 176.06762228, 184.67047969, 192.90220215,
       199.41325337, 220.21841493, 233.51184847, 286.18848506,
       293.01742971, 699.09563409])

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [141]:
userID = 2110

In [142]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [144]:
sigma = np.diag(sigma)

In [145]:
sigma

array([[104.91481077,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 105.34776482,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 105.73758018, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 286.18848506,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        293.01742971,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 699.09563409]])

In [146]:
all_users_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

In [147]:
preds_df = pd.DataFrame(all_users_predicted_ratings, columns = R_df.columns)
preds_df

ISBN,9022906116,0000000000,00000000000,0001046438,000104687X,0001047213,0001047973,0001048082,0001055666,0001845039,...,O439060737,O446611638,O590418262,O67174142X,O9088446X,Q380708353,X000000000,ZR903CX0003,"\0432534220\""""","\2842053052\"""""
0,-0.004781,-0.006147,-0.005464,-0.009786,-0.006524,-0.009786,-0.009786,-0.008699,-0.006147,-0.013427,...,-0.006830,-0.004781,0.136111,-0.018376,-0.005464,0.018021,-0.006830,-0.000683,-0.004098,-0.004781
1,0.003978,0.005115,0.004547,-0.000097,-0.000064,-0.000097,-0.000097,-0.000086,0.005115,-0.001646,...,0.005683,0.003978,-0.002004,-0.006160,0.004547,-0.000364,0.005683,0.000568,0.003410,0.003978
2,-0.001755,-0.002257,-0.002006,-0.001762,-0.001175,-0.001762,-0.001762,-0.001566,-0.002257,0.001163,...,-0.002508,-0.001755,-0.022913,0.001868,-0.002006,-0.002472,-0.002508,-0.000251,-0.001505,-0.001755
3,-0.000019,-0.000025,-0.000022,-0.000087,-0.000058,-0.000087,-0.000087,-0.000077,-0.000025,0.000019,...,-0.000027,-0.000019,-0.000318,0.000464,-0.000022,-0.001004,-0.000027,-0.000003,-0.000016,-0.000019
4,-0.001914,-0.002461,-0.002187,-0.000828,-0.000552,-0.000828,-0.000828,-0.000736,-0.002461,0.006011,...,-0.002734,-0.001914,0.005400,-0.007949,-0.002187,0.246078,-0.002734,-0.000273,-0.001640,-0.001914
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,-0.000432,-0.000556,-0.000494,0.000334,0.000222,0.000334,0.000334,0.000297,-0.000556,-0.001008,...,-0.000617,-0.000432,0.001092,-0.002521,-0.000494,-0.051990,-0.000617,-0.000062,-0.000370,-0.000432
112,0.002837,0.003648,0.003242,-0.012918,-0.008612,-0.012918,-0.012918,-0.011483,0.003648,0.016197,...,0.004053,0.002837,-0.042630,-0.011356,0.003242,0.139566,0.004053,0.000405,0.002432,0.002837
113,-0.002968,-0.003816,-0.003392,-0.011977,-0.007985,-0.011977,-0.011977,-0.010646,-0.003816,-0.005561,...,-0.004240,-0.002968,0.008887,-0.025091,-0.003392,-0.026021,-0.004240,-0.000424,-0.002544,-0.002968
114,0.010595,0.013622,0.012109,0.001956,0.001304,0.001956,0.001956,0.001739,0.013622,0.056711,...,0.015136,0.010595,0.087364,0.018192,0.012109,-0.044840,0.015136,0.001514,0.009081,0.010595


In [148]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33346 entries, 226679 to 133466
Data columns (total 3 columns):
userID        33346 non-null int64
ISBN          33346 non-null object
bookRating    33346 non-null int64
dtypes: int64(2), object(1)
memory usage: 1.0+ MB


In [149]:
def recommend_books(predictions_df, userID,userId, books_df, original_ratings_df, num_recommendations = False):
    user_row_number = userID   #UserID starts at zero not 1
    sorted_user_predictions = predictions_df.loc[user_row_number].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userId)]
    user_full = (user_data.merge(books, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )
    print('User {0} has already rated {1} books.'.format(userID, user_full.dropna().shape[0]))
    print('Recommending the highest {0} predicted ratings books not already rated.'.format(num_recommendations))
    
    recommendations = (books_df[~books_df['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').
                      rename(columns = {user_row_number: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
    return user_full, recommendations, sorted_user_predictions, user_data, user_full
#R_df = rating.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)

In [150]:

already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_books(preds_df, 2,2110,books, ratings, 10)

User 2 has already rated 103 books.
Recommending the highest 10 predicted ratings books not already rated.


### Already rated by user 

In [151]:
already_rated

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
51,2110,0345362276,10,Wizard at Large (Rookies Series),Terry Brooks,1989,Del Rey Books
86,2110,0441000150,10,Quantum Leap: The Wall (Quantum Leap),Ashley McConnell,1994,Ace Books
26,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio
29,2110,0345283929,10,Empire Strikes Back Wars,Donald F Glut,1980,Ballantine Books
32,2110,0439222303,10,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
...,...,...,...,...,...,...,...
88,2110,0515134384,5,The Cat Who Went Up the Creek,Lilian Jackson Braun,2003,Jove Books
40,2110,037361490X,5,Age of War (Super Bolan #90),Don Pendleton,2003,Gold Eagle
102,2110,0151008116,5,Life of Pi,Yann Martel,2002,Harcourt
59,2110,0515136557,3,The Cat Who Brought Down the House,Lilian Jackson Braun,2004,Jove Books


### reccendation for User

In [152]:
predictions

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
3642,451456734,Queen of the Darkness (Black Jewels Trilogy),Anne Bishop,2000,Roc
16079,886774632,By the Sword (Kerowyn's Tale),Mercedes Lackey,1991,Daw Books
16078,886775639,"Winds of Change (The Mage Winds, Book 2)",Mercedes Lackey,1994,Daw Books
12424,886775167,"Winds of Fate (The Mage Winds, Book 1)",Mercedes Lackey,1997,Daw Books
16077,886776120,"Winds of Fury (The Mage Winds, Book 3)",Mercedes Lackey,1994,Daw Books
16148,671524313,The Girlfriends' Guide to Pregnancy,Vicki Iovine,1995,Pocket
3681,446612790,2nd Chance,James Patterson,2003,Warner Vision
1796,440236673,The Brethren,John Grisham,2000,Island
3245,440221471,The Runaway Jury,JOHN GRISHAM,1997,Dell
16449,671797050,FIRST WIVES CLUB,Olivia Goldsmith,1993,Pocket


In [153]:
#user_records[220]

In [154]:
sorted_user_predictions

ISBN
0451456734    10.799932
0886774632    10.797675
0886775167    10.604632
0886776120    10.604632
0886775639    10.604632
                ...    
0064400018    -0.334082
0064400557    -0.347000
0786817070    -0.349782
0385484518    -0.397992
0743418174    -0.496362
Name: 2, Length: 27878, dtype: float64

### User Data 

In [156]:
user_data

Unnamed: 0,userID,ISBN,bookRating
14502,2110,0394707745,8
14599,2110,0898863538,8
14579,2110,068808527X,8
14493,2110,0373765649,8
14483,2110,0373642857,6
...,...,...,...
14508,2110,0439240700,10
14601,2110,093317490X,7
14524,2110,0486270718,10
14503,2110,0394843509,10


In [157]:
user_data.shape 

(103, 3)

In [1036]:
user_full

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
51,2110,0345362276,10,Wizard at Large (Rookies Series),Terry Brooks,1989,Del Rey Books
86,2110,0441000150,10,Quantum Leap: The Wall (Quantum Leap),Ashley McConnell,1994,Ace Books
26,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio
29,2110,0345283929,10,Empire Strikes Back Wars,Donald F Glut,1980,Ballantine Books
32,2110,0439222303,10,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
...,...,...,...,...,...,...,...
88,2110,0515134384,5,The Cat Who Went Up the Creek,Lilian Jackson Braun,2003,Jove Books
40,2110,037361490X,5,Age of War (Super Bolan #90),Don Pendleton,2003,Gold Eagle
102,2110,0151008116,5,Life of Pi,Yann Martel,2002,Harcourt
59,2110,0515136557,3,The Cat Who Brought Down the House,Lilian Jackson Braun,2004,Jove Books


### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [158]:
user_data_2100 = ratings[ratings['userID'] == 2110]
user_data_2100.head()

Unnamed: 0,userID,ISBN,bookRating
14502,2110,0394707745,8
14599,2110,0898863538,8
14579,2110,068808527X,8
14493,2110,0373765649,8
14483,2110,0373642857,6


In [159]:
user_data_2100.shape

(103, 3)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [160]:
user_full_info = user_data_2100.merge(books,how = 'left' , left_on ='ISBN' , right_on = 'ISBN')
user_full_info=user_full_info.drop(columns=['userID','bookRating'],axis=1)

In [161]:
user_full_info.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,0394707745,Hungry Hikers Book of Good Cooking,Gretchen McHugh,1982,Alfred A. Knopf
1,0898863538,"A Hiker's Companion: 12,000 Miles of Trail-Tes...",Cindy Ross,1993,Mountaineers Books
2,068808527X,Close Friends,Peter Jenkins,1989,Harpercollins
3,0373765649,Breathless For The Bachelor (Silhouette Desire...,Cindy Gerard,2004,Silhouette
4,0373642857,"Final Strike (Executioner #285) (Executioner,...",Don Pendleton,2002,Gold Eagle


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [163]:
predictions

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
3642,451456734,Queen of the Darkness (Black Jewels Trilogy),Anne Bishop,2000,Roc
16079,886774632,By the Sword (Kerowyn's Tale),Mercedes Lackey,1991,Daw Books
16078,886775639,"Winds of Change (The Mage Winds, Book 2)",Mercedes Lackey,1994,Daw Books
12424,886775167,"Winds of Fate (The Mage Winds, Book 1)",Mercedes Lackey,1997,Daw Books
16077,886776120,"Winds of Fury (The Mage Winds, Book 3)",Mercedes Lackey,1994,Daw Books
16148,671524313,The Girlfriends' Guide to Pregnancy,Vicki Iovine,1995,Pocket
3681,446612790,2nd Chance,James Patterson,2003,Warner Vision
1796,440236673,The Brethren,John Grisham,2000,Island
3245,440221471,The Runaway Jury,JOHN GRISHAM,1997,Dell
16449,671797050,FIRST WIVES CLUB,Olivia Goldsmith,1993,Pocket


In [164]:
sorted_user_predictions1 = preds_df.loc[2].sort_values(ascending = False)

In [165]:
recommendations = (books[~books['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions1).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').rename(columns = {2: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                   iloc[:10, :-1])

In [166]:
recommendations

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
3642,451456734,Queen of the Darkness (Black Jewels Trilogy),Anne Bishop,2000,Roc
16079,886774632,By the Sword (Kerowyn's Tale),Mercedes Lackey,1991,Daw Books
16078,886775639,"Winds of Change (The Mage Winds, Book 2)",Mercedes Lackey,1994,Daw Books
12424,886775167,"Winds of Fate (The Mage Winds, Book 1)",Mercedes Lackey,1997,Daw Books
16077,886776120,"Winds of Fury (The Mage Winds, Book 3)",Mercedes Lackey,1994,Daw Books
16148,671524313,The Girlfriends' Guide to Pregnancy,Vicki Iovine,1995,Pocket
3681,446612790,2nd Chance,James Patterson,2003,Warner Vision
1796,440236673,The Brethren,John Grisham,2000,Island
3245,440221471,The Runaway Jury,JOHN GRISHAM,1997,Dell
16449,671797050,FIRST WIVES CLUB,Olivia Goldsmith,1993,Pocket


ISBN
 9022906116      -0.001755
0000000000       -0.002257
00000000000      -0.002006
0001046438       -0.001762
000104687X       -0.001175
                    ...   
Q380708353       -0.002472
X000000000       -0.002508
ZR903CX0003      -0.000251
\0432534220\""   -0.001505
\2842053052\""   -0.001755
Name: 2, Length: 27878, dtype: float64