**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

In [1]:
#Import all the necessary modules
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

#### Execute the below cell to load the datasets

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [3]:
books.shape

(271360, 8)

In [4]:
users.shape

(278858, 3)

In [5]:
ratings.shape

(1149780, 3)

## Exploring books dataset

In [6]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [7]:
books = books.drop(labels=["imageUrlS","imageUrlM","imageUrlL"], axis=1)

In [8]:
print("Books Shape: ", books.shape)
books.head()

Books Shape:  (271360, 5)


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [9]:
books["yearOfPublication"].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

In [10]:
print("Total unique values for yearOfPublication: ", len(books["yearOfPublication"].unique()))

Total unique values for yearOfPublication:  202


As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [11]:
books[books["yearOfPublication"].isin(["DK Publishing Inc"])].shape

(2, 5)

In [12]:
books[books["yearOfPublication"].isin(["DK Publishing Inc"])].head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [13]:
dropIndexs = books[books["yearOfPublication"].isin(["DK Publishing Inc","Gallimard"])].index

In [14]:
books.drop(dropIndexs, inplace=True)

In [15]:
print("Books Shape: ", books.shape)

Books Shape:  (271357, 5)


### Change the datatype of yearOfPublication to 'int'

In [16]:
books["yearOfPublication"] = books["yearOfPublication"].astype('int32')

In [17]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [18]:
books[books["publisher"].isnull()].head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


In [19]:
books.dropna(subset= ['publisher'], inplace = True)

In [20]:
books[books["publisher"].isnull()].head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


In [21]:
books.isna().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            0
dtype: int64

In [22]:
# there is one more column "bookauthor" that also contains Null values
books.dropna(subset= ['bookAuthor'], inplace = True)

In [23]:
books.isna().sum()

ISBN                 0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64

In [24]:
print("Books Shape: ", books.shape)

Books Shape:  (271354, 5)


## Exploring Users dataset

In [25]:
print("User Shape:", users.shape)
users.head()

User Shape: (278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [26]:
age = users["Age"].unique()
print(sorted(age))

[nan, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 110.0, 111.0, 113.0, 114.0, 115.0, 116.0, 118.0, 119.0, 123.0, 124.0, 127.0, 128.0, 132.0, 133.0, 136.0, 137.0, 138.0, 140.0, 141.0, 143.0, 146.0, 147.0, 148.0, 151.0, 152.0, 156.0, 157.0, 159.0, 162.0, 168.0, 172.0, 175.0, 183.0, 186.0, 189.0, 199.0, 200.0, 201.0, 204.0, 207.0, 208.0, 209.0, 210.0, 212.0, 219.0, 220.0, 223.0, 226.0

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [27]:
users["Age"].isnull().sum()

110762

In [28]:
users['Age'].values[(users['Age'] < 5) | (users['Age'] > 90)] = np.nan

In [29]:
age = users["Age"].unique()
print(sorted(age))

[nan, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0]


### Replace null values in column `Age` with mean

In [30]:
users['Age'] = users['Age'].fillna(users['Age'].mean())

### Change the datatype of `Age` to `int`

In [31]:
users['Age'] = users['Age'].astype('int')

In [32]:
age = users["Age"].unique()
print(sorted(age))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [33]:
ratings.shape

(1149780, 3)

In [34]:
n_users = users.shape[0]
n_books = books.shape[0]

print("Users: ",n_users)
print("Books: ",n_books)

Users:  278858
Books:  271354


In [35]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [36]:
dropIndexs = ratings[~ratings["ISBN"].isin(books["ISBN"])].index

In [37]:
print("Num of entrie to be dropped due to mismatch book ISBN: ", dropIndexs.shape[0])

Num of entrie to be dropped due to mismatch book ISBN:  118651


In [38]:
ratings.drop(dropIndexs, inplace=True)

In [39]:
ratings.shape

(1031129, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [40]:
dropIndexs = ratings[~ratings["userID"].isin(users["userID"])].index

In [41]:
print("Num of entries to be dropped due to mismatch in userID: ", dropIndexs.shape[0])

Num of entries to be dropped due to mismatch in userID:  0


In [42]:
ratings.drop(dropIndexs, inplace=True)
ratings.shape

(1031129, 3)

In [43]:
print("Unique ratings-ISBN:" , len(ratings["ISBN"].unique()))
print("Unique books-ISBN:" , len(books["ISBN"].unique()))

Unique ratings-ISBN: 270145
Unique books-ISBN: 271354


In [44]:
print("Unique ratings-userID:" , len(ratings["userID"].unique()))
print("Unique users-userID:" , len(users["userID"].unique()))

Unique ratings-userID: 92106
Unique users-userID: 278858


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [45]:
ratings["bookRating"].unique()
# all rating are in range 1-10 and have 0 as well.. there is no rating outside these range.

array([ 0,  5,  3,  6,  7,  9,  8, 10,  1,  4,  2], dtype=int64)

In [46]:
rating_1to10 = ratings[ratings["bookRating"] != 0]

### Find out which rating has been given highest number of times

In [47]:
ratings_counts = rating_1to10["bookRating"].value_counts().sort_values(ascending=False)
ratings_counts

8     91803
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

In [48]:
print("Rating highest number of times: ", ratings_counts.index[0])

Rating highest number of times:  8


### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [49]:
ratings.shape

(1031129, 3)

In [50]:
rated_count = ratings["userID"].value_counts()
usersId_below100rating = rated_count[rated_count<100].index

In [51]:
dropIndexs = ratings[ratings["userID"].isin(usersId_below100rating)].index

In [52]:
ratings.drop(dropIndexs, inplace=True)
ratings.shape

(592991, 3)

In [53]:
# finally ratings contains only user who have rated atleast 100 books
ratings["userID"].value_counts()

11676     11144
198711     6456
153662     5814
98391      5777
35859      5646
212898     4289
278418     3996
76352      3329
110973     2971
235105     2943
16795      2920
230522     2857
234623     2594
204864     2461
36836      2458
245963     2395
185233     2382
55492      2361
52584      2340
232131     2329
227447     2312
102967     2285
129358     2275
98741      2235
171118     2219
60244      2204
190925     2088
135149     2061
231210     1986
189835     1966
          ...  
4157        102
42400       102
159732      102
37293       102
254377      101
54206       101
77270       101
187410      101
98723       101
110746      101
65584       101
126697      101
175636      101
201548      101
266283      101
44089       101
259114      101
152435      101
30806       101
28619       100
123597      100
236955      100
160735      100
26443       100
39345       100
70999       100
143909      100
13935       100
100782      100
246507      100
Name: userID, Length: 16

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [54]:
ratings.isnull().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [55]:
from sklearn.model_selection import train_test_split
trainDF, tempDF = train_test_split(ratings, test_size=0.2, random_state=100)
#creating a copy of tempDF as testDF
testDF = tempDF.copy()
#Assigning ratings of tempDF to nan
tempDF.bookRating = np.nan

In [56]:
#Creating ratings with trainDF and tempDF
df_ratings = pd.concat([trainDF, tempDF]).reset_index()
print("Shape:" ,df_ratings.shape)
df_ratings.head()

Shape: (592991, 4)


Unnamed: 0,index,userID,ISBN,bookRating
0,54828,11676,851402607,0.0
1,909465,221445,307113248,0.0
2,1132590,271622,316781460,8.0
3,468285,112001,373511167,0.0
4,531467,128208,618526412,0.0


In [57]:
# making format of ratings matrix to be one row per user and one column per books. 
# Fill not available values as 0.0 - sprase martix
R_df = df_ratings.pivot(index = 'userID', columns ='ISBN', values = 'bookRating').fillna(0)
R_df.tail()

ISBN,0000913154,0001010565,0001046438,000104687X,0001047213,0001047647,0001047663,0001047868,0001047973,000104799X,...,B0001GDNCK,B0001GMSV2,B0001I1JII,B0001I1KOG,B0001PBXMS,B0001PIOX4,B000234N3A,B000234N76,B000234NC6,B00029DGGO
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
277478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277639,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278188,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
R_df.shape

(1659, 198934)

In [59]:
from scipy.sparse.linalg import svds
# singluar value decomposition
# Compute the largest k singular values/vectors for a sparse matrix.
# k: Number of singular values and vectors to compute. Must be 1 <= k < min(R_df.shape)
# R_df is to compute the SVD on
# The singular values - sigma
U, sigma, Vt = svds(R_df, k = 50)

In [60]:
print(U.shape)
print(sigma)

(1659, 50)
[131.96294707 133.36434343 135.62609053 137.04791183 137.69129822
 138.72807432 139.21195085 141.08514561 143.4642771  143.58916699
 144.61544351 145.58748345 146.64200502 147.52207607 149.55442063
 150.25537026 151.39988441 152.52483272 153.86773366 154.48554689
 156.27970496 156.50373899 160.17606769 161.75120559 163.81442648
 166.51271555 168.39958187 170.08446333 172.0307092  173.43542893
 174.24007386 175.28657769 180.05794095 182.6158473  183.75951803
 188.26387287 191.26980664 195.19128804 201.76203255 208.3054046
 210.72341193 222.66402581 227.44877641 234.27932516 241.23540489
 251.13753004 262.60226215 338.33013548 570.55500256 608.14921106]


In [61]:
#diag
sigma = np.diag(sigma)

In [62]:
print("U shape: ", U.shape)
print("sigma shape: ", sigma.shape)
print("Vt shape: ", Vt.shape)

U shape:  (1659, 50)
sigma shape:  (50, 50)
Vt shape:  (50, 198934)


In [63]:
# also need to add the user means back to get the predicted ratings
#np.dot - Dot product of two arrays
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

In [64]:
#predictions of movie ratings
preds_df.head()

ISBN,0000913154,0001010565,0001046438,000104687X,0001047213,0001047647,0001047663,0001047868,0001047973,000104799X,...,B0001GDNCK,B0001GMSV2,B0001I1JII,B0001I1KOG,B0001PBXMS,B0001PIOX4,B000234N3A,B000234N76,B000234NC6,B00029DGGO
0,0.0,0.0,0.003337,0.002225,0.003337,0.0,0.0,0.0,0.003337,0.001557,...,0.0,0.005863,0.0,0.001423,0.0,0.0,-0.021418,0.0,0.0,0.0
1,0.0,0.0,-0.003259,-0.002173,-0.003259,0.0,0.0,0.0,-0.003259,0.002642,...,0.0,0.002155,0.0,0.001651,0.0,0.0,-0.007241,0.0,0.0,0.0
2,0.0,0.0,0.004528,0.003019,0.004528,0.0,0.0,0.0,0.004528,0.005333,...,0.0,0.008032,0.0,0.004172,0.0,0.0,-0.019157,0.0,0.0,0.0
3,0.0,0.0,0.006791,0.004527,0.006791,0.0,0.0,0.0,0.006791,0.003157,...,0.0,0.002506,0.0,0.001703,0.0,0.0,-0.005419,0.0,0.0,0.0
4,0.0,0.0,0.011036,0.007358,0.011036,0.0,0.0,0.0,0.011036,0.002979,...,0.0,0.004345,0.0,0.001875,0.0,0.0,-0.014563,0.0,0.0,0.0


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [65]:
userID = 2110

In [66]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [67]:
 # Get and sort the user's predictions
num_recommendations = 10
user_row_number = user_id - 1 # UserID starts at 1, not 0
sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)

In [68]:
sorted_user_predictions.head(10)

ISBN
0316666343    0.526064
0312195516    0.379362
0385504209    0.274298
0142001740    0.250667
0060928336    0.231939
067976402X    0.231578
0446310786    0.226412
0743418174    0.212866
0156027321    0.201935
0375727345    0.191695
Name: 1, dtype: float64

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [69]:
# Get the user's data 
user_data = df_ratings[df_ratings.userID == (userID)]

In [70]:
user_data.head()

Unnamed: 0,index,userID,ISBN,bookRating
3155,14479,2110,373638078,9.0
12828,14530,2110,553073273,9.0
20308,14552,2110,590629794,10.0
26444,14600,2110,929634063,0.0
29232,14461,2110,345307674,10.0


In [71]:
user_data.shape

(163, 4)

In [72]:
print ('User {0} has already rated {1} movies.'.format(userID, user_data.shape[0]))

User 2110 has already rated 163 movies.


### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [73]:
user_full_info = (user_data.merge(books, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
             sort_values(['bookRating'], ascending=False))

In [74]:
books.shape

(271354, 5)

In [75]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [76]:
print("Shape, user fill info : ", user_full_info.shape)
user_full_info.head()

Shape, user fill info :  (163, 8)


Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
63,14593,2110,812505042,10.0,The Time Machine,H. G. Wells,1995,Tor Books
89,14576,2110,679805265,10.0,Long Shot (Three Investigators Crimebusters (P...,Megan Stine,1993,Random House Children's Books
35,14496,2110,380570009,10.0,How to Talk So Kids Will Listen and Listen So ...,Adele Faber,1991,Back Bay Books
49,14608,2110,1570420564,10.0,A Dream Is a Wish Your Heart Makes: My Story,Annette Funicello,1994,Time Warner Audio Books
33,14464,2110,345335287,10.0,The Black Unicorn (Magic Kingdom of Landover N...,Terry Brooks,1990,Del Rey Books


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [77]:
# Recommend the highest predicted rating books that the user hasn't seen yet.
# select the books which are in user list and remove those books from our book list
# Then merge with the sorted user predictions with the 'ISBN' with left join
# Rename the columns to 'predictions'
recommendations = (books[~books['ISBN'].isin(user_full_info['ISBN'])].
     merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left', left_on = 'ISBN', right_on = 'ISBN').
     rename(columns = {user_row_number: 'Predictions'}))

#sort the prediction values
recommendations = (recommendations.sort_values('Predictions', ascending = False).iloc[:num_recommendations, :-1])

In [78]:
recommendations

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
405,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
519,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
744,0385504209,The Da Vinci Code,Dan Brown,2003,Doubleday
353,0142001740,The Secret Life of Bees,Sue Monk Kidd,2003,Penguin Books
1098,0060928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,1997,Perennial
1911,067976402X,Snow Falling on Cedars,David Guterson,1995,Vintage Books USA
37,0446310786,To Kill a Mockingbird,Harper Lee,1988,Little Brown &amp; Company
1486,0743418174,Good in Bed,Jennifer Weiner,2002,Washington Square Press
560,0156027321,Life of Pi,Yann Martel,2003,Harvest Books
4368,0375727345,House of Sand and Fog,Andre Dubus III,2000,Vintage Books
