**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import pandas as pd
import numpy as np

from scipy.sparse.linalg import svds

In [45]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [46]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


In [47]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [48]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


## Exploring books dataset

In [49]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [50]:
books.drop(['imageUrlS','imageUrlM','imageUrlL'],axis=1,inplace=True)

In [51]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [52]:
books.groupby('yearOfPublication').count()

Unnamed: 0_level_0,ISBN,bookTitle,bookAuthor,publisher
yearOfPublication,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,3570,3570,3570,3570
1806,1,1,1,1
1900,1,1,1,1
1901,7,7,7,7
1902,2,2,2,2
1904,1,1,1,1
1906,1,1,1,1
1908,1,1,1,1
1910,1,1,1,1
1911,10,10,10,10


As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [53]:
books[(books.yearOfPublication=='DK Publishing Inc')|(books.yearOfPublication=='Gallimard')]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [54]:
indexname=books[(books.yearOfPublication=='DK Publishing Inc')|(books.yearOfPublication=='Gallimard')].index
indexname

Int64Index([209538, 220731, 221678], dtype='int64')

In [55]:
books.drop(indexname,inplace=True)
books[(books.yearOfPublication=='DK Publishing Inc')|(books.yearOfPublication=='Gallimard')]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


### Change the datatype of yearOfPublication to 'int'

In [56]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
dtype: object

In [57]:
books['yearOfPublication']=books['yearOfPublication'].astype('int64')

In [58]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null int64
publisher            271355 non-null object
dtypes: int64(1), object(4)
memory usage: 12.4+ MB


### Drop NaNs in `'publisher'` column


In [59]:
books.drop(books[books.publisher.isna()].index,inplace=True)

In [60]:
books[books.publisher.isna()]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


## Exploring Users dataset

In [61]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [62]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


### Get all unique values in ascending order for column `Age`

In [63]:
sorted(users.Age.unique())

[nan,
 0.0,
 1.0,
 2.0,
 3.0,
 4.0,
 5.0,
 6.0,
 7.0,
 8.0,
 9.0,
 10.0,
 11.0,
 12.0,
 13.0,
 14.0,
 15.0,
 16.0,
 17.0,
 18.0,
 19.0,
 20.0,
 21.0,
 22.0,
 23.0,
 24.0,
 25.0,
 26.0,
 27.0,
 28.0,
 29.0,
 30.0,
 31.0,
 32.0,
 33.0,
 34.0,
 35.0,
 36.0,
 37.0,
 38.0,
 39.0,
 40.0,
 41.0,
 42.0,
 43.0,
 44.0,
 45.0,
 46.0,
 47.0,
 48.0,
 49.0,
 50.0,
 51.0,
 52.0,
 53.0,
 54.0,
 55.0,
 56.0,
 57.0,
 58.0,
 59.0,
 60.0,
 61.0,
 62.0,
 63.0,
 64.0,
 65.0,
 66.0,
 67.0,
 68.0,
 69.0,
 70.0,
 71.0,
 72.0,
 73.0,
 74.0,
 75.0,
 76.0,
 77.0,
 78.0,
 79.0,
 80.0,
 81.0,
 82.0,
 83.0,
 84.0,
 85.0,
 86.0,
 87.0,
 88.0,
 89.0,
 90.0,
 91.0,
 92.0,
 93.0,
 94.0,
 95.0,
 96.0,
 97.0,
 98.0,
 99.0,
 100.0,
 101.0,
 102.0,
 103.0,
 104.0,
 105.0,
 106.0,
 107.0,
 108.0,
 109.0,
 110.0,
 111.0,
 113.0,
 114.0,
 115.0,
 116.0,
 118.0,
 119.0,
 123.0,
 124.0,
 127.0,
 128.0,
 132.0,
 133.0,
 136.0,
 137.0,
 138.0,
 140.0,
 141.0,
 143.0,
 146.0,
 147.0,
 148.0,
 151.0,
 152.0,
 156.0,
 157.0,
 159.0,


Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [64]:
users.Age[users['Age']<5]=np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [65]:
users.Age[users['Age']>90]=np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


### Replace null values in column `Age` with mean

In [66]:
users['Age'].replace(np.nan,users.Age.mean(),inplace=True)

In [67]:
users.head()

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",34.72384
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",34.72384
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",34.72384


### Change the datatype of `Age` to `int`

In [68]:
users['Age']=users['Age'].astype('int64')

In [69]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null int64
dtypes: int64(2), object(1)
memory usage: 6.4+ MB


In [70]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [71]:
ratings.shape

(1149780, 3)

In [72]:
n_users = users.shape[0]
n_books = books.shape[0]

In [73]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [74]:
ratings.drop((ratings['ISBN'].isin(books.ISBN)!=True).index)

Unnamed: 0,userID,ISBN,bookRating


In [75]:
print(' Total Records in DataSet    :',ratings.ISBN.count(),'\n'
     ,'Total Rec with NO Book Ref. :',ratings[ratings['ISBN'].isin(books.ISBN)!=True].ISBN.count(),'\n'
     ,'Total Rec with Book ref     :',ratings.ISBN.count()-ratings[ratings['ISBN'].isin(books.ISBN)!=True].ISBN.count(),'\n'
     )


 Total Records in DataSet    : 1149780 
 Total Rec with NO Book Ref. : 118650 
 Total Rec with Book ref     : 1031130 



In [76]:
ratings.drop(ratings[ratings['ISBN'].isin(books.ISBN)!=True].index,inplace=True)
print('Total records in Dataset after clearing non book references :',ratings.ISBN.count())

Total records in Dataset after clearing non book references : 1031130


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [77]:
print(' Total Records in DataSet    :',ratings.userID.count(),'\n'
     ,'Total Rec with NO User Ref. :',ratings[ratings['userID'].isin(users.userID)!=True].userID.count(),'\n'
     ,'Total Rec with User ref     :',ratings.userID.count()-ratings[ratings['userID'].isin(users.userID)!=True].userID.count(),'\n'
     )


 Total Records in DataSet    : 1031130 
 Total Rec with NO User Ref. : 0 
 Total Rec with User ref     : 1031130 



**Observation :** Since there is no record without user reference, hence no action required.

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [78]:
# Check count of each rating before dropping 0 book rating for books
ratings.groupby('bookRating').count()

Unnamed: 0_level_0,userID,ISBN
bookRating,Unnamed: 1_level_1,Unnamed: 2_level_1
0,647291,647291
1,1481,1481
2,2375,2375
3,5118,5118
4,7617,7617
5,45355,45355
6,31687,31687
7,66401,66401
8,91804,91804
9,60776,60776


In [79]:
ratings.drop(ratings[(ratings.bookRating==0)].index,inplace=True)

In [80]:
# Check count of each rating after dropping 0 book rating for books
ratings.groupby('bookRating').count()

Unnamed: 0_level_0,userID,ISBN
bookRating,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1481,1481
2,2375,2375
3,5118,5118
4,7617,7617
5,45355,45355
6,31687,31687
7,66401,66401
8,91804,91804
9,60776,60776
10,71225,71225


### Find out which rating has been given highest number of times

In [81]:
Final_ratings=ratings.groupby('bookRating').count()

In [82]:
Final_ratings.sort_values(by='ISBN',ascending=False).head()

Unnamed: 0_level_0,userID,ISBN
bookRating,Unnamed: 1_level_1,Unnamed: 2_level_1
8,91804,91804
10,71225,71225
7,66401,66401
9,60776,60776
5,45355,45355


### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [83]:
user_100=ratings.groupby('userID').count()

In [84]:
user_100.shape

(68091, 2)

In [85]:
user_100.drop(user_100[user_100['ISBN']<100].index, inplace=True)

In [86]:
user_100.shape

(449, 2)

In [87]:
user_100.head()

Unnamed: 0_level_0,ISBN,bookRating
userID,Unnamed: 1_level_1,Unnamed: 2_level_1
2033,129,129
2110,103,103
2276,196,196
4017,154,154
4385,212,212


In [88]:
ratings[ratings['userID'].isin(user_100.index)!=True].count()

userID        280570
ISBN          280570
bookRating    280570
dtype: int64

In [89]:
ratings_temp=ratings.drop(ratings[ratings['userID'].isin(user_100.index)!=True].index,inplace=True)

In [90]:
ratings.groupby('userID').count().sample(5)

Unnamed: 0_level_0,ISBN,bookRating
userID,Unnamed: 1_level_1,Unnamed: 2_level_1
218552,140,140
126736,136,136
197659,781,781
23768,210,210
264321,273,273


### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [91]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103269 entries, 1456 to 1147615
Data columns (total 3 columns):
userID        103269 non-null int64
ISBN          103269 non-null object
bookRating    103269 non-null int64
dtypes: int64(2), object(1)
memory usage: 3.2+ MB


In [92]:
ratings.groupby(ratings.isna().ISBN).count()

Unnamed: 0_level_0,userID,ISBN,bookRating
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,103269,103269,103269


**Observation :** Since userID & bookRating is already int64 hence no Nan value can be observed, also check ISBN for Nan value by grouping the same, however, no NaN value observed

In [93]:
ratings.shape

(103269, 3)

In [95]:
rating_m=ratings.pivot_table(index='userID',columns='ISBN',values='bookRating').fillna(0)
rating_m.head(10)

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5582,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6543,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [96]:
rating_m.shape

(449, 66572)

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [97]:
R = rating_m.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
#R_demeaned = R - user_ratings_mean.reshape(-1, 1)

  """Entry point for launching an IPython kernel.


In [99]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(rating_m, k = 50)
U.shape

(449, 50)

In [100]:
sigma = np.diag(sigma)

In [101]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = rating_m.columns)

In [102]:
preds_df

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.042240,0.014753,0.015468,0.014753,0.014753,0.019870,0.012979,0.023934,0.023934,0.029215,...,0.017079,0.017125,0.058980,0.000095,-0.063129,0.021645,0.045213,0.017019,0.015206,0.084402
1,0.002801,0.009144,0.010367,0.009144,0.009144,0.013888,0.014253,0.009313,0.009313,0.014426,...,0.012450,0.013216,0.020955,0.013917,-0.016411,0.013812,0.015176,0.012571,0.012842,-0.000246
2,0.009566,0.009163,0.014316,0.009163,0.009163,0.031901,0.010587,0.036561,0.036561,0.036416,...,0.024165,0.026527,0.072602,0.030357,0.142479,0.031565,0.027739,0.024316,0.033629,-0.033072
3,-0.000770,0.056331,0.044464,0.056331,0.056331,0.051037,0.044944,0.019677,0.019677,0.088308,...,0.023701,0.030641,0.106978,0.011912,0.036883,0.049577,0.020605,0.022710,0.051930,-0.025934
4,0.033082,0.023039,0.025694,0.023039,0.023039,0.033951,0.034061,0.031235,0.031235,0.037084,...,0.033124,0.032601,0.018823,0.040424,0.704463,0.033595,0.022775,0.032417,0.035922,0.078777
5,0.014312,0.034972,0.028768,0.034972,0.034972,0.024324,0.039341,0.011191,0.011191,0.034656,...,0.018637,0.019701,0.046087,0.002929,-0.053399,0.024441,0.019689,0.017877,0.024877,0.088440
6,-0.003722,0.032418,0.025678,0.032418,0.032418,0.026391,0.028974,0.009859,0.009859,0.043690,...,0.014794,0.015864,0.042222,0.024652,0.042265,0.025784,0.013505,0.013928,0.026414,0.007648
7,0.017290,0.018114,0.021464,0.018114,0.018114,0.051538,0.013992,0.045252,0.045252,0.085276,...,0.028306,0.034946,0.149845,0.027515,-0.291014,0.047009,0.073759,0.028259,0.043067,0.040642
8,0.060758,-0.010524,-0.000406,-0.010524,-0.010524,0.038301,-0.011759,0.006550,0.006550,0.057688,...,0.021898,0.025083,0.090278,0.027632,-0.057385,0.034185,0.025748,0.021208,0.036355,-0.007432
9,0.048619,0.020978,0.022367,0.020978,0.020978,0.079572,0.009448,0.013409,0.013409,0.140070,...,0.031764,0.037811,0.197392,0.014243,-0.150410,0.075018,0.037993,0.029558,0.077003,0.001377


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [103]:
userID = 2110

In [104]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

In [105]:
user_predictions = preds_df.iloc[user_id ]

### Get the predicted ratings for userID `2110` and sort them in descending order

In [106]:
user_predictions =user_predictions.sort_values(ascending=False)

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [107]:
user_data = pd.DataFrame(user_predictions)

In [108]:
user_data.head()

Unnamed: 0_level_0,2
ISBN,Unnamed: 1_level_1
0316666343,1.040017
059035342X,0.803285
0345350499,0.721929
0440214041,0.690059
044021145X,0.688169


### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [124]:
user_full_info = (user_data.merge(books, how = 'left', left_on = 'ISBN', right_on = 'ISBN'))

In [125]:
user_full_info.shape

(66572, 6)

In [126]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [128]:
user_full_info.head(10)

Unnamed: 0,ISBN,2,bookTitle,bookAuthor,yearOfPublication,publisher
0,0316666343,1.040017,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
1,059035342X,0.803285,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books
2,0345350499,0.721929,The Mists of Avalon,MARION ZIMMER BRADLEY,1987,Del Rey
3,0440214041,0.690059,The Pelican Brief,John Grisham,1993,Dell
4,044021145X,0.688169,The Firm,John Grisham,1992,Bantam Dell Publishing Group
5,0312195516,0.66746,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
6,0345318862,0.664084,Golem in the Gears (Xanth Novels (Paperback)),PIERS ANTHONY,1986,Del Rey
7,0345313151,0.656066,Bearing an Hourglass (Incarnations of Immortal...,Piers Anthony,1991,Del Rey Books
8,0380752891,0.653763,"Man from Mundania (Xanth Trilogy, No 12)",Piers Anthony,1990,Harper Mass Market Paperbacks
9,051511605X,0.642575,Undue Influence,Steven Paul Martini,1995,Jove Books
