**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [520]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

In [521]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [522]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


In [523]:
print('no. of records and features in books dataframe:' , books.shape)

no. of records and features in books dataframe: (271360, 8)


In [524]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [525]:
print('no. of records and features in users dataframe:' , users.shape)

no. of records and features in users dataframe: (278858, 3)


In [526]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [527]:
print('no. of records and features in ratings dataframe:' , ratings.shape)

no. of records and features in ratings dataframe: (1149780, 3)


## Exploring books dataset

In [528]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [529]:
books=books.iloc[:,:-3]

In [530]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [531]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [532]:
books[books['yearOfPublication']=='DK Publishing Inc' ]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [533]:
books[(books['yearOfPublication'] == 'DK Publishing Inc') | (books['yearOfPublication'] == 'Gallimard')]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [534]:
books[(books['yearOfPublication'] == 'DK Publishing Inc') | (books['yearOfPublication'] == 'Gallimard')].index

Int64Index([209538, 220731, 221678], dtype='int64')

In [535]:
index1=books[(books['yearOfPublication'] == 'DK Publishing Inc') | (books['yearOfPublication'] == 'Gallimard')].index

In [536]:
books.drop(index1,inplace=True)

In [537]:
books.shape

(271357, 5)

### Change the datatype of yearOfPublication to 'int'

In [538]:
books['yearOfPublication']=books['yearOfPublication'].astype(int)

In [539]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [540]:
books.isna().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            2
dtype: int64

In [541]:
books[books['publisher'].isna()==True]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


In [542]:
books.dropna(subset=['publisher'],inplace=True)

In [543]:
books.shape

(271355, 5)

## Exploring Users dataset

In [544]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [545]:
np.sort(users['Age'].unique())

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [546]:
users['Age'] = np.where((users['Age'] >=5) & (users['Age']<=90),users['Age'],np.nan)

In [547]:
np.sort(users['Age'].unique())

array([ 5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17.,
       18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
       31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
       44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56.,
       57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.,
       70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82.,
       83., 84., 85., 86., 87., 88., 89., 90., nan])

### Replace null values in column `Age` with mean

In [548]:
np.mean(users['Age'])

34.72384041634689

In [549]:
users['Age'].fillna(np.mean(users['Age']),inplace=True)

In [550]:
np.sort(users['Age'].unique())

array([ 5.        ,  6.        ,  7.        ,  8.        ,  9.        ,
       10.        , 11.        , 12.        , 13.        , 14.        ,
       15.        , 16.        , 17.        , 18.        , 19.        ,
       20.        , 21.        , 22.        , 23.        , 24.        ,
       25.        , 26.        , 27.        , 28.        , 29.        ,
       30.        , 31.        , 32.        , 33.        , 34.        ,
       34.72384042, 35.        , 36.        , 37.        , 38.        ,
       39.        , 40.        , 41.        , 42.        , 43.        ,
       44.        , 45.        , 46.        , 47.        , 48.        ,
       49.        , 50.        , 51.        , 52.        , 53.        ,
       54.        , 55.        , 56.        , 57.        , 58.        ,
       59.        , 60.        , 61.        , 62.        , 63.        ,
       64.        , 65.        , 66.        , 67.        , 68.        ,
       69.        , 70.        , 71.        , 72.        , 73.  

### Change the datatype of `Age` to `int`

In [551]:
users['Age']=users['Age'].astype('int')

In [552]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [553]:
ratings.shape

(1149780, 3)

In [554]:
n_users = users.shape[0]
n_books = books.shape[0]

In [555]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [556]:
ratings=pd.merge(ratings,books,on='ISBN',how='inner').iloc[:,:3]

In [557]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,2313,034545104X,5
2,6543,034545104X,0
3,8680,034545104X,5
4,10314,034545104X,9


In [558]:
ratings.shape

(1031130, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [559]:
ratings=pd.merge(ratings,users,on='userID',how='inner').iloc[:,:3]

In [560]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,2313,034545104X,5
2,2313,0812533550,9
3,2313,0679745580,8
4,2313,0060173289,9


In [561]:
ratings.shape

(1031130, 3)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [562]:
ratings['bookRating'].value_counts()

0     647291
8      91804
10     71225
7      66401
9      60776
5      45355
6      31687
4       7617
3       5118
2       2375
1       1481
Name: bookRating, dtype: int64

In [563]:
ratings['bookRating'].replace(0,np.nan,inplace=True)

In [564]:
ratings.dropna(subset=['bookRating'],inplace=True)

In [565]:
ratings['bookRating'].value_counts()

8.0     91804
10.0    71225
7.0     66401
9.0     60776
5.0     45355
6.0     31687
4.0      7617
3.0      5118
2.0      2375
1.0      1481
Name: bookRating, dtype: int64

In [566]:
ratings.shape

(383839, 3)

### Find out which rating has been given highest number of times

In [567]:
ratings['bookRating'].value_counts()

8.0     91804
10.0    71225
7.0     66401
9.0     60776
5.0     45355
6.0     31687
4.0      7617
3.0      5118
2.0      2375
1.0      1481
Name: bookRating, dtype: int64

Rating 8 was given highest number of times. 

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [568]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
1,2313,034545104X,5.0
2,2313,0812533550,9.0
3,2313,0679745580,8.0
4,2313,0060173289,9.0
5,2313,0385482388,5.0


In [569]:
RatingCount=ratings.groupby(by='userID').count()

In [570]:
RatingCount.head()

Unnamed: 0_level_0,ISBN,bookRating
userID,Unnamed: 1_level_1,Unnamed: 2_level_1
8,7,7
9,1,1
12,1,1
14,3,3
16,1,1


In [571]:
Rate_100books=RatingCount[RatingCount['bookRating']>=100]

In [572]:
Rate_100books.head()

Unnamed: 0_level_0,ISBN,bookRating
userID,Unnamed: 1_level_1,Unnamed: 2_level_1
2033,129,129
2110,103,103
2276,196,196
4017,154,154
4385,212,212


In [573]:
Rate_100books.shape

(449, 2)

In [574]:
Rate_100books.index

Int64Index([  2033,   2110,   2276,   4017,   4385,   5582,   6242,   6251,
              6543,   6575,
            ...
            269566, 270713, 271448, 271705, 273113, 274061, 274301, 275970,
            277427, 278418],
           dtype='int64', name='userID', length=449)

In [575]:
ratings = ratings[ratings['userID'].isin(Rate_100books.index)]

In [576]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
43,6543,446605484,10.0
47,6543,805062971,8.0
48,6543,345342968,8.0
49,6543,446610038,9.0
55,6543,61009059,8.0


In [577]:
ratings.shape

(103269, 3)

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [578]:
ratings['bookRating'].value_counts()

8.0     23904
10.0    22904
9.0     18552
7.0     15242
5.0     13314
6.0      6500
4.0      1260
3.0       772
2.0       440
1.0       381
Name: bookRating, dtype: int64

there are no Nans in the given rating means no absence of ratings. 

In [579]:
from sklearn.model_selection import train_test_split

trainDF, tempDF = train_test_split(ratings, test_size = 0.2, random_state = 100)

In [580]:
testDF = tempDF.copy()

In [581]:
tempDF.bookRating = np.nan

In [582]:
tempDF.head()

Unnamed: 0,userID,ISBN,bookRating
446631,197659,842304673,
135668,184299,345358791,
84242,115003,1400031354,
193585,153662,380761319,
103772,16795,394891139,


In [583]:
testDF = testDF.dropna()

In [584]:
testDF.head()

Unnamed: 0,userID,ISBN,bookRating
446631,197659,842304673,7.0
135668,184299,345358791,8.0
84242,115003,1400031354,9.0
193585,153662,380761319,10.0
103772,16795,394891139,6.0


In [585]:
ratings = pd.concat([trainDF, tempDF]).reset_index()

In [586]:
ratings

Unnamed: 0,index,userID,ISBN,bookRating
0,425933,150979,0679460152,9.0
1,278665,60244,0393049566,7.0
2,8536,98391,067104222X,9.0
3,145844,207782,0874060273,7.0
4,264590,189835,1560549653,5.0
5,260876,94347,0425061426,6.0
6,220575,16634,0441010512,7.0
7,86428,140358,0345409469,5.0
8,291060,76626,0140283587,7.0
9,76835,60337,0689111940,7.0


In [587]:
ratings.fillna(0,inplace=True)

In [588]:
ratings.isna().sum()

index         0
userID        0
ISBN          0
bookRating    0
dtype: int64

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [589]:
R_df = ratings.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)

In [590]:
R_df.tail()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
274061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
274301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [591]:
from scipy.sparse.linalg import svds

In [592]:
U, sigma, Vt = svds(R_df, k = 50)

In [593]:
sigma

array([131.07954208, 132.44479902, 132.61470995, 133.96010817,
       134.94232624, 136.38117803, 137.0634911 , 138.04647807,
       140.45935247, 141.29908114, 142.26811037, 143.88305269,
       144.27243066, 144.93753168, 149.39109893, 149.62291223,
       149.94512384, 152.15710138, 152.98116567, 154.23600256,
       155.64958852, 156.98587955, 158.30450983, 161.41139495,
       164.36235669, 164.60938522, 166.22369888, 168.8872909 ,
       173.19509942, 174.99507662, 176.37245022, 178.41205733,
       180.20327794, 181.26833216, 184.19621481, 186.26397001,
       190.17666439, 194.12064112, 202.52424067, 206.23585733,
       210.1876945 , 219.80287636, 223.09823012, 232.70628393,
       237.36014895, 252.56483856, 257.35846413, 338.84909015,
       567.12180411, 605.76299262])

In [594]:
sigma = np.diag(sigma)

In [595]:
sigma

array([[131.07954208,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 132.44479902,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 132.61470995, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 338.84909015,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        567.12180411,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 605.76299262]])

In [596]:
all_users_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

In [597]:
preds_df = pd.DataFrame(all_users_predicted_ratings, columns = R_df.columns)

In [598]:
preds_df

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.007815,-0.012798,-0.008532,-0.012798,0.0,0.005981,-0.002933,0.010925,0.010925,0.014187,...,0.001694,0.001188,0.013656,0.0,0.0,0.003294,0.014820,0.001129,0.003141,0.025547
1,-0.009052,-0.003543,-0.002362,-0.003543,0.0,0.001077,0.003381,-0.001985,-0.001985,0.002332,...,-0.000213,0.000924,0.006919,0.0,0.0,0.002442,0.000632,-0.000142,0.000155,-0.011345
2,-0.007289,-0.014245,-0.009497,-0.014245,0.0,0.008952,0.004210,0.015236,0.015236,0.020061,...,0.001029,-0.000127,0.048170,0.0,0.0,0.010027,0.003202,0.000686,0.009559,-0.053938
3,-0.020563,0.036033,0.024022,0.036033,0.0,0.027612,0.001397,0.003328,0.003328,0.064300,...,0.004822,0.000270,0.086069,0.0,0.0,0.023547,0.010329,0.003214,0.024629,-0.054396
4,-0.001996,-0.004971,-0.003314,-0.004971,0.0,0.002777,0.003071,-0.000250,-0.000250,0.006595,...,0.001129,0.000011,0.001815,0.0,0.0,0.003064,-0.001733,0.000753,0.002813,0.043207
5,0.002886,0.024017,0.016012,0.024017,0.0,0.007126,0.001949,-0.007713,-0.007713,0.016297,...,0.001977,0.000471,0.019840,0.0,0.0,0.006871,0.003491,0.001318,0.005549,0.095981
6,-0.015164,0.037423,0.024949,0.037423,0.0,0.012221,0.004590,-0.009147,-0.009147,0.027800,...,0.002001,-0.000293,0.028798,0.0,0.0,0.009141,0.004512,0.001334,0.010302,0.012920
7,0.008502,0.011246,0.007497,0.011246,0.0,0.025981,-0.012858,0.025985,0.025985,0.055775,...,0.002708,0.001801,0.070756,0.0,0.0,0.016543,0.026284,0.001805,0.016773,-0.029342
8,0.031977,-0.021332,-0.014221,-0.021332,0.0,0.015388,0.003333,-0.012689,-0.012689,0.031379,...,0.002464,0.001298,0.044266,0.0,0.0,0.009379,0.007422,0.001642,0.010417,-0.028545
9,0.037238,0.046507,0.031005,0.046507,0.0,0.043128,0.006078,-0.014510,-0.014510,0.087791,...,0.004656,-0.001341,0.128374,0.0,0.0,0.030903,0.014746,0.003104,0.031449,-0.071833


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [599]:
userID = 2110

In [600]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [601]:
def recommend_books(predictions_df, userID, user_row_number, books_df, original_ratings_df, num_recommendations = False):
    user_row_number = user_row_number - 1  #UserID starts at zero not 1
    sorted_user_predictions = predictions_df.loc[user_row_number].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    user_full = (user_data.merge(books_df, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )
    print('User {0} has already rated {1} books.'.format(userID, user_full.dropna().shape[0]))
    print('Recommending the highest {0} predicted ratings books not already rated.'.format(num_recommendations))
    
    recommendations = (books_df[~books_df['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').
                      rename(columns = {user_row_number: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
    return user_full, recommendations, sorted_user_predictions, user_data, user_full

In [602]:
already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_books(preds_df, userID, user_id,books, ratings, 10)

User 2110 has already rated 103 books.
Recommending the highest 10 predicted ratings books not already rated.


In [603]:
sorted_user_predictions

ISBN
059035342X    0.670207
044021145X    0.335010
0345370775    0.326200
0345384911    0.323808
0440213525    0.303738
0880389117    0.294164
0380759497    0.286410
0439064872    0.280313
0316666343    0.269477
0441845630    0.258484
0060392452    0.250820
0345317580    0.248826
0345348109    0.248578
043935806X    0.247523
0618002235    0.245761
0345353145    0.242574
0345335287    0.238556
0451160525    0.230056
0451180232    0.229447
0425109720    0.218258
0425147622    0.213593
0441172717    0.212400
1558531564    0.211671
1560763361    0.210520
0439139597    0.208522
043936213X    0.207282
0449221512    0.202139
0345367693    0.200714
0345380371    0.199478
0440206154    0.195434
                ...   
0515114006   -0.038246
0062501860   -0.038576
0684810395   -0.038649
0671673688   -0.038670
0553285785   -0.038878
0553801430   -0.038890
0515127833   -0.039842
051511779X   -0.040447
0515114693   -0.040472
0060977744   -0.040884
0385311109   -0.041742
0385316895   -0.041742
045140

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [604]:
user_data.head()

Unnamed: 0,index,userID,ISBN,bookRating
209,670445,2110,553073273,9.0
334,670434,2110,440920221,6.0
1977,670404,2110,373642849,6.0
4797,670415,2110,373765649,8.0
7907,670461,2110,671038931,8.0


In [605]:
user_data.shape

(103, 4)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [606]:
user_data.head()

Unnamed: 0,index,userID,ISBN,bookRating
209,670445,2110,553073273,9.0
334,670434,2110,440920221,6.0
1977,670404,2110,373642849,6.0
4797,670415,2110,373765649,8.0
7907,670461,2110,671038931,8.0


In [607]:
book_data=books

In [608]:
book_data.shape

(271355, 5)

In [609]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [610]:
user_full_info = (user_data.merge(book_data, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )

In [611]:
user_full_info.head()

Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
51,670425,2110,0394843509,10.0,Reel Trouble (The Three Investigators Crimebus...,G. H. Stone,1989,Random House Children's Books
53,670429,2110,0439222303,10.0,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
21,670345,2110,0590629786,10.0,"The Visitor (Animorphs, No 2)",K. A. Applegate,1996,Scholastic
22,670456,2110,0590629778,10.0,"The Invasion (Animorphs, No 1)",K. A. Applegate,1996,Scholastic
23,670454,2110,059046678X,10.0,The Yearbook,Peter Lerangis,1994,Scholastic


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [612]:
predictions

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
455,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
1192,0345370775,Jurassic Park,Michael Crichton,1999,Ballantine Books
6184,0345384911,Crystal Line,Anne McCaffrey,1993,Del Rey Books
3781,0440213525,The Client,John Grisham,1994,Dell Publishing Company
54653,0880389117,Flint the King (Dragonlance: Preludes),Mary Kirchoff,1990,Wizards of the Coast
976,0380759497,Xanth 15: The Color of Her Panties,Piers Anthony,1992,Eos
3413,0439064872,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic
407,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
81942,0441845630,Unicorn Point (Apprentice Adept (Paperback)),Piers Anthony,1990,ACE Charter
2758,0060392452,Stupid White Men ...and Other Sorry Excuses fo...,Michael Moore,2002,Regan Books
