**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [44]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [45]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [46]:
# No. of records in each datatset
books.shape[0],users.shape[0], ratings.shape[0]

(271360, 278858, 1149780)

In [47]:
# No. of features in each datatset
books.shape[1],users.shape[1], ratings.shape[1]

(8, 3, 3)

## Exploring books dataset

In [48]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [49]:
books.drop(['imageUrlS','imageUrlM','imageUrlL'],axis =1, inplace = True)

In [50]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [51]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [52]:
ind = books[books['yearOfPublication'] == 'DK Publishing Inc'].index
ind

Int64Index([209538, 221678], dtype='int64')

In [53]:
ind1 = books[books['yearOfPublication'] == 'Gallimard'].index
ind1

Int64Index([220731], dtype='int64')

### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [54]:
books.drop(ind, inplace = True)


In [55]:
books.drop(ind1, inplace = True)

In [56]:
books[books['yearOfPublication'] == 'DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


In [57]:
books[books['yearOfPublication'] == 'Gallimard']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


### Change the datatype of yearOfPublication to 'int'

In [58]:
books['yearOfPublication'] = books['yearOfPublication'].astype('int64')

In [59]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [60]:
books.isna().sum(), books.shape[0]
# We have dropped the rows which has Nan values for bookAuthor and Publisher

(ISBN                 0
 bookTitle            0
 bookAuthor           1
 yearOfPublication    0
 publisher            2
 dtype: int64, 271357)

In [61]:
books.dropna(inplace = True)

In [62]:
books.isna().sum()

ISBN                 0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64

In [63]:
books.shape[0]

271354

## Exploring Users dataset

In [64]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [65]:
pd.DataFrame(users['Age'].unique())[0].sort_values()

47       0.0
57       1.0
79       2.0
72       3.0
84       4.0
90       5.0
93       6.0
83       7.0
82       8.0
64       9.0
89      10.0
77      11.0
68      12.0
33      13.0
5       14.0
45      15.0
50      16.0
2       17.0
1       18.0
7       19.0
12      20.0
17      21.0
49      22.0
14      23.0
11      24.0
6       25.0
4       26.0
40      27.0
31      28.0
42      29.0
       ...  
157    157.0
135    159.0
120    162.0
133    168.0
115    172.0
114    175.0
150    183.0
136    186.0
164    189.0
131    199.0
140    200.0
95     201.0
151    204.0
144    207.0
155    208.0
116    209.0
129    210.0
117    212.0
110    219.0
161    220.0
153    223.0
142    226.0
149    228.0
145    229.0
87     230.0
71     231.0
118    237.0
88     239.0
101    244.0
0        NaN
Name: 0, Length: 166, dtype: float64

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [68]:
users.loc[(users['Age']<5) | (users['Age']>90),'Age'].count()

1312

In [69]:
import numpy as np
users.loc[(users['Age']<5) | (users['Age']>90),'Age'] = np.nan

In [70]:
users[(users['Age']<5) | (users['Age']>90)]


Unnamed: 0,userID,Location,Age


In [71]:
users.loc[(users['Age']<5) | (users['Age']>90),'Age'].count()

0

### Replace null values in column `Age` with mean

In [72]:
users['Age'].fillna(users['Age'].mean(), inplace = True)

In [73]:
users.isna().sum()

userID      0
Location    0
Age         0
dtype: int64

### Change the datatype of `Age` to `int`

In [74]:
users['Age'].astype('int64')

0         34
1         18
2         34
3         17
4         34
5         61
6         34
7         34
8         34
9         26
10        14
11        34
12        26
13        34
14        34
15        34
16        34
17        25
18        14
19        19
20        46
21        34
22        34
23        19
24        55
25        34
26        32
27        24
28        19
29        24
          ..
278828    34
278829    28
278830    34
278831    62
278832    25
278833    34
278834    18
278835    47
278836    34
278837    15
278838    34
278839    45
278840    34
278841    34
278842    28
278843    28
278844    34
278845    23
278846    34
278847    34
278848    23
278849    34
278850    33
278851    32
278852    17
278853    34
278854    50
278855    34
278856    34
278857    34
Name: Age, Length: 278858, dtype: int64

In [75]:
users.dtypes
# Age column shall have Float value as mean value is of type float

userID        int64
Location     object
Age         float64
dtype: object

In [76]:
print(sorted(users.Age.unique()))

[5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 34.72384041634689, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0]


## Exploring the Ratings Dataset

### check the shape

In [77]:
ratings.shape

(1149780, 3)

In [78]:
n_users = users.shape[0]
n_books = books.shape[0]
n_ratings = ratings.shape[0]
n_users,n_books, n_ratings


(278858, 271354, 1149780)

In [79]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [80]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [81]:
users.head()

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",34.72384
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",34.72384
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",34.72384


In [82]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [84]:
ratingbooks_Left = pd.merge(ratings,books, how = 'left',on = 'ISBN')
ratingbooks_Left

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books
1,276726,0155061224,5,Rites of Passage,Judith Rae,2001.0,Heinle
2,276727,0446520802,0,The Notebook,Nicholas Sparks,1996.0,Warner Books
3,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999.0,Cambridge University Press
4,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001.0,Cambridge University Press
5,276733,2080674722,0,Les Particules Elementaires,Michel Houellebecq,1998.0,Flammarion
6,276736,3257224281,8,,,,
7,276737,0600570967,6,,,,
8,276744,038550120X,7,A Painted House,JOHN GRISHAM,2001.0,Doubleday
9,276745,342310538,10,,,,


In [102]:
ratingbooks_Left.isna().sum(), ratingbooks_Left.shape
# 118651 books in Ratings table there are no records in Books table. So we have to drop these rows.

(userID                    0
 ISBN                      0
 bookRating                0
 bookTitle            118651
 bookAuthor           118651
 yearOfPublication    118651
 publisher            118651
 dtype: int64, (1149780, 7))

In [103]:
ratingbooks_Left.dropna(inplace = True)

In [104]:
ratingbooks_Left.isna().sum(), ratingbooks_Left.shape

(userID               0
 ISBN                 0
 bookRating           0
 bookTitle            0
 bookAuthor           0
 yearOfPublication    0
 publisher            0
 dtype: int64, (1031129, 7))

In [119]:
ratings = ratingbooks_Left[['userID','ISBN','bookRating']]
ratings
# This is our revised ratings table where we have removed books which are not in books table

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
5,276733,2080674722,0
8,276744,038550120X,7
10,276746,0425115801,0
11,276746,0449006522,0
12,276746,0553561618,0


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [120]:
ratingUser_Left = pd.merge(ratings,users, how = 'left',on = 'userID')
ratingUser_Left

Unnamed: 0,userID,ISBN,bookRating,Location,Age
0,276725,034545104X,0,"tyler, texas, usa",34.72384
1,276726,0155061224,5,"seattle, washington, usa",34.72384
2,276727,0446520802,0,"h, new south wales, australia",16.00000
3,276729,052165615X,3,"rijeka, n/a, croatia",16.00000
4,276729,0521795028,6,"rijeka, n/a, croatia",16.00000
5,276733,2080674722,0,"paris, n/a, france",37.00000
6,276744,038550120X,7,"torrance, california, usa",34.72384
7,276746,0425115801,0,"fort worth, ,",34.72384
8,276746,0449006522,0,"fort worth, ,",34.72384
9,276746,0553561618,0,"fort worth, ,",34.72384


In [121]:
ratingUser_Left['Location'].isnull().sum()
# All users in Ratings are present in Users dataframe.

0

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [123]:
ratings.count()

userID        1031129
ISBN          1031129
bookRating    1031129
dtype: int64

In [124]:
ratings[(ratings['bookRating'] == 0)].count()

userID        647291
ISBN          647291
bookRating    647291
dtype: int64

In [125]:
ind = ratings[ratings['bookRating'] == 0].index

In [126]:
ratings.drop(ind, inplace = True)

In [127]:
ratings[(ratings['bookRating'] == 0)].count()

userID        0
ISBN          0
bookRating    0
dtype: int64

In [128]:
ratings.count()

userID        383838
ISBN          383838
bookRating    383838
dtype: int64

### Find out which rating has been given highest number of times

In [129]:
ratings['bookRating'].value_counts()
# Rating 8 is given highest no of times.

8     91803
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [130]:
hund = (ratings.groupby('userID').count() >= 100)

In [133]:
var = hund[hund['bookRating'] == True]

In [134]:
var
# Users which have rated atleast 100 books

Unnamed: 0_level_0,ISBN,bookRating
userID,Unnamed: 1_level_1,Unnamed: 2_level_1
2033,True,True
2110,True,True
2276,True,True
4017,True,True
4385,True,True
5582,True,True
6242,True,True
6251,True,True
6543,True,True
6575,True,True


In [135]:
p = pd.DataFrame(var.index)
# Here we have extracted the userIDs which have rated atlest 100 books

In [136]:
p.count()
# No. of userid which have rated atleast 100 books.

userID    449
dtype: int64

In [137]:
temp_ratings = pd.merge(ratings,p, how = 'inner', on = 'userID')
    

In [138]:
ratings.count(), temp_ratings.count()

(userID        383838
 ISBN          383838
 bookRating    383838
 dtype: int64, userID        103269
 ISBN          103269
 bookRating    103269
 dtype: int64)

In [139]:
temp_ratings.groupby('userID').count()
# There are 449 users who rated atleast 100 books. Below is list of those users with No. of books rated by them

Unnamed: 0_level_0,ISBN,bookRating
userID,Unnamed: 1_level_1,Unnamed: 2_level_1
2033,129,129
2110,103,103
2276,196,196
4017,154,154
4385,212,212
5582,132,132
6242,134,134
6251,217,217
6543,174,174
6575,233,233


In [140]:
temp_ratings
# This is Subset table of Ratings which has only those users who rated atleast 100 books.
# We shall be using this table for further recommendations

Unnamed: 0,userID,ISBN,bookRating
0,277427,002542730X,10
1,277427,003008685X,8
2,277427,0060006641,10
3,277427,0060542128,7
4,277427,0061009059,9
5,277427,0062507109,8
6,277427,0132220598,8
7,277427,0140283374,6
8,277427,014039026X,8
9,277427,0140390715,7


### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [141]:
R_df = temp_ratings.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)
R_df

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5582,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6242,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6543,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [142]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_df, k = 50)

In [143]:
sigma 

array([147.92121613, 149.3438051 , 150.07400599, 152.20116297,
       152.87416391, 154.61308307, 154.80093432, 155.95760177,
       158.05646578, 159.21079484, 159.81670657, 162.01963916,
       162.77851768, 163.33054635, 166.02489324, 166.8162391 ,
       168.04972004, 170.77485167, 171.01325686, 173.29428498,
       174.57624968, 176.65724713, 178.61913749, 180.29517222,
       182.25079063, 184.10706957, 187.61687534, 189.75276623,
       190.96966388, 195.14643609, 199.83133018, 201.70083339,
       202.18713912, 203.48697581, 207.26449173, 209.92986988,
       213.23598777, 216.88280493, 224.26954726, 231.66186197,
       235.67095629, 249.9581775 , 252.02866425, 261.24756904,
       267.98197504, 281.0120779 , 293.69539562, 379.58327277,
       634.72875357, 680.30978318])

In [144]:
sigma = np.diag(sigma)
sigma

array([[147.92121613,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 149.3438051 ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 150.07400599, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 379.58327277,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        634.72875357,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 680.30978318]])

In [145]:
all_users_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

In [146]:
preds_df = pd.DataFrame(all_users_predicted_ratings, columns = R_df.columns)
preds_df

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025341,-0.002146,-1.430820e-03,-0.002146,-0.002146,0.002971,-0.003920,0.007035,0.007035,0.012316,...,0.000180,0.000226,0.042081,-0.016804,-0.080028,0.004746,0.028314,0.000120,-0.001693,0.067503
1,-0.010012,-0.003669,-2.446297e-03,-0.003669,-0.003669,0.001075,0.001440,-0.003500,-0.003500,0.001612,...,-0.000363,0.000403,0.008142,0.001104,-0.029224,0.000999,0.002363,-0.000242,0.000029,-0.013059
2,-0.015054,-0.015457,-1.030440e-02,-0.015457,-0.015457,0.007281,-0.014033,0.011941,0.011941,0.011796,...,-0.000455,0.001907,0.047982,0.005737,0.117859,0.006945,0.003119,-0.000304,0.009009,-0.057692
3,-0.021499,0.035602,2.373467e-02,0.035602,0.035602,0.030307,0.024215,-0.001053,-0.001053,0.067579,...,0.002971,0.009912,0.086248,-0.008818,0.016154,0.028848,-0.000125,0.001981,0.031201,-0.046664
4,0.002077,-0.007965,-5.310012e-03,-0.007965,-0.007965,0.002947,0.003057,0.000231,0.000231,0.006080,...,0.002120,0.001597,-0.012181,0.009420,0.673459,0.002591,-0.008229,0.001413,0.004918,0.047773
5,-0.002046,0.018614,1.240949e-02,0.018614,0.018614,0.007966,0.022983,-0.005167,-0.005167,0.018298,...,0.002278,0.003343,0.029729,-0.013429,-0.069757,0.008082,0.003330,0.001519,0.008519,0.072081
6,-0.015920,0.020221,1.348068e-02,0.020221,0.020221,0.014194,0.016776,-0.002339,-0.002339,0.031493,...,0.002596,0.003667,0.030025,0.012455,0.030068,0.013586,0.001308,0.001731,0.014216,-0.004550
7,-0.010875,-0.010051,-6.700652e-03,-0.010051,-0.010051,0.023373,-0.014173,0.017087,0.017087,0.057111,...,0.000141,0.006781,0.121680,-0.000650,-0.319179,0.018844,0.045594,0.000094,0.014902,0.012477
8,0.040930,-0.030352,-2.023461e-02,-0.030352,-0.030352,0.018473,-0.031587,-0.013278,-0.013278,0.037860,...,0.002069,0.005255,0.070450,0.007803,-0.077213,0.014357,0.005920,0.001380,0.016526,-0.027260
9,0.023473,-0.004168,-2.778375e-03,-0.004168,-0.004168,0.054426,-0.015698,-0.011736,-0.011736,0.114924,...,0.006618,0.012665,0.172246,-0.010902,-0.175556,0.049872,0.012847,0.004412,0.051858,-0.023768


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [198]:
def recommend_movies(predictions_df, userID, books_df, original_ratings_df,rowid, num_recommendations = False ):
    user_row_number = rowid - 1  #UserID starts at zero not 1
    sorted_user_predictions = predictions_df.loc[user_row_number].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    user_full = (user_data.merge(books_df, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )
    print('User {0} has already rated {1} books.'.format(userID, user_full.dropna().shape[0]))
    print('Recommending the highest {0} predicted ratings books not already rated.'.format(num_recommendations))
    
     
    recommendations = (books_df[~books_df['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').
                      rename(columns = {user_row_number: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
                       
    return user_full, recommendations, sorted_user_predictions, user_data, user_full

In [199]:
already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_movies(preds_df, 2110, books, temp_ratings, 2, 10)
# We used user_id = 2, because this user 2110 is at 2nd row in temp_ratings table.

User 2110 has already rated 103 books.
Recommending the highest 10 predicted ratings books not already rated.


In [55]:
userID = 2110

In [56]:
user_id = 2 #2nd row in ratings matrix and predicted matrix


In [149]:
predictions
# Recommendation for user 2110

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
1192,0345370775,Jurassic Park,Michael Crichton,1999,Ballantine Books
6184,0345384911,Crystal Line,Anne McCaffrey,1993,Del Rey Books
5458,043935806X,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,2003,Scholastic
455,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
2031,0451151259,Eyes of the Dragon,Stephen King,1988,Penguin Putnam~mass
5383,0439139597,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic
3413,0439064872,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic
976,0380759497,Xanth 15: The Color of Her Panties,Piers Anthony,1992,Eos
2435,0345353145,Sphere,MICHAEL CRICHTON,1988,Ballantine Books
6048,0451167317,The Dark Half,Stephen King,1994,Signet Book


### Get the predicted ratings for userID `2110` and sort them in descending order

In [150]:
sorted_user_predictions

ISBN
059035342X    0.682444
0345370775    0.368946
0345384911    0.333624
043935806X    0.333209
044021145X    0.329336
0451151259    0.313295
0439139597    0.305088
0439064872    0.290587
0380759497    0.278563
0345353145    0.250941
0451167317    0.249254
0439136369    0.242676
0618002235    0.239957
0880389117    0.239552
0618002227    0.239242
0451160525    0.234959
0446310786    0.231819
0451173317    0.229402
0440213525    0.228038
0060392452    0.227935
0439136350    0.226968
0345335287    0.223613
1560768304    0.221496
0441845630    0.221496
0451156609    0.221396
0451180232    0.221054
0345317580    0.219552
0451142934    0.218949
0312980140    0.218348
0670835382    0.216858
                ...   
0553576925   -0.042664
0688088686   -0.043207
0786000899   -0.043301
0553567683   -0.043553
042518630X   -0.043825
0671673661   -0.044025
0345361571   -0.044076
042517770X   -0.044668
0446603090   -0.044738
0684195984   -0.047424
0684195976   -0.047690
0679405283   -0.047838
055380

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [151]:
user_data
# The userID 2110 rated 103 books

Unnamed: 0,userID,ISBN,bookRating
381,2110,0060987529,7
382,2110,0064472779,8
383,2110,0140022651,10
384,2110,0142302163,8
385,2110,0151008116,5
386,2110,015216250X,8
387,2110,0345260627,10
388,2110,0345283554,10
389,2110,0345283929,10
390,2110,034528710X,10


In [152]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating
381,2110,60987529,7
382,2110,64472779,8
383,2110,140022651,10
384,2110,142302163,8
385,2110,151008116,5


In [153]:
user_data.shape

(103, 3)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [154]:
user_full

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
76,2110,067166865X,10,STAR TREK YESTERDAY'S SON (Star Trek: The Orig...,A.C. Crispin,1988,Audioworks
52,2110,0590109715,10,"The Andalite Chronicles (Elfangor's Journey, A...",Katherine Applegate,1997,Apple
64,2110,0590629786,10,"The Visitor (Animorphs, No 2)",K. A. Applegate,1996,Scholastic
63,2110,0590629778,10,"The Invasion (Animorphs, No 1)",K. A. Applegate,1996,Scholastic
61,2110,059046678X,10,The Yearbook,Peter Lerangis,1994,Scholastic
55,2110,059035342X,10,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books
93,2110,0812505042,10,The Time Machine,H. G. Wells,1995,Tor Books
54,2110,0590213040,10,The Andalite's Gift (Animorphs : Megamorphs 1),K. A. Applegate,1997,Scholastic
53,2110,0590109960,10,Watchers #1: Last Stop,Peter Lerangis,1998,Scholastic
82,2110,0679805265,10,Long Shot (Three Investigators Crimebusters (P...,Megan Stine,1993,Random House Children's Books


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [157]:
predictions
# Top 10 recommendation for userid 2110 from books not already rated

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
1192,0345370775,Jurassic Park,Michael Crichton,1999,Ballantine Books
6184,0345384911,Crystal Line,Anne McCaffrey,1993,Del Rey Books
5458,043935806X,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,2003,Scholastic
455,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
2031,0451151259,Eyes of the Dragon,Stephen King,1988,Penguin Putnam~mass
5383,0439139597,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic
3413,0439064872,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic
976,0380759497,Xanth 15: The Color of Her Panties,Piers Anthony,1992,Eos
2435,0345353145,Sphere,MICHAEL CRICHTON,1988,Ballantine Books
6048,0451167317,The Dark Half,Stephen King,1994,Signet Book
