**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

# Better example of SVD at: https://surprise.readthedocs.io/en/stable/getting_started.html
from surprise import SVD, Dataset, Reader

# Alternate way of single value decomposition: https://beckernick.github.io/matrix-factorization-recommender/
from scipy.sparse.linalg import svds

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [3]:
print("Books: ", books.shape)
print("Users: ", users.shape)
print("Ratings: ", ratings.shape)

Books:  (271360, 8)
Users:  (278858, 3)
Ratings:  (1149780, 3)


In [4]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


In [5]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [6]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


## Exploring books dataset

In [7]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [8]:
books.drop(books.columns[-3:], axis=1, inplace=True)

In [9]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [10]:
pd.unique(books.yearOfPublication)

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [11]:
books[(books.yearOfPublication=='DK Publishing Inc') | (books.yearOfPublication=='Gallimard')]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [12]:
books.drop(books.index[(books.yearOfPublication=='DK Publishing Inc') | (books.yearOfPublication=='Gallimard')], 
           inplace=True)

books[(books.yearOfPublication=='DK Publishing Inc') | (books.yearOfPublication=='Gallimard')]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


### Change the datatype of yearOfPublication to 'int'

In [13]:
books.yearOfPublication = pd.to_numeric(books.yearOfPublication, downcast='integer')
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null int16
publisher            271355 non-null object
dtypes: int16(1), object(4)
memory usage: 10.9+ MB


In [14]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int16
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [15]:
# Droping null values instead

books[books.publisher.isna()]

#pub = pd.unique(books.publisher)
#len(pub)

#Why are we not able to find bookTitles with empty publisher in excel:
    # Because excel is not properly formating the values. 
    # Hence the value copied from excel (for making search here) should be searched using the 'contains' clause
    # Don't use excel for viewing this data. Excel is behaving weird for values containing comma(,) in the value
#books[books.bookTitle.str.contains('The Ruby in the Smoke (Sally Lockhart Trilogy', regex=False)]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


In [16]:
books.drop(books.index[books.publisher.isnull()], inplace=True)

# After droping the columns search the columns. It should return empty if columns were successfully droped.
books[books.publisher.isna()]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


## Exploring Users dataset

In [17]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [18]:
np.sort(pd.unique(users.Age))

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [19]:
print("Numer of users below 5 or above 90: ", len(users[(users.Age < 5) | (users.Age > 90)]))

users.Age[(users.Age < 5) | (users.Age > 90)] = np.nan

print("Numer of users below 5 or above 90, after replacement with NaN: ", len(users[(users.Age < 5) | (users.Age > 90)]))

Numer of users below 5 or above 90:  1312
Numer of users below 5 or above 90, after replacement with NaN:  0


### Replace null values in column `Age` with mean

In [20]:
print("Number of rows in users with null age: ", len(users[users.Age.isnull()]))

print("Mean age: ", np.round(users.Age.mean(), decimals=0).astype(int))

users.Age[users.Age.isnull()] = np.round(users.Age.mean(), decimals=0).astype(int)

print("Number of rows in users with null age, after replacement with mean: ", len(users[users.Age.isnull()]))

Number of rows in users with null age:  112074
Mean age:  35
Number of rows in users with null age, after replacement with mean:  0


### Change the datatype of `Age` to `int`

In [21]:
users.Age = pd.to_numeric(users.Age).astype(dtype="int8")

users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null int8
dtypes: int64(1), int8(1), object(1)
memory usage: 4.5+ MB


In [22]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [23]:
ratings.shape

(1149780, 3)

In [24]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [25]:
print(len(books))
books.head(1)

271355


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press


In [26]:
print(len(ratings))
ratings.head(1)

1149780


Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0


In [27]:
ratings = pd.merge(ratings, books, how="inner", on = "ISBN").iloc[:, :3]
print("Length after removing ratings without matching book: ", len(ratings))
ratings.head()

Length after removing ratings without matching book:  1031130


Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,2313,034545104X,5
2,6543,034545104X,0
3,8680,034545104X,5
4,10314,034545104X,9


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [28]:
users.head(1)

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",35


In [29]:
ratings = pd.merge(ratings, users, how="inner", on="userID").iloc[:, :3]
print("Length after removing ratings without matching user: ", len(ratings))
ratings.head()

Length after removing ratings without matching user:  1031130


Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,2313,034545104X,5
2,2313,0812533550,9
3,2313,0679745580,8
4,2313,0060173289,9


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [30]:
# Check if only 0 to 10 is the value in the column bookRating
pd.unique(ratings.bookRating)

array([ 0,  5,  9,  8,  7,  6, 10,  3,  4,  2,  1], dtype=int64)

In [31]:
# Check how many ratings are with value 0
print("Number of ratings with value 0: ", len(ratings[ratings.bookRating==0]))

ratings.drop(ratings.index[ratings.bookRating==0], inplace=True)

print("Number of ratings, after droping ratings with value 0: ", len(ratings))

Number of ratings with value 0:  647291
Number of ratings, after droping ratings with value 0:  383839


### Find out which rating has been given highest number of times

In [32]:
#sorted(ratings.bookRating.value_counts(), reverse=True)
ratings.bookRating.value_counts()

8     91804
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

8 is the rating which is given the hightest number of times

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [33]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
1,2313,034545104X,5
2,2313,0812533550,9
3,2313,0679745580,8
4,2313,0060173289,9
5,2313,0385482388,5


In [34]:
# Generate dataframe with user and corresponding frequency

#temp = ratings.userID.value_counts()
#temp = pd.DataFrame(temp)
#temp.columns = ["frequency"]      # Rename column to count
#temp["userID"] = temp.index   # Create userID column with value from index, which correspond to the userID

temp = pd.DataFrame(ratings.userID.value_counts())
temp.columns = ["frequency"]      # Rename column to frequency
temp["userID"] = temp.index   # Create userID column with value from index, which correspond to the userID


# Filter ratings by comparing with above dataframe
print("Total number of ratings is ", ratings.shape[0])
ratings = pd.merge(ratings, temp[temp.frequency>=100], how="inner", on="userID").iloc[:, :3]
print("Number of ratings with userID occuring more than 100 is ", ratings.shape[0])
ratings.head()

Total number of ratings is  383839
Number of ratings with userID occuring more than 100 is  103269


Unnamed: 0,userID,ISBN,bookRating
0,6543,446605484,10
1,6543,805062971,8
2,6543,345342968,8
3,6543,446610038,9
4,6543,61009059,8


### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [35]:
ratings[ratings.isnull().any(axis=1)]

Unnamed: 0,userID,ISBN,bookRating


No rows found with NaN in the ratings dataframe

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [36]:
#from surprise import SVD, Dataset, Reader
#from surprise import Reader

In [37]:
# The used file in SVD, as per assumption by Reader, must follow structure: [userId, itemId, rating]
#reader = Reader(rating_scale=(1, 10))
#data = Dataset.load_from_df(ratings, reader)

#trainset = data.build_full_trainset()
#trainset

In [38]:
# What does trainset.ur do?
    # It creates a dictionary structure
    # In the dictionary structure, userId is the key
    # For each userId-index, the value is a array of all mappings between itemId and ratings.
    # Note that userId and itemId are represented by their index, instead of actual id.
    # This way of representing data is called: ???
#trainset.ur

In [39]:
#algo = SVD()
#algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
#testset = trainset.build_anti_testset()
#testset

In [40]:
#predictions = algo.test(testset)
#predictions

# getting out of the memory error when calling the test method

#### Alternate method of SVD

In [41]:
# Format ratings matrix to one row per user and one column per book, with 0 where not rated
r_df = ratings.pivot(index="userID", columns="ISBN", values="bookRating").fillna(0)

r_df.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
# De-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array
r_matrix = r_df.as_matrix()
user_rating_mean = np.mean(r_matrix, axis=1)   # Mean of each row (average rating given by a user)
r_demeaned = r_matrix - user_rating_mean.reshape(-1, 1)     # Delete mean from each value to get bias

r_demeaned

array([[-0.016899  , -0.016899  , -0.016899  , ..., -0.016899  ,
        -0.016899  , -0.016899  ],
       [-0.01281319, -0.01281319, -0.01281319, ..., -0.01281319,
        -0.01281319, -0.01281319],
       [-0.02461996, -0.02461996, -0.02461996, ..., -0.02461996,
        -0.02461996, -0.02461996],
       ...,
       [-0.01804062, -0.01804062, -0.01804062, ..., -0.01804062,
        -0.01804062, -0.01804062],
       [-0.0185213 , -0.0185213 , -0.0185213 , ..., -0.0185213 ,
        -0.0185213 , -0.0185213 ],
       [-0.00962867, -0.00962867, -0.00962867, ..., -0.00962867,
        -0.00962867, -0.00962867]])

In [43]:
#from scipy.sparse.linalg import svds

# Use latent-factor value 50 for approximation. It is the dimension of the sigma matrix.
U, sigma, Vt = svds(r_demeaned, k = 50)

In [44]:
sigma

array([147.85406428, 149.03873604, 150.00542937, 152.11508432,
       152.86090591, 154.577847  , 154.73136341, 155.94010654,
       158.00953632, 159.20282665, 159.69203677, 161.70248304,
       162.65252849, 163.05501865, 165.93119223, 166.66652357,
       167.91825361, 170.51932114, 170.95306073, 172.95274052,
       173.77510684, 176.64996981, 178.02982499, 180.29330376,
       182.07325734, 183.90232841, 187.40100654, 189.42369442,
       190.70522784, 195.12547971, 198.00444269, 200.36773193,
       202.00827912, 203.42551365, 207.19628835, 209.88598921,
       212.68581391, 215.59150477, 220.67458007, 231.04636103,
       234.76022319, 244.27172613, 249.99796019, 255.2411884 ,
       263.58816449, 278.15145742, 291.33530188, 374.71881607,
       599.45284918, 651.92625064])

In [45]:
# Convert sigma into a diagonal matrix
sigma = np.diag(sigma)
sigma

array([[147.85406428,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 149.03873604,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 150.00542937, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 374.71881607,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        599.45284918,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 651.92625064]])

In [46]:
# We now have everything we need to make rating predictions for every user.
# We can do it all at once by doing matrix multiplication between U, Σ, and Vt.
# The multiplication will give back the rank k=50 approximation of R.
# We also need to add the mean rating by each user back to get the predicted n-star (10 in this case) ratings.

# Get all user ratings by multiplication of U, sigma and Vt
allusers_ratings_predications = np.dot(np.dot(U, sigma), Vt) + user_rating_mean.reshape(-1, 1)
pred_df = pd.DataFrame(allusers_ratings_predications, columns=r_df.columns, index=r_df.index)     # Convert to Dataframe

pred_df.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.026782,-0.002117,-0.000493,-0.002117,-0.002117,0.00403,-0.001473,0.006066,0.006066,0.012757,...,-0.000761,-0.002984,0.045408,-0.015242,-0.078286,0.00483,0.030557,0.000411,-0.006341,0.069061
2110,-0.007081,-0.00352,0.002556,-0.00352,-0.00352,0.01517,0.000451,-0.00233,-0.00233,0.014413,...,0.01409,0.014776,0.019194,0.00451,-0.019647,0.014777,0.015427,0.014296,0.014224,-0.007096
2276,-0.010854,-0.016486,-0.002251,-0.016486,-0.016486,0.032315,-0.017119,0.011673,0.011673,0.034113,...,0.02533,0.027419,0.067758,0.012525,0.133598,0.031094,0.02594,0.025626,0.034242,-0.048261
4017,-0.020304,0.035857,0.023889,0.035857,0.035857,0.028706,0.028975,-0.000805,-0.000805,0.066419,...,-0.000948,0.002945,0.087435,-0.008658,0.015234,0.025709,-0.000298,-0.000648,0.023094,-0.045901
4385,0.011244,-0.011626,0.01209,-0.011626,-0.011626,0.060196,-0.013531,-0.011274,-0.011274,0.050544,...,0.065934,0.069227,0.037958,0.030262,0.711175,0.058916,0.041851,0.063797,0.071857,0.070338


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [47]:
userID = 2110

In [48]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [49]:
#pred_2110 = r_df.loc[2110]
#print(pred_2110.loc["0441000150"])

pred_2110 = pred_df.loc[userID]
pred_2110 = pred_2110.sort_values(ascending=False)

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [50]:
user_data = ratings[ratings.userID==2110]

In [51]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating
94453,2110,059035342X,10
94454,2110,0590448595,8
94455,2110,0451137965,9
94456,2110,0590629786,10
94457,2110,0590629794,10


In [52]:
user_data.shape

(103, 3)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [53]:
# No mention of book_data before. Assuming that it is the prediction data
book_data = pd.DataFrame(pred_2110)
book_data["ISBN"] = book_data.index
book_data.columns = ["Predicted_rating", "ISBN"]
book_data.head()

Unnamed: 0_level_0,Predicted_rating,ISBN
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
059035342X,0.536383,059035342X
0345384911,0.293114,0345384911
0451151259,0.2781,0451151259
0380759497,0.257534,0380759497
0345370775,0.255887,0345370775


In [54]:
book_data.shape

(66572, 2)

In [55]:
book_data.head()

Unnamed: 0_level_0,Predicted_rating,ISBN
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
059035342X,0.536383,059035342X
0345384911,0.293114,0345384911
0451151259,0.2781,0451151259
0380759497,0.257534,0380759497
0345370775,0.255887,0345370775


In [56]:
# Combine book_data and user_data
user_full_info = pd.merge(book_data, user_data, how="left", on="ISBN").drop("userID", axis=1).fillna(0)

In [57]:
user_full_info.head()

Unnamed: 0,Predicted_rating,ISBN,bookRating
0,0.536383,059035342X,10.0
1,0.293114,0345384911,0.0
2,0.2781,0451151259,0.0
3,0.257534,0380759497,0.0
4,0.255887,0345370775,0.0


In [58]:
#user_data[user_data.ISBN=="059035342X"]

### Get top 10 recommendations for above given userID from the books not already rated by that user

In [59]:
# Get rows which are not rated by the user
unrated_ratings = user_full_info[user_full_info.bookRating==0]

sorted_ratings = unrated_ratings.sort_values(by="Predicted_rating", ascending=False)
#sorted_ratings

In [60]:
# Select top 10 ratings
recommendations = sorted_ratings[0:10]

# Get book information
recomendations = pd.merge(recommendations, books, how="inner", on="ISBN")
print("Recommendations: ")
recomendations

Recommendations: 


Unnamed: 0,Predicted_rating,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,0.293114,0345384911,0.0,Crystal Line,Anne McCaffrey,1993,Del Rey Books
1,0.2781,0451151259,0.0,Eyes of the Dragon,Stephen King,1988,Penguin Putnam~mass
2,0.257534,0380759497,0.0,Xanth 15: The Color of Her Panties,Piers Anthony,1992,Eos
3,0.255887,0345370775,0.0,Jurassic Park,Michael Crichton,1999,Ballantine Books
4,0.245707,044021145X,0.0,The Firm,John Grisham,1992,Bantam Dell Publishing Group
5,0.223661,0880389117,0.0,Flint the King (Dragonlance: Preludes),Mary Kirchoff,1990,Wizards of the Coast
6,0.223525,0345353145,0.0,Sphere,MICHAEL CRICHTON,1988,Ballantine Books
7,0.221413,1560768304,0.0,"The Dragons of Krynn (Dragonlance Dragons, Vol...",Margaret Weis,1994,Wizards of the Coast
8,0.221413,0441845630,0.0,Unicorn Point (Apprentice Adept (Paperback)),Piers Anthony,1990,ACE Charter
9,0.208325,0439064872,0.0,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic
