**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np

#### Execute the below cell to load the datasets

In [3]:
# A content based recommendation system for Items with their description. The description is used as the feature 
#using TF -IDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel # for cosine similarity

In [4]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [5]:
books.shape

(271360, 8)

In [6]:
users.shape

(278858, 3)

In [7]:
ratings.shape

(1149780, 3)

In [8]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(books['bookTitle'])

In [9]:
tfidf_matrix.shape

(271360, 1085188)

In [10]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
imageUrlS            object
imageUrlM            object
imageUrlL            object
dtype: object

## Exploring books dataset

In [11]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [12]:
books = books.iloc[:, :-3]

In [13]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [14]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

In [15]:
books['yearOfPublication'].value_counts()

2002         13903
2001         13715
1999         13414
2000         13373
1998         12116
2003         11610
1997         11494
1996         10687
1995         10259
1994          8857
1993          7920
1992          7390
1991          6926
1990          6394
1989          5825
1988          5545
1987          4761
2004          4629
1986          4258
1999          4017
1985          3912
2000          3859
2002          3724
1998          3650
2001          3644
1984          3631
0             3570
1997          3396
1996          3343
1983          3297
             ...  
1910             1
1930             1
1906             1
1904             1
1938             1
2021             1
1900             1
1806             1
1897             1
1926             1
2026             1
1378             1
1924             1
1926             1
2011             1
2024             1
1944             1
1924             1
2008             1
Gallimard        1
1927             1
2012        

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [16]:
books.shape

(271360, 5)

In [17]:
#books['yearOfPublication']
print('"DK Publishing Inc" Count = ', len(books[books['yearOfPublication'] == 'DK Publishing Inc']))
print('"Gallimard" Count = ', len(books[books['yearOfPublication'] == 'Gallimard']))

"DK Publishing Inc" Count =  2
"Gallimard" Count =  1


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [18]:
books = books[books['yearOfPublication'] != 'DK Publishing Inc']
books = books[books['yearOfPublication'] != 'Gallimard']

In [19]:
books.shape

(271357, 5)

### Change the datatype of yearOfPublication to 'int'

In [20]:
books['yearOfPublication'] = books['yearOfPublication'].astype(int)

In [21]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [22]:
books['publisher'].isnull().any()
books['publisher'].isnull().sum()

True

2

In [23]:
books.dropna(subset=['publisher'], inplace=True)

In [24]:
books['publisher'].isnull().any()
books['publisher'].isnull().sum()

False

0

## Exploring Users dataset

In [25]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [26]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [27]:
unique_ages = users['Age'].unique()
unique_ages

array([ nan,  18.,  17.,  61.,  26.,  14.,  25.,  19.,  46.,  55.,  32.,
        24.,  20.,  34.,  23.,  51.,  31.,  21.,  44.,  30.,  57.,  43.,
        37.,  41.,  54.,  42.,  50.,  39.,  53.,  47.,  36.,  28.,  35.,
        13.,  58.,  49.,  38.,  45.,  62.,  63.,  27.,  33.,  29.,  66.,
        40.,  15.,  60.,   0.,  79.,  22.,  16.,  65.,  59.,  48.,  72.,
        56.,  67.,   1.,  80.,  52.,  69.,  71.,  73.,  78.,   9.,  64.,
       103., 104.,  12.,  74.,  75., 231.,   3.,  76.,  83.,  68., 119.,
        11.,  77.,   2.,  70.,  93.,   8.,   7.,   4.,  81., 114., 230.,
       239.,  10.,   5., 148., 151.,   6., 101., 201.,  96.,  84.,  82.,
        90., 123., 244., 133.,  91., 128.,  94.,  85., 141., 110.,  97.,
       219.,  86., 124.,  92., 175., 172., 209., 212., 237.,  87., 162.,
       100., 156., 136.,  95.,  89., 106.,  99., 108., 210.,  88., 199.,
       147., 168., 132., 159., 186., 152., 102., 116., 200., 115., 226.,
       137., 207., 229., 138., 109., 105., 228., 18

In [28]:
## Print Unique Ages in Ascending Order
np.sort(unique_ages, axis=-1, kind='quicksort', order=None)

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

In [29]:
print(sorted(users.Age.unique()))

[nan, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 110.0, 111.0, 113.0, 114.0, 115.0, 116.0, 118.0, 119.0, 123.0, 124.0, 127.0, 128.0, 132.0, 133.0, 136.0, 137.0, 138.0, 140.0, 141.0, 143.0, 146.0, 147.0, 148.0, 151.0, 152.0, 156.0, 157.0, 159.0, 162.0, 168.0, 172.0, 175.0, 183.0, 186.0, 189.0, 199.0, 200.0, 201.0, 204.0, 207.0, 208.0, 209.0, 210.0, 212.0, 219.0, 220.0, 223.0, 226.0

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [30]:
print('Count of records Age less than 5 = ', len(users[users.Age < 5]))
print('Count of records Age greater than 90 = ', len(users[users.Age > 90]))

Count of records Age less than 5 =  882
Count of records Age greater than 90 =  430


In [31]:
users.loc[users['Age'] < 5, 'Age'] = np.nan
users.loc[users['Age'] > 90, 'Age'] = np.nan

In [32]:
print('Count of records Age less than 5 = ', (users.Age < 5).sum())
print('Count of records Age greater than 90 = ', len(users[users.Age > 90]))

Count of records Age less than 5 =  0
Count of records Age greater than 90 =  0


### Replace null values in column `Age` with mean

In [33]:
users['Age'].isnull().any()

True

In [34]:
#users['Age'] = users['Age'].fillna(users['Age']).mean()
users['Age'].replace(np.nan, users['Age'].mean(), inplace=True)

In [35]:
users['Age'].isnull().any()

False

### Change the datatype of `Age` to `int`

In [36]:
users['Age'] = users['Age'].astype(int)

In [37]:
users.dtypes

userID       int64
Location    object
Age          int32
dtype: object

In [38]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


In [39]:
#print(sorted(users.Age.unique()))
unique_ages = users['Age'].unique()
np.sort(unique_ages, axis=-1, kind='quicksort', order=None)

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
       39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
       56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
       73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
       90], dtype=int64)

## Exploring the Ratings Dataset

### check the shape

In [40]:
ratings.shape

(1149780, 3)

In [41]:
n_users = users.shape[0]
n_books = books.shape[0]
print('Number of Users : ', n_users)
print('Number of Users : ', n_books)

Number of Users :  278858
Number of Users :  271355


In [42]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [43]:
ratings.shape
books.shape

(1149780, 3)

(271355, 5)

In [44]:
ratings['bookRating'].value_counts()

0     716109
8     103736
10     78610
7      76457
9      67541
5      50974
6      36924
4       8904
3       5996
2       2759
1       1770
Name: bookRating, dtype: int64

In [45]:
ratings =   ratings[ratings.ISBN.isin(books.ISBN)]

In [46]:
ratings.shape

(1031130, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [47]:
users.shape

(278858, 3)

In [48]:
ratings =   ratings[ratings.userID.isin(users.userID)]

In [49]:
ratings.shape

(1031130, 3)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [50]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [51]:
ratings = ratings[ratings.bookRating > 0]
ratings.shape

(383839, 3)

### Find out which rating has been given highest number of times

In [52]:
ratings['bookRating'].unique()

array([ 5,  3,  6,  7,  9,  8, 10,  1,  4,  2], dtype=int64)

In [53]:
max(ratings['bookRating'])

10

In [54]:
ratings['bookRating'].value_counts()

8     91804
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

In [55]:
# Rating 8 was given highest number of items

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [56]:
print('Columns in Books dataset : ', books.columns)
print('Columns in Ratings dataset : ', ratings.columns)
print('Columns in Users dataset : ', users.columns)

Columns in Books dataset :  Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher'], dtype='object')
Columns in Ratings dataset :  Index(['userID', 'ISBN', 'bookRating'], dtype='object')
Columns in Users dataset :  Index(['userID', 'Location', 'Age'], dtype='object')


In [57]:
ratings['userID'].unique()

array([276726, 276729, 276744, ..., 276704, 276709, 276721], dtype=int64)

In [58]:
ratings.shape

(383839, 3)

In [59]:
group_data = ratings.groupby('userID')

In [60]:
ratings = ratings.groupby('userID').filter(lambda x: len(x) >= 100)
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
1456,277427,002542730X,10
1458,277427,003008685X,8
1461,277427,0060006641,10
1465,277427,0060542128,7
1474,277427,0061009059,9


### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [61]:
ratings.isna().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

In [62]:
ratings.shape

(103269, 3)

In [63]:
from sklearn.model_selection import train_test_split

In [64]:
trainDF, tempDF = train_test_split(ratings, test_size=0.2, random_state=100)
#creating a copy of tempDF as testDF
testDF = tempDF.copy()
#Assigning ratings of tempDF to nan
tempDF.rating = np.nan

In [65]:
tempDF.head()

Unnamed: 0,userID,ISBN,bookRating
733900,177432,192835947,6
273626,63714,3829060114,9
173342,37644,590457225,7
363370,87555,141000198,8
224143,52350,2203001143,9


In [66]:
#Remove missing values in testDF
testDF = testDF.dropna()
testDF.head()

Unnamed: 0,userID,ISBN,bookRating
733900,177432,192835947,6
273626,63714,3829060114,9
173342,37644,590457225,7
363370,87555,141000198,8
224143,52350,2203001143,9


In [67]:
#Creating ratings with trainDF and tempDF
ratings = pd.concat([trainDF, tempDF]).reset_index()

In [68]:
ratings.shape
ratings.head()

(103269, 4)

Unnamed: 0,index,userID,ISBN,bookRating
0,695529,169682,0553580337,8
1,426444,101851,0890878234,10
2,48772,11676,0385474016,10
3,284111,67840,019285304X,10
4,410090,98391,0756400503,9


In [69]:
#Fill not available values as 0.0 - sprase martix
R_df = ratings.pivot(index = 'userID', columns ='ISBN', values = 'bookRating').fillna(0)
R_df.tail()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
274061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
274301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [70]:
from scipy.sparse.linalg import svds
#singluar value decomposition
#Compute the largest k singular values/vectors for a sparse matrix.
#k: Number of singular values and vectors to compute. Must be 1 <= k < min(R_df.shape)
# R_df is to compute the SVD on
# The singular values - sigma
U, sigma, Vt = svds(R_df, k = 50)

In [71]:
#diag
sigma = np.diag(sigma)

In [72]:
sigma

array([[147.92121613,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 149.3438051 ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 150.07400599, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 379.58327277,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        634.72875357,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 680.30978318]])

In [73]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

In [74]:
#Predictions Ratings
preds_df.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025341,-0.002146,-0.001431,-0.002146,-0.002146,0.002971,-0.00392,0.007035,0.007035,0.012316,...,0.00018,0.000226,0.042081,-0.016804,-0.080028,0.004746,0.028314,0.00012,-0.001693,0.067503
1,-0.010012,-0.003669,-0.002446,-0.003669,-0.003669,0.001075,0.00144,-0.0035,-0.0035,0.001612,...,-0.000363,0.000403,0.008142,0.001104,-0.029224,0.000999,0.002363,-0.000242,2.9e-05,-0.013059
2,-0.015054,-0.015457,-0.010304,-0.015457,-0.015457,0.007281,-0.014033,0.011941,0.011941,0.011796,...,-0.000455,0.001907,0.047982,0.005737,0.117859,0.006945,0.003119,-0.000304,0.009009,-0.057692
3,-0.021499,0.035602,0.023735,0.035602,0.035602,0.030307,0.024215,-0.001053,-0.001053,0.067579,...,0.002971,0.009912,0.086248,-0.008818,0.016154,0.028848,-0.000125,0.001981,0.031201,-0.046664
4,0.002077,-0.007965,-0.00531,-0.007965,-0.007965,0.002947,0.003057,0.000231,0.000231,0.00608,...,0.00212,0.001597,-0.012181,0.00942,0.673459,0.002591,-0.008229,0.001413,0.004918,0.047773


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [75]:
userID = 2110

In [76]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

In [77]:
ratings_2ndrow = ratings.iloc[0:1,:]

In [78]:
ratings_2ndrow

Unnamed: 0,index,userID,ISBN,bookRating
0,695529,169682,553580337,8


### Get the predicted ratings for userID `2110` and sort them in descending order

In [79]:
ratings = ratings[ratings['userID']==userID]

In [80]:
ratings.sort_values('bookRating',ascending=False,inplace=True)

In [81]:
ratings

Unnamed: 0,index,userID,ISBN,bookRating
863,14463,2110,0345317580,10
74612,14466,2110,0345362276,10
43895,14507,2110,0439222303,10
45804,14571,2110,0671751174,10
49337,14524,2110,0486270718,10
2560,14606,2110,1565111575,10
56994,14461,2110,0345307674,10
62498,14568,2110,0671695304,10
65854,14593,2110,0812505042,10
68826,14552,2110,0590629794,10


### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [82]:
user_data = ratings

In [83]:
user_data.head()

Unnamed: 0,index,userID,ISBN,bookRating
863,14463,2110,345317580,10
74612,14466,2110,345362276,10
43895,14507,2110,439222303,10
45804,14571,2110,671751174,10
49337,14524,2110,486270718,10


In [84]:
user_data.shape

(103, 4)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [85]:
books.columns

Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher'], dtype='object')

In [86]:
book_data = books[books.ISBN.isin(user_data.ISBN)]

In [87]:
book_data.shape

(103, 5)

In [88]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
246,0151008116,Life of Pi,Yann Martel,2002,Harcourt
904,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books
1000,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy
1302,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
1472,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket


In [89]:
user_test = pd.merge(user_data, book_data, on='ISBN', how='outer')

In [90]:
user_test.head()

Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,14463,2110,345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,14466,2110,345362276,10,Wizard at Large (Rookies Series),Terry Brooks,1989,Del Rey Books
2,14507,2110,439222303,10,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
3,14571,2110,671751174,10,First Evil (Fear Street Cheerleaders 1) : Firs...,R.L. Stine,1992,Simon Pulse
4,14524,2110,486270718,10,The Invisible Man (Dover Thrift Editions),H. G. Wells,1992,Dover Publications


In [91]:
user_test.shape

(103, 8)

In [92]:
user_full_info = pd.merge(user_data, book_data, on='ISBN', how='outer')

In [93]:
user_full_info = user_full_info.drop('index',axis=1)

In [94]:
user_full_info.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,2110,345362276,10,Wizard at Large (Rookies Series),Terry Brooks,1989,Del Rey Books
2,2110,439222303,10,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
3,2110,671751174,10,First Evil (Fear Street Cheerleaders 1) : Firs...,R.L. Stine,1992,Simon Pulse
4,2110,486270718,10,The Invisible Man (Dover Thrift Editions),H. G. Wells,1992,Dover Publications


In [95]:
user_full_info.shape

(103, 7)

### Get top 10 recommendations for above given userID from the books not already rated by that user

In [96]:
user_full_info.head(10)

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,2110,345362276,10,Wizard at Large (Rookies Series),Terry Brooks,1989,Del Rey Books
2,2110,439222303,10,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
3,2110,671751174,10,First Evil (Fear Street Cheerleaders 1) : Firs...,R.L. Stine,1992,Simon Pulse
4,2110,486270718,10,The Invisible Man (Dover Thrift Editions),H. G. Wells,1992,Dover Publications
5,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio
6,2110,345307674,10,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
7,2110,671695304,10,"FOREVER : A Novel of Good and Evil, Love and Hope",Judy Blume,1989,Pocket
8,2110,812505042,10,The Time Machine,H. G. Wells,1995,Tor Books
9,2110,590629794,10,"The Encounter (Animorphs , No 3)",K. A. Applegate,1996,Scholastic
