**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import pandas as pd

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [3]:
print ("No. of records & features in books file:", books.shape)
print ("No. of records & features in users file:", users.shape)
print ("No. of records & features in ratings file:", ratings.shape)

No. of records & features in books file: (271360, 8)
No. of records & features in users file: (278858, 3)
No. of records & features in ratings file: (1149780, 3)


## Exploring books dataset

In [4]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [5]:
books = books.drop(['imageUrlS', 'imageUrlM', 'imageUrlL'],axis=1)

In [6]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [7]:
books.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [8]:
books.loc[books['yearOfPublication']=='DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [9]:
array = ['Gallimard','DK Publishing Inc']

In [10]:
books = books.drop(books.loc[books['yearOfPublication'].isin(array)].index)

### Change the datatype of yearOfPublication to 'int'

In [11]:
books['yearOfPublication'] = books['yearOfPublication'].astype(int)

In [12]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [13]:
books['publisher'].isna().sum()

2

In [14]:
#books = books.drop(books['publisher'].isna().index)
books.dropna(subset=['publisher'], inplace=True)

## Exploring Users dataset

In [15]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [16]:
users.Age.unique()

array([ nan,  18.,  17.,  61.,  26.,  14.,  25.,  19.,  46.,  55.,  32.,
        24.,  20.,  34.,  23.,  51.,  31.,  21.,  44.,  30.,  57.,  43.,
        37.,  41.,  54.,  42.,  50.,  39.,  53.,  47.,  36.,  28.,  35.,
        13.,  58.,  49.,  38.,  45.,  62.,  63.,  27.,  33.,  29.,  66.,
        40.,  15.,  60.,   0.,  79.,  22.,  16.,  65.,  59.,  48.,  72.,
        56.,  67.,   1.,  80.,  52.,  69.,  71.,  73.,  78.,   9.,  64.,
       103., 104.,  12.,  74.,  75., 231.,   3.,  76.,  83.,  68., 119.,
        11.,  77.,   2.,  70.,  93.,   8.,   7.,   4.,  81., 114., 230.,
       239.,  10.,   5., 148., 151.,   6., 101., 201.,  96.,  84.,  82.,
        90., 123., 244., 133.,  91., 128.,  94.,  85., 141., 110.,  97.,
       219.,  86., 124.,  92., 175., 172., 209., 212., 237.,  87., 162.,
       100., 156., 136.,  95.,  89., 106.,  99., 108., 210.,  88., 199.,
       147., 168., 132., 159., 186., 152., 102., 116., 200., 115., 226.,
       137., 207., 229., 138., 109., 105., 228., 18

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [17]:
import numpy as np

In [18]:
users.dtypes

userID        int64
Location     object
Age         float64
dtype: object

In [19]:
##Unable to use the & function to pass two arguments. Hence trying the long way.
## While executing the below code I am getting error
#users['Age'] = np.where((users['Age']>90 & users['Age']<5),'nan',users['Age'])
#TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]
#Not sure what is wrong

In [20]:
users['Age'] = np.where(users['Age']>90,0,users['Age'])

In [21]:
users['Age'] = np.where(users['Age']<5,0,users['Age'])

In [22]:
users['Age'] = users['Age'].fillna(0)

### Replace null values in column `Age` with mean

In [23]:
users['Age'].mean()

20.768208191983017

In [24]:
users['Age'] = np.where(users['Age']==0,users['Age'].mean(),users['Age'])

### Change the datatype of `Age` to `int`

In [25]:
users['Age'] = users['Age'].astype(int)

In [26]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [27]:
ratings.shape

(1149780, 3)

In [28]:
n_users = users.shape[0]
n_books = books.shape[0]

In [29]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [30]:
ratings.dtypes

userID         int64
ISBN          object
bookRating     int64
dtype: object

### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [31]:
booklist = books.ISBN.unique()

In [32]:
ratings = ratings.loc[ratings['ISBN'].isin(booklist)]

In [33]:
ratings.shape

(1031130, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [34]:
userlist = users.userID.unique()

In [35]:
ratings = ratings.loc[ratings['userID'].isin(userlist)]

In [36]:
ratings.shape
# After dropping unidentified user and unidentified ISBN, rating dataset has 1031130 records.

(1031130, 3)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [37]:
ratings = ratings[ratings['bookRating']!=0]

### Find out which rating has been given highest number of times

In [38]:
ratings['bookRating'].value_counts().idxmax()
# Rating 8 is given highest number of times

8

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [39]:
rating1 = ratings['userID'].value_counts()

In [40]:
rating1 = rating1[rating1>100].index

In [41]:
rating1

Int64Index([ 11676,  98391, 189835, 153662,  23902, 235105,  76499, 171118,
             16795, 248718,
            ...
             76223, 193898,  86189, 235935,  66942, 146113, 224525, 109901,
            172888, 117384],
           dtype='int64', length=440)

In [42]:
#ratings.loc[ratings['userID']==11676]
rating100 = ratings[ratings['userID'].isin(rating1)]

In [43]:
rating100.userID.value_counts()

11676     6943
98391     5689
189835    1899
153662    1845
23902     1180
235105    1020
76499     1012
171118     962
16795      959
248718     941
56399      837
197659     781
35859      777
185233     698
95359      606
114368     603
158295     567
101851     563
177458     524
204864     504
93047      501
69078      499
182085     498
135149     487
100906     484
107784     482
78973      479
23872      478
60244      476
257204     475
          ... 
236058     106
86947      106
224646     105
136348     105
184532     105
7286       105
242465     105
14422      105
227705     105
35836      104
183958     104
164096     104
113270     104
163804     104
250405     104
2110       103
164323     103
132492     103
86189      102
10560      102
193898     102
148966     102
76223      102
235935     102
172888     101
117384     101
66942      101
224525     101
146113     101
109901     101
Name: userID, Length: 440, dtype: int64

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [44]:
rating100.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102369 entries, 1456 to 1147615
Data columns (total 3 columns):
userID        102369 non-null int64
ISBN          102369 non-null object
bookRating    102369 non-null int64
dtypes: int64(2), object(1)
memory usage: 3.1+ MB


In [45]:
rating100.isna().sum()
#There is no null value in any field

userID        0
ISBN          0
bookRating    0
dtype: int64

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [53]:
from surprise import Dataset,Reader
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise import accuracy

In [55]:
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(rating100[['userID','ISBN','bookRating']],reader)

In [58]:
trainset,testset = train_test_split(data,test_size=0.2,random_state=100)

In [49]:
from surprise import SVD
from surprise import accuracy

In [59]:
svd_model = SVD(n_factors=50,biased=False)
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1261a0320>

In [60]:
test_pred = svd_model.test(testset)

In [61]:
test_pred_df = pd.DataFrame([[x.uid,x.iid,x.est] for x in test_pred])
test_pred_df.columns = ['userID','ISBN','estRating']
test_pred_df.sort_values(by= ['userID','estRating'],ascending=False,inplace=True)

In [62]:
test_pred_df.head()

Unnamed: 0,userID,ISBN,estRating
816,278418,60392185,5.0
3137,278418,312923651,5.0
3895,278418,812531558,5.0
3933,278418,446520608,5.0
6947,278418,812543181,5.0


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [None]:
userID = 2110

In [None]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [63]:
test_pred_df.sort_values(by = 'estRating',ascending=False,inplace=True)
test_pred_df[test_pred_df.userID == 2110]

Unnamed: 0,userID,ISBN,estRating
14070,2110,0345283929,5.0
18908,2110,0373638078,5.0
17644,2110,0394563131,5.0
16825,2110,0373642849,5.0
16421,2110,0671038931,5.0
15638,2110,059046678X,5.0
14536,2110,0671802283,5.0
13112,2110,1570420564,5.0
12385,2110,0590880748,5.0
12208,2110,0345260627,5.0


### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [65]:
user_data = rating100[rating100.userID == 2110]

In [66]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating
14448,2110,60987529,7
14449,2110,64472779,8
14450,2110,140022651,10
14452,2110,142302163,8
14453,2110,151008116,5


In [67]:
user_data.shape

(103, 3)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [68]:
ISBNlist = user_data.ISBN.unique()
ISBNlist = pd.DataFrame(ISBNlist)
ISBNlist.columns = ['ISBN']

ratingtemp = ratings.query('ISBN in @ISBNlist.ISBN')
ratingtemp = ratingtemp.query('userID in [2110]')
#ratingtemp

usertemp = books.query('ISBN in @ISBNlist.ISBN')
usertemp['userID']=2110
usertemp['Location']='charlotte, north carolina, usa'
usertemp['Age']=34
user_full_info =pd.merge(usertemp,ratingtemp,on='ISBN')
user_full_info.drop('userID_y',axis=1,inplace=True)
user_full_info.rename(columns={'userID_x':'user_ID'},inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


In [70]:
user_full_info.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,user_ID,Location,Age,bookRating
0,0151008116,Life of Pi,Yann Martel,2002,Harcourt,2110,"charlotte, north carolina, usa",34,5
1,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books,2110,"charlotte, north carolina, usa",34,8
2,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy,2110,"charlotte, north carolina, usa",34,8
3,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books,2110,"charlotte, north carolina, usa",34,10
4,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket,2110,"charlotte, north carolina, usa",34,9


In [None]:
book_data.head()

In [None]:
user_full_info.head()

### Get top 10 recommendations for above given userID from the books not already rated by that user

In [71]:
test_pred_df.sort_values(by = 'estRating',ascending=False,inplace=True)

In [72]:
test_pred_df[test_pred_df.userID == 2110]

Unnamed: 0,userID,ISBN,estRating
6573,2110,0590956159,5.0
4161,2110,042516540X,5.0
3754,2110,0898863538,5.0
12208,2110,0345260627,5.0
14070,2110,0345283929,5.0
18908,2110,0373638078,5.0
17644,2110,0394563131,5.0
16825,2110,0373642849,5.0
16421,2110,0671038931,5.0
15638,2110,059046678X,5.0
