**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [3]:
books.shape

(271360, 8)

In [4]:
users.shape

(278858, 3)

In [5]:
ratings.shape

(1149780, 3)

In [6]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


In [7]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [8]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


## Exploring books dataset

In [9]:
columns=books.columns
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [10]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [11]:
columns

Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher',
       'imageUrlS', 'imageUrlM', 'imageUrlL'],
      dtype='object')

In [12]:
books.drop(columns=['imageUrlS','imageUrlM','imageUrlL'],axis=1,inplace=True)

In [13]:
columns=books.columns
columns

Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher'], dtype='object')

**yearOfPublication**

### Check unique values of yearOfPublication


In [14]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [15]:
books[books.yearOfPublication == 'DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [16]:
books[(books.yearOfPublication == 'DK Publishing Inc') | (books.yearOfPublication == 'Gallimard')]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [17]:
# Drop a row by condition
books.drop(books[(books.yearOfPublication == 'DK Publishing Inc') | (books.yearOfPublication == 'Gallimard')].index,inplace=True)

In [18]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

### Change the datatype of yearOfPublication to 'int'

In [19]:
books['yearOfPublication']=books['yearOfPublication'].astype(int)

In [20]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [21]:
if books.isnull().values.any() == False:
    print("There are no nulls/NaN values in our data frame")
else:
    print("There are nulls/NaN values in our data frame")

There are nulls/NaN values in our data frame


In [22]:
books[books.publisher.isnull().values]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


In [23]:
books.drop(books[books.publisher.isnull().values].index,inplace=True)

In [24]:
books[books.publisher.isnull().values]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


## Exploring Users dataset

In [25]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [26]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


### Get all unique values in ascending order for column `Age`

In [27]:
users.sort_values(by='Age', ascending=True, na_position='first')['Age'].unique()

array([ nan,   0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,
        10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,
        21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,
        32.,  33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,
        43.,  44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,
        54.,  55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,
        65.,  66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,
        76.,  77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,
        87.,  88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,
        98.,  99., 100., 101., 102., 103., 104., 105., 106., 107., 108.,
       109., 110., 111., 113., 114., 115., 116., 118., 119., 123., 124.,
       127., 128., 132., 133., 136., 137., 138., 140., 141., 143., 146.,
       147., 148., 151., 152., 156., 157., 159., 162., 168., 172., 175.,
       183., 186., 189., 199., 200., 201., 204., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [28]:
users[~((users.Age < 5) | (users.Age > 90))]

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
...,...,...,...
278853,278854,"portland, oregon, usa",
278854,278855,"tacoma, washington, united kingdom",50.0
278855,278856,"brampton, ontario, canada",
278856,278857,"knoxville, tennessee, usa",


In [29]:
users[((users.Age < 5) | (users.Age > 90))]

Unnamed: 0,userID,Location,Age
219,220,"bogota, bogota, colombia",0.0
469,470,"indianapolis, indiana, usa",0.0
561,562,"adfdaf, australian capital territory, albania",0.0
612,613,"ankara, n/a, turkey",1.0
670,671,"jeddah, jeddah, saudi arabia",1.0
...,...,...,...
277107,277108,"quinto, ticino, switzerland",104.0
277503,277504,"san diego, california, usa",103.0
277558,277559,"lake george, new york, usa",98.0
277908,277909,"phoenix, arizona, usa",2.0


In [30]:
## user id 220 has age 0.0

In [31]:
users['Age']=np.where((users.Age<5) | (users.Age > 90),np.nan,users.Age)

In [32]:
users[((users.Age < 5) | (users.Age > 90))]

Unnamed: 0,userID,Location,Age


In [33]:
users[users.userID==220]

Unnamed: 0,userID,Location,Age
219,220,"bogota, bogota, colombia",


In [34]:
##value converted to NaN

In [35]:
users[users.userID==2]

Unnamed: 0,userID,Location,Age
1,2,"stockton, california, usa",18.0


### Replace null values in column `Age` with mean

In [36]:
users[users.userID==220]

Unnamed: 0,userID,Location,Age
219,220,"bogota, bogota, colombia",


In [37]:
users[users.userID==2]

Unnamed: 0,userID,Location,Age
1,2,"stockton, california, usa",18.0


In [38]:
users.replace(to_replace=np.nan,value=np.mean(users['Age']),inplace=True)

In [39]:
users[users.userID==220]

Unnamed: 0,userID,Location,Age
219,220,"bogota, bogota, colombia",34.72384


In [40]:
users[users.userID==2]

Unnamed: 0,userID,Location,Age
1,2,"stockton, california, usa",18.0


In [41]:
## user id 220 which was having NaN replaced with 34.72 which is mean value

### Change the datatype of `Age` to `int`

In [42]:
print(sorted(users.Age.unique()))

[5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 34.72384041634689, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0]


In [43]:
users['Age']=users['Age'].astype(int)

In [44]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null int32
dtypes: int32(1), int64(1), object(1)
memory usage: 5.3+ MB


## Exploring the Ratings Dataset

### check the shape

In [45]:
ratings.shape

(1149780, 3)

In [46]:
books.shape

(271355, 5)

In [47]:
n_users = users.shape[0]
n_books = books.shape[0]
n_ratings=ratings.shape[0]

In [48]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [49]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271355 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271355 non-null object
bookTitle            271355 non-null object
bookAuthor           271354 non-null object
yearOfPublication    271355 non-null int32
publisher            271355 non-null object
dtypes: int32(1), object(4)
memory usage: 11.4+ MB


In [50]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [51]:
n_ratings

1149780

In [52]:
ratings=ratings.merge(books,on='ISBN',how='inner')

In [53]:
ratings.shape

(1031130, 7)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [54]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null int32
dtypes: int32(1), int64(1), object(1)
memory usage: 5.3+ MB


In [60]:
ratings=ratings.merge(users,on='userID',how='inner')

In [61]:
ratings.shape

(383839, 11)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [62]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383839 entries, 0 to 383838
Data columns (total 11 columns):
userID               383839 non-null int64
ISBN                 383839 non-null object
bookRating           383839 non-null int64
bookTitle            383839 non-null object
bookAuthor           383838 non-null object
yearOfPublication    383839 non-null int32
publisher            383839 non-null object
Location_x           383839 non-null object
Age_x                383839 non-null int32
Location_y           383839 non-null object
Age_y                383839 non-null int32
dtypes: int32(3), int64(2), object(6)
memory usage: 30.7+ MB


In [63]:
ratings.loc[ratings['bookRating']==0]

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher,Location_x,Age_x,Location_y,Age_y


In [64]:
ratings.drop(ratings.loc[ratings['bookRating']==0].index,axis=0,inplace=True)

In [65]:
ratings.loc[ratings['bookRating']==0]

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher,Location_x,Age_x,Location_y,Age_y


In [66]:
ratings.loc[ratings['bookRating']==10]

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher,Location_x,Age_x,Location_y,Age_y
14,2313,0394756827,10,"Godel, Escher, Bach: An Eternal Golden Braid",Douglas R. Hofstadter,1989,Vintage Books USA,"cincinnati, ohio, usa",23,"cincinnati, ohio, usa",23
28,6543,0446605484,10,Roses Are Red (Alex Cross Novels),James Patterson,2001,Warner Vision,"strafford, missouri, usa",34,"strafford, missouri, usa",34
35,6543,038548951X,10,Sister of My Heart,Chitra Banerjee Divakaruni,2000,Anchor Pub,"strafford, missouri, usa",34,"strafford, missouri, usa",34
55,6543,0060987103,10,Wicked: The Life and Times of the Wicked Witch...,Gregory Maguire,1996,Regan Books,"strafford, missouri, usa",34,"strafford, missouri, usa",34
60,6543,0380813815,10,"Lamb : The Gospel According to Biff, Christ's ...",Christopher Moore,2003,Perennial,"strafford, missouri, usa",34,"strafford, missouri, usa",34
...,...,...,...,...,...,...,...,...,...,...,...
383814,274623,1561483761,10,The Little Book of Restorative Justice (The Li...,Howard Zehr,2002,Good Books,"chicago, illinois, usa",57,"chicago, illinois, usa",57
383819,275040,1403331715,10,The Sorceress of Atunluck,Aaron Dean Hall,2003,Authorhouse,"salt lake city, utah, usa",30,"salt lake city, utah, usa",30
383828,276067,0694004162,10,If You Give a Mouse a Cookie/Mini Book and Mou...,Laura Joffe Numeroff,1992,HarperCollins Publishers,"reston, virginia, usa",36,"reston, virginia, usa",36
383829,276142,0671038672,10,"Lights Out : Sleep, Sugar, and Survival",T. S. Wiley,2000,Atria,"delta, british columbia, canada",34,"delta, british columbia, canada",34


### Find out which rating has been given highest number of times

In [67]:
max_value=np.max(ratings.groupby('bookRating').size())
test=ratings.groupby('bookRating').size()
print("Maximum rating given is for ",test[test==max_value])

Maximum rating given is for  bookRating
8    91804
dtype: int64


### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [69]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383839 entries, 0 to 383838
Data columns (total 11 columns):
userID               383839 non-null int64
ISBN                 383839 non-null object
bookRating           383839 non-null int64
bookTitle            383839 non-null object
bookAuthor           383838 non-null object
yearOfPublication    383839 non-null int32
publisher            383839 non-null object
Location_x           383839 non-null object
Age_x                383839 non-null int32
Location_y           383839 non-null object
Age_y                383839 non-null int32
dtypes: int32(3), int64(2), object(6)
memory usage: 30.7+ MB


In [70]:
ratings[ratings['userID']==6543]

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher,Location_x,Age_x,Location_y,Age_y
28,6543,0446605484,10,Roses Are Red (Alex Cross Novels),James Patterson,2001,Warner Vision,"strafford, missouri, usa",34,"strafford, missouri, usa",34
29,6543,0805062971,8,Fight Club,Chuck Palahniuk,1999,Owl Books,"strafford, missouri, usa",34,"strafford, missouri, usa",34
30,6543,0345342968,8,Fahrenheit 451,RAY BRADBURY,1987,Del Rey,"strafford, missouri, usa",34,"strafford, missouri, usa",34
31,6543,0446610038,9,1st to Die: A Novel,James Patterson,2002,Warner Vision,"strafford, missouri, usa",34,"strafford, missouri, usa",34
32,6543,0061009059,8,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,1995,HarperTorch,"strafford, missouri, usa",34,"strafford, missouri, usa",34
...,...,...,...,...,...,...,...,...,...,...,...
197,6543,1570629137,5,Confessions of a Pagan Nun : A Novel,KATE HORSLEY,2002,Shambhala,"strafford, missouri, usa",34,"strafford, missouri, usa",34
198,6543,1573222127,8,Miracle at St. Anna,James McBride,2002,Riverhead Books,"strafford, missouri, usa",34,"strafford, missouri, usa",34
199,6543,1841954608,6,Vernon God Little: A 21st Century Comedy in th...,D. B. C. Pierre,2003,Canongate Books,"strafford, missouri, usa",34,"strafford, missouri, usa",34
200,6543,1850514496,7,Collecting Art Nouveau,Philippe Garner,0,Treasure Press,"strafford, missouri, usa",34,"strafford, missouri, usa",34


In [71]:
# Number of ratings per user
data = ratings.groupby('userID')['bookRating'].count().clip(upper=100)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 100,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User (Clipped at 100)',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.5)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [72]:
df_rated_100_books=ratings.groupby(['userID']).filter(lambda s: s.bookRating.count()>=100)

In [None]:
ratings.shape

In [None]:
df_rated_100_books.shape

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [74]:
if df_rated_100_books.isnull().values.any() == False:
    print("There are no nulls/NaN values in our data frame")
else:
    print("There are nulls/NaN values in our data frame")

There are no nulls/NaN values in our data frame


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [83]:
from sklearn.model_selection import train_test_split
from scipy.sparse.linalg import svds
from surprise import Dataset, Reader

In [76]:
df_rated_100_books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103269 entries, 28 to 328224
Data columns (total 11 columns):
userID               103269 non-null int64
ISBN                 103269 non-null object
bookRating           103269 non-null int64
bookTitle            103269 non-null object
bookAuthor           103269 non-null object
yearOfPublication    103269 non-null int32
publisher            103269 non-null object
Location_x           103269 non-null object
Age_x                103269 non-null int32
Location_y           103269 non-null object
Age_y                103269 non-null int32
dtypes: int32(3), int64(2), object(6)
memory usage: 8.3+ MB


In [77]:
ratings_trim = df_rated_100_books.filter(['userID', 'ISBN', 'bookRating'], axis=1).copy()

In [80]:
ratings_trim.shape

(103269, 3)

In [90]:
trainDF, tempDF = train_test_split(ratings_trim, test_size = 0.2, random_state = 100)
testDF = tempDF.copy()
tempDF.bookRating = np.nan
testDF = testDF.dropna()
ratings = pd.concat([trainDF, tempDF]).reset_index()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [91]:
ratings_trim.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103269 entries, 28 to 328224
Data columns (total 3 columns):
userID        103269 non-null int64
ISBN          103269 non-null object
bookRating    103269 non-null int64
dtypes: int64(2), object(1)
memory usage: 3.2+ MB


In [93]:
R_df = ratings.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)

In [393]:
R_df.shape

(449, 55763)

In [394]:
df_rated_100_books.loc[df_rated_100_books.userID==2110]

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher,Location,Age
670337,2110,059035342X,10,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books,"charlotte, north carolina, usa",34
670341,2110,0590448595,8,Karen's School Trip (Baby-Sitters Little Siste...,Ann M. Martin,1992,Scholastic Paperbacks (Mm),"charlotte, north carolina, usa",34
670344,2110,0451137965,9,Thinner,Stephen King,1985,New Amer Library,"charlotte, north carolina, usa",34
670345,2110,0590629786,10,"The Visitor (Animorphs, No 2)",K. A. Applegate,1996,Scholastic,"charlotte, north carolina, usa",34
670346,2110,0590629794,10,"The Encounter (Animorphs , No 3)",K. A. Applegate,1996,Scholastic,"charlotte, north carolina, usa",34
...,...,...,...,...,...,...,...,...,...
670494,2110,1558504184,8,Bradymania!: Everything You Always Wanted to K...,Elizabeth Moran,1995,Adams Media Corporation,"charlotte, north carolina, usa",34
670495,2110,1561008931,7,Knee Deep in Paradise (Nova Audio Books),Brett Butler,1996,Brilliance Audio - Trade,"charlotte, north carolina, usa",34
670496,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio,"charlotte, north carolina, usa",34
670498,2110,1570420564,10,A Dream Is a Wish Your Heart Makes: My Story,Annette Funicello,1994,Time Warner Audio Books,"charlotte, north carolina, usa",34


In [425]:
R_df.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [426]:
U, sigma, Vt = svds(R_df, k = 50)
sigma = np.diag(sigma)
all_users_predicted_ratings = np.dot(np.dot(U, sigma), Vt)
preds_df = pd.DataFrame(all_users_predicted_ratings, columns = R_df.columns)
preds_df

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025341,-0.002146,-0.001431,-0.002146,-0.002146,0.002971,-0.003920,0.007035,0.007035,0.012316,...,0.000180,0.000226,0.042081,-0.016804,-0.080028,0.004746,0.028314,0.000120,-0.001693,0.067503
1,-0.010012,-0.003669,-0.002446,-0.003669,-0.003669,0.001075,0.001440,-0.003500,-0.003500,0.001612,...,-0.000363,0.000403,0.008142,0.001104,-0.029224,0.000999,0.002363,-0.000242,0.000029,-0.013059
2,-0.015054,-0.015457,-0.010304,-0.015457,-0.015457,0.007281,-0.014033,0.011941,0.011941,0.011796,...,-0.000455,0.001907,0.047982,0.005737,0.117859,0.006945,0.003119,-0.000304,0.009009,-0.057692
3,-0.021499,0.035602,0.023735,0.035602,0.035602,0.030307,0.024215,-0.001053,-0.001053,0.067579,...,0.002971,0.009912,0.086248,-0.008818,0.016154,0.028848,-0.000125,0.001981,0.031201,-0.046664
4,0.002077,-0.007965,-0.005310,-0.007965,-0.007965,0.002947,0.003057,0.000231,0.000231,0.006080,...,0.002120,0.001597,-0.012181,0.009420,0.673459,0.002591,-0.008229,0.001413,0.004918,0.047773
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
444,-0.013295,-0.002811,-0.001874,-0.002811,-0.002811,-0.023810,0.250610,0.005716,0.005716,-0.028896,...,0.001848,-0.007856,0.062856,-0.001133,0.128216,-0.020683,0.007356,0.001232,-0.024207,0.100999
445,0.017231,0.020953,0.013969,0.020953,0.020953,0.016303,0.045661,0.017187,0.017187,0.009598,...,-0.002726,0.002173,0.238855,-0.003137,0.058286,0.014197,0.010563,-0.001817,0.008224,-0.071487
446,0.003814,-0.011141,-0.007427,-0.011141,-0.011141,0.007776,-0.012431,0.005990,0.005990,0.022312,...,0.002610,0.003690,0.015291,0.021539,-0.051916,0.008105,0.005769,0.001740,0.008526,0.126062
447,0.078020,-0.024439,-0.016292,-0.024439,-0.024439,0.011760,-0.018174,-0.005821,-0.005821,0.027239,...,0.000453,0.003468,0.060661,0.010400,-0.074412,0.012083,0.008547,0.000302,0.010178,0.035976


In [421]:
preds_df = pd.DataFrame(all_users_predicted_ratings, columns = R_df.columns)
preds_df

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025341,-0.002146,-0.001431,-0.002146,-0.002146,0.002971,-0.003920,0.007035,0.007035,0.012316,...,0.000180,0.000226,0.042081,-0.016804,-0.080028,0.004746,0.028314,0.000120,-0.001693,0.067503
1,-0.010012,-0.003669,-0.002446,-0.003669,-0.003669,0.001075,0.001440,-0.003500,-0.003500,0.001612,...,-0.000363,0.000403,0.008142,0.001104,-0.029224,0.000999,0.002363,-0.000242,0.000029,-0.013059
2,-0.015054,-0.015457,-0.010304,-0.015457,-0.015457,0.007281,-0.014033,0.011941,0.011941,0.011796,...,-0.000455,0.001907,0.047982,0.005737,0.117859,0.006945,0.003119,-0.000304,0.009009,-0.057692
3,-0.021499,0.035602,0.023735,0.035602,0.035602,0.030307,0.024215,-0.001053,-0.001053,0.067579,...,0.002971,0.009912,0.086248,-0.008818,0.016154,0.028848,-0.000125,0.001981,0.031201,-0.046664
4,0.002077,-0.007965,-0.005310,-0.007965,-0.007965,0.002947,0.003057,0.000231,0.000231,0.006080,...,0.002120,0.001597,-0.012181,0.009420,0.673459,0.002591,-0.008229,0.001413,0.004918,0.047773
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
444,-0.013295,-0.002811,-0.001874,-0.002811,-0.002811,-0.023810,0.250610,0.005716,0.005716,-0.028896,...,0.001848,-0.007856,0.062856,-0.001133,0.128216,-0.020683,0.007356,0.001232,-0.024207,0.100999
445,0.017231,0.020953,0.013969,0.020953,0.020953,0.016303,0.045661,0.017187,0.017187,0.009598,...,-0.002726,0.002173,0.238855,-0.003137,0.058286,0.014197,0.010563,-0.001817,0.008224,-0.071487
446,0.003814,-0.011141,-0.007427,-0.011141,-0.011141,0.007776,-0.012431,0.005990,0.005990,0.022312,...,0.002610,0.003690,0.015291,0.021539,-0.051916,0.008105,0.005769,0.001740,0.008526,0.126062
447,0.078020,-0.024439,-0.016292,-0.024439,-0.024439,0.011760,-0.018174,-0.005821,-0.005821,0.027239,...,0.000453,0.003468,0.060661,0.010400,-0.074412,0.012083,0.008547,0.000302,0.010178,0.035976


In [422]:
R_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 449 entries, 2033 to 278418
Columns: 66572 entries, 0000913154 to B000234N3A
dtypes: float64(66572)
memory usage: 228.1 MB


In [413]:
preds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 449 entries, 0 to 448
Columns: 66572 entries, 0000913154 to B000234N3A
dtypes: float64(66572)
memory usage: 228.0 MB


In [414]:
all_users_predicted_ratings

array([[ 2.53413396e-02, -2.14623005e-03, -1.43082004e-03, ...,
         1.19754797e-04, -1.69335636e-03,  6.75032653e-02],
       [-1.00123158e-02, -3.66944542e-03, -2.44629695e-03, ...,
        -2.41938710e-04,  2.91532927e-05, -1.30588632e-02],
       [-1.50543571e-02, -1.54565959e-02, -1.03043973e-02, ...,
        -3.03637070e-04,  9.00887045e-03, -5.76922555e-02],
       ...,
       [ 3.81433419e-03, -1.11407816e-02, -7.42718773e-03, ...,
         1.74016504e-03,  8.52624532e-03,  1.26062064e-01],
       [ 7.80204879e-02, -2.44385752e-02, -1.62923834e-02, ...,
         3.02176041e-04,  1.01777285e-02,  3.59761492e-02],
       [ 8.05621362e-03,  1.16247449e-02,  7.74982991e-03, ...,
         8.58215920e-05, -3.24030447e-04,  7.47563413e-03]])

In [408]:
def recommend_books(predictions_df, userID, books_df, original_ratings_df, num_recommendations = False):
    user_row_number = userID - 1  #UserID starts at zero not 1
    sorted_user_predictions = predictions_df.loc[user_row_number].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    user_full = original_ratings_df
    print('User {0} has already rated {1} movies.'.format(userID, user_full.dropna().shape[0]))
    print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    recommendations = (books[~books['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').
                      rename(columns = {user_row_number: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
    return user_full, recommendations, sorted_user_predictions, user_data, user_full

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [336]:
userID = 2110

In [350]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271355 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271355 non-null object
bookTitle            271355 non-null object
bookAuthor           271354 non-null object
yearOfPublication    271355 non-null int32
publisher            271355 non-null object
dtypes: int32(1), object(4)
memory usage: 11.4+ MB


In [419]:
user_id=2

In [None]:
preds_df.head()

In [409]:
df_rated_100_books.loc[df_rated_100_books.userID==2110].index

Int64Index([670337, 670341, 670344, 670345, 670346, 670347, 670348, 670350,
            670352, 670353,
            ...
            670488, 670490, 670491, 670492, 670493, 670494, 670495, 670496,
            670498, 670499],
           dtype='int64', length=103)

In [420]:
already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_books(preds_df, user_id, books, df_rated_100_books, 10)

User 2 has already rated 103269 movies.
Recommending the highest 10 predicted ratings movies not already rated.


In [None]:
already_rated.loc

### Get the predicted ratings for userID `2110` and sort them in descending order

In [371]:
sorted_user_predictions

ISBN
059035342X    0.670207
044021145X    0.335010
0345370775    0.326200
0345384911    0.323808
0440213525    0.303738
                ...   
042511984X   -0.049180
0345313860   -0.049305
0671450344   -0.050616
0553569783   -0.055609
0553561618   -0.059308
Name: 1, Length: 55763, dtype: float64

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [374]:
user_data=already_rated.copy()

In [375]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher,Location,Age
43,6543,446605484,10,Roses Are Red (Alex Cross Novels),James Patterson,2001,Warner Vision,"strafford, missouri, usa",34
47,6543,805062971,8,Fight Club,Chuck Palahniuk,1999,Owl Books,"strafford, missouri, usa",34
48,6543,345342968,8,Fahrenheit 451,RAY BRADBURY,1987,Del Rey,"strafford, missouri, usa",34
49,6543,446610038,9,1st to Die: A Novel,James Patterson,2002,Warner Vision,"strafford, missouri, usa",34
55,6543,61009059,8,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,1995,HarperTorch,"strafford, missouri, usa",34


In [376]:
user_data.shape

(103269, 9)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [377]:
user_full_info=user_data.head()

### Get top 10 recommendations for above given userID from the books not already rated by that user

In [379]:
predictions.head(10)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
2,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company
3,0679425608,Under the Black Flag: The Romance and the Real...,David Cordingly,1996,Random House
4,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner
5,0771074670,Nights Below Station Street,David Adams Richards,1988,Emblem Editions
6,080652121X,Hitler's Secret Bankers: The Myth of Swiss Neu...,Adam Lebor,2000,Citadel Press
7,0887841740,The Middle Stories,Sheila Heti,2004,House of Anansi Press
8,1552041778,Jane Doe,R. J. Kaiser,1999,Mira Books
9,1558746218,A Second Chicken Soup for the Woman's Soul (Ch...,Jack Canfield,1998,Health Communications
