**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

In [1]:
import numpy as np
import pandas as pd

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#### Execute the below cell to load the datasets

In [3]:
books_df = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books_df.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
#Loading data

users_df = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users_df.columns = ['userID', 'Location', 'Age']


In [5]:
ratings_df = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings_df.columns = ['userID', 'ISBN', 'bookRating']

### Check no.of records and features given in each dataset

In [6]:
books_df.shape

(271360, 8)

In [7]:
users_df.shape

(278858, 3)

In [8]:
ratings_df.shape

(1149780, 3)

In [9]:
books_df.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
imageUrlS            object
imageUrlM            object
imageUrlL            object
dtype: object

In [10]:
users_df.dtypes

userID        int64
Location     object
Age         float64
dtype: object

In [11]:
ratings_df.dtypes

userID         int64
ISBN          object
bookRating     int64
dtype: object

In [12]:
users_df['userID'] = users_df['userID'].astype(str)

In [13]:
ratings_df.userID = ratings_df.userID.astype(str)

## Exploring books dataset

In [14]:
books_df.head(10)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
5,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...
6,0425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...
7,0671870432,PLEADING GUILTY,Scott Turow,1993,Audioworks,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...
8,0679425608,Under the Black Flag: The Romance and the Real...,David Cordingly,1996,Random House,http://images.amazon.com/images/P/0679425608.0...,http://images.amazon.com/images/P/0679425608.0...,http://images.amazon.com/images/P/0679425608.0...
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,http://images.amazon.com/images/P/074322678X.0...,http://images.amazon.com/images/P/074322678X.0...


In [15]:
books_df.sample(10)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
150639,0679749810,Balkan Ghosts: A Journey Through History (Vint...,Robert D. Kaplan,1994,Vintage Books USA,http://images.amazon.com/images/P/0679749810.0...,http://images.amazon.com/images/P/0679749810.0...,http://images.amazon.com/images/P/0679749810.0...
72810,0385336446,Hunting Midnight,Richard Zimler,2003,Delacorte Press,http://images.amazon.com/images/P/0385336446.0...,http://images.amazon.com/images/P/0385336446.0...,http://images.amazon.com/images/P/0385336446.0...
183100,0874313465,Mysterious Cairo,Greg Gorden,1992,West End Games,http://images.amazon.com/images/P/0874313465.0...,http://images.amazon.com/images/P/0874313465.0...,http://images.amazon.com/images/P/0874313465.0...
206894,081931126X,Elephant Goes to School,Jerry Smath,1984,Parents Magazine Press,http://images.amazon.com/images/P/081931126X.0...,http://images.amazon.com/images/P/081931126X.0...,http://images.amazon.com/images/P/081931126X.0...
46017,0061090409,Keeper of the Light : Keeper of the Light,Diane Chamberlain,1993,HarperTorch,http://images.amazon.com/images/P/0061090409.0...,http://images.amazon.com/images/P/0061090409.0...,http://images.amazon.com/images/P/0061090409.0...
114318,059040623X,Mooncake,Frank Asch,1983,"Scholastic, Inc.",http://images.amazon.com/images/P/059040623X.0...,http://images.amazon.com/images/P/059040623X.0...,http://images.amazon.com/images/P/059040623X.0...
251170,0385338082,Can You Keep a Secret?,SOPHIE KINSELLA,2005,Delta,http://images.amazon.com/images/P/0385338082.0...,http://images.amazon.com/images/P/0385338082.0...,http://images.amazon.com/images/P/0385338082.0...
222205,3404203976,Der Feuerkreis 03. Schattentempel.,Janny Wurts,2000,LÃ?Â¼bbe,http://images.amazon.com/images/P/3404203976.0...,http://images.amazon.com/images/P/3404203976.0...,http://images.amazon.com/images/P/3404203976.0...
11637,0671027581,Open Season,Linda Howard,2002,Pocket,http://images.amazon.com/images/P/0671027581.0...,http://images.amazon.com/images/P/0671027581.0...,http://images.amazon.com/images/P/0671027581.0...
163438,0451163834,Like Love: An 87th Precinct Mystery,Ed McBain,1994,Signet Book,http://images.amazon.com/images/P/0451163834.0...,http://images.amazon.com/images/P/0451163834.0...,http://images.amazon.com/images/P/0451163834.0...


In [16]:
books_df.tail(10)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
271350,0762412119,"Burpee Gardening Cyclopedia: A Concise, Up to ...",Allan Armitage,2002,Running Press Book Publishers,http://images.amazon.com/images/P/0762412119.0...,http://images.amazon.com/images/P/0762412119.0...,http://images.amazon.com/images/P/0762412119.0...
271351,1582380805,Tropical Rainforests: 230 Species in Full Colo...,"Allen M., Ph.D. Young",2001,Golden Guides from St. Martin's Press,http://images.amazon.com/images/P/1582380805.0...,http://images.amazon.com/images/P/1582380805.0...,http://images.amazon.com/images/P/1582380805.0...
271352,1845170423,Cocktail Classics,David Biggs,2004,Connaught,http://images.amazon.com/images/P/1845170423.0...,http://images.amazon.com/images/P/1845170423.0...,http://images.amazon.com/images/P/1845170423.0...
271353,014002803X,Anti Death League,Kingsley Amis,1975,Viking Press,http://images.amazon.com/images/P/014002803X.0...,http://images.amazon.com/images/P/014002803X.0...,http://images.amazon.com/images/P/014002803X.0...
271354,0449906736,Flashpoints: Promise and Peril in a New World,Robin Wright,1993,Ballantine Books,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...
271359,0767409752,A Guided Tour of Rene Descartes' Meditations o...,Christopher Biffle,2000,McGraw-Hill Humanities/Social Sciences/Languages,http://images.amazon.com/images/P/0767409752.0...,http://images.amazon.com/images/P/0767409752.0...,http://images.amazon.com/images/P/0767409752.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [17]:
books_df.drop(['imageUrlM','imageUrlS','imageUrlL'],axis=1,inplace=True)

In [18]:
books_df.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [120]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271355 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271355 non-null object
bookTitle            271355 non-null object
bookAuthor           271354 non-null object
yearOfPublication    271355 non-null int64
publisher            271355 non-null object
dtypes: int64(1), object(4)
memory usage: 12.4+ MB


**yearOfPublication**

### Check unique values of yearOfPublication


In [19]:
books_df.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [20]:
books_df.loc[books_df['yearOfPublication'] == "DK Publishing Inc"]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [21]:
todrop= books_df[(books_df['yearOfPublication'] == 'DK Publishing Inc') | (books_df['yearOfPublication'] == 'Gallimard')].index

In [22]:
todrop

Int64Index([209538, 220731, 221678], dtype='int64')

In [23]:
books_df.drop(index= todrop, inplace=True)

In [24]:
books_df.loc[(books_df['yearOfPublication'] == "DK Publishing Inc") | (books_df['yearOfPublication'] == "Gallimard") ]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


In [25]:
books_df.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

### Change the datatype of yearOfPublication to 'int'

In [26]:
books_df.yearOfPublication = pd.to_numeric(books_df.yearOfPublication)

In [27]:
books_df.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [28]:
books_df.isnull().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            2
dtype: int64

In [29]:
books_df.loc[books_df['publisher'].isnull()]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


In [30]:
books_df['publisher'].dropna(axis=0, inplace= True)

In [31]:
books_df['publisher'].isnull().sum()

0

In [32]:
books_df.isnull().any()

ISBN                 False
bookTitle            False
bookAuthor            True
yearOfPublication    False
publisher             True
dtype: bool

In [33]:
books_df.loc[(books_df['ISBN'] == '193169656X') | (books_df['ISBN'] == '1931696993')]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


In [34]:
nullvalue = books_df.loc[(books_df['ISBN'] == '193169656X') | (books_df['ISBN'] == '1931696993')].index

In [35]:
nullvalue

Int64Index([128890, 129037], dtype='int64')

In [36]:
books_df.drop(index= nullvalue,inplace= True)

In [37]:
books_df.isnull().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            0
dtype: int64

###### we have one null value present in the bookAuthor column.

## Exploring Users dataset

In [38]:
print(users_df.shape)
users_df.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [39]:
uni = users_df['Age'].unique()

In [40]:
uni.sort()
print (uni)

[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
  28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
  42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
 113. 114. 115. 116. 118. 119. 123. 124. 127. 128. 132. 133. 136. 137.
 138. 140. 141. 143. 146. 147. 148. 151. 152. 156. 157. 159. 162. 168.
 172. 175. 183. 186. 189. 199. 200. 201. 204. 207. 208. 209. 210. 212.
 219. 220. 223. 226. 228. 229. 230. 231. 237. 239. 244.  nan]


Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [41]:
users_df.dtypes

userID       object
Location     object
Age         float64
dtype: object

In [42]:
users_df.isnull().sum()

userID           0
Location         0
Age         110762
dtype: int64

In [None]:
## There are 110762 null values present already

In [43]:
findNan= users_df[(users_df['Age'] < 5) | (users_df['Age']> 90)]

In [44]:
findNan.sample(15)

Unnamed: 0,userID,Location,Age
142681,142682,"colorado, springs co, colorado, usa",4.0
229838,229839,"80335 münchen, bayern, germany",104.0
224633,224634,"fuerteventura, las palmas, spain",111.0
20201,20202,"pfungstadt, hessen, germany",104.0
153484,153485,"melaka, melaka, malaysia",102.0
217182,217183,"los angeles, california, usa",2.0
275284,275285,"watauga, texas, usa",0.0
253872,253873,"aveiro, aveiro, portugal",1.0
250061,250062,"blof, hamburg, congo",103.0
131921,131922,"bremen, bremen, germany",93.0


In [45]:
users_df['Age']= users_df['Age'].replace(findNan['Age'],np.nan)

In [46]:
users_df['Age'].unique()

array([nan, 18., 17., 61., 26., 14., 25., 19., 46., 55., 32., 24., 20.,
       34., 23., 51., 31., 21., 44., 30., 57., 43., 37., 41., 54., 42.,
       50., 39., 53., 47., 36., 28., 35., 13., 58., 49., 38., 45., 62.,
       63., 27., 33., 29., 66., 40., 15., 60., 79., 22., 16., 65., 59.,
       48., 72., 56., 67., 80., 52., 69., 71., 73., 78.,  9., 64., 12.,
       74., 75., 76., 83., 68., 11., 77., 70.,  8.,  7., 81., 10.,  5.,
        6., 84., 82., 90., 85., 86., 87., 89., 88.])

### Replace null values in column `Age` with mean

In [47]:
users_df['Age'].replace(np.nan,0,inplace=True)

In [48]:
users_df.isnull().sum()

userID      0
Location    0
Age         0
dtype: int64

In [49]:
users_df['Age']= users_df['Age'].replace(0,users_df['Age'].mean())

In [50]:
users_df['Age'].unique()

array([20.76820819, 18.        , 17.        , 61.        , 26.        ,
       14.        , 25.        , 19.        , 46.        , 55.        ,
       32.        , 24.        , 20.        , 34.        , 23.        ,
       51.        , 31.        , 21.        , 44.        , 30.        ,
       57.        , 43.        , 37.        , 41.        , 54.        ,
       42.        , 50.        , 39.        , 53.        , 47.        ,
       36.        , 28.        , 35.        , 13.        , 58.        ,
       49.        , 38.        , 45.        , 62.        , 63.        ,
       27.        , 33.        , 29.        , 66.        , 40.        ,
       15.        , 60.        , 79.        , 22.        , 16.        ,
       65.        , 59.        , 48.        , 72.        , 56.        ,
       67.        , 80.        , 52.        , 69.        , 71.        ,
       73.        , 78.        ,  9.        , 64.        , 12.        ,
       74.        , 75.        , 76.        , 83.        , 68.  

### Change the datatype of `Age` to `int`

In [51]:
users_df['Age']=users_df['Age'].astype(int)

In [52]:
users_df.dtypes

userID      object
Location    object
Age          int32
dtype: object

## Exploring the Ratings Dataset

### check the shape

In [53]:
ratings_df.shape

(1149780, 3)

In [54]:
n_users = users_df.shape[0]
n_books = books_df.shape[0]

In [55]:
ratings_df.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [59]:
ratings_df['userID'].count() 

1149780

### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [60]:
ratings_df= pd.merge(ratings_df, books_df, on = ['ISBN'], how='inner')

In [61]:
ratings_df.shape

(1031130, 7)

In [59]:
## Another way: "ratings = ratings_df[ratings_df['ISBN'].isin(books_df['ISBN'])]""

In [60]:
###ratings.shape

In [62]:
ratings_df.head(10)

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
2,6543,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
3,8680,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
4,10314,034545104X,9,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
5,23768,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
6,28266,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
7,28523,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
8,39002,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
9,50403,034545104X,9,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [63]:
ratings_df = pd.merge(ratings_df, users_df['userID'], how='inner') ## ratings = ratings_df[ratings_df['userID'].isin(users_df['userID'])]

In [64]:
ratings_df.shape

(1031130, 7)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [65]:
ratings_df['bookRating'].unique()

array([ 0,  5,  9,  8,  7,  6, 10,  3,  4,  2,  1], dtype=int64)

In [66]:
ratings= ratings_df.loc[ratings_df['bookRating']!= 0]

In [67]:
ratings.shape

(383839, 7)

In [68]:
ratings['bookRating'].unique()

array([ 5,  9,  8,  7,  6, 10,  3,  4,  2,  1], dtype=int64)

### Find out which rating has been given highest number of times

In [69]:
ratings['bookRating'].value_counts()

8     91804
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

 '8' rating has been given 91804 times is the highest number

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [70]:
rat = ratings.groupby(['userID'], sort = False, as_index= False).count()

In [71]:
rat.head(15)

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2313,28,28,28,28,28,28
1,6543,174,174,174,174,174,174
2,8680,16,16,16,16,16,16
3,10314,64,64,64,64,64,64
4,23768,210,210,210,210,210,210
5,28523,57,57,57,57,57,57
6,50403,1,1,1,1,1,1
7,56157,15,15,15,15,15,15
8,59102,3,3,3,3,3,3
9,59287,2,2,2,2,2,2


In [72]:
rat= rat.loc[rat['ISBN'] >= 100]

In [73]:
rat.head(15)

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
1,6543,174,174,174,174,174,174
4,23768,210,210,210,210,210,210
15,98391,5689,5689,5689,5689,5689,5689
18,123981,136,136,136,136,136,136
37,208406,100,100,100,100,100,100
41,227447,182,182,182,182,182,182
42,227520,159,159,159,159,159,159
44,240144,227,227,227,227,227,227
52,271448,129,129,129,129,129,129
55,278418,106,106,106,106,106,106


In [74]:
ratings = ratings[ratings['userID'].isin(rat['userID'])]

In [75]:
ratings.shape

(103269, 7)

In [76]:
ratings.dtypes

userID               object
ISBN                 object
bookRating            int64
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [77]:
ratings.isnull().sum()

userID               0
ISBN                 0
bookRating           0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64

In [None]:
## Currently, there is no null values present in the data.

In [78]:
from surprise import Dataset, Reader
reader =  Reader(rating_scale=(1,10))

In [79]:
recommend = Dataset.load_from_df(ratings[['userID','bookTitle','bookRating']], reader)

In [80]:
recommend

<surprise.dataset.DatasetAutoFolds at 0x2fb48c3f7b8>

In [81]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(recommend, test_size=.25,random_state=123)

In [82]:
type(trainset)

surprise.trainset.Trainset

In [83]:
len(testset)

25818

In [84]:
user_records = trainset.ur
type(user_records)

collections.defaultdict

In [85]:
user_records

defaultdict(list,
            {0: [(0, 5.0),
              (245, 10.0),
              (608, 10.0),
              (1266, 9.0),
              (1283, 7.0),
              (1877, 9.0),
              (2747, 7.0),
              (3077, 5.0),
              (3468, 10.0),
              (3535, 8.0),
              (3789, 10.0),
              (3864, 10.0),
              (4145, 10.0),
              (5121, 7.0),
              (6917, 10.0),
              (7476, 8.0),
              (7906, 10.0),
              (7986, 9.0),
              (8857, 8.0),
              (10036, 7.0),
              (2699, 10.0),
              (10223, 10.0),
              (10407, 9.0),
              (10488, 8.0),
              (10542, 7.0),
              (10861, 4.0),
              (11123, 8.0),
              (7715, 4.0),
              (11438, 8.0),
              (12477, 8.0),
              (12590, 7.0),
              (12934, 10.0),
              (13131, 9.0),
              (13557, 10.0),
              (5917, 5.0),
              

In [86]:
print(trainset.to_raw_uid(192))
trainset.to_raw_iid(192)

2110


'Children Are from Heaven : Positive Parenting Skills for Raising Cooperative, Confident, and Compassionate Children'

In [87]:
trainset.to_inner_iid('Clara Callan')

5305

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [88]:
from surprise import SVD
from surprise import accuracy

In [89]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(recommend, test_size=.25,random_state=123)

In [90]:
recomend_svd = SVD(n_factors=50,biased= False)

In [91]:
recomend_svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2fb4cc1a6d8>

In [92]:
test_pred = recomend_svd.test(testset)

In [93]:
test_pred[196]

Prediction(uid='98391', iid="What She Doesn't Know (Zebra Romantic Suspense)", r_ui=8.0, est=7.8288853597758585, details={'was_impossible': True, 'reason': 'User and item are unkown.'})

In [94]:
test_pred

[Prediction(uid='13552', iid='Cruel &amp; Unusual (Kay Scarpetta Mysteries (Paperback))', r_ui=7.0, est=7.26765709795908, details={'was_impossible': False}),
 Prediction(uid='235105', iid='Dismissed With Prejudice: A J. P. Beaumont Mystery', r_ui=7.0, est=7.8288853597758585, details={'was_impossible': True, 'reason': 'User and item are unkown.'}),
 Prediction(uid='242465', iid="The Doctor's Book of Home Remedies : Thousands of Tips and Techniques Anyone Can Use to Heal Everyday Health Problems", r_ui=10.0, est=2.8875214136706004, details={'was_impossible': False}),
 Prediction(uid='13552', iid='In Her Shoes : A Novel', r_ui=9.0, est=7.303580240753087, details={'was_impossible': False}),
 Prediction(uid='38281', iid="Ender's Shadow", r_ui=9.0, est=6.7441063264729415, details={'was_impossible': False}),
 Prediction(uid='225595', iid='Quarantine', r_ui=8.0, est=7.8288853597758585, details={'was_impossible': True, 'reason': 'User and item are unkown.'}),
 Prediction(uid='23768', iid='Let t

In [102]:
test_pred_df= pd.DataFrame([[x.uid,x.iid,x.est] for x in test_pred])

In [103]:
test_pred_df.head()

Unnamed: 0,0,1,2
0,13552,Cruel &amp; Unusual (Kay Scarpetta Mysteries (...,7.267657
1,235105,Dismissed With Prejudice: A J. P. Beaumont Mys...,7.828885
2,242465,The Doctor's Book of Home Remedies : Thousands...,2.887521
3,13552,In Her Shoes : A Novel,7.30358
4,38281,Ender's Shadow,6.744106


In [104]:
test_pred_df.columns = ["userID","bookTitle","est_ratings"]
test_pred_df.sort_values(by= ["userID", "est_ratings"], ascending= False, inplace= True)

In [105]:
test_pred_df.head()

Unnamed: 0,userID,bookTitle,est_ratings
6003,98758,On Writing,9.234665
12558,98758,Stand,8.467609
17807,98758,Gerald's Game,8.282596
175,98758,Coldheart Canyon,7.828885
6378,98758,Weird Florida,7.828885


In [106]:
test_pred_df.userID.value_counts()

11676     1741
98391     1448
153662     489
189835     450
23902      270
76499      263
171118     252
235105     246
248718     235
16795      218
35859      204
56399      204
185233     191
197659     185
114368     150
95359      149
101851     148
69078      142
158295     135
87141      131
93047      129
182085     128
60244      127
100906     125
177458     124
23872      124
257204     121
189334     121
107784     114
135149     113
          ... 
235935      24
240568      24
199416      24
183958      24
55492       23
10560       23
126492      23
2110        23
177072      23
140000      23
52350       23
146113      23
241666      22
88283       22
132492      22
163759      22
94242       22
208406      21
190708      21
76223       21
86947       20
15418       20
26544       20
187256      20
148966      20
33145       19
97754       19
36609       18
182993      16
117384      16
Name: userID, Length: 449, dtype: int64

In [131]:
accuracy.rmse(test_pred)

RMSE: 2.7226


2.722644388191621

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [107]:
user2110_recom = test_pred_df.loc[test_pred_df['userID'] == '2110']

In [108]:
user2110_recom

Unnamed: 0,userID,bookTitle,est_ratings
1284,2110,In the Time of Dinosaurs (Animorphs Megamorphs...,7.828885
2165,2110,Under a War-Torn Sky,7.828885
6827,2110,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",7.828885
8813,2110,Mack Bolan: Dark Truth,7.828885
8869,2110,"The Visitor (Animorphs, No 2)",7.828885
11671,2110,The Lives of John Lennon,7.828885
12393,2110,Han Solo at Stars' End,7.828885
13198,2110,Bradymania!: Everything You Always Wanted to K...,7.828885
14419,2110,Blood Trade (The Executioner #291),7.828885
14670,2110,The Secret of Terror Castle (Three Investigato...,7.828885


### Get the predicted ratings for userID `2110` and sort them in descending order

In [None]:
## already displayed

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [111]:
user_data= ratings_df.loc[ratings_df['userID'] == '2110']

In [113]:
user_data

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
670337,2110,059035342X,10,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books
670338,2110,0886774829,0,"Stronghold (Dragon Star, Book 1)",Melanie Rawn,1994,Daw Books
670339,2110,0886775426,0,"The Dragon Token (Dragon Star, Book 2)",Melanie Rawn,1994,Daw Books
670340,2110,0886775957,0,"Skybowl (Dragon Star, Book 3)",Melanie Rawn,1994,Daw Books
670341,2110,0590448595,8,Karen's School Trip (Baby-Sitters Little Siste...,Ann M. Martin,1992,Scholastic Paperbacks (Mm)
670342,2110,0671042858,0,The Girl Who Loved Tom Gordon,Stephen King,2000,Pocket
670343,2110,0670820555,0,Spy Catcher: The Candid Autobiography of a Sen...,Peter Wright,1987,Penguin USA
670344,2110,0451137965,9,Thinner,Stephen King,1985,New Amer Library
670345,2110,0590629786,10,"The Visitor (Animorphs, No 2)",K. A. Applegate,1996,Scholastic
670346,2110,0590629794,10,"The Encounter (Animorphs , No 3)",K. A. Applegate,1996,Scholastic


In [118]:
user_data['bookTitle'].unique()

array(["Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))",
       'Stronghold (Dragon Star, Book 1)',
       'The Dragon Token (Dragon Star, Book 2)',
       'Skybowl (Dragon Star, Book 3)',
       "Karen's School Trip (Baby-Sitters Little Sister, 24)",
       'The Girl Who Loved Tom Gordon',
       'Spy Catcher: The Candid Autobiography of a Senior Intelligence Officer',
       'Thinner', 'The Visitor (Animorphs, No 2)',
       'The Encounter (Animorphs , No 3)',
       'The Message (Animorphs , No 4)', 'Return of the Jedi (Star Wars)',
       'The Five People You Meet in Heaven',
       "BODY SWITCHERS FROM OUTER SPACE: R L STINE'S GHOSTS OF FEAR STREET #14 (GHOSTS OF FEAR STREET)",
       '50 Simple Things You Can Do to Save the Earth',
       'The Cat Who Went Up the Creek', 'Sword of Shannara',
       'The Black Unicorn (Magic Kingdom of Landover Novel)',
       'Quantum Leap: The Wall (Quantum Leap)',
       'Quantum Leap: Too Close for Comfort (Quantum Leap)',
  

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [127]:
user_full_info = pd.merge(ratings, users_df, on= ['userID'], how= 'inner')

In [128]:
user_full_info.head(10)

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher,Location,Age
0,6543,0446605484,10,Roses Are Red (Alex Cross Novels),James Patterson,2001,Warner Vision,"strafford, missouri, usa",34
1,6543,0805062971,8,Fight Club,Chuck Palahniuk,1999,Owl Books,"strafford, missouri, usa",34
2,6543,0345342968,8,Fahrenheit 451,RAY BRADBURY,1987,Del Rey,"strafford, missouri, usa",34
3,6543,0446610038,9,1st to Die: A Novel,James Patterson,2002,Warner Vision,"strafford, missouri, usa",34
4,6543,0061009059,8,One for the Money (Stephanie Plum Novels (Pape...,Janet Evanovich,1995,HarperTorch,"strafford, missouri, usa",34
5,6543,0142001740,9,The Secret Life of Bees,Sue Monk Kidd,2003,Penguin Books,"strafford, missouri, usa",34
6,6543,0345436911,8,The Dress Lodger (Ballantine Reader's Circle),Sheri Holman,2001,Ballantine Books,"strafford, missouri, usa",34
7,6543,038548951X,10,Sister of My Heart,Chitra Banerjee Divakaruni,2000,Anchor Pub,"strafford, missouri, usa",34
8,6543,0399149562,6,Short &amp; Tall Tales: Moose County Legends C...,Lilian Jackson Braun,2002,Putnam Publishing Group,"strafford, missouri, usa",34
9,6543,0553574566,8,A Monstrous Regiment of Women,LAURIE R. KING,1996,Bantam,"strafford, missouri, usa",34


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [151]:
recom = (books_df[~books_df['ISBN'].isin(ratings['ISBN'])].
                            merge(test_pred_df.reset_index(), how = 'inner',))
        

In [157]:
recommen = recom.sort_values(by=['est_ratings'], ascending = False)

In [159]:
recommen.head(10)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,index,userID,est_ratings
6412,1570429669,The Rescue,Nicholas Sparks,2000,Time Warner Audio Major,16955,236283,10.0
9612,B00008WFXL,The Da Vinci Code,Dan Brown,0,Doubleday,11472,98391,10.0
9628,0739313126,The Da Vinci Code,DAN BROWN,2003,Random House Audio,11472,98391,10.0
9631,0739313126,The Da Vinci Code,DAN BROWN,2003,Random House Audio,5224,70594,10.0
3736,0749336021,The Joy Luck Club,Amy Tan,1994,Minerva,18013,11676,10.0
8542,0505521474,The Night Before Christmas,Victoria Alexander,1996,Leisure Books,3705,104399,10.0
8539,0786806087,The Night Before Christmas,William Wegman,2000,Hyperion Books for Children,3705,104399,10.0
8536,0689840535,The Night Before Christmas,Clement Clarke Moore,2001,Atheneum/Anne Schwartz Books,3705,104399,10.0
8533,0694004243,The Night Before Christmas,Clement C. Moore,1992,HarperCollins Publishers,3705,104399,10.0
8530,0843134453,The Night Before Christmas,Kees Moerbeek,1992,Price Stern Sloan,3705,104399,10.0


In [161]:
recommendations = recommen.iloc[:10,:-1]

In [162]:
recommendations

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,index,userID
6412,1570429669,The Rescue,Nicholas Sparks,2000,Time Warner Audio Major,16955,236283
9612,B00008WFXL,The Da Vinci Code,Dan Brown,0,Doubleday,11472,98391
9628,0739313126,The Da Vinci Code,DAN BROWN,2003,Random House Audio,11472,98391
9631,0739313126,The Da Vinci Code,DAN BROWN,2003,Random House Audio,5224,70594
3736,0749336021,The Joy Luck Club,Amy Tan,1994,Minerva,18013,11676
8542,0505521474,The Night Before Christmas,Victoria Alexander,1996,Leisure Books,3705,104399
8539,0786806087,The Night Before Christmas,William Wegman,2000,Hyperion Books for Children,3705,104399
8536,0689840535,The Night Before Christmas,Clement Clarke Moore,2001,Atheneum/Anne Schwartz Books,3705,104399
8533,0694004243,The Night Before Christmas,Clement C. Moore,1992,HarperCollins Publishers,3705,104399
8530,0843134453,The Night Before Christmas,Kees Moerbeek,1992,Price Stern Sloan,3705,104399
