**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [0]:
import pandas as pd
import numpy as np

In [4]:
from google.colab import drive
drive.mount('/content/drive') 

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [5]:
#Loading data
books = pd.read_csv("/content/drive/My Drive/GreatlakesAIML/books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('/content/drive/My Drive/GreatlakesAIML/users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('/content/drive/My Drive/GreatlakesAIML/ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [6]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise (from surprise)
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K     |████████████████████████████████| 6.5MB 2.8MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.0-cp36-cp36m-linux_x86_64.whl size=1678047 sha256=9862ca43ebbd6d1f17a9fcdb8c334eddd731e12f5d024695f1c028fd17e095dc
  Stored in directory: /root/.cache/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.0 surprise-0

In [0]:
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

In [8]:
books.head(10)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
5,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...
6,0425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...,http://images.amazon.com/images/P/0425176428.0...
7,0671870432,PLEADING GUILTY,Scott Turow,1993,Audioworks,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...,http://images.amazon.com/images/P/0671870432.0...
8,0679425608,Under the Black Flag: The Romance and the Real...,David Cordingly,1996,Random House,http://images.amazon.com/images/P/0679425608.0...,http://images.amazon.com/images/P/0679425608.0...,http://images.amazon.com/images/P/0679425608.0...
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,http://images.amazon.com/images/P/074322678X.0...,http://images.amazon.com/images/P/074322678X.0...


In [9]:
books.describe().transpose() # one record missing in bookauthor and 2 records missing in publisher

Unnamed: 0,count,unique,top,freq
ISBN,271360,271360,0486270602,1
bookTitle,271360,242135,Selected Poems,27
bookAuthor,271359,102023,Agatha Christie,632
yearOfPublication,271360,202,2002,13903
publisher,271358,16807,Harlequin,7535
imageUrlS,271360,271044,http://images.amazon.com/images/P/044990587X.0...,2
imageUrlM,271360,271044,http://images.amazon.com/images/P/042513976X.0...,2
imageUrlL,271357,271041,http://images.amazon.com/images/P/055326981X.0...,2


In [10]:
users.describe().transpose() # we have outliars like age 244 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
userID,278858.0,139429.5,80499.51502,1.0,69715.25,139429.5,209143.75,278858.0
Age,168096.0,34.751434,14.428097,0.0,24.0,32.0,44.0,244.0


In [11]:
users.columns

Index(['userID', 'Location', 'Age'], dtype='object')

In [12]:
ratings.columns

Index(['userID', 'ISBN', 'bookRating'], dtype='object')

In [13]:
ratings.describe().transpose() # Maximum users have given 7 & above rating. 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
userID,1149780.0,140386.395126,80562.277718,2.0,70345.0,141010.0,211028.0,278854.0
bookRating,1149780.0,2.86695,3.854184,0.0,0.0,0.0,7.0,10.0


## Exploring books dataset

In [14]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [0]:
books_d=books.drop(['imageUrlS','imageUrlM','imageUrlL'],axis=1)

In [16]:
books_d

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company
5,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group
6,0425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group
7,0671870432,PLEADING GUILTY,Scott Turow,1993,Audioworks
8,0679425608,Under the Black Flag: The Romance and the Real...,David Cordingly,1996,Random House
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner


In [17]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


**yearOfPublication**

### Check unique values of yearOfPublication


In [18]:
books_d['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [19]:
books_d[books_d['yearOfPublication'].str.contains('DK Publishing Inc',na = False)]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [20]:
books_d[books_d['yearOfPublication'].str.contains('Gallimard',na = False)]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [0]:
books_dp=books_d[~books_d['yearOfPublication'].isin(['DK Publishing Inc','Gallimard'])]

### Change the datatype of yearOfPublication to 'int'

In [22]:
books_dp ['yearOfPublication']=books_dp.yearOfPublication.astype('int')
books_dp.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

In [23]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
imageUrlS            object
imageUrlM            object
imageUrlL            object
dtype: object

### Drop NaNs in `'publisher'` column


In [0]:
books_dpc=books_dp.dropna(subset=['publisher'])

## Exploring Users dataset

In [25]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [26]:
users_a=users['Age'].unique()
users_a.sort()
users_a

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [0]:
users = pd.read_csv('/content/drive/My Drive/GreatlakesAIML/users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

In [28]:
users.head()

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [29]:
users_copy= users.copy()
users.copy()

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
5,6,"santa monica, california, usa",61.0
6,7,"washington, dc, usa",
7,8,"timmins, ontario, canada",
8,9,"germantown, tennessee, usa",
9,10,"albacete, wisconsin, spain",26.0


In [30]:
a = np.array(users['Age'].values.tolist())
print (a)
users['Age'] = np.where(a < 5, 0 , a).tolist()
print (users)



[nan 18. nan ... nan nan nan]
        userID                                          Location   Age
0            1                                nyc, new york, usa   NaN
1            2                         stockton, california, usa  18.0
2            3                   moscow, yukon territory, russia   NaN
3            4                         porto, v.n.gaia, portugal  17.0
4            5                farnborough, hants, united kingdom   NaN
5            6                     santa monica, california, usa  61.0
6            7                               washington, dc, usa   NaN
7            8                          timmins, ontario, canada   NaN
8            9                        germantown, tennessee, usa   NaN
9           10                        albacete, wisconsin, spain  26.0
10          11                    melbourne, victoria, australia  14.0
11          12                       fort bragg, california, usa   NaN
12          13                       barcelona,

  This is separate from the ipykernel package so we can avoid doing imports until


In [31]:
users_a=users['Age'].unique()
users_a.sort()
users_a

array([  0.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.,
        15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,  23.,  24.,  25.,
        26.,  27.,  28.,  29.,  30.,  31.,  32.,  33.,  34.,  35.,  36.,
        37.,  38.,  39.,  40.,  41.,  42.,  43.,  44.,  45.,  46.,  47.,
        48.,  49.,  50.,  51.,  52.,  53.,  54.,  55.,  56.,  57.,  58.,
        59.,  60.,  61.,  62.,  63.,  64.,  65.,  66.,  67.,  68.,  69.,
        70.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,  78.,  79.,  80.,
        81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,  89.,  90.,  91.,
        92.,  93.,  94.,  95.,  96.,  97.,  98.,  99., 100., 101., 102.,
       103., 104., 105., 106., 107., 108., 109., 110., 111., 113., 114.,
       115., 116., 118., 119., 123., 124., 127., 128., 132., 133., 136.,
       137., 138., 140., 141., 143., 146., 147., 148., 151., 152., 156.,
       157., 159., 162., 168., 172., 175., 183., 186., 189., 199., 200.,
       201., 204., 207., 208., 209., 210., 212., 21

In [32]:
a = np.array(users['Age'].values.tolist())
print (a)
users['Age'] = np.where(a > 90, 0 , a).tolist()
print (users)

[nan 18. nan ... nan nan nan]
        userID                                          Location   Age
0            1                                nyc, new york, usa   NaN
1            2                         stockton, california, usa  18.0
2            3                   moscow, yukon territory, russia   NaN
3            4                         porto, v.n.gaia, portugal  17.0
4            5                farnborough, hants, united kingdom   NaN
5            6                     santa monica, california, usa  61.0
6            7                               washington, dc, usa   NaN
7            8                          timmins, ontario, canada   NaN
8            9                        germantown, tennessee, usa   NaN
9           10                        albacete, wisconsin, spain  26.0
10          11                    melbourne, victoria, australia  14.0
11          12                       fort bragg, california, usa   NaN
12          13                       barcelona,

  This is separate from the ipykernel package so we can avoid doing imports until


In [33]:
users_a=users['Age'].unique()
users_a.sort()
users_a

array([ 0.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16.,
       17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29.,
       30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42.,
       43., 44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55.,
       56., 57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68.,
       69., 70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81.,
       82., 83., 84., 85., 86., 87., 88., 89., 90., nan])

In [34]:
users['Age']= users['Age'].replace(0, np.nan) # replacing 0 with NAN
users

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
5,6,"santa monica, california, usa",61.0
6,7,"washington, dc, usa",
7,8,"timmins, ontario, canada",
8,9,"germantown, tennessee, usa",
9,10,"albacete, wisconsin, spain",26.0


In [35]:
users_a=users['Age'].unique()
users_a.sort()
users_a

array([ 5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17.,
       18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
       31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
       44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56.,
       57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.,
       70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82.,
       83., 84., 85., 86., 87., 88., 89., 90., nan])

### Replace null values in column `Age` with mean

In [36]:
users['Age'] = users['Age'].fillna(users['Age'].mean())
users

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",34.72384
1,2,"stockton, california, usa",18.00000
2,3,"moscow, yukon territory, russia",34.72384
3,4,"porto, v.n.gaia, portugal",17.00000
4,5,"farnborough, hants, united kingdom",34.72384
5,6,"santa monica, california, usa",61.00000
6,7,"washington, dc, usa",34.72384
7,8,"timmins, ontario, canada",34.72384
8,9,"germantown, tennessee, usa",34.72384
9,10,"albacete, wisconsin, spain",26.00000


### Change the datatype of `Age` to `int`

In [0]:
users ['Age']=users.Age.astype('int')

In [38]:
users.dtypes

userID       int64
Location    object
Age          int64
dtype: object

In [39]:
users.dtypes

userID       int64
Location    object
Age          int64
dtype: object

In [40]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


In [41]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [42]:
ratings.shape

(1149780, 3)

In [0]:
n_users = users.shape[0]
n_books = books.shape[0]

In [44]:
n_users

278858

In [45]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [46]:
ratings.dtypes

userID         int64
ISBN          object
bookRating     int64
dtype: object

### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

---



In [47]:
book_ratings=ratings[ratings.ISBN.isin(books.ISBN)]
book_ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [48]:
book_ratings=book_ratings[book_ratings.userID.isin(users.userID)]
book_ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [49]:
ratings_10=book_ratings[book_ratings.bookRating != 0]
ratings_10.head()

Unnamed: 0,userID,ISBN,bookRating
1,276726,0155061224,5
3,276729,052165615X,3
4,276729,0521795028,6
8,276744,038550120X,7
16,276747,0060517794,9


### Find out which rating has been given highest number of times

In [50]:
ratings_10.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
userID,383842.0,136031.46126,80482.299401,8.0,67591.0,133789.0,206219.0,278854.0
bookRating,383842.0,7.626701,1.841339,1.0,7.0,8.0,9.0,10.0


In [51]:
ratings_10.groupby('bookRating')['userID'].count().sort_values(ascending=False).head()  

bookRating
8     91804
10    71225
7     66402
9     60778
5     45355
Name: userID, dtype: int64

In [52]:
ratings_10.groupby('bookRating')['ISBN'].count().sort_values(ascending=False).head()  # 8 has been rated highest number of times

bookRating
8     91804
10    71225
7     66402
9     60778
5     45355
Name: ISBN, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [53]:
ratings_11= ratings_10.groupby('userID')['bookRating'].count().sort_values(ascending=False).head()  
ratings_11

userID
11676     6943
98391     5691
189835    1899
153662    1845
23902     1180
Name: bookRating, dtype: int64

In [54]:
user_counts= ratings_10['userID'].value_counts()
user_counts.head()

11676     6943
98391     5691
189835    1899
153662    1845
23902     1180
Name: userID, dtype: int64

In [55]:
ratings_10=ratings_10[ratings_10['userID'].isin(user_counts[user_counts >=100].index)]
ratings_10.head()

Unnamed: 0,userID,ISBN,bookRating
1456,277427,002542730X,10
1458,277427,003008685X,8
1461,277427,0060006641,10
1465,277427,0060542128,7
1474,277427,0061009059,9


### Generating ratings matrix from explicit ratings


In [56]:
ratings_matrix = ratings_10.pivot(index = 'userID', columns='ISBN',values='bookRating')
userID = ratings_matrix.index
ISBN = ratings_matrix.columns
print(ratings_matrix.shape)

(449, 66574)


In [57]:
ratings_matrix.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,0001056107,0001845039,0001935968,0001944711,0001952803,0001953877,0002000547,0002005018,0002005050,0002005557,0002006588,0002115328,0002116286,0002118580,0002154900,0002158973,0002163713,0002176181,0002176432,0002179695,0002181924,0002184974,0002190915,0002197154,0002223929,0002228394,000223257X,0002233509,0002239183,0002240114,...,987960170X,9974643058,999058284X,9992003766,9992059958,9993584185,9994256963,9994348337,9997405137,9997406567,9997406990,999740923X,9997409728,9997411757,9997411870,9997412044,9997412958,9997507002,999750805X,9997508769,9997512952,9997519086,9997555635,9998914140,B00001U0CP,B00005TZWI,B00006CRTE,B00006I4OX,B00007FYKW,B00008RWPV,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
2033,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2110,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2276,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4017,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4385,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [0]:
#ratings_10 = ratings_10['bookRating'].fillna(0, inplace=True)

In [59]:
ratings_10.isna().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

In [60]:
ratings_10

Unnamed: 0,userID,ISBN,bookRating
1456,277427,002542730X,10
1458,277427,003008685X,8
1461,277427,0060006641,10
1465,277427,0060542128,7
1474,277427,0061009059,9
1477,277427,0062507109,8
1483,277427,0132220598,8
1488,277427,0140283374,6
1490,277427,014039026X,8
1491,277427,0140390715,7


### Generate the predicted ratings using SVD with no.of singular values to be 50

---



In [0]:
from collections import defaultdict
from surprise import SVD
from surprise import Dataset
from surprise import Reader

In [62]:
reader = Reader(rating_scale=(0, 9))
data = Dataset.load_from_df(ratings_10[['userID', 'ISBN', 'bookRating']], reader)
data

<surprise.dataset.DatasetAutoFolds at 0x7f8b9bdf2668>

In [0]:
trainset = data.build_full_trainset()

In [64]:
trainset

<surprise.trainset.Trainset at 0x7f8bcf6c90b8>

In [65]:
trainset.ur

Output hidden; open in https://colab.research.google.com to view.

In [66]:
algo = SVD(n_epochs=50)
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f8b9be486d8>

In [0]:
testset = trainset.build_anti_testset()

In [68]:
testset

[(277427, '0006542808', 7.825420495589275),
 (277427, '0060392185', 7.825420495589275),
 (277427, '0140367209', 7.825420495589275),
 (277427, '0140546499', 7.825420495589275),
 (277427, '0192816071', 7.825420495589275),
 (277427, '0307022196', 7.825420495589275),
 (277427, '0310912520', 7.825420495589275),
 (277427, '0312850131', 7.825420495589275),
 (277427, '0312860862', 7.825420495589275),
 (277427, '0312923651', 7.825420495589275),
 (277427, '0312970234', 7.825420495589275),
 (277427, '0321096983', 7.825420495589275),
 (277427, '0345300742', 7.825420495589275),
 (277427, '0345318617', 7.825420495589275),
 (277427, '0345334590', 7.825420495589275),
 (277427, '0345352459', 7.825420495589275),
 (277427, '0345369335', 7.825420495589275),
 (277427, '0345431464', 7.825420495589275),
 (277427, '037302973X', 7.825420495589275),
 (277427, '0373071388', 7.825420495589275),
 (277427, '0373074646', 7.825420495589275),
 (277427, '0373106939', 7.825420495589275),
 (277427, '0373110138', 7.825420

In [0]:
predictions = algo.test(testset)

In [70]:
predictions

[Prediction(uid=277427, iid='0006542808', r_ui=7.825420495589275, est=7.891936517071817, details={'was_impossible': False}),
 Prediction(uid=277427, iid='0060392185', r_ui=7.825420495589275, est=8.543707861996333, details={'was_impossible': False}),
 Prediction(uid=277427, iid='0140367209', r_ui=7.825420495589275, est=8.937108944352971, details={'was_impossible': False}),
 Prediction(uid=277427, iid='0140546499', r_ui=7.825420495589275, est=8.809214480500406, details={'was_impossible': False}),
 Prediction(uid=277427, iid='0192816071', r_ui=7.825420495589275, est=8.602393697469019, details={'was_impossible': False}),
 Prediction(uid=277427, iid='0307022196', r_ui=7.825420495589275, est=8.324573020830321, details={'was_impossible': False}),
 Prediction(uid=277427, iid='0310912520', r_ui=7.825420495589275, est=9, details={'was_impossible': False}),
 Prediction(uid=277427, iid='0312850131', r_ui=7.825420495589275, est=8.081282984814559, details={'was_impossible': False}),
 Prediction(uid=

In [0]:
def get_top_n(predictions, n=10):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, bookRating in top_n.items():
        bookRating.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = bookRating[:n]

    return top_n

In [0]:
top_n = get_top_n(predictions, n=10)

In [73]:
top_n

defaultdict(list,
            {2033: [('002542730X', 9),
              ('0062507109', 9),
              ('0310435706', 9),
              ('0385486804', 9),
              ('0385504209', 9),
              ('0399501487', 9),
              ('0517573636', 9),
              ('0671037692', 9),
              ('0671876821', 9),
              ('0679417648', 9)],
             2110: [('002542730X', 9),
              ('0441627404', 9),
              ('0553351672', 9),
              ('0553571656', 9),
              ('0679417648', 9),
              ('0786866845', 9),
              ('0897330536', 9),
              ('0898157803', 9),
              ('1572302399', 9),
              ('0060392185', 9)],
             2276: [('0060006641', 9),
              ('0201000822', 9),
              ('0375751513', 9),
              ('0425047962', 9),
              ('0440236738', 9),
              ('0446600415', 9),
              ('0451191153', 9),
              ('0553571656', 9),
              ('0553573926', 9),
     

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [0]:
userID = 2110

In [0]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [0]:
uid = str(2110)

In [0]:
def get_top_n(predictions, n=10):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, bookRating in top_n.items():
        bookRating.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = bookRating[:n]

    return top_n

In [0]:
top_n = get_top_n(predictions, n=10)

In [79]:
top_n

defaultdict(list,
            {2033: [('002542730X', 9),
              ('0062507109', 9),
              ('0310435706', 9),
              ('0385486804', 9),
              ('0385504209', 9),
              ('0399501487', 9),
              ('0517573636', 9),
              ('0671037692', 9),
              ('0671876821', 9),
              ('0679417648', 9)],
             2110: [('002542730X', 9),
              ('0441627404', 9),
              ('0553351672', 9),
              ('0553571656', 9),
              ('0679417648', 9),
              ('0786866845', 9),
              ('0897330536', 9),
              ('0898157803', 9),
              ('1572302399', 9),
              ('0060392185', 9)],
             2276: [('0060006641', 9),
              ('0201000822', 9),
              ('0375751513', 9),
              ('0425047962', 9),
              ('0440236738', 9),
              ('0446600415', 9),
              ('0451191153', 9),
              ('0553571656', 9),
              ('0553573926', 9),
     

In [80]:
# get RMSE
print("User-based Model : Test Set")
accuracy.rmse(predictions, verbose=True)

User-based Model : Test Set
RMSE: 0.9251


0.925114106477752

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [0]:
user_data = ratings_10[ratings_10['userID'] == 2110] 



In [82]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating
14448,2110,60987529,7
14449,2110,64472779,8
14450,2110,140022651,10
14452,2110,142302163,8
14453,2110,151008116,5


In [83]:
user_data.shape

(103, 3)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [84]:
user_full_info = pd.merge(user_data, books, on='ISBN')
user_full_info.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,2110,60987529,7,Confessions of an Ugly Stepsister : A Novel,Gregory Maguire,2000,Regan Books,http://images.amazon.com/images/P/0060987529.0...,http://images.amazon.com/images/P/0060987529.0...,http://images.amazon.com/images/P/0060987529.0...
1,2110,64472779,8,All-American Girl,Meg Cabot,2003,HarperTrophy,http://images.amazon.com/images/P/0064472779.0...,http://images.amazon.com/images/P/0064472779.0...,http://images.amazon.com/images/P/0064472779.0...
2,2110,140022651,10,Journey to the Center of the Earth,Jules Verne,1965,Penguin Books,http://images.amazon.com/images/P/0140022651.0...,http://images.amazon.com/images/P/0140022651.0...,http://images.amazon.com/images/P/0140022651.0...
3,2110,142302163,8,The Ghost Sitter,Peni R. Griffin,2002,Puffin Books,http://images.amazon.com/images/P/0142302163.0...,http://images.amazon.com/images/P/0142302163.0...,http://images.amazon.com/images/P/0142302163.0...
4,2110,151008116,5,Life of Pi,Yann Martel,2002,Harcourt,http://images.amazon.com/images/P/0151008116.0...,http://images.amazon.com/images/P/0151008116.0...,http://images.amazon.com/images/P/0151008116.0...


In [0]:
book_data.shape

In [0]:
book_data.head()

In [0]:
user_full_info.head()

### Get top 10 recommendations for above given userID from the books not already rated by that user

---

