**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [2]:
import os
import numpy as np  
import pandas as pd

In [3]:
#Loading data
books = pd.read_csv("books/books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('books/users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('books/ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [4]:
print (books.shape)
print (books.info())

(271360, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB
None


In [5]:
print (users.info())
print (users.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB
None
(278858, 3)


In [6]:
print (ratings.info())
print (ratings.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
None


## Exploring books dataset

In [7]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [8]:
# dropping the Plot column
books.drop(columns = ['imageUrlS','imageUrlM','imageUrlL'], inplace = True)

In [9]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [10]:
books.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [11]:
books.loc[books.yearOfPublication =='DK Publishing Inc',:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [12]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
dtypes: object(5)
memory usage: 10.4+ MB


In [13]:
#df_books = books.drop(books[(books.yearOfPublication=='DK Publishing Inc') & (books.yearOfPublication=='Gallimard')])
df_books = books[books.yearOfPublication!='DK Publishing Inc'] 
new_books = df_books[df_books.yearOfPublication!='Gallimard']

In [14]:
new_books.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

In [15]:
df_books.shape

(271358, 5)

### Change the datatype of yearOfPublication to 'int'

In [16]:
new_books= new_books.astype({"yearOfPublication": int})

In [17]:
new_books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [18]:
new_books.dropna(subset=['publisher'],inplace=True)

In [19]:
new_books.publisher.unique()

array(['Oxford University Press', 'HarperFlamingo Canada',
       'HarperPerennial', ..., 'Tempo', 'Life Works Books', 'Connaught'],
      dtype=object)

## Exploring Users dataset

In [20]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [21]:
users.Age.unique()

array([ nan,  18.,  17.,  61.,  26.,  14.,  25.,  19.,  46.,  55.,  32.,
        24.,  20.,  34.,  23.,  51.,  31.,  21.,  44.,  30.,  57.,  43.,
        37.,  41.,  54.,  42.,  50.,  39.,  53.,  47.,  36.,  28.,  35.,
        13.,  58.,  49.,  38.,  45.,  62.,  63.,  27.,  33.,  29.,  66.,
        40.,  15.,  60.,   0.,  79.,  22.,  16.,  65.,  59.,  48.,  72.,
        56.,  67.,   1.,  80.,  52.,  69.,  71.,  73.,  78.,   9.,  64.,
       103., 104.,  12.,  74.,  75., 231.,   3.,  76.,  83.,  68., 119.,
        11.,  77.,   2.,  70.,  93.,   8.,   7.,   4.,  81., 114., 230.,
       239.,  10.,   5., 148., 151.,   6., 101., 201.,  96.,  84.,  82.,
        90., 123., 244., 133.,  91., 128.,  94.,  85., 141., 110.,  97.,
       219.,  86., 124.,  92., 175., 172., 209., 212., 237.,  87., 162.,
       100., 156., 136.,  95.,  89., 106.,  99., 108., 210.,  88., 199.,
       147., 168., 132., 159., 186., 152., 102., 116., 200., 115., 226.,
       137., 207., 229., 138., 109., 105., 228., 18

In [22]:
sorted(users.Age.unique())

[nan,
 0.0,
 1.0,
 2.0,
 3.0,
 4.0,
 5.0,
 6.0,
 7.0,
 8.0,
 9.0,
 10.0,
 11.0,
 12.0,
 13.0,
 14.0,
 15.0,
 16.0,
 17.0,
 18.0,
 19.0,
 20.0,
 21.0,
 22.0,
 23.0,
 24.0,
 25.0,
 26.0,
 27.0,
 28.0,
 29.0,
 30.0,
 31.0,
 32.0,
 33.0,
 34.0,
 35.0,
 36.0,
 37.0,
 38.0,
 39.0,
 40.0,
 41.0,
 42.0,
 43.0,
 44.0,
 45.0,
 46.0,
 47.0,
 48.0,
 49.0,
 50.0,
 51.0,
 52.0,
 53.0,
 54.0,
 55.0,
 56.0,
 57.0,
 58.0,
 59.0,
 60.0,
 61.0,
 62.0,
 63.0,
 64.0,
 65.0,
 66.0,
 67.0,
 68.0,
 69.0,
 70.0,
 71.0,
 72.0,
 73.0,
 74.0,
 75.0,
 76.0,
 77.0,
 78.0,
 79.0,
 80.0,
 81.0,
 82.0,
 83.0,
 84.0,
 85.0,
 86.0,
 87.0,
 88.0,
 89.0,
 90.0,
 91.0,
 92.0,
 93.0,
 94.0,
 95.0,
 96.0,
 97.0,
 98.0,
 99.0,
 100.0,
 101.0,
 102.0,
 103.0,
 104.0,
 105.0,
 106.0,
 107.0,
 108.0,
 109.0,
 110.0,
 111.0,
 113.0,
 114.0,
 115.0,
 116.0,
 118.0,
 119.0,
 123.0,
 124.0,
 127.0,
 128.0,
 132.0,
 133.0,
 136.0,
 137.0,
 138.0,
 140.0,
 141.0,
 143.0,
 146.0,
 147.0,
 148.0,
 151.0,
 152.0,
 156.0,
 157.0,
 159.0,


Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [23]:
users.loc[(users.Age<5) | (users.Age>100), 'Age'] = np.nan

In [24]:
sorted(users.Age.unique())

[nan,
 5.0,
 6.0,
 7.0,
 8.0,
 9.0,
 10.0,
 11.0,
 12.0,
 13.0,
 14.0,
 15.0,
 16.0,
 17.0,
 18.0,
 19.0,
 20.0,
 21.0,
 22.0,
 23.0,
 24.0,
 25.0,
 26.0,
 27.0,
 28.0,
 29.0,
 30.0,
 31.0,
 32.0,
 33.0,
 34.0,
 35.0,
 36.0,
 37.0,
 38.0,
 39.0,
 40.0,
 41.0,
 42.0,
 43.0,
 44.0,
 45.0,
 46.0,
 47.0,
 48.0,
 49.0,
 50.0,
 51.0,
 52.0,
 53.0,
 54.0,
 55.0,
 56.0,
 57.0,
 58.0,
 59.0,
 60.0,
 61.0,
 62.0,
 63.0,
 64.0,
 65.0,
 66.0,
 67.0,
 68.0,
 69.0,
 70.0,
 71.0,
 72.0,
 73.0,
 74.0,
 75.0,
 76.0,
 77.0,
 78.0,
 79.0,
 80.0,
 81.0,
 82.0,
 83.0,
 84.0,
 85.0,
 86.0,
 87.0,
 88.0,
 89.0,
 90.0,
 91.0,
 92.0,
 93.0,
 94.0,
 95.0,
 96.0,
 97.0,
 98.0,
 99.0,
 100.0]

### Replace null values in column `Age` with mean

In [25]:
users.Age


0          NaN
1         18.0
2          NaN
3         17.0
4          NaN
5         61.0
6          NaN
7          NaN
8          NaN
9         26.0
10        14.0
11         NaN
12        26.0
13         NaN
14         NaN
15         NaN
16         NaN
17        25.0
18        14.0
19        19.0
20        46.0
21         NaN
22         NaN
23        19.0
24        55.0
25         NaN
26        32.0
27        24.0
28        19.0
29        24.0
          ... 
278828     NaN
278829    28.0
278830     NaN
278831    62.0
278832    25.0
278833     NaN
278834    18.0
278835    47.0
278836     NaN
278837    15.0
278838     NaN
278839    45.0
278840     NaN
278841     NaN
278842    28.0
278843    28.0
278844     NaN
278845    23.0
278846     NaN
278847     NaN
278848    23.0
278849     NaN
278850    33.0
278851    32.0
278852    17.0
278853     NaN
278854    50.0
278855     NaN
278856     NaN
278857     NaN
Name: Age, Length: 278858, dtype: float64

In [26]:
users['Age'] = users['Age'].fillna((users['Age'].mean()))

In [27]:
users.Age.isnull()

0         False
1         False
2         False
3         False
4         False
5         False
6         False
7         False
8         False
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
278828    False
278829    False
278830    False
278831    False
278832    False
278833    False
278834    False
278835    False
278836    False
278837    False
278838    False
278839    False
278840    False
278841    False
278842    False
278843    False
278844    False
278845    False
278846    False
278847    False
278848    False
278849    False
278850    False
278851    False
278852    False
278853    False
278854    False
278855    False
278856    False
278857    False
Name: Age, Length: 27885

### Change the datatype of `Age` to `int`

In [28]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [29]:
users= users.astype({"Age": int})

In [30]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


## Exploring the Ratings Dataset

### check the shape

In [31]:
ratings.shape

(1149780, 3)

In [32]:
n_users = users.shape[0]
n_books = books.shape[0]

In [33]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [34]:
ratings.dtypes

userID         int64
ISBN          object
bookRating     int64
dtype: object

### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [35]:
print(f'Books table size: {len(books)}')
print(f'Ratings table size: {len(ratings)}')
books_with_ratings = ratings.join(books.set_index('ISBN'), on='ISBN')
print(f'New table size: {len(books_with_ratings)}')

Books table size: 271360
Ratings table size: 1149780
New table size: 1149780


In [36]:

print(f'There are {books_with_ratings.bookTitle.isnull().sum()} books with no title/author information.')
print(f'This represents {len(books_with_ratings)/books_with_ratings.bookTitle.isnull().sum():.2f}% of the ratings dataset.')

There are 118644 books with no title/author information.
This represents 9.69% of the ratings dataset.


There seems to be quite a few ISBNs in the ratings table that did not match an ISBN in the books table, almost 9% of all entries!

In [37]:
books_with_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 7 columns):
userID               1149780 non-null int64
ISBN                 1149780 non-null object
bookRating           1149780 non-null int64
bookTitle            1031136 non-null object
bookAuthor           1031135 non-null object
yearOfPublication    1031136 non-null object
publisher            1031134 non-null object
dtypes: int64(2), object(5)
memory usage: 61.4+ MB


We'll choose to remove rows for which the book_title is empty, as this is the most crucial piece of data needed to identify the book.

In [38]:
books_with_ratings.dropna(subset=['bookTitle'], inplace=True) # remove rows with missing title/author data

In [39]:
books_with_ratings

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
1,276726,0155061224,5,Rites of Passage,Judith Rae,2001,Heinle
2,276727,0446520802,0,The Notebook,Nicholas Sparks,1996,Warner Books
3,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999,Cambridge University Press
4,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001,Cambridge University Press
5,276733,2080674722,0,Les Particules Elementaires,Michel Houellebecq,1998,Flammarion
8,276744,038550120X,7,A Painted House,JOHN GRISHAM,2001,Doubleday
10,276746,0425115801,0,Lightning,Dean R. Koontz,1996,Berkley Publishing Group
11,276746,0449006522,0,Manhattan Hunt Club,JOHN SAUL,2002,Ballantine Books
12,276746,0553561618,0,Dark Paradise,TAMI HOAG,1994,Bantam


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [40]:
print(f'Users table size: {len(users)}')
print(f'Ratings table size: {len(ratings)}')
Users_with_ratings = ratings.join(users.set_index('userID'), on='userID')
print(f'New table size: {len(Users_with_ratings)}')

Users table size: 278858
Ratings table size: 1149780
New table size: 1149780


In [41]:
Users_with_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 5 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
Location      1149780 non-null object
Age           1149780 non-null int32
dtypes: int32(1), int64(2), object(2)
memory usage: 39.5+ MB


In [42]:
print(f'There are {Users_with_ratings.userID.isnull().sum()} users with missing userID information.')


There are 0 users with missing userID information.


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [43]:
Users_with_ratings.loc[(Users_with_ratings.bookRating<1)&(Users_with_ratings.bookRating>10)]  = 0


In [44]:
Users_with_ratings

Unnamed: 0,userID,ISBN,bookRating,Location,Age
0,276725,034545104X,0,"tyler, texas, usa",34
1,276726,0155061224,5,"seattle, washington, usa",34
2,276727,0446520802,0,"h, new south wales, australia",16
3,276729,052165615X,3,"rijeka, n/a, croatia",16
4,276729,0521795028,6,"rijeka, n/a, croatia",16
5,276733,2080674722,0,"paris, n/a, france",37
6,276736,3257224281,8,"salzburg, salzburg, austria",34
7,276737,0600570967,6,"sydney, new south wales, australia",14
8,276744,038550120X,7,"torrance, california, usa",34
9,276745,342310538,10,"berlin, berlin, germany",27


### Find out which rating has been given highest number of times

In [45]:
books_with_ratings.groupby('bookRating').bookRating.sum()

bookRating
0          0
1       1481
2       4750
3      15354
4      30468
5     226775
6     190122
7     464814
8     734432
9     547002
10    712250
Name: bookRating, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [47]:
print(f'Books+Ratings table size: {len(books_with_ratings)}')
print(f'Users table size: {len(users)}')
books_users_ratings = books_with_ratings.join(users.set_index('userID'), on='userID')
print(f'New "books_users_ratings" table size: {len(books_users_ratings)}')

Books+Ratings table size: 1031136
Users table size: 278858
New "books_users_ratings" table size: 1031136


In [48]:
books_users_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031136 entries, 0 to 1149778
Data columns (total 9 columns):
userID               1031136 non-null int64
ISBN                 1031136 non-null object
bookRating           1031136 non-null int64
bookTitle            1031136 non-null object
bookAuthor           1031135 non-null object
yearOfPublication    1031136 non-null object
publisher            1031134 non-null object
Location             1031136 non-null object
Age                  1031136 non-null int32
dtypes: int32(1), int64(2), object(6)
memory usage: 74.7+ MB


In [49]:
user_item_rating = books_users_ratings[['userID', 'ISBN', 'bookRating']]
user_item_rating.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [50]:
user_item_rating.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031136 entries, 0 to 1149778
Data columns (total 3 columns):
userID        1031136 non-null int64
ISBN          1031136 non-null object
bookRating    1031136 non-null int64
dtypes: int64(2), object(1)
memory usage: 31.5+ MB


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [54]:

from surprise import Reader, Dataset

# First need to create a 'Reader' object to set the scale/limit of the ratings field
reader = Reader(rating_scale=(1, 10))

data = Dataset.load_from_df(user_item_rating, reader)

In [55]:
from surprise import SVD, NMF, model_selection, accuracy
# Load SVD algorithm
model = SVD()

# Train on books dataset
%time model_selection.cross_validate(model, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    3.5044  3.4948  3.5026  3.4942  3.5052  3.5002  0.0048  
Fit time          63.74   67.10   60.66   60.50   62.14   62.83   2.44    
Test time         2.28    1.97    1.94    1.95    2.08    2.04    0.13    
Wall time: 5min 40s


{'test_rmse': array([3.50440622, 3.49476222, 3.50256613, 3.49419266, 3.50524015]),
 'fit_time': (63.742999792099,
  67.09799981117249,
  60.65899991989136,
  60.503000020980835,
  62.13599991798401),
 'test_time': (2.2840001583099365,
  1.9720001220703125,
  1.940000057220459,
  1.9489998817443848,
  2.0759999752044678)}

In [59]:
# set test set to 20%.
trainset, testset = model_selection.train_test_split(data, test_size=0.2)

# Instantiate the SVD model.
model = SVD()

# Train the algorithm on the training set, and predict ratings for the test set
model.fit(trainset)
predictions = model.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 3.5015


3.501503294689561

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [60]:
userID = 2110

In [66]:
from collections import defaultdict

def get_top_n(predictions, n=10):
 

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
        
    return top_n

In [72]:
 def get_reading_list(userid):
    """
    Retrieve full book titles from full 'books_users_ratings' dataframe
    """
    reading_list = defaultdict(list)
    top_n = get_top_n(predictions, n=10)
    for n in top_n[userid]:
        book, rating = n
        title = books_users_ratings.loc[books_users_ratings.ISBN==book].bookTitle.unique()[0]
        reading_list[title] = rating
    return reading_list

### Get the predicted ratings for userID `2110` and sort them in descending order

In [73]:

example_reading_list = get_reading_list(userID)
for book, rating in example_reading_list.items():
    print(f'{book}: {rating}')

Thinner: 10
Just As Long As We're Together: 9.451138421650015
The Girl Who Loved Tom Gordon: 7.618248132369063
Close Friends: 7.191701488333628
Watchers #1: Last Stop: 7.011945134778619
Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback)): 6.537124555542003
Han Solo and the Lost Legacy: 6.468524348637647
The Lives of John Lennon: 6.410425414768008
The Dragon Token (Dragon Star, Book 2): 6.368527004617085
It: 6.34419127934035


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [74]:

example_reading_list = get_reading_list(userID)
for book, rating in example_reading_list.items():
    print(f'{book}: {rating}')

Thinner: 10
Just As Long As We're Together: 9.451138421650015
The Girl Who Loved Tom Gordon: 7.618248132369063
Close Friends: 7.191701488333628
Watchers #1: Last Stop: 7.011945134778619
Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback)): 6.537124555542003
Han Solo and the Lost Legacy: 6.468524348637647
The Lives of John Lennon: 6.410425414768008
The Dragon Token (Dragon Star, Book 2): 6.368527004617085
It: 6.34419127934035
