### Residency-5 : Internal Lab

**Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

### Load the required libraries

In [1]:
import numpy as np
import pandas as pd

from surprise import Dataset,Reader, KNNWithMeans, accuracy
from surprise.model_selection import train_test_split


#### Execute the below cell to load the datasets

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher', 'Image-Url-S', 'Image-Url-M', 'Image-Url-L']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['User-ID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['User-ID', 'ISBN', 'Book-Rating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [3]:
# Books data-set
books.shape

(271360, 8)

In [4]:
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-Url-S', 'Image-Url-M', 'Image-Url-L'],
      dtype='object')

In [5]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                   271360 non-null object
Book-Title             271360 non-null object
Book-Author            271359 non-null object
Year-Of-Publication    271360 non-null object
Publisher              271358 non-null object
Image-Url-S            271360 non-null object
Image-Url-M            271360 non-null object
Image-Url-L            271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MB


In [6]:
# user data-set
users.shape

(278858, 3)

In [7]:
users.columns

Index(['User-ID', 'Location', 'Age'], dtype='object')

In [8]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
User-ID     278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [9]:
ratings.shape

(1149780, 3)

In [10]:
# rating data-set
ratings.columns

Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object')

In [11]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
User-ID        1149780 non-null int64
ISBN           1149780 non-null object
Book-Rating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


## Exploring books dataset

In [12]:
books.head().T

Unnamed: 0,0,1,2,3,4
ISBN,0195153448,0002005018,0060973129,0374157065,0393045218
Book-Title,Classical Mythology,Clara Callan,Decision in Normandy,Flu: The Story of the Great Influenza Pandemic...,The Mummies of Urumchi
Book-Author,Mark P. O. Morford,Richard Bruce Wright,Carlo D'Este,Gina Bari Kolata,E. J. W. Barber
Year-Of-Publication,2002,2001,1991,1999,1999
Publisher,Oxford University Press,HarperFlamingo Canada,HarperPerennial,Farrar Straus Giroux,W. W. Norton &amp; Company
Image-Url-S,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0393045218.0...
Image-Url-M,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0393045218.0...
Image-Url-L,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [13]:
books.drop(columns=['Image-Url-S','Image-Url-M','Image-Url-L'],inplace=True)

In [14]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [15]:
books['Book-Title'].value_counts()

Selected Poems                                                                         27
Little Women                                                                           24
Wuthering Heights                                                                      21
Adventures of Huckleberry Finn                                                         20
The Secret Garden                                                                      20
Dracula                                                                                20
Jane Eyre                                                                              19
The Night Before Christmas                                                             18
Pride and Prejudice                                                                    18
Great Expectations                                                                     17
Black Beauty                                                                           16
Masquerade

### Inference:
    There are more than one publishers for the same book title

### Check unique values of yearOfPublication


In [16]:
books['Year-Of-Publication'].nunique()


202

In [17]:
books['Year-Of-Publication'].unique()


array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

In [18]:
books[books['Year-Of-Publication'].str.isnumeric() == False]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [19]:
books[books['Year-Of-Publication'] == 'DK Publishing Inc']

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [20]:
books.drop(labels=books[books['Year-Of-Publication'].str.isnumeric() == False].index,inplace=True)

In [21]:
books[books['Year-Of-Publication'].str.isnumeric() == False]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher


In [22]:
books['Year-Of-Publication'].count()

271357

### Change the datatype of yearOfPublication to 'int'

In [23]:
books['Year-Of-Publication'] = books['Year-Of-Publication'].astype(np.int64)


In [24]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                   271357 non-null object
Book-Title             271357 non-null object
Book-Author            271356 non-null object
Year-Of-Publication    271357 non-null int64
Publisher              271355 non-null object
dtypes: int64(1), object(4)
memory usage: 12.4+ MB


### Drop NaNs in `'publisher'` column


In [25]:
books.isna().sum()

ISBN                   0
Book-Title             0
Book-Author            1
Year-Of-Publication    0
Publisher              2
dtype: int64

In [26]:
books[books['Publisher'].isna() == True]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


In [27]:
books['Publisher'].nunique()

16804

#### Drop 'NaN' in 'Publisher' column being interepted as to drop that rows
For this scenario, column can't be replaced with some known strings or value to some values like 0, as it doesn't make sense,  hence dropping it

In [28]:
books.drop(labels=books[books['Publisher'].isna() == True].index,inplace=True)

In [29]:
books[books['Publisher'].isna() == True]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher


In [30]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271355 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                   271355 non-null object
Book-Title             271355 non-null object
Book-Author            271354 non-null object
Year-Of-Publication    271355 non-null int64
Publisher              271355 non-null object
dtypes: int64(1), object(4)
memory usage: 12.4+ MB


## Exploring Users dataset

In [31]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [32]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
User-ID     278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [33]:
users.isna().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

**Column 'Age' has a value 'NaN' for 110762 rows**

### Get all unique values in ascending order for column `Age`

In [34]:
users['Age'].nunique()

165

In [35]:
np.sort(users['Age'].unique())

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [36]:
users[(users['Age'] < 5) | (users['Age'] > 90)].count()

User-ID     1312
Location    1312
Age         1312
dtype: int64

In [37]:
users[(users['Age'] < 5) | (users['Age'] > 90) == True].sample(5)

Unnamed: 0,User-ID,Location,Age
169984,169985,"florence, alabama, usa",0.0
194105,194106,"arlington, texas, usa",1.0
237564,237565,"athina, alberta, greece",104.0
114730,114731,"bendigo, victoria, australia",0.0
150238,150239,"london, england, united kingdom",0.0


In [38]:
users['Age'].mask((users['Age'] < 5), inplace=True)

In [39]:
users['Age'].mask((users['Age'] > 90), inplace=True)

In [40]:
np.sort(users['Age'].unique())

array([ 5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17.,
       18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
       31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
       44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56.,
       57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.,
       70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82.,
       83., 84., 85., 86., 87., 88., 89., 90., nan])

In [41]:
users['Age'].isna().sum()

112074

### Replace null values in column `Age` with mean

In [42]:
users['Age'].mean()

34.72384041634689

In [43]:
users['Age'].replace(np.nan,users['Age'].mean(),inplace=True)

In [44]:
users['Age'].isna().sum()

0

### Change the datatype of `Age` to `int`

In [45]:
users['Age'] = users['Age'].astype(np.int64,)

In [46]:
users['Age'].dtype

dtype('int64')

## Exploring the Ratings Dataset

### check the shape

In [47]:
ratings.shape

(1149780, 3)

In [48]:
ratings.head(5)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [49]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
User-ID        1149780 non-null int64
ISBN           1149780 non-null object
Book-Rating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [50]:
ratings.isna().sum()

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

In [51]:
ratings_new = ratings.copy()

### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [52]:
ratings_new['ISBN'].nunique()

340556

In [53]:
books['ISBN'].nunique()

271355

In [54]:
ratings_new = ratings_new[ratings_new['ISBN'].isin(books['ISBN'].values)]

In [55]:
ratings_new['ISBN'].nunique()

270146

In [56]:
ratings_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031130 entries, 0 to 1149778
Data columns (total 3 columns):
User-ID        1031130 non-null int64
ISBN           1031130 non-null object
Book-Rating    1031130 non-null int64
dtypes: int64(2), object(1)
memory usage: 31.5+ MB


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [57]:
ratings_new['User-ID'].nunique()

92106

In [58]:
users['User-ID'].nunique()

278858

In [59]:
ratings_new = ratings_new[ratings_new['User-ID'].isin(users['User-ID'].values)]

In [60]:
ratings_new['User-ID'].nunique()

92106

In [61]:
ratings_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031130 entries, 0 to 1149778
Data columns (total 3 columns):
User-ID        1031130 non-null int64
ISBN           1031130 non-null object
Book-Rating    1031130 non-null int64
dtypes: int64(2), object(1)
memory usage: 31.5+ MB


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [62]:
ratings_new['Book-Rating'].nunique()

11

In [63]:
ratings_new['Book-Rating'].unique()

array([ 0,  5,  3,  6,  7,  9,  8, 10,  1,  4,  2], dtype=int64)

In [64]:
ratings_new[ ratings_new['Book-Rating'] == 0 ].count()

User-ID        647291
ISBN           647291
Book-Rating    647291
dtype: int64

In [65]:
# As above it was mentioned 1-10 are explicit ratings
explicit_ratings = ratings_new[ ratings_new['Book-Rating'] != 0 ].copy()

In [66]:
explicit_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383839 entries, 1 to 1149778
Data columns (total 3 columns):
User-ID        383839 non-null int64
ISBN           383839 non-null object
Book-Rating    383839 non-null int64
dtypes: int64(2), object(1)
memory usage: 11.7+ MB


### Find out which rating has been given highest number of times

In [67]:
explicit_ratings['Book-Rating'].value_counts()

8     91804
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: Book-Rating, dtype: int64

**It can be infered that rating:8 has been given more number of times**

#### Top 5 books that was given ratings by more no of users

In [68]:
books[books['ISBN'].isin(explicit_ratings['ISBN'].value_counts().head().index)]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
26,971880107,Wild Animus,Rich Shapero,2004,Too Far
408,316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
522,312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
748,385504209,The Da Vinci Code,Dan Brown,2003,Doubleday
1105,60928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,1997,Perennial


#### Top 5 books that was given more ratings 

In [69]:
gby=explicit_ratings.groupby('ISBN')['Book-Rating'].sum()

In [70]:
gby.sort_values(ascending=False).head(5)

ISBN
0316666343    5787
0385504209    4108
0312195516    3134
059035342X    2798
0142001740    2595
Name: Book-Rating, dtype: int64

In [71]:
books[books['ISBN'].isin(gby.sort_values(ascending=False).head(5).index)]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
356,0142001740,The Secret Life of Bees,Sue Monk Kidd,2003,Penguin Books
408,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
522,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
748,0385504209,The Da Vinci Code,Dan Brown,2003,Doubleday
2143,059035342X,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books


#### Top 5 user those have given more number of ratings 

In [72]:
users[users['User-ID'].isin(explicit_ratings['User-ID'].value_counts().head().index)]

Unnamed: 0,User-ID,Location,Age
11675,11676,"n/a, n/a, n/a",34
23901,23902,"london, england, united kingdom",34
98390,98391,"morrow, georgia, usa",52
153661,153662,"ft. stewart, georgia, usa",44
189834,189835,"honolulu, hawaii, usa",34


In [73]:
gby = explicit_ratings[explicit_ratings['User-ID'].isin(explicit_ratings['User-ID'].value_counts().head().index)]
gby.groupby(['User-ID','Book-Rating']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,ISBN
User-ID,Book-Rating,Unnamed: 2_level_1
11676,1,94
11676,2,86
11676,3,166
11676,4,230
11676,5,680
11676,6,685
11676,7,1252
11676,8,1764
11676,9,861
11676,10,1125


### **Collaborative Filtering Based Recommendation Systems**

**Item based Collaborative Filtering**

As facing Memory error while fitting the dataset, hence cheking for users who has provided more than 100 number of ratings & book that has ratings provided by more than 100 users

In [74]:
user_rating = explicit_ratings['User-ID'].value_counts()
book_rating = explicit_ratings['ISBN'].value_counts()

In [75]:
explicit_ratings = explicit_ratings[explicit_ratings['User-ID'].isin(user_rating[user_rating.values >= 100].index)]

In [76]:
explicit_ratings = explicit_ratings[explicit_ratings['ISBN'].isin(book_rating[book_rating.values >= 100].index)]

In [77]:
explicit_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3149 entries, 1474 to 1146852
Data columns (total 3 columns):
User-ID        3149 non-null int64
ISBN           3149 non-null object
Book-Rating    3149 non-null int64
dtypes: int64(2), object(1)
memory usage: 98.4+ KB


In [78]:
rating_scale = np.sort(explicit_ratings['Book-Rating'].value_counts().index)
rating_scale

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int64)

In [79]:
reader = Reader(rating_scale=(np.min(rating_scale), np.max(rating_scale)))

In [80]:
data = Dataset.load_from_df(explicit_ratings[['User-ID', 'ISBN', 'Book-Rating']], reader)

In [81]:
trainset, testset = train_test_split(data, test_size=.25,random_state=0)

In [82]:
sim_measure = { ('pearson'),
                 ('pearson_baseline'),
                 ('cosine'),
                 ('msd'),
              }

for sim in sim_measure:
    algo = KNNWithMeans(sim_options={'name': sim, 'user_based': False})
    algo.fit(trainset)
    test_pred = algo.test(testset)
    accuracy.rmse(test_pred)
    print('\n')

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.6814


Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
RMSE: 1.8276


Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.6951


Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.8363




### Inference:
   
   From the above it can be infered that cosine similarity measure gives least error than others(for k=default value i.e 40).
   Even for various values of k (i.e K=10,20...100) the RSME was same for the given similarity measures.

**'Cosine' similarity measure**

In [83]:
algo = KNNWithMeans(sim_options={'name': 'cosine', 'user_based': False})
algo.fit(trainset)
test_pred = algo.test(testset)
accuracy.rmse(test_pred)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.6814


1.6814004422554982

In [84]:
test_pred_df = pd.DataFrame(test_pred)

# error = estimated rating - actual rating
test_pred_df['err'] = abs(test_pred_df.est - test_pred_df.r_ui)

In [85]:
best_predictions = test_pred_df.sort_values(by='err').head(10)
best_predictions

Unnamed: 0,uid,iid,r_ui,est,details,err
683,153662,439139600,10.0,10.0,"{'actual_k': 11, 'was_impossible': False}",0.0
491,208406,439136369,10.0,10.0,"{'actual_k': 3, 'was_impossible': False}",0.0
493,101851,439139597,10.0,10.0,"{'actual_k': 8, 'was_impossible': False}",0.0
242,104399,439136369,10.0,10.0,"{'actual_k': 1, 'was_impossible': False}",0.0
530,208406,439139597,10.0,10.0,"{'actual_k': 3, 'was_impossible': False}",0.0
102,141902,316666343,10.0,10.0,"{'actual_k': 3, 'was_impossible': False}",0.0
573,244627,440234743,10.0,10.0,"{'actual_k': 4, 'was_impossible': False}",0.0
387,226965,439139597,10.0,10.0,"{'actual_k': 7, 'was_impossible': False}",0.0
243,184532,439136369,10.0,10.0,"{'actual_k': 2, 'was_impossible': False}",0.0
467,225087,60915544,10.0,10.0,"{'actual_k': 3, 'was_impossible': False}",0.0


In [86]:
worst_predictions = test_pred_df.sort_values(by='err').tail(10)
worst_predictions

Unnamed: 0,uid,iid,r_ui,est,details,err
82,66942,1573229326,1.0,6.478665,"{'actual_k': 9, 'was_impossible': False}",5.478665
288,46398,385484518,3.0,8.536435,"{'actual_k': 9, 'was_impossible': False}",5.536435
566,162738,440224675,1.0,6.678304,"{'actual_k': 6, 'was_impossible': False}",5.678304
758,219683,316769487,1.0,6.846914,"{'actual_k': 3, 'was_impossible': False}",5.846914
602,75819,316769487,2.0,8.233892,"{'actual_k': 9, 'was_impossible': False}",6.233892
632,224525,971880107,10.0,3.466667,"{'actual_k': 0, 'was_impossible': False}",6.533333
146,38023,971880107,9.0,2.207407,"{'actual_k': 1, 'was_impossible': False}",6.792593
50,11676,671003755,1.0,7.982911,"{'actual_k': 36, 'was_impossible': False}",6.982911
533,11676,440222656,1.0,8.019289,"{'actual_k': 39, 'was_impossible': False}",7.019289
332,97874,446611867,10.0,1.0,"{'actual_k': 1, 'was_impossible': False}",9.0


### For the first items in the best-prediction list find the similar book items 

In [87]:
sim_books = algo.get_neighbors(trainset.to_inner_iid(best_predictions.iloc[0,1]),10)
books[books['ISBN'] == best_predictions.iloc[0,1]]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
6932,439139600,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2002,Scholastic Paperbacks


In [88]:
books[books['Book-Title'] == 'Harry Potter and the Goblet of Fire (Book 4)']

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
5431,439139597,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic
6932,439139600,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2002,Scholastic Paperbacks


In [89]:
similar_books = []
for x in sim_books:
    similar_books.append(trainset.to_raw_iid(x))
books[books['ISBN'].isin(similar_books)]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
94,679429220,Midnight in the Garden of Good and Evil: A Sav...,John Berendt,1994,Random House
712,142000205,Icy Sparks,Gwyn Hyman Rubio,2001,Penguin Books
953,440211727,A Time to Kill,JOHN GRISHAM,1992,Dell
1071,743237188,Fall On Your Knees (Oprah #45),Ann-Marie MacDonald,2002,Touchstone
1325,385335482,Confessions of a Shopaholic (Summer Display Op...,SOPHIE KINSELLA,2001,Delta
1680,440226430,Summer Sisters,Judy Blume,1999,Dell Publishing Company
3171,380002930,Watership Down,Richard Adams,1976,Avon
5431,439139597,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic
6196,312983271,Full House (Janet Evanovich's Full Series),Janet Evanovich,2002,St. Martin's Paperbacks
8981,684801523,The Great Gatsby,F. Scott Fitzgerald,1995,Scribner


In [90]:
ratings_for_similar_books = explicit_ratings[explicit_ratings['ISBN'].isin(similar_books)==True]
ratings_for_similar_books.sample(5)

Unnamed: 0,User-ID,ISBN,Book-Rating
367534,88283,385335482,10
370135,88733,679429220,8
738999,178522,440226430,8
429309,102647,440211727,10
396069,95359,440226430,6


In [91]:
gby = ratings_for_similar_books.groupby('ISBN')['Book-Rating']
gby.sum()

ISBN
0142000205    108
0312983271    156
0380002930    220
0385335482    150
0439139597    331
0440211727    283
0440226430    166
0679429220    135
0684801523    105
0743237188    116
Name: Book-Rating, dtype: int64

### Inference:
    IMO, mostly there are more than one ISBN number for the given books hence rather than using ISBN in the rating datset,
    Book-Title shall have been used as it can be appropriate. As if the number of ratings is obtained by using ISBN, then
    infer that only partial information(w.r.t rating) is used as that book has more than one ISBN code.
    
 Here the item shall be 'Book-Title' rather than 'ISBN'

# Below questions are optional ( Will not be graded)

### Generating ratings matrix from explicit ratings (Optional)


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

### Generate the predicted ratings using SVD with no.of singular values to be 50 (Optional)

### Take a particular user_id 

### Lets find the recommendations for user with id `2110` (Optional)

#### Note: Execute the below cells to get the variables loaded

### Get the predicted ratings for userID `2110` and sort them in descending order (Optional)

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books (Optional)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info` (Optional)

### Get top 10 recommendations for above given userID from the books not already rated by that user (Optional)