**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [3]:
books.shape
books.head()

(271360, 8)

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
users.shape
users.head()

(278858, 3)

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [5]:
ratings.shape
ratings.head()

(1149780, 3)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


## Exploring books dataset

In [6]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [7]:
books = books.drop(labels = ['imageUrlS','imageUrlM','imageUrlL'], axis = 1)

In [8]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [9]:
books[books["publisher"] =='DK Publishing Inc']["yearOfPublication"].unique()
books[books["publisher"] =='Gallimard']["yearOfPublication"].unique()
books["yearOfPublication"].unique()
books.dtypes

array([2003, 2000, 1997, 2002, 1998, 2001, 1999, 1996, 1994, 1995, 1993,
       1992, 2004, '1993', '2004', '1997', '2001', '2000', '2002', '1995',
       '1999', '2003', '1998', '1994', '1996', '1991'], dtype=object)

array([2002, 1991, 2001, 1972, 2003, 1973, 2000, 1983, 1994, 1995, 1976,
       1999, 1996, 1993, 1982, 1992, 1997, 1998, 1987, 1984, 1977, 1981,
       1989, 1986, 1988, 1974, 1978, 1975, 1980, 1967, 1990, 1979, 1985,
       1970, 1969, 1966, '2003', '2000', '2002', '1973', '1997', '1995',
       '1983', '1994', '1988', '1985', '1986', '1972', '1996', '1984',
       '1999', '1979', '1990', '1998', '1981', '1982', '1978', '2001',
       '1975', '1967', '1991', '1987', '1977', '1989', '1962', '1993',
       '1992', 1949], dtype=object)

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
dtype: object

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [10]:
books[books["yearOfPublication"] =='DK Publishing Inc']["yearOfPublication"].count()
books[books["yearOfPublication"] =='Gallimard']["yearOfPublication"].count()
books[books["yearOfPublication"] ==0]["yearOfPublication"].count()
books.dtypes
print("Looks 3 data with name of publication and 3570 with year as 0, Looks like year of publication is object data type")

2

1

3570

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
dtype: object

Looks 3 data with name of publication and 3570 with year as 0, Looks like year of publication is object data type


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [11]:
books1 = books
books1[books1["yearOfPublication"].isin(['DK Publishing Inc','Gallimard'])]
#books[books["yearOfPublication"] =='DK Publishing Inc']["yearOfPublication"].count()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [12]:

books.drop(index = books[books["yearOfPublication"].isin(['DK Publishing Inc','Gallimard'])].index.tolist(),inplace =True)

In [13]:
books.shape # (271358, 5)
print("Dropped only year of pulblication having publisher name, leaving those entries with 0 as it is not mentioned for droping")

(271357, 5)

Dropped only year of pulblication having publisher name, leaving those entries with 0 as it is not mentioned for droping


### Change the datatype of yearOfPublication to 'int'

In [14]:
books["yearOfPublication"] = books["yearOfPublication"].astype("int64")

In [15]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [16]:
import numpy as np
books[books["publisher"]==np.nan]
books.isnull().sum()
books.shape
books = books.dropna(subset=['publisher'])
books.shape


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            2
dtype: int64

(271357, 5)

(271355, 5)

## Exploring Users dataset

In [17]:
users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [18]:
users.dtypes
users.sort_values(by=['Age'])["Age"].unique()

userID        int64
Location     object
Age         float64
dtype: object

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [19]:
for i in range(0,250):
    if i not in range(5,90):
        users["Age"] = users["Age"].replace(i,np.nan)
        


In [20]:
users["Age"].unique()
users["Age"].mean()

array([nan, 18., 17., 61., 26., 14., 25., 19., 46., 55., 32., 24., 20.,
       34., 23., 51., 31., 21., 44., 30., 57., 43., 37., 41., 54., 42.,
       50., 39., 53., 47., 36., 28., 35., 13., 58., 49., 38., 45., 62.,
       63., 27., 33., 29., 66., 40., 15., 60., 79., 22., 16., 65., 59.,
       48., 72., 56., 67., 80., 52., 69., 71., 73., 78.,  9., 64., 12.,
       74., 75., 76., 83., 68., 11., 77., 70.,  8.,  7., 81., 10.,  5.,
        6., 84., 82., 85., 86., 87., 89., 88.])

34.722183248490516

### Replace null values in column `Age` with mean

In [21]:
users["Age"] = users["Age"].replace(np.nan,users["Age"].mean())

In [22]:
users["Age"].unique()

array([34.72218325, 18.        , 17.        , 61.        , 26.        ,
       14.        , 25.        , 19.        , 46.        , 55.        ,
       32.        , 24.        , 20.        , 34.        , 23.        ,
       51.        , 31.        , 21.        , 44.        , 30.        ,
       57.        , 43.        , 37.        , 41.        , 54.        ,
       42.        , 50.        , 39.        , 53.        , 47.        ,
       36.        , 28.        , 35.        , 13.        , 58.        ,
       49.        , 38.        , 45.        , 62.        , 63.        ,
       27.        , 33.        , 29.        , 66.        , 40.        ,
       15.        , 60.        , 79.        , 22.        , 16.        ,
       65.        , 59.        , 48.        , 72.        , 56.        ,
       67.        , 80.        , 52.        , 69.        , 71.        ,
       73.        , 78.        ,  9.        , 64.        , 12.        ,
       74.        , 75.        , 76.        , 83.        , 68.  

### Change the datatype of `Age` to `int`

In [23]:
users["Age"] = users["Age"].astype('int64')

In [24]:
users["Age"].unique()

array([34, 18, 17, 61, 26, 14, 25, 19, 46, 55, 32, 24, 20, 23, 51, 31, 21,
       44, 30, 57, 43, 37, 41, 54, 42, 50, 39, 53, 47, 36, 28, 35, 13, 58,
       49, 38, 45, 62, 63, 27, 33, 29, 66, 40, 15, 60, 79, 22, 16, 65, 59,
       48, 72, 56, 67, 80, 52, 69, 71, 73, 78,  9, 64, 12, 74, 75, 76, 83,
       68, 11, 77, 70,  8,  7, 81, 10,  5,  6, 84, 82, 85, 86, 87, 89, 88])

In [25]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]


## Exploring the Ratings Dataset

### check the shape

In [26]:
ratings.shape
ratings.dtypes

(1149780, 3)

userID         int64
ISBN          object
bookRating     int64
dtype: object

In [27]:
n_users = users.shape[0]
n_books = books.shape[0]
n_users
n_books

278858

271355

In [28]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [29]:
#r = ratings[ratings["userID"].isin([276725,276726])]
#u = users[users["userID"].isin([276725])]

#new_rating = pd.merge(r,u,on = "userID")
new_rating = pd.merge(ratings,books,how = 'inner',on = "ISBN")
new_rating.shape

(1031130, 7)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [30]:
new_rating = pd.merge(new_rating,users,how = 'inner',on = "userID")
new_rating.shape
new_ratings = new_rating.loc[:,["userID","ISBN","bookRating"]]

(1031130, 9)

In [31]:
new_ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,2313,034545104X,5
2,2313,0812533550,9
3,2313,0679745580,8
4,2313,0060173289,9


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [32]:
#new_ratings[new_ratings["bookRating"]==0]
new_ratings[new_ratings["bookRating"]==0].count()
new_ratings.drop(index = new_ratings[new_ratings["bookRating"]==0].index.tolist(),inplace =True)

userID        647291
ISBN          647291
bookRating    647291
dtype: int64

In [33]:
new_ratings.shape

(383839, 3)

### Find out which rating has been given highest number of times

In [34]:
new_ratings["bookRating"].value_counts()

8     91804
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [35]:
l = new_ratings.groupby(["userID"],as_index =False).count()
new_ratings_users = l[l["ISBN"]>=100]
new_ratings_users.head()



Unnamed: 0,userID,ISBN,bookRating
481,2033,129,129
508,2110,103,103
554,2276,196,196
967,4017,154,154
1055,4385,212,212


In [36]:
new_ratings_users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 449 entries, 481 to 67980
Data columns (total 3 columns):
userID        449 non-null int64
ISBN          449 non-null int64
bookRating    449 non-null int64
dtypes: int64(3)
memory usage: 14.0 KB


In [37]:
collabrative_filtering = pd.merge(new_ratings,new_ratings_users,how = 'inner',on = "userID")

In [38]:
collabrative_filtering.shape
collabrative_filtering = collabrative_filtering.head(1000)
collabrative_filtering.shape

(103269, 5)

(1000, 5)

In [39]:
collabrative_filtering = collabrative_filtering.loc[:,["userID","ISBN_x","bookRating_x"]]
collabrative_filtering =collabrative_filtering.rename(columns={"userID": "userID", "ISBN_x": "ISBN","bookRating_x":"bookRating"})

In [40]:
import sys
!pip install surprise



In [41]:
!pip install --upgrade pip

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (19.0.3)


In [42]:
from surprise import Dataset,Reader
reader = Reader(rating_scale=(1, 10))

In [43]:
collabrative_filtering.userID=collabrative_filtering.userID.astype(str)
collabrative_filtering.ISBN=collabrative_filtering.ISBN.astype(str)

In [44]:
data = Dataset.load_from_df(collabrative_filtering[['userID', 'ISBN', 'bookRating']], reader)

In [45]:
data
# user item rating data can be obtained as follows


#for keys in user_records.keys():
 #   print(keys)


<surprise.dataset.DatasetAutoFolds at 0x7fce224807b8>

In [46]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=.25,random_state=123)


In [47]:
#user_records[3]
print(trainset.to_raw_uid(0))
print(trainset.to_raw_iid(0))

6543
0515075345


### Generating ratings matrix from explicit ratings


In [48]:
from surprise import KNNWithMeans
from surprise import accuracy
from surprise import Prediction

In [49]:
algo = KNNWithMeans(k=1, sim_options={'name': 'pearson', 'user_based': False})
algo.fit(trainset)


Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7fce23062cc0>

In [50]:
len(testset[:])

250

In [51]:
# Evalute on test set
test_pred = algo.test(testset)

# compute RMSE
accuracy.rmse(test_pred)


RMSE: 1.2990


1.2989808189499952

In [52]:
test_pred[:2]

[Prediction(uid='23768', iid='0684196387', r_ui=8.0, est=8.428, details={'was_impossible': True, 'reason': 'User and/or item is unkown.'}),
 Prediction(uid='98391', iid='0312983271', r_ui=8.0, est=8.428, details={'was_impossible': True, 'reason': 'User and/or item is unkown.'})]

In [53]:
# convert results to dataframe
test_pred_df = pd.DataFrame(test_pred)
test_pred_df["was_impossible"] = [x["was_impossible"] for x in test_pred_df["details"]]

#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [54]:
#collabrative_filtering = pd.merge(new_ratings,new_ratings_users,how = 'inner',on = "userID")
#collabrative_filtering.shape

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [55]:
collabrative_filtering.head()
collabrative_filtering.dtypes
from sklearn.model_selection import train_test_split
trainset, tempDF = train_test_split(collabrative_filtering, test_size=.25,random_state=123)
testDF = tempDF.copy()
tempDF.bookRating = np.nan

Unnamed: 0,userID,ISBN,bookRating
0,6543,446605484,10
1,6543,805062971,8
2,6543,345342968,8
3,6543,446610038,9
4,6543,61009059,8


userID        object
ISBN          object
bookRating     int64
dtype: object

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [56]:
collabrative_filtering["userID"].unique()

array(['6543', '23768', '98391'], dtype=object)

In [57]:
testDF = testDF.dropna()
testDF.head()

Unnamed: 0,userID,ISBN,bookRating
131,6543,515075345,8
203,23768,671880187,8
50,6543,451207246,7
585,98391,399242562,9
138,6543,553575279,9


In [58]:
ratings = pd.concat([trainset, tempDF]).reset_index()
#ratings["bookRating"] ==np.nan

In [130]:
tempDF.head()
ratings[(ratings["userID"]=='98391')].isna().sum()
ratings[(ratings["userID"]=='98391')].count()
ratings.shape
ratings[(ratings["userID"]=='98391')&(ratings["ISBN"]=='0312984863') ]

Unnamed: 0,userID,ISBN,bookRating
131,6543,515075345,
203,23768,671880187,
50,6543,451207246,
585,98391,399242562,
138,6543,553575279,


index           0
userID          0
ISBN            0
bookRating    147
dtype: int64

index         616
userID        616
ISBN          616
bookRating    469
dtype: int64

(1000, 4)

Unnamed: 0,index,userID,ISBN,bookRating
113,918,98391,312984863,10.0


In [60]:

R_df = ratings.pivot(index = 'userID', columns ='ISBN', values = 'bookRating').fillna(0)
R_df.tail()

ISBN,0002712172,0026888130,0028621697,0030850746,0060001445,0060001453,0060002050,0060002433,006000469X,0060005556,...,1885222645,1885222653,1885222661,188522267X,1885222696,188522270X,1885222718,1885222726,1885222734,1931402213
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
23768,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,7.0,0.0,7.0,7.0,0.0,7.0,7.0,0.0
6543,7.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98391,0.0,0.0,0.0,0.0,8.0,9.0,8.0,0.0,8.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0


In [61]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_df, k = 2)
U.shape
sigma.shape
Vt.shape

(3, 2)

(2,)

(2, 978)

In [62]:
sigma = np.diag(sigma)
sigma.shape
sigma
type(U)

(2, 2)

array([[ 99.05718244,   0.        ],
       [  0.        , 191.56211831]])

numpy.ndarray

In [63]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma,out = (U)),Vt)

preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

In [64]:
preds_df.head()

ISBN,0002712172,0026888130,0028621697,0030850746,0060001445,0060001453,0060002050,0060002433,006000469X,0060005556,...,1885222645,1885222653,1885222661,188522267X,1885222696,188522270X,1885222718,1885222726,1885222734,1931402213
0,-0.008054,7.999989,0.0,0.0,0.000182,0.000204,0.000182,-0.006903,0.000182,0.000204,...,0.0,0.0,6.999991,0.0,6.999991,6.999991,0.0,6.999991,6.999991,0.000182
1,0.002736,-0.009204,0.0,0.0,0.157863,0.177596,0.157863,0.002345,0.157863,0.177596,...,0.0,0.0,-0.008054,0.0,-0.008054,-0.008054,0.0,-0.008054,-0.008054,0.157863
2,0.13813,0.000182,0.0,0.0,7.996884,8.996494,7.996884,0.118397,7.996884,8.996494,...,0.0,0.0,0.000159,0.0,0.000159,0.000159,0.0,0.000159,0.000159,7.996884


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [131]:
userID = '98391'
print("Using user id 98391 as per my data set")

Using user id 98391 as per my data set


In [91]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

In [107]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


### Get the predicted ratings for userID `2110` and sort them in descending order

In [143]:
def recommend_movies(predictions_df, matrix_id,userID, books_df, original_ratings_df, num_recommendations=10):
    
    # Get and sort the user's predictions
    user_row_number = matrix_id - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    #Added title and genres
    user_full = (user_data.merge(books_df, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                     sort_values(['bookRating'], ascending=False)
                 )

    print ('User {0} has already rated {1} Books.'.format(userID, user_full.shape[0]))
    print ('Recommending the highest {0} predicted ratings Books not already rated.'.format(num_recommendations))
    print(num_recommendations)
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (books_df[~books_df['ISBN'].isin(user_full['ISBN'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'ISBN',
               right_on = 'ISBN').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations, sorted_user_predictions, user_data, user_full

already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_movies(preds_df,3, userID, books, ratings, 10)

User 98391 has already rated 616 Books.
Recommending the highest 10 predicted ratings Books not already rated.
10


In [79]:
ratings.head()

Unnamed: 0,index,userID,ISBN,bookRating
0,894,98391,006000469X,8.0
1,941,98391,084395275X,10.0
2,285,23768,0385469683,8.0
3,462,98391,051513628X,10.0
4,370,23768,1885222580,7.0


In [104]:
ratings["userID"].unique()

array(['98391', '23768', '6543'], dtype=object)

In [138]:
 #   user_row_number = 1 - 1 # UserID starts at 1, not 0
sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)


In [127]:
 ratings[ratings["userID"] == '98391'].count()
    user

index         616
userID        616
ISBN          616
bookRating    469
dtype: int64

In [144]:
predictions

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
14801,836218051,The Essential Calvin and Hobbes,Bill Watterson,1988,Andrews McMeel Publishing
18915,451165624,On the Road,Jack Kerouac,1981,Signet Book
18794,375502971,"A Dog Year: Twelve Months, Four Dogs, and Me",Jon Katz,2002,Villard Books
1652,872860175,Howl and Other Poems (Pocket Poets),Allen Ginsberg,1956,City Lights Publishers
18843,451179285,The Stand: The Complete &amp; Uncut Edition,Stephen King,1994,New Amer Library
18841,553251481,Jitterbug Perfume,Tom Robbins,1985,Bantam Books
215,446605484,Roses Are Red (Alex Cross Novels),James Patterson,2001,Warner Vision
1244,380813815,"Lamb : The Gospel According to Biff, Christ's ...",Christopher Moore,2003,Perennial
13280,312099436,Women of the Silk : A Novel,Gail Tsukiyama,1993,St. Martin's Griffin
18856,688021174,The Highwayman,Alfred Noyes,1983,Harpercollins Juvenile Books


### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [109]:
already_rated.head()

Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
65,918,98391,0312984863,10.0,The Impostor : The Liar's Club (Liars Club),Celeste Bradley,2003,St. Martin's Paperbacks
401,539,98391,0425190927,10.0,The Wife Test,Betina Krahn,2003,Berkley Publishing Group
150,443,98391,039915180X,10.0,Hidden Prey,John Sandford,2004,Putnam Publishing Group
152,848,98391,0451210727,10.0,Murder of a Barbie and Ken (National Bestselli...,Denise Swanson,2003,Signet Book
86,429,98391,0743444477,10.0,Letters for Emily,Camron Wright,2003,Pocket


In [110]:
user_data[]

Unnamed: 0,index,userID,ISBN,bookRating
0,894,98391,006000469X,8.0
1,941,98391,084395275X,10.0
3,462,98391,051513628X,10.0
6,621,98391,0743211375,9.0
7,415,98391,0743422732,8.0


In [133]:
user_data.shape
user_data[(user_data["userID"]=='98391')].count()

(616, 4)

index         616
userID        616
ISBN          616
bookRating    469
dtype: int64

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [132]:
book_data.shape

NameError: name 'book_data' is not defined

In [None]:
book_data.head()

In [None]:
user_full_info.head()

### Get top 10 recommendations for above given userID from the books not already rated by that user

In [145]:
print("Done in previous steps")

Done in previous steps
