**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
#Loading data
import numpy as np
import pandas as pd
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [2]:
books.shape

(271360, 8)

In [5]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
imageUrlS            object
imageUrlM            object
imageUrlL            object
dtype: object

### The books dataframe consist of 271360 rows and 8 columns. These columns are object data types with information about the book, its title, author, year of publication and publisher.

In [3]:
users.shape

(278858, 3)

In [6]:
users.dtypes

userID        int64
Location     object
Age         float64
dtype: object

### The users dataframe consist of 278858 rows and three columns. These columns are 'UserId', 'Location' and 'Age'. These columns are object data types with information about User ID, Location and age.

In [7]:
ratings.shape

(1149780, 3)

In [8]:
ratings.dtypes

userID         int64
ISBN          object
bookRating     int64
dtype: object

### The ratings dataframe consist of 1149780 rows with three columns. These columns are UserID, ISBN and rating.

## Exploring books dataset

In [9]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [17]:
books.drop(["imageUrlS", "imageUrlM", "imageUrlL"], axis=1, inplace=True)

In [18]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [24]:
books[books['yearOfPublication']=="DK Publishing Inc"]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [25]:
books[books['yearOfPublication']=="Gallimard"]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [26]:
books.drop(books[books['yearOfPublication']=="DK Publishing Inc"].index, inplace=True)
books.drop(books[books['yearOfPublication']=="Gallimard"].index, inplace=True)

### Change the datatype of yearOfPublication to 'int'

In [27]:
books.astype({'yearOfPublication': 'int32'}).dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

In [28]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [30]:
books['publisher'].dropna()

0                                  Oxford University Press
1                                    HarperFlamingo Canada
2                                          HarperPerennial
3                                     Farrar Straus Giroux
4                               W. W. Norton &amp; Company
5                                         Putnam Pub Group
6                                 Berkley Publishing Group
7                                               Audioworks
8                                             Random House
9                                                 Scribner
10                                         Emblem Editions
11                                           Citadel Press
12                                   House of Anansi Press
13                                              Mira Books
14                                   Health Communications
15                                Brilliance Audio - Trade
16                             Kensington Publishing Cor

## Exploring Users dataset

In [31]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [52]:
ageunique=users['Age'].unique()
age=pd.DataFrame(age)
age.sort_values(by=0, ascending=True)

Unnamed: 0,0
47,0.0
57,1.0
79,2.0
72,3.0
84,4.0
90,5.0
93,6.0
83,7.0
82,8.0
64,9.0


Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [65]:
users.loc[(users['Age']<5),'Age']=np.nan
users.loc[(users['Age']>90),'Age']=np.nan

### Replace null values in column `Age` with mean

In [97]:
ageunique=users['Age'].unique()
age=np.array(ageunique)
age=age[np.logical_not(np.isnan(age))]
m=np.array(age)
a=m.mean()
users['Age']=users['Age'].fillna(a)

### Change the datatype of `Age` to `int`

In [98]:
users.astype({'Age': 'int32'}).dtypes

userID       int64
Location    object
Age          int32
dtype: object

In [99]:
print(sorted(users.Age.unique()))

[5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 47.5, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0]


## Exploring the Ratings Dataset

### check the shape

In [100]:
ratings.shape

(1149780, 3)

In [101]:
n_users = users.shape[0]
n_books = books.shape[0]

In [102]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [218]:
ratings = ratings.merge(books, on="ISBN", how = 'inner')
ratings.drop(["bookTitle", "bookAuthor", "yearOfPublication", "publisher"], axis=1, inplace=True)
ratings.shape

(383841, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [221]:
ratings = ratings.merge(users, on="userID", how = 'inner')
ratings.drop(["Location", "Age"], axis=1, inplace=True)
ratings

Unnamed: 0,userID,ISBN,bookRating
0,2313,034545104X,5
1,2313,0812533550,9
2,2313,0679745580,8
3,2313,0060173289,9
4,2313,0385482388,5
5,2313,0399146431,5
6,2313,0345348036,9
7,2313,0553278223,7
8,2313,0020442602,9
9,2313,0295955252,8


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [222]:
bookRating=ratings[ratings['bookRating']==0]
ratings=ratings[(ratings['bookRating']>=1) & (ratings['bookRating']<=10)]
ratings

Unnamed: 0,userID,ISBN,bookRating
0,2313,034545104X,5
1,2313,0812533550,9
2,2313,0679745580,8
3,2313,0060173289,9
4,2313,0385482388,5
5,2313,0399146431,5
6,2313,0345348036,9
7,2313,0553278223,7
8,2313,0020442602,9
9,2313,0295955252,8


### Find out which rating has been given highest number of times

In [223]:
ratings['bookRating'].value_counts()


8     91804
10    71225
7     66401
9     60778
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [224]:
uid=ratings['userID'].value_counts()
ratings_exp=ratings[ratings['userID'].isin(uid[uid>=100].index)]

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [230]:
rating_matrix=ratings_exp.pivot(index='ISBN', columns='userID', values='bookRating').fillna(0)
rating_matrix.head()

userID,2033,2110,2276,4017,4385,5582,6242,6251,6543,6575,...,269566,270713,271448,271705,273113,274061,274301,275970,277427,278418
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0000913154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0001046438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000104687X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0001047213,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0001047973,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [231]:
from scipy.sparse.linalg import svds
u,s,vt=svds(rating_matrix, k=50)
s=np.diag(s)
all_user_ratings=np.dot(np.dot(u,s), vt)
predictions=pd.DataFrame(all_user_ratings, columns=rating_matrix.columns)
predictions.head()

userID,2033,2110,2276,4017,4385,5582,6242,6251,6543,6575,...,269566,270713,271448,271705,273113,274061,274301,275970,277427,278418
0,0.025341,-0.010012,-0.015054,-0.021499,0.002077,-0.002046,-0.01592,-0.010875,0.04093,0.023473,...,-0.013891,-0.042651,-0.026051,0.012979,-0.000171,-0.013295,0.017231,0.003814,0.07802,0.008056
1,-0.002146,-0.003669,-0.015457,0.035602,-0.007965,0.018614,0.020221,-0.010051,-0.030352,-0.004168,...,-0.008549,-0.007464,-0.030714,0.069724,-0.002851,-0.002811,0.020953,-0.011141,-0.024439,0.011625
2,-0.001431,-0.002446,-0.010304,0.023735,-0.00531,0.012409,0.013481,-0.006701,-0.020235,-0.002778,...,-0.005699,-0.004976,-0.020476,0.046483,-0.001901,-0.001874,0.013969,-0.007427,-0.016292,0.00775
3,-0.002146,-0.003669,-0.015457,0.035602,-0.007965,0.018614,0.020221,-0.010051,-0.030352,-0.004168,...,-0.008549,-0.007464,-0.030714,0.069724,-0.002851,-0.002811,0.020953,-0.011141,-0.024439,0.011625
4,-0.002146,-0.003669,-0.015457,0.035602,-0.007965,0.018614,0.020221,-0.010051,-0.030352,-0.004168,...,-0.008549,-0.007464,-0.030714,0.069724,-0.002851,-0.002811,0.020953,-0.011141,-0.024439,0.011625


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [207]:
userID = 2110

In [208]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [235]:
predictions[2110].sort_values(ascending=False)

35086    0.682443
11783    0.368945
11932    0.333624
23462    0.333209
23924    0.329336
27719    0.313295
23356    0.305088
23308    0.290587
17767    0.278563
11654    0.250941
27856    0.249253
23351    0.242676
36193    0.239957
53790    0.239552
36192    0.239242
27791    0.234959
25396    0.231818
27907    0.229402
23958    0.228038
1307     0.227935
23350    0.226968
11481    0.223613
25242    0.221496
58868    0.221496
27745    0.221396
27962    0.221054
11319    0.219552
27678    0.218949
9824     0.218348
36544    0.216858
           ...   
34192   -0.042664
41554   -0.043207
46228   -0.043301
33945   -0.043553
22991   -0.043825
38277   -0.044025
11707   -0.044076
22821   -0.044668
26089   -0.044738
40833   -0.047424
40832   -0.047690
39619   -0.047838
34526   -0.048401
28599   -0.048595
19091   -0.048595
19024   -0.048595
22882   -0.048705
21433   -0.049460
33352   -0.050022
33489   -0.050099
43203   -0.052753
21373   -0.052800
41668   -0.053488
23093   -0.054607
33899   -0

In [258]:
def recommend_books(predictions_df, userID, movies_df, original_ratings_df, num_recommendations = False):
    user_row_number = userID - 1  #UserID starts at zero not 1
    sorted_user_predictions = predictions_df.loc[user_row_number].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['bookRating'], ascending = False)
                )
    print('User {0} has already rated {1} books.'.format(userID, user_full.dropna().shape[0]))
    print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    recommendations = (movies_df[~movies_df['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'userID',
                           right_on = 'userID').
                      rename(columns = {user_row_number: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
    return user_full, recommendations, sorted_user_predictions, user_data, user_full

In [259]:
already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_books(predictions, 2110, books, ratings_exp, 10)

User 2110 has already rated 103 books.
Recommending the highest 10 predicted ratings movies not already rated.


TypeError: object of type 'NoneType' has no len()

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [67]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating
14448,2110,60987529,7
14449,2110,64472779,8
14450,2110,140022651,10
14452,2110,142302163,8
14453,2110,151008116,5


In [68]:
user_data.shape

(103, 3)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [70]:
book_data.shape

(103, 5)

In [71]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
246,0151008116,Life of Pi,Yann Martel,2002,Harcourt
904,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books
1000,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy
1302,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
1472,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket


In [73]:
user_full_info.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,60987529,7,Confessions of an Ugly Stepsister : A Novel,Gregory Maguire,2000,Regan Books
1,2110,64472779,8,All-American Girl,Meg Cabot,2003,HarperTrophy
2,2110,140022651,10,Journey to the Center of the Earth,Jules Verne,1965,Penguin Books
3,2110,142302163,8,The Ghost Sitter,Peni R. Griffin,2002,Puffin Books
4,2110,151008116,5,Life of Pi,Yann Martel,2002,Harcourt


### Get top 10 recommendations for above given userID from the books not already rated by that user