**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [6]:
import pandas as pd
#Loading data

books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']



b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [7]:
books.shape

(271360, 8)

In [8]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
imageUrlS            object
imageUrlM            object
imageUrlL            object
dtype: object

In [9]:
users.shape

(278858, 3)

In [10]:
ratings.shape

(1149780, 3)

## Exploring books dataset

In [11]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [0]:
books=books.drop(labels=['imageUrlS','imageUrlM','imageUrlL'],axis=1)

In [13]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [14]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [15]:
books[books['yearOfPublication']=='DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [16]:
books[books['yearOfPublication']=='Gallimard']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [0]:
books=books.drop([209538,221678,220731])

In [18]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [0]:
import numpy as np


### Change the datatype of yearOfPublication to 'int'

In [0]:
books['yearOfPublication']=books['yearOfPublication'].astype(np.int)

In [21]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [0]:
books=books.dropna(axis=0)

In [23]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


## Exploring Users dataset

In [24]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [25]:
users.dtypes

userID        int64
Location     object
Age         float64
dtype: object

### Get all unique values in ascending order for column `Age`

In [26]:
users.sort_values('Age',ascending=True)['Age'].unique()

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [27]:
((users['Age']<5) | (users['Age']>90)).value_counts()

False    277546
True       1312
Name: Age, dtype: int64

In [0]:
users['Age']=np.where(((users['Age']<5) | (users['Age']>90)),np.nan,users['Age'])

In [29]:
users.sort_values('Age',ascending=True)['Age'].unique()

array([ 5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17.,
       18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
       31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
       44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56.,
       57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.,
       70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82.,
       83., 84., 85., 86., 87., 88., 89., 90., nan])

### Replace null values in column `Age` with mean

In [0]:
users['Age']=users['Age'].replace(np.nan,np.mean(users['Age']))

### Change the datatype of `Age` to `int`

In [0]:
users['Age']=users['Age'].astype(np.int)

In [32]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [33]:
ratings.shape

(1149780, 3)

In [0]:
n_users = users.shape[0]
n_books = books.shape[0]

In [35]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [0]:
ratings=ratings[ratings['ISBN'].isin(books['ISBN'])]

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [0]:
ratings=ratings[ratings['userID'].isin(users['userID'])]

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [0]:
ratings=ratings[ratings['bookRating']!=0]


### Find out which rating has been given highest number of times

In [39]:
ratings['bookRating'].value_counts() # 8 has given highest no of times

8     91803
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [0]:
ratings=ratings.iloc[(ratings['userID'].value_counts()>=100).index]

In [41]:
ratings.shape

(68091, 3)

### Generating ratings matrix from explicit ratings


In [0]:
ratings=ratings.head(45000)

In [43]:
ratings.shape

(45000, 3)

In [0]:
R_df=ratings.pivot(index = 'userID', columns ='ISBN', values = 'bookRating').fillna(0)

#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [0]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_df, k = 50)

In [0]:
sigma = np.diag(sigma)

In [0]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [0]:
userID = 2110

In [0]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

In [71]:
ratings.head()


Unnamed: 0,userID,ISBN,bookRating
34358,8143,0060505885,7
284180,67840,0385720653,8
543388,130554,014034537X,8
432949,103067,0345439244,10
63693,12982,0380752999,9


In [81]:
# return the movies with the highest predicted rating that the specified user hasnâ€™t already rated
#Take specific user row from matrix from predictions
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = user_id - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    print(user_data)
    #Added title and genres
    user_full = (user_data.merge(books, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                     sort_values(['bookRating'], ascending=False)
                  )
    print(user_full)
    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print ('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    #Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (books[~books['ISBN'].isin(user_full['ISBN'])].
          merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                left_on = 'ISBN',
               right_on = 'ISBN').
         rename(columns = {user_row_number: 'Predictions'}).
          sort_values('Predictions', ascending = False).iloc[:num_recommendations, :-1]
                       )

    return user_full, recommendations, sorted_user_predictions, user_data, user_full

already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_movies(preds_df, userID, books, ratings, 10)

       userID        ISBN  bookRating
14582    2110  0743486625          10
14464    2110  0345335287          10
14521    2110  0451137965           9
14538    2110  0590448285           8
14549    2110  059047054X           7
14506    2110  0439176824           8
14601    2110  093317490X           7
   userID        ISBN  bookRating  \
0    2110  0743486625          10   
1    2110  0345335287          10   
2    2110  0451137965           9   
3    2110  0590448285           8   
5    2110  0439176824           8   
4    2110  059047054X           7   
6    2110  093317490X           7   

                                           bookTitle      bookAuthor  \
0                                    Damnation Alley   Roger Zelazny   
1  The Black Unicorn (Magic Kingdom of Landover N...    Terry Brooks   
2                                            Thinner    Stephen King   
3  Karen's Tea Party (Baby-Sitters Little Sister,...   Ann M. Martin   
5               The Fall (The Seventh T

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [82]:
already_rated.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,743486625,10,Damnation Alley,Roger Zelazny,2004,I Books
1,2110,345335287,10,The Black Unicorn (Magic Kingdom of Landover N...,Terry Brooks,1990,Del Rey Books
2,2110,451137965,9,Thinner,Stephen King,1985,New Amer Library
3,2110,590448285,8,"Karen's Tea Party (Baby-Sitters Little Sister,...",Ann M. Martin,1992,Scholastic
5,2110,439176824,8,"The Fall (The Seventh Tower, Book 1)",Garth Nix,2000,Scholastic


In [83]:
predictions

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
2801,0060392452,Stupid White Men ...and Other Sorry Excuses fo...,Michael Moore,2002,Regan Books
6378,0451163524,"The Drawing of the Three (The Dark Tower, Book 2)",Stephen King,1997,Signet Book
1175,0380789035,American Gods,Neil Gaiman,2002,HarperTorch
2143,059035342X,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books
456,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
2104,0451524934,1984,George Orwell,1990,Signet Book
408,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
2865,0446611212,Violets Are Blue,James Patterson,2002,Warner Vision
90,0316769487,The Catcher in the Rye,J.D. Salinger,1991,"Little, Brown"
225,0446605484,Roses Are Red (Alex Cross Novels),James Patterson,2001,Warner Vision


### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [86]:
user_full

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,0743486625,10,Damnation Alley,Roger Zelazny,2004,I Books
1,2110,0345335287,10,The Black Unicorn (Magic Kingdom of Landover N...,Terry Brooks,1990,Del Rey Books
2,2110,0451137965,9,Thinner,Stephen King,1985,New Amer Library
3,2110,0590448285,8,"Karen's Tea Party (Baby-Sitters Little Sister,...",Ann M. Martin,1992,Scholastic
5,2110,0439176824,8,"The Fall (The Seventh Tower, Book 1)",Garth Nix,2000,Scholastic
4,2110,059047054X,7,Claudia and the Clue in the Photograph (Baby-S...,Ann M. Martin,1994,Scholastic
6,2110,093317490X,7,The Yucatan: A Guide to the Land of Maya Myste...,Antoinette May,1993,Wide World Publishing


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [90]:
sorted_user_predictions[0:11]

ISBN
0060392452    0.023993
0451163524    0.019897
0380789035    0.019573
059035342X    0.016791
044021145X    0.016342
0451524934    0.013605
0316666343    0.013233
0446611212    0.012915
0316769487    0.012659
0446605484    0.012176
0060929871    0.011990
Name: 1, dtype: float64