**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
#import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import defaultdict
from surprise import SVD
from surprise import Dataset
%matplotlib inline

In [3]:
#Loading data
books_df = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books_df.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

In [4]:
ratings_df = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings_df.columns = ['userID', 'ISBN', 'bookRating']

In [5]:
users_df = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users_df.columns = ['userID', 'Location', 'Age']

In books.csv few datapoints/rows had **&amp; and extra ;** in booTitle mainly. Replaced **&amp; with and & removed ; from title** for those entries that solved the issue<br>

### Check no.of records and features given in each dataset

In [161]:
books_df.shape

(271360, 8)

In [162]:
users_df.shape

(278858, 3)

In [163]:
ratings_df.shape

(1149780, 3)

## Exploring books dataset

In [164]:
books_df.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [165]:
books_df.drop(columns=['imageUrlS', 'imageUrlM', 'imageUrlL'], inplace=True)

In [166]:
books_df.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [167]:
#202 unique values are there in attribute yearOfPublication
books_df['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [168]:
#2 rows are having DK Publishing Inc value 
books_df[(books_df['yearOfPublication'] == 'DK Publishing Inc') | (books_df['yearOfPublication'] == 'Gallimard')]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [169]:
books_df = books_df[~((books_df['yearOfPublication'] == 'DK Publishing Inc') | (books_df['yearOfPublication'] == 'Gallimard'))]

In [170]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null object
publisher            271355 non-null object
dtypes: object(5)
memory usage: 12.4+ MB


### Change the datatype of yearOfPublication to 'int'

In [171]:
books_df = books_df.astype({'yearOfPublication':int})

In [172]:
books_df.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [173]:
books_df.shape

(271357, 5)

In [174]:
books_df.dropna(subset=['publisher'], inplace=True)

In [175]:
books_df.shape

(271355, 5)

## Exploring Users dataset

In [176]:
print(users_df.shape)
users_df.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [177]:
users_df.dtypes

userID        int64
Location     object
Age         float64
dtype: object

### Get all unique values in ascending order for column `Age`

In [178]:
print(sorted(users_df['Age'].unique()))

[nan, 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 93.0, 94.0, 95.0, 96.0, 97.0, 98.0, 99.0, 100.0, 101.0, 102.0, 103.0, 104.0, 105.0, 106.0, 107.0, 108.0, 109.0, 110.0, 111.0, 113.0, 114.0, 115.0, 116.0, 118.0, 119.0, 123.0, 124.0, 127.0, 128.0, 132.0, 133.0, 136.0, 137.0, 138.0, 140.0, 141.0, 143.0, 146.0, 147.0, 148.0, 151.0, 152.0, 156.0, 157.0, 159.0, 162.0, 168.0, 172.0, 175.0, 183.0, 186.0, 189.0, 199.0, 200.0, 201.0, 204.0, 207.0, 208.0, 209.0, 210.0, 212.0, 219.0, 220.0, 223.0, 226.0

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [179]:
users_mean = users_df['Age'].mean()

In [180]:
users_df.Age.where(users_df.Age.between(5,90), inplace=True)

In [181]:
print(sorted(users_df['Age'].unique()))

[nan, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0]


In [182]:
#Just ensuring NaN is added in only Age column
users_df.head(220)

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
...,...,...,...
215,216,"barcelona, georgia, spain",42.0
216,217,"spencer, massachusetts, usa",
217,218,"hyderabad, alabama, india",41.0
218,219,"bismark, sachsen-anhalt, greece",


### Replace null values in column `Age` with mean

In [183]:
users_df['Age'].fillna(users_mean, inplace=True)

In [184]:
users_df.head()

Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",34.751434
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",34.751434
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",34.751434


### Change the datatype of `Age` to `int`

In [185]:
users_df = users_df.astype({'Age':int})
users_df.dtypes

userID       int64
Location    object
Age          int64
dtype: object

## Exploring the Ratings Dataset

### check the shape

In [186]:
ratings_df.shape

(1149780, 3)

In [187]:
n_users = users_df.shape[0]
n_books = books_df.shape[0]

In [188]:
ratings_df.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [189]:
books_unique = sorted(books_df['ISBN'].unique())
print("unique books count ", len(books_unique))
ratings_df = ratings_df[ratings_df['ISBN'].isin(books_unique)]

unique books count  271355


In [190]:
ratings_df.shape

(1031130, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [191]:
users_unique = sorted(users_df['userID'].unique())
print("unique users count ", len(users_unique))
ratings_df = ratings_df[ratings_df['userID'].isin(users_unique)]
ratings_df.shape

unique users count  278858


(1031130, 3)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [192]:
ratings_df = ratings_df[(ratings_df['bookRating'] >= 1) & (ratings_df['bookRating'] <= 10)]

In [193]:
ratings_df.head()

Unnamed: 0,userID,ISBN,bookRating
1,276726,0155061224,5
3,276729,052165615X,3
4,276729,0521795028,6
8,276744,038550120X,7
16,276747,0060517794,9


### Find out which rating has been given highest number of times

In [194]:
ratings_df['bookRating'].value_counts()

8     91804
10    71225
7     66401
9     60776
5     45355
6     31687
4      7617
3      5118
2      2375
1      1481
Name: bookRating, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [195]:
counts1=ratings_df['userID'].value_counts()
ratings_df=ratings_df[ratings_df['userID'].isin(counts1[counts1>=100].index)]
ratings_df.head(50)


Unnamed: 0,userID,ISBN,bookRating
1456,277427,002542730X,10
1458,277427,003008685X,8
1461,277427,0060006641,10
1465,277427,0060542128,7
1474,277427,0061009059,9
1477,277427,0062507109,8
1483,277427,0132220598,8
1488,277427,0140283374,6
1490,277427,014039026X,8
1491,277427,0140390715,7


In [196]:
ratings_df['userID'].value_counts()

11676     6943
98391     5689
189835    1899
153662    1845
23902     1180
          ... 
211919     100
208406     100
36299      100
156300     100
95010      100
Name: userID, Length: 449, dtype: int64

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [197]:
from sklearn.model_selection import train_test_split
trainDF, tempDF = train_test_split(ratings_df, test_size=0.2, random_state=100)
#creating a copy of tempDF as testDF
testDF = tempDF.copy()
#Assigning ratings of tempDF to nan
tempDF.rating = np.nan

tempDF.head()

Unnamed: 0,userID,ISBN,bookRating
733900,177432,192835947,6
273626,63714,3829060114,9
173342,37644,590457225,7
363370,87555,141000198,8
224143,52350,2203001143,9


In [198]:
#Remove missing values in testDF
testDF = testDF.dropna()
testDF.head()

Unnamed: 0,userID,ISBN,bookRating
733900,177432,192835947,6
273626,63714,3829060114,9
173342,37644,590457225,7
363370,87555,141000198,8
224143,52350,2203001143,9


In [199]:
#Creating ratings with trainDF and tempDF
ratings = pd.concat([trainDF, tempDF]).reset_index()
ratings.shape


(103269, 4)

In [200]:
ratings.head()

Unnamed: 0,index,userID,ISBN,bookRating
0,695529,169682,0553580337,8
1,426444,101851,0890878234,10
2,48772,11676,0385474016,10
3,284111,67840,019285304X,10
4,410090,98391,0756400503,9


In [201]:
#Fill not available values as 0.0 - sprase martix
R_df = ratings.pivot(index = 'userID', columns ='ISBN', values = 'bookRating').fillna(0)
R_df.tail()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
274061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
274301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [202]:
R_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 449 entries, 2033 to 278418
Columns: 66572 entries, 0000913154 to B000234N3A
dtypes: float64(66572)
memory usage: 228.1 MB


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [203]:
from scipy.sparse.linalg import svds
#singluar value decomposition
#Compute the largest k singular values/vectors for a sparse matrix.
#k: Number of singular values and vectors to compute. Must be 1 <= k < min(R_df.shape)
# R_df is to compute the SVD on
# The singular values - sigma
U, sigma, Vt = svds(R_df, k = 50)

In [204]:
#diag
sigma = np.diag(sigma)

In [205]:
sigma

array([[147.92121613,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 149.3438051 ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 150.07400599, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 379.58327277,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        634.72875357,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 680.30978318]])

In [206]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

In [207]:
#Predictions Ratings
preds_df.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025341,-0.002146,-0.001431,-0.002146,-0.002146,0.002971,-0.00392,0.007035,0.007035,0.012316,...,0.00018,0.000226,0.042081,-0.016804,-0.080028,0.004746,0.028314,0.00012,-0.001693,0.067503
1,-0.010012,-0.003669,-0.002446,-0.003669,-0.003669,0.001075,0.00144,-0.0035,-0.0035,0.001612,...,-0.000363,0.000403,0.008142,0.001104,-0.029224,0.000999,0.002363,-0.000242,2.9e-05,-0.013059
2,-0.015054,-0.015457,-0.010304,-0.015457,-0.015457,0.007281,-0.014033,0.011941,0.011941,0.011796,...,-0.000455,0.001907,0.047982,0.005737,0.117859,0.006945,0.003119,-0.000304,0.009009,-0.057692
3,-0.021499,0.035602,0.023735,0.035602,0.035602,0.030307,0.024215,-0.001053,-0.001053,0.067579,...,0.002971,0.009912,0.086248,-0.008818,0.016154,0.028848,-0.000125,0.001981,0.031201,-0.046664
4,0.002077,-0.007965,-0.00531,-0.007965,-0.007965,0.002947,0.003057,0.000231,0.000231,0.00608,...,0.00212,0.001597,-0.012181,0.00942,0.673459,0.002591,-0.008229,0.001413,0.004918,0.047773


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [208]:
userID = 2110

In [209]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

In [210]:
ratings_2ndrow = ratings.iloc[0:1,:]

In [211]:
ratings_2ndrow

Unnamed: 0,index,userID,ISBN,bookRating
0,695529,169682,553580337,8


### Get the predicted ratings for userID `2110` and sort them in descending order

In [212]:
ratings = ratings[ratings['userID']==userID]

In [213]:

ratings.sort_values('bookRating',ascending=False,inplace=True)

In [214]:
ratings

Unnamed: 0,index,userID,ISBN,bookRating
863,14463,2110,0345317580,10
74612,14466,2110,0345362276,10
43895,14507,2110,0439222303,10
45804,14571,2110,0671751174,10
49337,14524,2110,0486270718,10
...,...,...,...,...
83158,14473,2110,037361490X,5
31726,14528,2110,0515134384,5
60291,14453,2110,0151008116,5
66710,14529,2110,0515136557,3


### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [215]:
user_data = ratings

In [216]:
user_data.head()

Unnamed: 0,index,userID,ISBN,bookRating
863,14463,2110,345317580,10
74612,14466,2110,345362276,10
43895,14507,2110,439222303,10
45804,14571,2110,671751174,10
49337,14524,2110,486270718,10


In [217]:
user_data.shape

(103, 4)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [218]:
books_df.columns

Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher'], dtype='object')

In [219]:
book_data = books_df[books_df.ISBN.isin(user_data.ISBN)]

In [220]:
book_data.shape

(103, 5)

In [221]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
246,0151008116,Life of Pi,Yann Martel,2002,Harcourt
904,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books
1000,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy
1302,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
1472,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket


In [222]:

user_test = pd.merge(user_data, book_data, on='ISBN', how='outer')

In [223]:
user_test.head()

Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,14463,2110,345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,14466,2110,345362276,10,Wizard at Large (Rookies Series),Terry Brooks,1989,Del Rey Books
2,14507,2110,439222303,10,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
3,14571,2110,671751174,10,First Evil (Fear Street Cheerleaders 1) : Firs...,R.L. Stine,1992,Simon Pulse
4,14524,2110,486270718,10,The Invisible Man (Dover Thrift Editions),H. G. Wells,1992,Dover Publications


In [224]:
user_test.shape

(103, 8)

In [225]:
user_full_info = pd.merge(user_data, book_data, on='ISBN', how='outer')

In [226]:
user_full_info = user_full_info.drop('index',axis=1)

In [227]:
user_full_info.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,2110,345362276,10,Wizard at Large (Rookies Series),Terry Brooks,1989,Del Rey Books
2,2110,439222303,10,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
3,2110,671751174,10,First Evil (Fear Street Cheerleaders 1) : Firs...,R.L. Stine,1992,Simon Pulse
4,2110,486270718,10,The Invisible Man (Dover Thrift Editions),H. G. Wells,1992,Dover Publications


In [228]:

user_full_info.shape

(103, 7)

### Get top 10 recommendations for above given userID from the books not already rated by that user

In [229]:

user_full_info.head(10)

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,2110,345362276,10,Wizard at Large (Rookies Series),Terry Brooks,1989,Del Rey Books
2,2110,439222303,10,"Poof! Rabbits Everywhere! (Abracadabra!, 1)",Peter Lerangis,2002,Little Apple
3,2110,671751174,10,First Evil (Fear Street Cheerleaders 1) : Firs...,R.L. Stine,1992,Simon Pulse
4,2110,486270718,10,The Invisible Man (Dover Thrift Editions),H. G. Wells,1992,Dover Publications
5,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio
6,2110,345307674,10,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
7,2110,671695304,10,"FOREVER : A Novel of Good and Evil, Love and Hope",Judy Blume,1989,Pocket
8,2110,812505042,10,The Time Machine,H. G. Wells,1995,Tor Books
9,2110,590629794,10,"The Encounter (Animorphs , No 3)",K. A. Applegate,1996,Scholastic
