**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

In [506]:
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.random_projection import sparse_random_matrix

#### Execute the below cell to load the datasets

In [507]:
#Loading data
import pandas as pd
books = pd.read_csv("books/books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']



b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [508]:
books.shape

(271360, 8)

In [509]:
#I have put users and ratings somwhere below where it is used, to avoid running dataset load again and again

## Exploring books dataset

In [510]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [511]:
columns = ['imageUrlS', 'imageUrlM', 'imageUrlL']
books.drop(columns, axis=1, inplace=True)

In [512]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [513]:
#to eye ball the data I am exporting it into csv
books.to_csv('books_pruned.csv')

In [514]:
books['yearOfPublication'].unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [515]:
books.loc[books['yearOfPublication'] == "DK Publishing Inc",'yearOfPublication']


209538    DK Publishing Inc
221678    DK Publishing Inc
Name: yearOfPublication, dtype: object

### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [516]:
#books.drop(books.loc[books['yearOfPublication'] == "DK Publishing Inc" or books['yearOfPublication']=="Gallimard"], axis = 0)
junk_yop = ['DK Publishing Inc','Gallimard']
for i in junk_yop:
    books.drop(books.loc[books['yearOfPublication'] == i].index, axis=0, inplace=True)
    print(books.shape)

(271358, 5)
(271357, 5)


### Change the datatype of yearOfPublication to 'int'

In [517]:
books['yearOfPublication']=books['yearOfPublication'].astype(int)
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [518]:
books.info()
#shaows that we have 2 null columns in publisher

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null int64
publisher            271355 non-null object
dtypes: int64(1), object(4)
memory usage: 12.4+ MB


In [519]:
books.publisher.isna().value_counts()

False    271355
True          2
Name: publisher, dtype: int64

In [520]:
books['publisher'].dropna(inplace=True)
books.publisher.isna().value_counts()

False    271355
Name: publisher, dtype: int64

## Exploring Users dataset

In [521]:
users = pd.read_csv('books/users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

In [522]:
print(users.shape)
users.head(100)

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",
5,6,"santa monica, california, usa",61.0
6,7,"washington, dc, usa",
7,8,"timmins, ontario, canada",
8,9,"germantown, tennessee, usa",
9,10,"albacete, wisconsin, spain",26.0


### Get all unique values in ascending order for column `Age`

In [523]:
users['Age'].unique()

array([ nan,  18.,  17.,  61.,  26.,  14.,  25.,  19.,  46.,  55.,  32.,
        24.,  20.,  34.,  23.,  51.,  31.,  21.,  44.,  30.,  57.,  43.,
        37.,  41.,  54.,  42.,  50.,  39.,  53.,  47.,  36.,  28.,  35.,
        13.,  58.,  49.,  38.,  45.,  62.,  63.,  27.,  33.,  29.,  66.,
        40.,  15.,  60.,   0.,  79.,  22.,  16.,  65.,  59.,  48.,  72.,
        56.,  67.,   1.,  80.,  52.,  69.,  71.,  73.,  78.,   9.,  64.,
       103., 104.,  12.,  74.,  75., 231.,   3.,  76.,  83.,  68., 119.,
        11.,  77.,   2.,  70.,  93.,   8.,   7.,   4.,  81., 114., 230.,
       239.,  10.,   5., 148., 151.,   6., 101., 201.,  96.,  84.,  82.,
        90., 123., 244., 133.,  91., 128.,  94.,  85., 141., 110.,  97.,
       219.,  86., 124.,  92., 175., 172., 209., 212., 237.,  87., 162.,
       100., 156., 136.,  95.,  89., 106.,  99., 108., 210.,  88., 199.,
       147., 168., 132., 159., 186., 152., 102., 116., 200., 115., 226.,
       137., 207., 229., 138., 109., 105., 228., 18

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [524]:
users.loc[(users['Age']<5) | (users['Age']>90),'Age']= np.nan
users['Age'].isna().value_counts()

False    166784
True     112074
Name: Age, dtype: int64

### Replace null values in column `Age` with mean

In [525]:
users['Age'] = users['Age'].replace(np.nan,users['Age'].mean())
users['Age'].isna().value_counts()

False    278858
Name: Age, dtype: int64

### Change the datatype of `Age` to `int`

In [526]:
users['Age'] = users['Age'].astype(int)

In [527]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

In [528]:
ratings = pd.read_csv('books/ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

### check the shape

In [529]:
ratings.shape

(1149780, 3)

In [530]:
n_users = users.shape[0]
n_books = books.shape[0]

In [531]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [532]:
#ISBN is a unique book identifier, so we can use it as a key to join books and ratings table
#we can then remove the rows that contain Nan from user
books_ratings = ratings.join(books.set_index('ISBN'), on ="ISBN")


In [533]:
books_ratings.head(20)

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books
1,276726,0155061224,5,Rites of Passage,Judith Rae,2001.0,Heinle
2,276727,0446520802,0,The Notebook,Nicholas Sparks,1996.0,Warner Books
3,276729,052165615X,3,Help!: Level 1,Philip Prowse,1999.0,Cambridge University Press
4,276729,0521795028,6,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001.0,Cambridge University Press
5,276733,2080674722,0,Les Particules Elementaires,Michel Houellebecq,1998.0,Flammarion
6,276736,3257224281,8,,,,
7,276737,0600570967,6,,,,
8,276744,038550120X,7,A Painted House,JOHN GRISHAM,2001.0,Doubleday
9,276745,342310538,10,,,,


In [534]:
books_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 7 columns):
userID               1149780 non-null int64
ISBN                 1149780 non-null object
bookRating           1149780 non-null int64
bookTitle            1031132 non-null object
bookAuthor           1031131 non-null object
yearOfPublication    1031132 non-null float64
publisher            1031130 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 61.4+ MB


In [535]:
#removing the extra columns that got added due to join
books_ratings.dropna(inplace=True)
columns = ['bookTitle','bookAuthor', 'yearOfPublication','publisher']
ratings = books_ratings.drop(columns, axis=1)
ratings.head(10)
ratings.shape

(1031129, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [536]:
#same process for users and ratings as well. Here the user's data set has unique userID but ratings dataset has 

users_ratings = ratings.join(users.set_index('userID'), on ='userID')
users_ratings.head(10)

Unnamed: 0,userID,ISBN,bookRating,Location,Age
0,276725,034545104X,0,"tyler, texas, usa",34
1,276726,0155061224,5,"seattle, washington, usa",34
2,276727,0446520802,0,"h, new south wales, australia",16
3,276729,052165615X,3,"rijeka, n/a, croatia",16
4,276729,0521795028,6,"rijeka, n/a, croatia",16
5,276733,2080674722,0,"paris, n/a, france",37
8,276744,038550120X,7,"torrance, california, usa",34
10,276746,0425115801,0,"fort worth, ,",34
11,276746,0449006522,0,"fort worth, ,",34
12,276746,0553561618,0,"fort worth, ,",34


In [537]:
users_ratings.shape
users_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031129 entries, 0 to 1149778
Data columns (total 5 columns):
userID        1031129 non-null int64
ISBN          1031129 non-null object
bookRating    1031129 non-null int64
Location      1031129 non-null object
Age           1031129 non-null int64
dtypes: int64(3), object(2)
memory usage: 47.2+ MB


In [538]:
users_ratings.dropna(inplace=True)
columns = ['Location','Age']
ratings = users_ratings.drop(columns, axis=1)
ratings.head(10)
ratings.shape

(1031129, 3)

In [539]:
ratings.head(10)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
5,276733,2080674722,0
8,276744,038550120X,7
10,276746,0425115801,0
11,276746,0449006522,0
12,276746,0553561618,0


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [540]:
ratings.drop(ratings.loc[ratings['bookRating'] <= 0].index, axis=0, inplace=True)
ratings.shape

(383838, 3)

### Find out which rating has been given highest number of times

In [541]:
ratings['bookRating'].mode()
#Rating 8 is the highest given rating

0    8
dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [542]:
g = ratings.groupby('userID')
count_df=g.count()
print(count_df)
#users who have rated more than 100 books
rating_frequency = count_df.loc[count_df.ISBN >100].index
ratings = ratings[ratings['userID'].isin(rating_frequency)]
#ratings[ratings_done['userID'].isin(df_user['userID'])]
ratings

        ISBN  bookRating
userID                  
8          7           7
9          1           1
12         1           1
14         3           3
16         1           1
17         4           4
19         1           1
22         1           1
26         2           2
32         1           1
39         2           2
42         1           1
44         1           1
51         1           1
53         4           4
56         2           2
64         1           1
67         1           1
69         1           1
70         1           1
73         1           1
75         1           1
78         1           1
81         1           1
82         1           1
83         1           1
85         1           1
86         1           1
87         2           2
88         1           1
...      ...         ...
278750     1           1
278752     1           1
278755     1           1
278759     1           1
278760     1           1
278767     1           1
278772     1           1


Unnamed: 0,userID,ISBN,bookRating
1456,277427,002542730X,10
1458,277427,003008685X,8
1461,277427,0060006641,10
1465,277427,0060542128,7
1474,277427,0061009059,9
1477,277427,0062507109,8
1483,277427,0132220598,8
1488,277427,0140283374,6
1490,277427,014039026X,8
1491,277427,0140390715,7


### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [543]:
user_bookrating_pivot = ratings.pivot_table(index='userID', columns='ISBN', values='bookRating')
user_bookrating_pivot.head(20)

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,,,,,,,,,,,...,,,,,,,,,,
2110,,,,,,,,,,,...,,,,,,,,,,
2276,,,,,,,,,,,...,,,,,,,,,,
4017,,,,,,,,,,,...,,,,,,,,,,
4385,,,,,,,,,,,...,,,,,,,,,,
5582,,,,,,,,,,,...,,,,,,,,,,
6242,,,,,,,,,,,...,,,,,,,,,,
6251,,,,,,,,,,,...,,,,,,,,,,
6543,,,,,,,,,,,...,,,,,,,,,,
6575,,,,,,,,,,,...,,,,,,,,,,


In [544]:
user_bookrating_pivot.shape

(440, 66074)

In [545]:
#replacing Nan's to 0
user_bookrating_pivot = user_bookrating_pivot.replace(np.nan, 0)
user_bookrating_pivot.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [546]:
user_bookrating_pivot.shape

(440, 66074)

In [547]:
user_bookrating_pivot_matrix = user_bookrating_pivot.as_matrix()

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [548]:

from scipy.sparse.linalg import svds
U, sigma, Vt = svds(user_bookrating_pivot_matrix, k = 50)

In [549]:
sigma

array([147.86505334, 149.3252639 , 150.04583148, 152.18779748,
       152.85092339, 154.59835563, 154.79234893, 155.89656853,
       158.0387241 , 159.20452803, 159.79868146, 162.01559917,
       162.66913244, 163.3027848 , 166.02017651, 166.80982507,
       168.03377923, 170.76490745, 171.0090041 , 173.2664674 ,
       174.55625832, 176.63532917, 178.61625736, 180.29249907,
       182.22447299, 184.10415849, 187.59494573, 189.74901812,
       190.94518579, 195.13457875, 199.79880776, 201.70062063,
       202.18428336, 203.46475662, 207.23913363, 209.91536471,
       213.20180251, 216.84285916, 224.26669747, 231.63466808,
       235.66562174, 249.9299874 , 252.00477066, 261.17800888,
       267.92933045, 281.00386615, 293.69040211, 379.57255682,
       634.71795583, 680.3033617 ])

In [550]:
#sigma is a 1-D array. In order to do the dot product lets convert it to a diagonal matrix
sigma = np.diag(sigma)
sigma.shape

(50, 50)

In [551]:
U.shape

(440, 50)

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [552]:
userID = 2110

In [553]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [554]:
predicted_ratings = np.dot(np.dot(U,sigma),Vt)

In [555]:
predicted_ratings.shape

(440, 66074)

In [556]:
predicted_ratings

array([[ 2.54346049e-02, -2.17942041e-03, -1.45294694e-03, ...,
         1.38371883e-04, -1.52201741e-03,  6.78833822e-02],
       [-9.92664648e-03, -3.61714450e-03, -2.41142967e-03, ...,
        -2.39714456e-04,  2.59002312e-05, -1.29297616e-02],
       [-1.49239303e-02, -1.55912320e-02, -1.03941546e-02, ...,
        -2.88437201e-04,  9.09621510e-03, -5.80539457e-02],
       ...,
       [ 3.82645533e-03, -1.12336045e-02, -7.48906966e-03, ...,
         1.75255666e-03,  8.58025749e-03,  1.25919856e-01],
       [ 7.81152862e-02, -2.45904850e-02, -1.63936567e-02, ...,
         3.20930289e-04,  1.02531543e-02,  3.56639120e-02],
       [ 8.08819830e-03,  1.16211149e-02,  7.74740995e-03, ...,
         8.47168300e-05, -3.15558368e-04,  7.43558594e-03]])

In [557]:
pred_ratings_df = pd.DataFrame(predicted_ratings, columns = user_bookrating_pivot.columns, index = user_bookrating_pivot.index)

pred_ratings_df['userID'] = pred_ratings_df.index

In [558]:
pred_ratings_df.columns

Index(['0000913154', '0001046438', '000104687X', '0001047213', '0001047973',
       '000104799X', '0001048082', '0001053736', '0001053744', '0001055607',
       ...
       'B00009EF82', 'B00009NDAN', 'B0000DYXID', 'B0000T6KHI', 'B0000VZEJQ',
       'B0000X8HIE', 'B00013AX9E', 'B0001I1KOG', 'B000234N3A', 'userID'],
      dtype='object', name='ISBN', length=66075)

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [562]:

pred_df = pred_ratings_df[pred_ratings_df['userID'] == 2110].transpose()
user_data = pred_df.sort_values(userID,ascending = False)
user_data.head(5)

userID,2110
ISBN,Unnamed: 1_level_1
userID,2110.0
059035342X,0.666278
0345370775,0.356946
0345384911,0.332482
044021145X,0.32819


In [563]:
user_data.head(20)

userID,2110
ISBN,Unnamed: 1_level_1
userID,2110.0
059035342X,0.666278
0345370775,0.356946
0345384911,0.332482
044021145X,0.32819
043935806X,0.305998
0451151259,0.302311
0439139597,0.284296
0439064872,0.278464
0380759497,0.27808


In [564]:
user_data.shape

(66075, 1)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [565]:
def user_info(userID, books_df, original_ratings_df):
    
    
    # Get the user's data and merge in the book information.
    
    user_data = original_ratings_df.loc[original_ratings_df.userID == (userID)]
   
   #user_data contains all the ratings given by userID 2110
   #now pick user info 
    user_full_info = (user_data.merge(books_df, how = 'left', left_on = 'ISBN', right_on = 'ISBN').sort_values(['bookRating'], ascending=False))
                 

   

    return user_full_info

user_full_info = user_info(2110, books, ratings)
user_full_info.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
76,2110,067166865X,10,STAR TREK YESTERDAY'S SON (Star Trek: The Orig...,A.C. Crispin,1988,Audioworks
52,2110,0590109715,10,"The Andalite Chronicles (Elfangor's Journey, A...",Katherine Applegate,1997,Apple
64,2110,0590629786,10,"The Visitor (Animorphs, No 2)",K. A. Applegate,1996,Scholastic
63,2110,0590629778,10,"The Invasion (Animorphs, No 1)",K. A. Applegate,1996,Scholastic
61,2110,059046678X,10,The Yearbook,Peter Lerangis,1994,Scholastic


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [566]:

recommendations = (books[~books['ISBN'].isin(user_full_info['ISBN'])].
         merge(pd.DataFrame(user_data).reset_index(), how = 'left',
               left_on = 'ISBN',
               right_on = 'ISBN').
         rename(columns = {userID: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:10, :-1]
                      )

recommendations


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
1192,0345370775,Jurassic Park,Michael Crichton,1999,Ballantine Books
6184,0345384911,Crystal Line,Anne McCaffrey,1993,Del Rey Books
455,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group
5458,043935806X,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling,2003,Scholastic
2031,0451151259,Eyes of the Dragon,Stephen King,1988,Penguin Putnam~mass
5383,0439139597,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling,2000,Scholastic
3413,0439064872,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling,2000,Scholastic
976,0380759497,Xanth 15: The Color of Her Panties,Piers Anthony,1992,Eos
6048,0451167317,The Dark Half,Stephen King,1994,Signet Book
2435,0345353145,Sphere,MICHAEL CRICHTON,1988,Ballantine Books
