**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mp
import seaborn as sns
%matplotlib inline
#Loading data
books = pd.read_csv(r'C:\Users\ShruthiMrinalan\Downloads\books.csv', sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv(r'C:\Users\ShruthiMrinalan\Downloads\users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv(r'C:\Users\ShruthiMrinalan\Downloads\ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [2]:
books.shape

(271360, 8)

In [3]:
users.shape

(278858, 3)

In [4]:
ratings.shape

(1149780, 3)

## Exploring books dataset

In [5]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [6]:
books = books.iloc[:,:-3]

In [7]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [8]:
books.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [9]:
books.query('yearOfPublication == "DK Publishing Inc"')

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [10]:
#books.drop(["DK Publishing Inc", "Gallimard"], axis = 1, inplace = True)
#books = books~(books["yearOfPublication"].isin(['DK Publishing Inc','Gallimard']))]


books=books[books["yearOfPublication"] != 'Gallimard']

In [11]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


### Change the datatype of yearOfPublication to 'int'

In [12]:
books["yearOfPublication"] =pd.Categorical(books["yearOfPublication"]).codes 

In [13]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int16
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [14]:
books=books.dropna(subset=['publisher'])

In [15]:
books.isna().any()

ISBN                 False
bookTitle            False
bookAuthor            True
yearOfPublication    False
publisher            False
dtype: bool

## Exploring Users dataset

In [16]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [17]:
users = pd.read_csv(r'C:\Users\ShruthiMrinalan\Downloads\users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

In [18]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


### Get all unique values in ascending order for column `Age`

In [19]:
users['Age'].unique()

array([ nan,  18.,  17.,  61.,  26.,  14.,  25.,  19.,  46.,  55.,  32.,
        24.,  20.,  34.,  23.,  51.,  31.,  21.,  44.,  30.,  57.,  43.,
        37.,  41.,  54.,  42.,  50.,  39.,  53.,  47.,  36.,  28.,  35.,
        13.,  58.,  49.,  38.,  45.,  62.,  63.,  27.,  33.,  29.,  66.,
        40.,  15.,  60.,   0.,  79.,  22.,  16.,  65.,  59.,  48.,  72.,
        56.,  67.,   1.,  80.,  52.,  69.,  71.,  73.,  78.,   9.,  64.,
       103., 104.,  12.,  74.,  75., 231.,   3.,  76.,  83.,  68., 119.,
        11.,  77.,   2.,  70.,  93.,   8.,   7.,   4.,  81., 114., 230.,
       239.,  10.,   5., 148., 151.,   6., 101., 201.,  96.,  84.,  82.,
        90., 123., 244., 133.,  91., 128.,  94.,  85., 141., 110.,  97.,
       219.,  86., 124.,  92., 175., 172., 209., 212., 237.,  87., 162.,
       100., 156., 136.,  95.,  89., 106.,  99., 108., 210.,  88., 199.,
       147., 168., 132., 159., 186., 152., 102., 116., 200., 115., 226.,
       137., 207., 229., 138., 109., 105., 228., 18

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [20]:
users1 = users[ (users['Age'] >=5) & (users['Age'] <= 90) ]

In [21]:
users1['Age'].unique()

array([18., 17., 61., 26., 14., 25., 19., 46., 55., 32., 24., 20., 34.,
       23., 51., 31., 21., 44., 30., 57., 43., 37., 41., 54., 42., 50.,
       39., 53., 47., 36., 28., 35., 13., 58., 49., 38., 45., 62., 63.,
       27., 33., 29., 66., 40., 15., 60., 79., 22., 16., 65., 59., 48.,
       72., 56., 67., 80., 52., 69., 71., 73., 78.,  9., 64., 12., 74.,
       75., 76., 83., 68., 11., 77., 70.,  8.,  7., 81., 10.,  5.,  6.,
       84., 82., 90., 85., 86., 87., 89., 88.])

In [22]:
users1.head()

Unnamed: 0,userID,Location,Age
1,2,"stockton, california, usa",18.0
3,4,"porto, v.n.gaia, portugal",17.0
5,6,"santa monica, california, usa",61.0
9,10,"albacete, wisconsin, spain",26.0
10,11,"melbourne, victoria, australia",14.0


In [23]:
users1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 166784 entries, 1 to 278854
Data columns (total 3 columns):
userID      166784 non-null int64
Location    166784 non-null object
Age         166784 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 5.1+ MB


### Replace null values in column `Age` with mean

In [24]:
[users1["Age"].fillna(users1["Age"].mean(), inplace=True)]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


[None]

In [25]:
users1["Age"]=users1["Age"].astype('int64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### Change the datatype of `Age` to `int`

In [26]:
users1["Age"] =pd.Categorical(users1["Age"]).codes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [27]:
print(sorted(users1.Age.unique()))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85]


## Exploring the Ratings Dataset

### check the shape

In [28]:
ratings.shape

(1149780, 3)

In [29]:
n_users = users.shape[0]
n_books = books.shape[0]

In [30]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [31]:
result = pd.merge(ratings,
                 books[['ISBN']],
                 on='ISBN')
result.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,2313,034545104X,5
2,6543,034545104X,0
3,8680,034545104X,5
4,10314,034545104X,9


In [32]:
result.shape

(1031133, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [33]:
result1 = pd.merge(result,
                 users[['userID']],
                 on='userID')
result1.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,2313,034545104X,5
2,2313,0812533550,9
3,2313,0679745580,8
4,2313,0060173289,9


In [34]:
result1['bookRating'].unique()

array([ 0,  5,  9,  8,  7,  6, 10,  3,  4,  2,  1], dtype=int64)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [35]:
result1 = result1[ (result1['bookRating'] >=1) & (result1['bookRating'] <= 10) ]

### Find out which rating has been given highest number of times

In [36]:

counts2 = result1['bookRating'].value_counts().to_dict()

In [37]:
counts2

{8: 91804,
 10: 71225,
 7: 66402,
 9: 60776,
 5: 45355,
 6: 31687,
 4: 7617,
 3: 5118,
 2: 2375,
 1: 1481}

rating 8 has the highest count

In [38]:
result1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383840 entries, 1 to 1031132
Data columns (total 3 columns):
userID        383840 non-null int64
ISBN          383840 non-null object
bookRating    383840 non-null int64
dtypes: int64(2), object(1)
memory usage: 11.7+ MB


In [39]:
rating_count=pd.DataFrame(result1.groupby(['ISBN'])['bookRating'].sum())

In [40]:
top10= rating_count.sort_values('bookRating', ascending = False).head(10)

In [41]:
print("Following books are recommended")
top10.merge(books, left_index= True, right_on= 'ISBN')

Following books are recommended


Unnamed: 0,bookRating,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
408,5787,0316666343,The Lovely Bones: A Novel,Alice Sebold,93,"Little, Brown"
748,4108,0385504209,The Da Vinci Code,Dan Brown,94,Doubleday
522,3134,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,89,Picador USA
2143,2798,059035342X,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,90,Arthur A. Levine Books
356,2595,0142001740,The Secret Life of Bees,Sue Monk Kidd,94,Penguin Books
26,2551,0971880107,Wild Animus,Rich Shapero,95,Too Far
1105,2524,0060928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,88,Perennial
706,2402,0446672211,Where the Heart Is (Oprah's Book Club (Paperba...,Billie Letts,89,Warner Books
231,2219,0452282152,Girl with a Pearl Earring,Tracy Chevalier,92,Plume Books
118,2179,0671027360,Angels &amp; Demons,Dan Brown,92,Pocket Star


In [42]:
result1.shape

(383840, 3)

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [43]:
counts1 = result1['userID'].value_counts()

In [44]:
ratings_explicit= result1[result1['userID'].isin(counts1[counts1 >= 100].index)]

In [45]:
ratings_explicit.shape

(103269, 3)

counts = ratings_explicit['bookRating'].value_counts()

ratings_explicit= result1[result1['bookRating'].isin(counts[counts >= 100].index)]

In [46]:
ratings_explicit.shape

(103269, 3)

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [47]:
ratings_explicit.head()

Unnamed: 0,userID,ISBN,bookRating
43,6543,446605484,10
47,6543,805062971,8
48,6543,345342968,8
49,6543,446610038,9
55,6543,61009059,8


In [48]:
ratings_matrix = pd.merge(books, ratings_explicit, on='ISBN')

In [49]:
ratings_matrix.tail()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,userID,bookRating
103264,451205618,The Banished Bride (Signet Regency Romance),Andrea Pickens,93,Signet Book,268110,7
103265,553211994,The Jungle Books and Just So Stories,Kipling Rudyard,77,Bantam Classics,269566,8
103266,843101083,"Off-The-Wall (Mad Libs, No. 6)",Roger Price,61,Price Stern Sloan,269566,6
103267,395611563,"Walking With the Great Apes: Jane Goodall, Dia...",Sy Montgomery,83,Mariner Books,270713,10
103268,1845170423,Cocktail Classics,David Biggs,95,Connaught,275970,7


### Generate the predicted ratings using SVD with no.of singular values to be 50

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)
SVD()

from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.3,random_state=1)

In [50]:
from surprise import Reader, Dataset, SVD, evaluate,accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import KFold



In [51]:
reader = Reader(rating_scale=(1,10))


data = Dataset.load_from_df(ratings_matrix[['ISBN', 'userID', 'bookRating']], reader)
trainset, testset = train_test_split(data, test_size=.25)


In [None]:
#I tried it with 50 but then it takes lot of time , I tried reusing the function from inclass example it gives in correct answer hence going with this method

In [52]:
kf = KFold(n_splits=3)

algo = SVD()
algo.fit(trainset)
predictions = algo.test(testset)
print(accuracy.rmse(predictions, verbose=True))
for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 1.4977
1.4977184472461131
RMSE: 1.5166
RMSE: 1.5043
RMSE: 1.4891


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [53]:
userID = 2110

In [54]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

In [61]:
def recommendation(user_id):
    user = ratings_matrix.copy()
    already_read = ratings_matrix[ratings_matrix['userID'] == user_id]['ISBN'].unique()
    user = user.reset_index()
    user = user[~user['ISBN'].isin(already_read)]
    user['Estimate_Score']=user['ISBN'].apply(lambda x: algo.predict(user_id, x).est)
    user = user.drop('ISBN', axis = 1)
    user = user.sort_values('Estimate_Score', ascending=False)
    data1 = pd.DataFrame(user)
    print(user.head(10))

In [62]:
recommendation(2)

       index                                          bookTitle  \
0          0                                       Clara Callan   
68843  68843                                        The Captive   
68853  68853                                    Nachbarn: Roman   
68852  68852                                       Loving Jenny   
68851  68851                           Mrs. Drew Plays Her Hand   
68850  68850  Cowboy Comes Home  (Conard County) (Intimate M...   
68849  68849  Cowboy Comes Home  (Conard County) (Intimate M...   
68848  68848  Practically Married (Conveniently Yours) (Spec...   
68847  68847  Practically Married (Conveniently Yours) (Spec...   
68846  68846                          Das teuflische Testament.   

                 bookAuthor  yearOfPublication              publisher  userID  \
0      Richard Bruce Wright                 92  HarperFlamingo Canada   11676   
68843    Parris Afton Bonds                 84          Leisure Books  112001   
68853         Sibyl

In [None]:
recommendation(2)

### Get the predicted ratings for userID `2110` and sort them in descending order

In [63]:
recommendation(2110)

       index                                          bookTitle  \
0          0                                       Clara Callan   
68935  68935           A Magnificent Affair (Loveswept, No 528)   
68945  68945                                         Body Count   
68944  68944        Keine Louise: Nur die andern kriegen Kinder   
68943  68943  Alone With the Devil : Famous Cases of a Court...   
68942  68942  Kostbare Stunden: Ein Bericht Ã¼ber Sterben, T...   
68941  68941  The Man Who Killed Boys : The John Wayne Gacy,...   
68940  68940  RÃ?Â¼ckwÃ?Â¤rts. Und alles vergessen. Anna und...   
68939  68939                      Once A Hero (Historical, 505)   
68938  68938                      Once A Hero (Historical, 505)   

                   bookAuthor  yearOfPublication                publisher  \
0        Richard Bruce Wright                 92    HarperFlamingo Canada   
68935         Fayrene Preston                 83                Loveswept   
68945              Burl Barer  

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [None]:
user_data

In [None]:
user_data1.shape

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [None]:
book_data.shape

In [None]:
book_data.head()

In [None]:
user_full_info.head()

### Get top 10 recommendations for above given userID from the books not already rated by that user

In [101]:
rating_count=pd.DataFrame(result1.groupby(['ISBN'])['bookRating'].sum())
top10= rating_count.sort_values('bookRating', ascending = False).head(10)
print("Following books are recommended")
top10.merge(books, left_index= True, right_on= 'ISBN')

Following books are recommended


Unnamed: 0,bookRating,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
408,5787,0316666343,The Lovely Bones: A Novel,Alice Sebold,93,"Little, Brown"
748,4108,0385504209,The Da Vinci Code,Dan Brown,94,Doubleday
522,3134,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,89,Picador USA
2143,2798,059035342X,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,90,Arthur A. Levine Books
356,2595,0142001740,The Secret Life of Bees,Sue Monk Kidd,94,Penguin Books
26,2551,0971880107,Wild Animus,Rich Shapero,95,Too Far
1105,2524,0060928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,88,Perennial
706,2402,0446672211,Where the Heart Is (Oprah's Book Club (Paperba...,Billie Letts,89,Warner Books
231,2219,0452282152,Girl with a Pearl Earring,Tracy Chevalier,92,Plume Books
118,2179,0671027360,Angels &amp; Demons,Dan Brown,92,Pocket Star


In [None]:
#I am getting a lot of errors with SVD 

In [66]:
from sklearn.model_selection import train_test_split

trainDF, tempDF = train_test_split(ratings_explicit, test_size = 0.2, random_state = 100)

In [67]:
testDF = tempDF.copy()

In [68]:
tempDF.bookRating = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [69]:
tempDF.head()

Unnamed: 0,userID,ISBN,bookRating
446633,197659,842304673,
135669,184299,345358791,
84242,115003,1400031354,
193586,153662,380761319,
103773,16795,394891139,


In [70]:
testDF.head()

Unnamed: 0,userID,ISBN,bookRating
446633,197659,842304673,7
135669,184299,345358791,8
84242,115003,1400031354,9
193586,153662,380761319,10
103773,16795,394891139,6


In [72]:
ratings_f = pd.concat([trainDF, tempDF]).reset_index()

In [73]:
ratings_f.shape

(103269, 4)

In [74]:
ratings_f.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103269 entries, 0 to 103268
Data columns (total 4 columns):
index         103269 non-null int64
userID        103269 non-null int64
ISBN          103269 non-null object
bookRating    82615 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 3.2+ MB


In [75]:
R_df = ratings_f.pivot(index = 'userID', columns = 'ISBN', values = 'bookRating').fillna(0)

In [76]:
R_df.tail()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
274061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
274301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [77]:
from scipy.sparse.linalg import svds

In [78]:
U, sigma, Vt = svds(R_df, k = 50)

In [79]:
sigma

array([131.07954208, 132.44479902, 132.61470995, 133.96010817,
       134.94232624, 136.38117803, 137.0634911 , 138.04647807,
       140.45935247, 141.29908114, 142.26811037, 143.88305269,
       144.27243066, 144.93753168, 149.39109893, 149.62291223,
       149.94512384, 152.15710138, 152.98116567, 154.23600256,
       155.64958852, 156.98587955, 158.30450983, 161.41139495,
       164.36235669, 164.60938522, 166.22369888, 168.8872909 ,
       173.19509942, 174.99507662, 176.37245022, 178.41205733,
       180.20327794, 181.26833216, 184.19621481, 186.26397001,
       190.17666439, 194.12064112, 202.52424067, 206.23585733,
       210.1876945 , 219.80287636, 223.09823012, 232.70628393,
       237.36014895, 252.56483856, 257.35846413, 338.84909015,
       567.12180411, 605.76299262])

In [80]:
sigma = np.diag(sigma)

In [81]:
sigma

array([[131.07954208,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        , 132.44479902,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [  0.        ,   0.        , 132.61470995, ...,   0.        ,
          0.        ,   0.        ],
       ...,
       [  0.        ,   0.        ,   0.        , ..., 338.84909015,
          0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
        567.12180411,   0.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        , 605.76299262]])

In [82]:
all_users_predicted_ratings = np.dot(np.dot(U, sigma), Vt)

In [83]:
preds_df = pd.DataFrame(all_users_predicted_ratings, columns = R_df.columns)

In [84]:
preds_df

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.007815,-0.012798,-0.008532,-0.012798,0.0,0.005981,-0.002933,0.010925,0.010925,0.014187,...,0.001694,0.001188,0.013656,0.0,0.0,0.003294,0.014820,0.001129,0.003141,0.025547
1,-0.009052,-0.003543,-0.002362,-0.003543,0.0,0.001077,0.003381,-0.001985,-0.001985,0.002332,...,-0.000213,0.000924,0.006919,0.0,0.0,0.002442,0.000632,-0.000142,0.000155,-0.011345
2,-0.007289,-0.014245,-0.009497,-0.014245,0.0,0.008952,0.004210,0.015236,0.015236,0.020061,...,0.001029,-0.000127,0.048170,0.0,0.0,0.010027,0.003202,0.000686,0.009559,-0.053938
3,-0.020563,0.036033,0.024022,0.036033,0.0,0.027612,0.001397,0.003328,0.003328,0.064300,...,0.004822,0.000270,0.086069,0.0,0.0,0.023547,0.010329,0.003214,0.024629,-0.054396
4,-0.001996,-0.004971,-0.003314,-0.004971,0.0,0.002777,0.003071,-0.000250,-0.000250,0.006595,...,0.001129,0.000011,0.001815,0.0,0.0,0.003064,-0.001733,0.000753,0.002813,0.043207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
444,0.004413,-0.002982,-0.001988,-0.002982,0.0,-0.025783,0.236056,0.000565,0.000565,-0.031114,...,-0.000349,0.000329,0.005720,0.0,0.0,-0.017353,-0.002892,-0.000232,-0.024972,0.049300
445,0.020105,0.032336,0.021557,0.032336,0.0,0.020929,0.032592,0.007314,0.007314,0.027900,...,0.000255,-0.000319,0.203065,0.0,0.0,0.018857,0.006650,0.000170,0.012404,-0.098271
446,0.015862,-0.011977,-0.007985,-0.011977,0.0,0.007549,0.001465,0.009916,0.009916,0.019330,...,0.002869,0.001230,0.013882,0.0,0.0,0.004386,0.003565,0.001912,0.006804,0.137016
447,0.061603,-0.015970,-0.010647,-0.015970,0.0,0.011279,0.003281,-0.001239,-0.001239,0.026109,...,0.001439,0.001413,0.033679,0.0,0.0,0.009377,0.007068,0.000960,0.009139,-0.002388


In [102]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations = False):
    user_row_number = userID - 1  #UserID starts at zero not 1
    sorted_user_predictions = predictions_df.loc[user_row_number].sort_values(ascending = False)
    
    user_data = original_ratings_df[original_ratings_df.userID == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'ISBN', right_on = 'ISBN').
                sort_values(['rating'], ascending = False)
                )
    print('User {0} has already rated {1} movies.'.format(userID, user_full.dropna().shape[0]))
    print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    recommendations = (movies_df[~movies_df['ISBN'].isin(user_full['ISBN'])].
                      merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
                           left_on = 'ISBN',
                           right_on = 'ISBN').
                      rename(columns = {user_row_number: 'Predictions'}).
                      sort_values('Predictions', ascending = False).
                      iloc[:num_recommendations, :-1])
    return user_full, recommendations, sorted_user_predictions, user_data, user_full

In [103]:
already_rated, predictions, sorted_user_predictions, user_data, user_full = recommend_movies(preds_df, 2 , ratings, books, 10)

AttributeError: 'DataFrame' object has no attribute 'userID'