__Author: Samuel Estacio__

# __Book Rental Recommendation__
## Course-end Project 2
## __Description__
   
Book Rent is the largest online and offline book rental chain in India. They provide books of various genres, such as thrillers, mysteries, romances, and science fiction. The company charges a fixed rental fee for a book per month. Lately, the company has been losing its user base. The main reason for this is that users are not able to choose the right books for themselves. The company wants to solve this problem and increase its revenue and profit. 
   
## __Project Objective:__
   
You, as an ML expert, should focus on improving the user experience by personalizing it to the user's needs. You have to model a recommendation engine so that users get recommendations for books based on the behavior of similar users. This will ensure that users are renting the books based on their tastes and traits.
   
__Note:__ You have to perform user-based collaborative filtering and item-based collaborative filtering.
   
## __Dataset description:__

 ###  __BX-Users:__ It contains the information of users.

   __user_id__- These have been anonymized and mapped to integers

   __Location__- Demographic data is provided

   __Age__- Demographic data is provided

        If available, otherwise, these fields contain __NULL-values__.
  
  
### __BX-Books:__ 

   __isbn__ - Books are identified by their respective ISBNs. Invalid ISBNs have already been removed from the dataset.

   __book_title__

   __book_author__

   __year_of_publication__

   __publisher__
   
   
   
### __BX-Book-Ratings:__ Contains the book rating information. 

   __user_id__

   __isbn__

   __rating__ - Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1–10 (higher values denoting higher appreciation), or implicit, expressed by 0.
   
    
   
__Note:__ Download the “BX-Book-Ratings.csv”, “BX-Books.csv”, “BX-Users.csv”, and “Recommend.csv” using the link given in the Book Rental Recommendation project problem statement.

## Read the books dataset and explore it

In [1]:
import pandas as pd
import numpy as np


df_books = pd.read_csv('BX-Books.csv',encoding='iso-8859-1')
df_users = pd.read_csv('BX-Users.csv',encoding='iso-8859-1')
df_br = pd.read_csv('BX-Book-Ratings.csv',encoding='iso-8859-1')
df_rec = pd.read_csv('Recommend.csv')


  df_books = pd.read_csv('BX-Books.csv',encoding='iso-8859-1')
  df_users = pd.read_csv('BX-Users.csv',encoding='iso-8859-1')


In [2]:
df_books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [3]:
df_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271379 entries, 0 to 271378
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   isbn                 271379 non-null  object
 1   book_title           271379 non-null  object
 2   book_author          271378 non-null  object
 3   year_of_publication  271379 non-null  object
 4   publisher            271377 non-null  object
dtypes: object(5)
memory usage: 10.4+ MB


In [4]:
df_books.isnull().sum(axis=0)

isbn                   0
book_title             0
book_author            1
year_of_publication    0
publisher              2
dtype: int64

In [5]:
df_books[df_books.duplicated()]

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher


In [6]:
df_books[df_books.isna().any(axis=1)]

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
128896,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129043,1931696993,Finders Keepers,Linnea Sinclair,2001,
187700,9627982032,The Credit Suisse Guide to Managing Your Perso...,,1995,Edinburgh Financial Publishing


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> Only three books out of 270K+ books have some missing info. No need to omit these books, but data should be cleansed properly if this was a real project.
</div>

## Clean up NaN values

In [7]:
df_books.isnull().sum(axis=0)

isbn                   0
book_title             0
book_author            1
year_of_publication    0
publisher              2
dtype: int64

In [8]:
df_br.isnull().sum(axis=0)

user_id    0
isbn       0
rating     0
dtype: int64

In [9]:
df_users.isnull().sum(axis=0)

user_id          0
Location         1
Age         110763
dtype: int64

In [10]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278859 entries, 0 to 278858
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   user_id   278859 non-null  object 
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), object(2)
memory usage: 6.4+ MB


In [11]:
df_users[df_users.isna().any(axis=1)]

Unnamed: 0,user_id,Location,Age
0,1,"nyc, new york, usa",
2,3,"moscow, yukon territory, russia",
4,5,"farnborough, hants, united kingdom",
6,7,"washington, dc, usa",
7,8,"timmins, ontario, canada",
...,...,...,...
278850,278850,"sergnano, lombardia, italy",
278854,278854,"portland, oregon, usa",
278856,278856,"brampton, ontario, canada",
278857,278857,"knoxville, tennessee, usa",


In [12]:
non_numeric_values = df_users[df_users['user_id'].apply(lambda x: not str(x).isnumeric())]
non_numeric_values

Unnamed: 0,user_id,Location,Age
275081,", milan, italy""",,


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> Lots of missing ages, but this could be due to family accounts, children or group accounts. Until we can figure that out, we will omit these users. Only 1 location null, which would have been able to work, but the user id is all messed up. This could be an issue with the CSV transformation, so we'll omit this one row as well.
</div>

In [13]:
df_users = df_users.dropna()
df_users.isnull().sum(axis=0)

user_id     0
Location    0
Age         0
dtype: int64

In [14]:
df_users.head()

Unnamed: 0,user_id,Location,Age
1,2,"stockton, california, usa",18.0
3,4,"porto, v.n.gaia, portugal",17.0
5,6,"santa monica, california, usa",61.0
9,10,"albacete, wisconsin, spain",26.0
10,11,"melbourne, victoria, australia",14.0


## Read the data where ratings are given by users

In [15]:
df_br.head()

Unnamed: 0,user_id,isbn,rating
0,276725,034545104X,0
1,276726,155061224,5
2,276727,446520802,0
3,276729,052165615X,3
4,276729,521795028,6


In [16]:
df_br.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   user_id  1048575 non-null  int64 
 1   isbn     1048575 non-null  object
 2   rating   1048575 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 24.0+ MB


In [17]:
df_br.isnull().sum(axis=0)

user_id    0
isbn       0
rating     0
dtype: int64

In [18]:
df_br[df_br.duplicated()]

Unnamed: 0,user_id,isbn,rating
11338,709,9.78E+12,9
21604,4334,6.31E+11,0
28622,6575,6.31E+11,0
58876,11676,9.78E+12,9
58877,11676,9.78E+12,9
...,...,...,...
1047765,250634,9.78E+12,0
1047766,250634,9.78E+12,10
1047767,250634,9.78E+12,10
1047768,250634,9.78E+12,10


## Take a quick look at the number of unique users and books

In [19]:
unique_users = df_br['user_id'].nunique()
unique_books = df_br['isbn'].nunique()

print(f"Unique Users: {unique_users}")
print(f"Unique Books: {unique_books}")


Unique Users: 95513
Unique Books: 322102


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> Let's look at all the other data sets and do some prep work too if needed.
</div>

In [20]:
df_books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [21]:
df_users.head()

Unnamed: 0,user_id,Location,Age
1,2,"stockton, california, usa",18.0
3,4,"porto, v.n.gaia, portugal",17.0
5,6,"santa monica, california, usa",61.0
9,10,"albacete, wisconsin, spain",26.0
10,11,"melbourne, victoria, australia",14.0


In [22]:
df_br.head()

Unnamed: 0,user_id,isbn,rating
0,276725,034545104X,0
1,276726,155061224,5
2,276727,446520802,0
3,276729,052165615X,3
4,276729,521795028,6


In [23]:
df_rec.head()

Unnamed: 0,196,242,3,881250949
0,186,302,3,891717742
1,22,377,1,878887116
2,244,51,2,880606923
3,166,346,1,886397596
4,298,474,4,884182806


## Convert ISBN variables to numeric numbers in the correct order

In [24]:
df_books = df_books.sort_values(by='book_title')
df_books = df_books.reset_index(drop=True)
df_books['bookid'] = df_books.index
df_books['bookid'] = df_books['bookid'].astype('int32')
df_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271379 entries, 0 to 271378
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   isbn                 271379 non-null  object
 1   book_title           271379 non-null  object
 2   book_author          271378 non-null  object
 3   year_of_publication  271379 non-null  object
 4   publisher            271377 non-null  object
 5   bookid               271379 non-null  int32 
dtypes: int32(1), object(5)
memory usage: 11.4+ MB


## Convert the user_id variable to numeric numbers in the correct order

<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> This request may not be necessary since user_id is already numeric. 
</div>

In [25]:
df_users['user_id'] = df_users['user_id'].astype('int32')
df_users['Age'] = df_users['Age'].astype('int16')
df_users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168096 entries, 1 to 278855
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   user_id   168096 non-null  int32 
 1   Location  168096 non-null  object
 2   Age       168096 non-null  int16 
dtypes: int16(1), int32(1), object(1)
memory usage: 3.5+ MB


In [26]:
df_br['user_id'] = df_br['user_id'].astype('int32')
df_br.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   user_id  1048575 non-null  int32 
 1   isbn     1048575 non-null  object
 2   rating   1048575 non-null  int64 
dtypes: int32(1), int64(1), object(1)
memory usage: 20.0+ MB


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> We'll combine the books and book ratings dataframes to make a new dataframe with the book titles.
</div>

In [27]:
df_combined = pd.merge(df_books,df_br,on='isbn')
df_combined = pd.merge(df_combined,df_users,on='user_id')
df_combined = df_combined[['book_title','user_id','rating']]

In [28]:
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 685498 entries, 0 to 685497
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   book_title  685498 non-null  object
 1   user_id     685498 non-null  int32 
 2   rating      685498 non-null  int64 
dtypes: int32(1), int64(1), object(1)
memory usage: 18.3+ MB


In [29]:
df_combined.head()

Unnamed: 0,book_title,user_id,rating
0,A Light in the Storm: The Civil War Diary of ...,55927,0
1,A Break with Charity: A Story about the Salem ...,55927,7
2,A Coal Miner's Bride: The Diary of Anetka Kami...,55927,0
3,A Long Fatal Love Chase,55927,8
4,A Ride into Morning: The Story of Tempe Wick,55927,7


## Convert both user_id and ISBN to the ordered list, i.e., from 0...n-1

In [30]:
df_combined = df_combined.sort_values(by=['book_title','user_id'], ascending=[True,True])
df_combined.head() 

Unnamed: 0,book_title,user_id,rating
0,A Light in the Storm: The Civil War Diary of ...,55927,0
187,Always Have Popsicles,172742,0
1872,Apple Magic (The Collector's series),198711,0
8328,Dark Justice,98391,10
14107,Deceived,170595,0


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> For each book we can calculate the popularity (count of all ratings explicit and implicit), and the average rating (when the rating isn't 0).
</div>

In [31]:
df_pop = df_combined.groupby('book_title').size().reset_index(name='pop')
df_pop.describe()

Unnamed: 0,pop
count,195346.0
mean,3.509148
std,12.14086
min,1.0
25%,1.0
50%,1.0
75%,3.0
max,1931.0


In [32]:
pop_filter = df_pop[df_pop['pop']>20] 
pop_filter

Unnamed: 0,book_title,pop
60,'Salem's Lot,34
100,09-Nov,24
169,10 Lb. Penalty,39
358,101 Dalmatians,33
563,"14,000 Things to Be Happy About",21
...,...,...
195017,Zoya,68
195172,"\O\"" Is for Outlaw""",150
195186,"\Surely You're Joking, Mr. Feynman!\"": Adventu...",36
195252,e,30


In [33]:
rating_filter = df_combined[df_combined['rating']!=0] 
df_avgrating = rating_filter.groupby('book_title')['rating'].mean().reset_index()
df_avgrating.head()

Unnamed: 0,book_title,rating
0,Dark Justice,10.0
1,Earth Prayers From around the World: 365 Pray...,8.333333
2,Final Fantasy Anthology: Official Strategy Gu...,10.0
3,Flight of Fancy: American Heiresses (Zebra Ba...,8.0
4,Garfield Bigger and Better (Garfield (Numbere...,7.0


In [34]:
df_filtered = df_combined[df_combined['book_title'].isin(pop_filter['book_title'])]
df_filtered = df_filtered[df_filtered['rating']!=0] 
df_filtered = df_filtered.sort_values(by=['book_title','user_id'], ascending=[True,True])
df_filtered = df_filtered.reset_index(drop=True)
df_filtered

Unnamed: 0,book_title,user_id,rating
0,'Salem's Lot,33283,10
1,'Salem's Lot,56044,8
2,'Salem's Lot,60263,10
3,'Salem's Lot,70065,5
4,'Salem's Lot,71712,5
...,...,...,...
88481,stardust,215929,8
88482,stardust,218552,5
88483,stardust,234359,7
88484,stardust,236754,10


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> We used book popularity to bring down the number of books we are looking at to books that were read/related to 50 or more users as our threshold arbitrarily chosen as it was above the 75%ile. From there we added another filter that also pulled out 0/implicit ratings or books that were never rated.
</div>

## Re-index the columns to build a matrix

In [35]:
df_pcmat = df_filtered.pivot_table(index=['user_id'],columns=['book_title'],values='rating')
df_pcmat

book_title,'Salem's Lot,09-Nov,10 Lb. Penalty,101 Dalmatians,"14,000 Things to Be Happy About",16 Lighthouse Road,1984,1st to Die: A Novel,2010: Odyssey Two,204 Rosewood Lane,...,Yukon Ho!,Zen and the Art of Motorcycle Maintenance,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,Zlata's Diary: A Child's Life in Sarajevo,Zodiac: The Eco-Thriller,Zoya,"\O\"" Is for Outlaw""","\Surely You're Joking, Mr. Feynman!\"": Adventures of a Curious Character""",e,stardust
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
19,,,,,,,,,,,...,,,,,,,,,,
42,,,,,,,,,,,...,,,,,,,,,,
44,,,,,,,,,,,...,,,,,,,,,,
51,,,,,,,,,,,...,,,,,,,,,,
56,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
278843,,,,,,,,,,,...,,,,,,,,,,
278844,,,,,,,,,,,...,,,,,,,,,,
278846,,,,,,,,,,,...,,,,,,,,,,
278849,,,,,,,,,,,...,,,,,,,,,,


In [36]:
corrMatrix = df_pcmat.corr(method='pearson',min_periods=30)
corrMatrix.head()

book_title,'Salem's Lot,09-Nov,10 Lb. Penalty,101 Dalmatians,"14,000 Things to Be Happy About",16 Lighthouse Road,1984,1st to Die: A Novel,2010: Odyssey Two,204 Rosewood Lane,...,Yukon Ho!,Zen and the Art of Motorcycle Maintenance,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,Zlata's Diary: A Child's Life in Sarajevo,Zodiac: The Eco-Thriller,Zoya,"\O\"" Is for Outlaw""","\Surely You're Joking, Mr. Feynman!\"": Adventures of a Curious Character""",e,stardust
book_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Salem's Lot,,,,,,,,,,,...,,,,,,,,,,
09-Nov,,,,,,,,,,,...,,,,,,,,,,
10 Lb. Penalty,,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians,,,,,,,,,,,...,,,,,,,,,,
"14,000 Things to Be Happy About",,,,,,,,,,,...,,,,,,,,,,


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> We chose 30 as the minimum periods arbitrarily as well, but we could perform some statistical analysis to test what would be an acceptable threshold. The lower, the more costly/expensive in processing.
</div>

## Split your data into two sets (training and testing)

In [37]:
df_pcmat[df_pcmat.count(axis=1) >= 7]

book_title,'Salem's Lot,09-Nov,10 Lb. Penalty,101 Dalmatians,"14,000 Things to Be Happy About",16 Lighthouse Road,1984,1st to Die: A Novel,2010: Odyssey Two,204 Rosewood Lane,...,Yukon Ho!,Zen and the Art of Motorcycle Maintenance,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,Zlata's Diary: A Child's Life in Sarajevo,Zodiac: The Eco-Thriller,Zoya,"\O\"" Is for Outlaw""","\Surely You're Joking, Mr. Feynman!\"": Adventures of a Curious Character""",e,stardust
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
254,,,,,,,9.0,,,,...,,,,,,,,,,
388,,,,,,,,,,,...,,,,,,,,,,
503,,,,,,,,,,,...,,,,,,,,,,
638,,,,,,,,,,,...,,,,,,,,,,
805,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
278221,,,,,,,,,,,...,,,,,,,,,,
278356,,,,,,,,,,,...,,,,,,,,,,
278390,,,,,,,,,,,...,,,,,,,,,,
278554,,,,,,,,,,,...,,,,,,,,,,


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> These 2290 users have provided input on at least 7 books which would be good examples to test our model against. We will look at the first user/row here user_id=254 and make predictions on related books using the correlation matrix.
</div>

In [38]:
friend_ratings = df_pcmat.loc[254].dropna()
friend_ratings

book_title
1984                                                                         9.0
American Gods                                                               10.0
American Gods: A Novel                                                       9.0
Animal Farm                                                                  8.0
Complete Chronicles of Narnia                                                5.0
Familiar Lullaby (Fear Familiar) (Harlequin Intrigue, No 614)                5.0
Harry Potter and the Chamber of Secrets (Book 2)                             9.0
Harry Potter and the Goblet of Fire (Book 4)                                 9.0
Harry Potter and the Prisoner of Azkaban (Book 3)                            9.0
Harry Potter and the Sorcerer's Stone (Book 1)                               9.0
Making Minty Malone                                                          7.0
Martian Chronicles                                                           8.5
Neverwhere       

## Make predictions based on user and item variables

In [39]:
simcandidates= pd.Series()
for i in range(0,len(friend_ratings.index)):
    print('Adding similars to ', friend_ratings.index[i])
    
    print('--------------------------------')
    sims = corrMatrix[friend_ratings.index[i]].dropna()
    sims = sims.map(lambda x: x*friend_ratings[i]) # Assigning lower weights to books with lower ratings.
    simcandidates  = simcandidates.append(sims)
    
    print('sorting')
    
    simcandidates.sort_values(inplace=True,ascending=False)
    
    print(simcandidates.head(10))
    print(" ")
    print(" ")

Adding similars to  1984
--------------------------------
sorting
1984    9.0
dtype: float64
 
 
Adding similars to  American Gods
--------------------------------
sorting
American Gods    10.0
1984              9.0
dtype: float64
 
 
Adding similars to  American Gods: A Novel
--------------------------------
sorting
American Gods    10.0
1984              9.0
dtype: float64
 
 
Adding similars to  Animal Farm
--------------------------------
sorting
American Gods    10.0
1984              9.0
Animal Farm       8.0
dtype: float64
 
 
Adding similars to  Complete Chronicles of Narnia
--------------------------------
sorting
American Gods    10.0
1984              9.0
Animal Farm       8.0
dtype: float64
 
 
Adding similars to  Familiar Lullaby (Fear Familiar) (Harlequin Intrigue, No 614)
--------------------------------
sorting
American Gods    10.0
1984              9.0
Animal Farm       8.0
dtype: float64
 
 
Adding similars to  Harry Potter and the Chamber of Secrets (Book 2)
-------

  simcandidates= pd.Series()
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  simcandidates  = simcandidates.append(sims)
  sim

<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> For each book rating, we provide a list of up to ten books that would be similar to the one this user read, as well as the potential rating. Sometimes the rating might be low, and we should be careful to recommend a lower scoring book. Below, is a summary averaging the predicted scores across all books.
</div>

In [40]:
simcandidates = simcandidates.groupby(simcandidates.index).mean()
simcandidates.sort_values(inplace=True,ascending=False)
simcandidates

Neverwhere                                                                  10.000000
American Gods                                                               10.000000
The Secret Life of Bees                                                      9.000000
The Hobbit: or There and Back Again                                          9.000000
The Bonesetter's Daughter                                                    9.000000
1984                                                                         9.000000
Harry Potter and the Sorcerer's Stone (Book 1)                               8.094333
The Subtle Knife (His Dark Materials, Book 2)                                8.049340
The Golden Compass (His Dark Materials, Book 1)                              8.008125
The Dark Half                                                                8.000000
The Fellowship of the Ring (The Lord of the Rings, Part 1)                   8.000000
Animal Farm                                           

## Use RMSE to evaluate the predictions

In [42]:
df_actual = friend_ratings.to_frame()
df_actual.reset_index(inplace=True)
df_actual.rename(columns={254:'actual'}, inplace=True)
df_actual

Unnamed: 0,book_title,actual
0,1984,9.0
1,American Gods,10.0
2,American Gods: A Novel,9.0
3,Animal Farm,8.0
4,Complete Chronicles of Narnia,5.0
5,Familiar Lullaby (Fear Familiar) (Harlequin In...,5.0
6,Harry Potter and the Chamber of Secrets (Book 2),9.0
7,Harry Potter and the Goblet of Fire (Book 4),9.0
8,Harry Potter and the Prisoner of Azkaban (Book 3),9.0
9,Harry Potter and the Sorcerer's Stone (Book 1),9.0


In [43]:
df_predicted = simcandidates.to_frame()
df_predicted.reset_index(inplace=True)
df_predicted.rename(columns={'index':'book_title',0:'predicted'}, inplace=True)
df_predicted

Unnamed: 0,book_title,predicted
0,Neverwhere,10.0
1,American Gods,10.0
2,The Secret Life of Bees,9.0
3,The Hobbit: or There and Back Again,9.0
4,The Bonesetter's Daughter,9.0
5,1984,9.0
6,Harry Potter and the Sorcerer's Stone (Book 1),8.094333
7,"The Subtle Knife (His Dark Materials, Book 2)",8.04934
8,"The Golden Compass (His Dark Materials, Book 1)",8.008125
9,The Dark Half,8.0


In [45]:
avp = pd.merge(df_actual, df_predicted, on='book_title', how='inner')
avp['se'] = (avp['actual']-avp['predicted'])**2
avp

Unnamed: 0,book_title,actual,predicted,se
0,1984,9.0,9.0,0.0
1,American Gods,10.0,10.0,0.0
2,Animal Farm,8.0,8.0,0.0
3,Harry Potter and the Chamber of Secrets (Book 2),9.0,7.258399,3.033172
4,Harry Potter and the Goblet of Fire (Book 4),9.0,7.396533,2.571106
5,Harry Potter and the Prisoner of Azkaban (Book 3),9.0,7.603723,1.94959
6,Harry Potter and the Sorcerer's Stone (Book 1),9.0,8.094333,0.820233
7,Neverwhere,10.0,10.0,0.0
8,She's Come Undone (Oprah's Book Club),7.0,7.0,0.0
9,The Bonesetter's Daughter,9.0,9.0,0.0


In [44]:
mse = avp['se'].mean()
rmse = np.sqrt(mse)
print("RMSE: ",rmse)

RMSE:  0.7483234643710263


<div class="note" style="background-color: #b7e1cd; padding: 10px;">
    <strong>Note:</strong> Through iteratative testing, we found that the more books we included in the correlation matrix (when setting the pop_filter variable), the lower our RMSE, however including more books also proved to be more computationally expensive and took long to calculate, or broke the server due to lack of resources. Additionally, adjusting the min_periods hyperparameter led to varying RMSE results. With the demonstrated inputs the model is good at finding like-books, but not great at predicting ratings.
</div>