<div id="container" style="position:relative;">
<div style="float:left"><h1> Capstone Project Modeling: Funk SVD  </h1></div>
<div style="position:relative; float:right"><img style="height:65px" src ="https://drive.google.com/uc?export=view&id=1EnB0x-fdqMp6I5iMoEBBEuxB_s7AmE2k" />
</div>
</div>

<br>
<br>
<br>


### Camilo Salazar <br> BrainStation <br> November 10, 2023

## Introduction

In this notebook, we delve into the exciting realm of hybrid recommendation systems, a powerful approach that combines the strengths of both collaborative filtering and content-based methods to provide highly personalized book recommendations. By fusing user behavior and content attributes, we aim to create a recommendation model that offers superior accuracy and enhanced user experiences. Join us on this journey as we explore the fusion of data-driven insights and content analysis to bring you a state-of-the-art hybrid book recommender.


In [2]:
# imports usefull libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
import random

# import Supriside to run model 
from surprise import Dataset
from surprise.reader import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise import accuracy

# Filter warnings
from warnings import filterwarnings
filterwarnings('ignore')

In [3]:
#loading all dataframes
book_df = pd.read_csv('data_clean/books.csv',index_col=[0])
tags_df = pd.read_csv('data/tags.csv')
book_tags_df = pd.read_csv('data_clean/book_tags.csv',index_col=[0])
ratings_df = pd.read_csv('data/ratings.csv')

---
## Table of Content

- [Collaborative Filtering Recommender](#part-1.0)
- [Content Filtering Recommender](#part-2.0)
    - [Genres Content Recommendations](#part-2.1)
    - [Tags Content Recommendations](#part-2.2)
- [Hybrid Recommender](#part-3.0)
- [Final Recommendations DataFrame](#part-4.0)

---
## Collaborative Filtering Recommender <a class="anchor" id="part-1.0"></a>

In this first section of the notebook, we will construct our Collaborative Filtering Recommender using the tuned hyperparameters obtained in the previous notebook. Utilizing the latent features created by the Funk SVD for each book, crucial for reconstructing the product-user matrix, we can compare books liked by similar users. This is achieved through cosine similarity, enabling us to generate top recommendations based on the highest similarity scores.

The first step will be to run the Funk SVD model

In [69]:
# set reader of the rating
reader = Reader(rating_scale=(1, 5))
my_data = Dataset.load_from_df(ratings_df, reader)
# Split data to train and test split
trainset, testset = train_test_split(my_data, test_size=.10, random_state = 42)

In [7]:
# Running Funk SVD Model with the tunen Hyper Parameters
final_model = FunkSVD( n_factors = 20,
                 n_epochs = 20,
                 lr_all = 0.0075,
                 biased = False,
                 random_state = 42)

final_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x15fdb6297d0>

Now that we have run the model, we can extract the latent features for the books and also create an indexing dictionary that will help us assign the correct book_id to the correct index.

In [10]:
# book latent features
book_latent = final_model.qi
# book indexing dataframe 
book_simind = pd.DataFrame(list(trainset._raw2inner_id_items.items()
),columns=['book_id', 'Vindex']).set_index('book_id', drop=True)
book_simind.head(5)

Unnamed: 0_level_0,Vindex
book_id,Unnamed: 1_level_1
2757,0
134,1
1463,2
71,3
3339,4


In [13]:
from sklearn.metrics.pairwise import cosine_similarity 
# Creates similaritiy arrya for all combinations of books
coll_similarities = cosine_similarity(book_latent, dense_output=False)

After comparing the similarity of every book pair, we can begin building our recommendation function. To do this, we will first write a few functions that will help us convert information between different formats, such as converting index position to book id and title to book id.

In [14]:
def id_bookinfo(b_id):
    '''
    Retrieves book information based on a given book ID.

    Parameters
    ----------
    b_id: int
        The book ID (an integer) for the book to retrieve information about. Should be between 1 and 10,000.

    Returns
    -------
    result: pandas.DataFrame
        A DataFrame containing information about the book, including title, authors, and original publication year.
    '''
    # checks that the book id is and int and valid
    if (not isinstance(b_id, int)) or ((b_id < 1) or (b_id > 10000)):
        raise ValueError("Invalid Book Id Pick an Integer between 1-10000")
    
    result = book_df[book_df['book_id'] == b_id][['title', 'authors', 'original_publication_year','average_rating']]
    return result

def title_to_id(b_title):
    '''
    Retrieves the book ID based on a given book title.

    Parameters
    ----------
    b_title: str
        The title of the book to find the corresponding book ID for.

    Returns
    -------
    result: int
        The book ID associated with the given title.
    '''
    
    result = book_df[book_df['title'] == b_title]['book_id']
    return result.values[0]

def vin_to_id(vin, b_ind):
    '''
    Retrieves the book ID based on a given Vindex (index in a similarity matrix).

    Parameters
    ----------
    vin: int
        The Vindex value representing a book's index in a similarity matrix.

    Returns
    -------
    result: int
        The book ID associated with the given Vindex.
    '''
    result = b_ind[b_ind['Vindex'] == vin].index[0]
    return result

 
def id_to_vin(inb, b_ind ):
    '''
    Retrieves the Vindex (index in a similarity matrix) based on a given book ID.

    Parameters
    ----------
    inb: int
        The book ID for which the Vindex is needed.

    Returns
    -------
    result: int
    The Vindex associated with the given book ID
    '''     
    
    result = b_ind.loc[inb][0]
    return result

def title_to_vin(b_title, b_ind):
    '''
    Retrieves the Vindex (index in a similarity matrix) based on a given book title.

    Parameters
    ----------
    b_title: str
        The title of the book for which the Vindex is needed.

    Returns
    -------
    result: int
        The Vindex associated with the given book title.
    '''
        
    result = id_to_vin(title_to_id(b_title),b_ind)
    return result
    
    

With that completed, we can write our recommender function. Upon taking a title, it will find the top 10 similar books from the similarity results and return a dataframe containing the book information such as title, author, original publication year, and average rating, along with the similarity score of each book. This allows us to make ten recommendations based on collaborative filtering.

In [15]:
def Recommender(b_title, sim_arr, bi):
    '''
    Recommends books similar to a given book title based on book similarities.

    Parameters
    ----------
    b_title: str
        The title of the book for which you want book recommendations.
    book_similarities: np.ndarray
        A 2D numpy array containing book similarities where each row represents a book.
    bi: pd.DataFrame
        A DataFrame containing book indices.

    Returns
    -------
    results: pd.DataFrame
        A DataFrame containing book recommendations and their similarities to the input book.
    '''
    
    # Create a copy of the book indices DataFrame
    botoind = bi.copy()
    # Get the Vindex (index in similarity matrix) of the input book title
    vin = title_to_vin(b_title,bi)
    # Extract similarity data for the input book
    data = sim_arr[vin]
    # Add the Similarities column to the book indices DataFrame
    botoind['Similarities'] = data
    # Sort books by similarity in descending order
    botoind.sort_values('Similarities', ascending=False, inplace=True)
    # Remove the input book from the recommendations
    botoind = botoind.drop(vin_to_id(vin,bi))
    # Get the top 10 book indices
    top10ind = botoind.head(10).index
    results = pd.DataFrame([])
    # Retrieve book information for the top 10 recommended books
    for b_id in top10ind:
        res = id_bookinfo(b_id)
        results = pd.concat([results, res])
    # Add the Similarities column to the recommendations DataFrame
    results['Similarities'] = botoind.head(10)['Similarities'].values
    
    return results


Finally, we can test the function by providing some book titles.

In [20]:
Recommender("Words of Radiance (The Stormlight Archive, #2)",coll_similarities,book_simind)

Unnamed: 0,title,authors,original_publication_year,average_rating,Similarities
561,"The Way of Kings (The Stormlight Archive, #1)",Brandon Sanderson,2010.0,4.64,0.987111
1340,"Golden Son (Red Rising, #2)",Pierce Brown,2015.0,4.46,0.971419
306,"The Wise Man's Fear (The Kingkiller Chronicle,...",Patrick Rothfuss,2011.0,4.57,0.969749
140,The Martian,Andy Weir,2012.0,4.39,0.968499
9140,"The Way of Kings, Part 1 (The Stormlight Archi...",Brandon Sanderson,2011.0,4.67,0.964987
1373,"A Memory of Light (Wheel of Time, #14)","Robert Jordan, Brandon Sanderson",2012.0,4.5,0.964794
1807,"Morning Star (Red Rising, #3)",Pierce Brown,2016.0,4.5,0.963389
1199,"The Alloy of Law (Mistborn, #4)",Brandon Sanderson,2011.0,4.2,0.962975
2888,"Mistborn Trilogy Boxed Set (Mistborn, #1-3)",Brandon Sanderson,2009.0,4.55,0.95933
9523,"The Locket (The Locket, #1)",Richard Paul Evans,1998.0,4.1,0.959034


In [21]:
Recommender("Caliban's War (The Expanse, #2)",coll_similarities, book_simind)

Unnamed: 0,title,authors,original_publication_year,average_rating,Similarities
9202,"Batman: No Man's Land, Vol. 1","Bob Gale, Devin Grayson, Alex Maleev, Dale Eag...",1999.0,4.11,0.979077
7247,"Kill the Dead (Sandman Slim, #2)",Richard Kadrey,2010.0,4.06,0.977375
9901,"Death Without Company (Walt Longmire, #2)",Craig Johnson,2006.0,4.23,0.970572
7148,"Forever Peace (The Forever War, #2)",Joe Haldeman,1997.0,3.73,0.968576
5107,Chinese Cinderella: The True Story of an Unwan...,Adeline Yen Mah,1999.0,4.04,0.968555
2727,"Abaddon's Gate (The Expanse, #3)",James S.A. Corey,2013.0,4.18,0.968357
8093,What We Keep,Elizabeth Berg,1998.0,3.91,0.967049
9273,"The Pagan Lord (The Saxon Stories, #7)",Bernard Cornwell,2013.0,4.34,0.965922
9684,Cape Fear,John D. MacDonald,1958.0,4.07,0.964591
9695,"Perfect Shadow (Night Angel, #0.5)",Brent Weeks,2011.0,4.18,0.964388


---
## Content Filtering Recommender  <a class="anchor" id="part-2.0"></a>

For this next section, we will be building two Content Filtering Recommenders: one using genres and the other using user-assigned tags. This will be done to compare both approaches and decide which, if not both, to use for building our final hybrid recommender.

### Genres Content Recommendations <a class="anchor" id="part-2.1"></a>

For our first content recommender, we will be using genres, which come in the form of a string. Therefore, we first need to build our custom tokenizer to isolate each individual genre.

In [38]:
# taking a look at the genres format 
book_df['genres']

0       ['young-adult', 'fiction', 'fantasy', 'science...
1       ['fantasy', 'fiction', 'young-adult', 'classics']
2       ['young-adult', 'fantasy', 'romance', 'fiction...
3       ['classics', 'fiction', 'historical-fiction', ...
4       ['classics', 'fiction', 'historical-fiction', ...
                              ...                        
9995      ['fantasy', 'romance', 'paranormal', 'fiction']
9996               ['biography', 'history', 'nonfiction']
9997                    ['historical-fiction', 'fiction']
9998                         ['nonfiction', 'psychology']
9999      ['history', 'nonfiction', 'historical-fiction']
Name: genres, Length: 10000, dtype: object

In [39]:
import re

# Build custom tokenizer to eliminate unnecessary characters and divide genres into a list of individual genres

def custom_tokenizer(text):
    """
    Custom tokenizer function to preprocess text by removing non-alphabetic characters,
    converting to lowercase, and splitting into a list of words.

    Parameters:
    - text (str): Input text containing genres.

    Returns:
    - list: List of individual genres after tokenization.
    """
    
    text = re.sub("[,'\]\[]", ' ', text)  # Remove non-alphabetic characters excluding -
    text = text.lower()  # Convert to lowercase
    # Split sentence into genres
    list_of_words = text.split()
    return list_of_words

With our custom tokenizer that will allow us to split the genres, we can now use the CountVectorizer model to obtain a matrix of token counts where the tokens are each genre. We can then use this matrix to compare the similarity between book pairs.

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
# split book genres data using count vectories with the custom vectoriser

text_model = CountVectorizer(tokenizer=custom_tokenizer)

text_pop = text_model.fit_transform(book_df['genres'])

In [40]:
#create dataframe out of results of count vectoriser
cont_max = pd.DataFrame(columns = text_model.get_feature_names_out(), data = text_pop.toarray())
cont_max.head()

Unnamed: 0,art,biography,books,business,chick-lit,christian,classics,comics,contemporary,cookbooks,...,romance,science,science-fiction,self-help,spirituality,sports,suspense,thriller,travel,young-adult
0,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,1
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [43]:
#compares similarity based on genres 
cont1_similarities = cosine_similarity(cont_max, dense_output=False)

In [44]:
# defining the index dicionarry for sorting book_id to index
tag_ind = book_df[['book_id']]
tag_ind['Vindex'] = range(10000)
tag_ind = tag_ind.set_index(['book_id'])
tag_ind.head()

Unnamed: 0_level_0,Vindex
book_id,Unnamed: 1_level_1
1,0
2,1
3,2
4,3
5,4


We are now able to reuse the same recommender function built in the previous section by just replacing the similarity matrix and index dictionary. We can now generate recommendations based on the book genres.

In [36]:
Recommender("Words of Radiance (The Stormlight Archive, #2)", cont1_similarities, tag_ind)

Unnamed: 0,title,authors,original_publication_year,average_rating,Similarities
6775,"The Legend of Huma (Dragonlance: Heroes, #1)",Richard A. Knaak,1988.0,4.03,1.0
1904,"Blood Song (Raven's Shadow, #1)",Anthony Ryan,2011.0,4.47,1.0
4005,A Knight of the Seven Kingdoms (The Tales of D...,"George R.R. Martin, Gary Gianni",2013.0,4.19,1.0
1428,"Beyond the Shadows (Night Angel, #3)",Brent Weeks,2008.0,4.29,1.0
5066,"The Dragon Keeper (Rain Wild Chronicles, #1)",Robin Hobb,2009.0,3.93,1.0
858,"The Way of Shadows (Night Angel, #1)",Brent Weeks,2008.0,4.15,1.0
3277,"The Mad Ship (Liveship Traders, #2)",Robin Hobb,1999.0,4.21,1.0
4377,"Promise of Blood (Powder Mage, #1)",Brian McClellan,2013.0,4.16,1.0
3988,The Dark Elf Trilogy Collector's Edition (Forg...,R.A. Salvatore,1998.0,4.33,1.0
9407,"The Darkest Road (The Fionavar Tapestry, #3)",Guy Gavriel Kay,1986.0,4.18,1.0


In [45]:
Recommender("Caliban's War (The Expanse, #2)", cont1_similarities, tag_ind)

Unnamed: 0,title,authors,original_publication_year,average_rating,Similarities
282,Good Omens: The Nice and Accurate Prophecies o...,"Terry Pratchett, Neil Gaiman",1990.0,4.25,1.0
5590,"Virtual Light (Bridge, #1)",William Gibson,1993.0,3.84,1.0
2155,"Red Mars (Mars Trilogy, #1)",Kim Stanley Robinson,1993.0,3.84,1.0
5665,"Excession (Culture, #5)",Iain M. Banks,1996.0,4.19,1.0
3572,"Burning Chrome (Sprawl, #0)","William Gibson, Bruce Sterling",1986.0,4.05,1.0
1336,"Going Postal (Discworld, #33; Moist von Lipwig...",Terry Pratchett,2004.0,4.36,1.0
3631,"Use of Weapons (Culture, #3)",Iain M. Banks,1990.0,4.18,1.0
3643,"On Basilisk Station (Honor Harrington, #1)",David Weber,1992.0,4.11,1.0
5014,"Daughter of the Empire (The Empire Trilogy, #1)","Raymond E. Feist, Janny Wurts",1987.0,4.24,1.0
1728,Seveneves,Neal Stephenson,2015.0,3.98,1.0


While we are able to generate recommendations based on genres, upon exploring the similarity scores of each book, we find that all 10 books have a perfect score. This is likely due to the fact that genres are too broad, and on average, each book has only around 6 genres assigned, with some having only two. This results in many books sharing the same genres, making recommendations generated with this method less useful on its own. Thus, tags may be a better method for content recommendations, as they offer a much larger variety.

### Tag Recommendations <a class="anchor" id="part-2.2"></a>

For this new section, we will be using the book tags to create another content-based recommender. As the first step, we will filter the tags, removing those that appear in less than 10 books, as they don't occur frequently enough for accurate comparisons. Additionally, we will eliminate tags that appear in over 66% of books, as they are too common to provide meaningful distinctions. (This step is done in this notebook in case we want to adjust the treshholds)

In [46]:
#Reducing number of tags to those that apeear i over 10 books but less than 66% of them 
tag_list = book_tags_df['tag_id'].value_counts()/book_df.shape[0]*100

tag_list = list(tag_list[(tag_list > 0.1) & (tag_list < 66)].index)

book_tags_df2 = book_tags_df[(book_tags_df['tag_id'].isin(tag_list))]

book_tags_df3 = pd.merge(book_tags_df2, book_df[['book_id','goodreads_book_id']] ,on ='goodreads_book_id',how = 'left')

book_tags_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768559 entries, 0 to 768558
Data columns (total 4 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   goodreads_book_id  768559 non-null  int64
 1   tag_id             768559 non-null  int64
 2   count              768559 non-null  int64
 3   book_id            768559 non-null  int64
dtypes: int64(4)
memory usage: 29.3 MB


For the next step, we will be trasforming the data. This is done because the number of times a tag is assigned greatly changes depending on the popularity of the book. By taking the count of each tag for each book and dividing it by the count of the most popular tag for that book, we are able to maintain the importance of each tag relative to each book and ensure that all books have the same range.

In [81]:
# normalising tag count
unq_book = book_tags_df3['goodreads_book_id'].unique()
book_tags_df4 = book_tags_df3.copy()

# looping over each indidual book and diving eah tag count by the max value of the book
for i in unq_book:
    max_con = book_tags_df4[book_tags_df4['goodreads_book_id'] == i]['count'].max() #find the max count 
    book_tags_df4.loc[book_tags_df4['goodreads_book_id'] == i,'count'] /= max_con #divided all values of the book by such a value
    
book_tags_df4.tail()

Unnamed: 0,goodreads_book_id,tag_id,count,book_id
768554,33288638,29299,0.018182,8892
768555,33288638,2101,0.018182,8892
768556,33288638,21303,0.018182,8892
768557,33288638,17271,0.018182,8892
768558,33288638,1126,0.018182,8892


In [76]:
# trasforming tag and book dataframe into the correct fromat for scaling 
book_tag_max = book_tags_df4.groupby(['book_id','tag_id']).sum().unstack().fillna(0)
book_tag_max.head()

Unnamed: 0_level_0,goodreads_book_id,goodreads_book_id,goodreads_book_id,goodreads_book_id,goodreads_book_id,goodreads_book_id,goodreads_book_id,goodreads_book_id,goodreads_book_id,goodreads_book_id,...,count,count,count,count,count,count,count,count,count,count
tag_id,27,47,71,90,98,115,134,177,190,192,...,34003,34011,34031,34051,34148,34153,34155,34157,34206,34242
book_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The last step we can take before comparing the similarity of each book pair is to scale the tag data using StandardScaler. This helps level the playing field by allowing less frequently assigned tags to contribute meaningfully to the similarity scores, preventing any bias from the more frequent tags.

In [77]:
from sklearn.preprocessing import StandardScaler

scale_model =  StandardScaler()

scale_m = scale_model.fit_transform(book_tag_max)

In [78]:
cont2_similarities = cosine_similarity(scale_m, dense_output=False)

In [None]:
Using the same function we can now see the recomendations generated by the tags 

In [79]:
Recommender("Words of Radiance (The Stormlight Archive, #2)", cont2_similarities, tag_ind)

Unnamed: 0,title,authors,original_publication_year,average_rating,Similarities
561,"The Way of Kings (The Stormlight Archive, #1)",Brandon Sanderson,2010.0,4.64,0.764004
3340,"The Bands of Mourning (Mistborn, #6)",Brandon Sanderson,2016.0,4.45,0.518539
1199,"The Alloy of Law (Mistborn, #4)",Brandon Sanderson,2011.0,4.2,0.515263
2159,"The Blinding Knife (Lightbringer, #2)",Brent Weeks,2012.0,4.45,0.473415
2791,"Shadows of Self (Mistborn, #5)",Brandon Sanderson,2015.0,4.3,0.464496
7992,"Secret History (Mistborn, #3.5)",Brandon Sanderson,2016.0,4.38,0.374148
2977,The Emperor's Soul,Brandon Sanderson,2012.0,4.33,0.342964
3724,"Rise of Empire (The Riyria Revelations, #3-4)",Michael J. Sullivan,2010.0,4.35,0.338344
3797,"Heir of Novron (The Riyria Revelations, #5-6)",Michael J. Sullivan,2012.0,4.46,0.335325
3533,"The Broken Eye (Lightbringer, #3)",Brent Weeks,2014.0,4.46,0.31529


In [80]:
Recommender("Caliban's War (The Expanse, #2)", cont2_similarities, tag_ind)

Unnamed: 0,title,authors,original_publication_year,average_rating,Similarities
3669,"Cibola Burn (The Expanse, #4)",James S.A. Corey,2014.0,4.12,0.674643
4460,"Nemesis Games (The Expanse, #5)",James S.A. Corey,2015.0,4.37,0.579745
2727,"Abaddon's Gate (The Expanse, #3)",James S.A. Corey,2013.0,4.18,0.535179
9033,"Babylon's Ashes (The Expanse, #6)",James S.A. Corey,2016.0,4.18,0.525851
1262,"Leviathan Wakes (The Expanse, #1)",James S.A. Corey,2011.0,4.2,0.468206
5900,Ancillary Mercy (Imperial Radch #3),Ann Leckie,2015.0,4.2,0.279748
6813,"The Human Division (Old Man's War, #5)",John Scalzi,2013.0,4.07,0.270363
6417,The Temporal Void,Peter F. Hamilton,2008.0,4.24,0.26788
6928,Terms of Enlistment (Frontlines #1),Marko Kloos,2013.0,3.96,0.257692
4789,"Authority (Southern Reach, #2)",Jeff VanderMeer,2014.0,3.55,0.253433


--- 
## Hybrid Recommender <a class="anchor" id="part-3.0"></a>

In [55]:
def hybrid_recommender(book_title, exclude_author = True):
    """
    Generate book recommendations using a hybrid recommender system.
    
    Parameters:
    - book_title (str): Title of the book for which recommendations are requested.
    - exclude_author (bool): If True, excludes books by the same author from the recommendations.
    
    Returns:
    - DataFrame: Top 10 book recommendations based on a combination of collaborative and content-based filtering.
    """
    # Get the book_id
    book_id = title_to_id(book_title)
    
    # Get indices for collaborative and content-based filtering
    coll_index = id_to_vin(book_id, book_simind)
    cont_index = id_to_vin(book_id, tag_ind)
    
    # Get similarity data
    coll_data = coll_similarities[coll_index]
    cont_data1 = cont2_similarities[cont_index]
    cont_data2 = cont1_similarities[cont_index]
    
    # Create DataFrames for collaborative and content-based filtering
    coll_df = book_simind.copy()
    cont_df = tag_ind.copy()
    
    # Merge the DataFrames
    coll_df['SIM1'] = coll_data
    cont_df['SIM2'] = cont_data1
    cont_df['SIM3'] = cont_data2
    
    merge_df = pd.merge(cont_df, coll_df, on='book_id', how='left')
    merge_df = pd.merge(merge_df, book_df[['book_id', 'average_rating']], on='book_id', how='right').set_index('book_id')
    
    merge_df.columns = merge_df.columns.map(''.join)
    # Calculate the total score
    merge_df['score'] = weight_score(merge_df)
    
    # Exclude books by the same author if specified
    if exclude_author:
        author = book_df[book_df['book_id'] == book_id]['authors'].values[0]
        same_author_ids = list(book_df[book_df['authors'] == author]['book_id'].values)
        merge_df = merge_df.drop(same_author_ids)
    else:
        merge_df = merge_df.drop(book_id)
    
    # Sort and retrieve top 10 recommendations
    merge_df = merge_df.sort_values('score', ascending=False)
    top10_ids = merge_df.head(10).index
    results = pd.DataFrame([])
    
    # Retrieve book information for the top 10 recommended books
    for book_id in top10_ids:
        book_info = id_bookinfo(book_id)
        results = pd.concat([results, book_info])
    
    # Add the 'score' column to the recommendations DataFrame
    results['score'] = merge_df.head(10)['score'].values / 1.325  # Normalize the score
    
    return results


def weight_score(merge_df):
    """
    Calculate the total weighted score for each book in the hybrid recommender system.
    
    Parameters:
    - merge_df (DataFrame): DataFrame containing similarity and rating data for books.
    
    Returns:
    - Series: Total weighted score for each book.
    """
    # Calculate the total weighted score using collaborative and content-based similarities and average rating
    score = merge_df['SIM1'] + 0.25 * merge_df['SIM2'] + 0.05 * merge_df['SIM3'] + 0.005 * merge_df['average_rating']
    
    return score 

In [82]:
hybrid_recommender("Words of Radiance (The Stormlight Archive, #2)")

Unnamed: 0,title,authors,original_publication_year,average_rating,score
2159,"The Blinding Knife (Lightbringer, #2)",Brent Weeks,2012.0,4.45,0.855968
306,"The Wise Man's Fear (The Kingkiller Chronicle,...",Patrick Rothfuss,2011.0,4.57,0.832913
3797,"Heir of Novron (The Riyria Revelations, #5-6)",Michael J. Sullivan,2012.0,4.46,0.830187
3533,"The Broken Eye (Lightbringer, #3)",Brent Weeks,2014.0,4.46,0.825966
6308,"The Crimson Campaign (Powder Mage, #2)",Brian McClellan,2014.0,4.35,0.807225
3724,"Rise of Empire (The Riyria Revelations, #3-4)",Michael J. Sullivan,2010.0,4.35,0.804137
1340,"Golden Son (Red Rising, #2)",Pierce Brown,2015.0,4.46,0.802357
2605,"The Daylight War (Demon Cycle, #3)",Peter V. Brett,2013.0,4.23,0.799723
6227,"Fool's Quest (The Fitz and The Fool, #2)",Robin Hobb,2015.0,4.53,0.798141
7196,The Providence of Fire (Chronicle of the Unhew...,Brian Staveley,2015.0,4.16,0.796822


In [83]:
hybrid_recommender("Caliban's War (The Expanse, #2)", False)

Unnamed: 0,title,authors,original_publication_year,average_rating,score
3669,"Cibola Burn (The Expanse, #4)",James S.A. Corey,2014.0,4.12,0.888075
2727,"Abaddon's Gate (The Expanse, #3)",James S.A. Corey,2013.0,4.18,0.885322
4460,"Nemesis Games (The Expanse, #5)",James S.A. Corey,2015.0,4.37,0.872243
1262,"Leviathan Wakes (The Expanse, #1)",James S.A. Corey,2011.0,4.2,0.852277
9033,"Babylon's Ashes (The Expanse, #6)",James S.A. Corey,2016.0,4.18,0.84221
6417,The Temporal Void,Peter F. Hamilton,2008.0,4.24,0.818752
6813,"The Human Division (Old Man's War, #5)",John Scalzi,2013.0,4.07,0.815957
7119,"The Dark Forest (Remembrance of Earth’s Past, #2)","Liu Cixin, Joel Martinsen",2008.0,4.38,0.809092
1340,"Golden Son (Red Rising, #2)",Pierce Brown,2015.0,4.46,0.806274
9120,Great North Road,Peter F. Hamilton,2012.0,4.06,0.803091


In [84]:
hybrid_recommender("The Hunger Games (The Hunger Games, #1)")

Unnamed: 0,title,authors,original_publication_year,average_rating,score
11,"Divergent (Divergent, #1)",Veronica Roth,2011.0,4.24,0.807585
716,The Hunger Games: Official Illustrated Movie C...,Kate Egan,2012.0,4.51,0.783479
279,"Delirium (Delirium, #1)",Lauren Oliver,2011.0,3.99,0.778005
90,"The Maze Runner (Maze Runner, #1)",James Dashner,2009.0,4.02,0.777965
5307,"Independent Study (The Testing, #2)",Joelle Charbonneau,2014.0,3.97,0.771688
3336,"UnWholly (Unwind, #2)",Neal Shusterman,2012.0,4.25,0.770104
4192,"The Crown of Embers (Fire and Thorns, #2)",Rae Carson,2012.0,4.2,0.768002
2376,"Through the Ever Night (Under the Never Sky, #2)",Veronica Rossi,2013.0,4.17,0.767778
6721,"Horde (Razorland, #3)",Ann Aguirre,2013.0,4.28,0.765928
524,"Pandemonium (Delirium, #2)",Lauren Oliver,2012.0,4.07,0.765614


---
## Final Recommendations DataFrame <a class="anchor" id="part-4.0"></a>

In this final section, we will be creating two dataframes containing all the titles of the books and their corresponding book_id for the top 15 recommendations based on weighted scores. This is done to download these two dataframes for use in our interactive showcase and to provide results quickly without complex calculations that could slow down the demonstration. To achieve this, we will first build a new function, similar to the hybrid recommender from above, but this time it will only return a one-row dataframe with the book id of the top 15 recommendations.

In [61]:
def demo_Recommendations(book_title, exclude_author = True):

    """
    Generate book recommendations for demonstration purposes.

    Parameters:
    - book_title (str): Title of the book for which recommendations are desired.
    - exclude_author (bool, optional): Option to exclude books by the same author from recommendations. Default is True.

    Returns:
    - results (pd.DataFrame): DataFrame containing the top 10 recommended book IDs.

    Note:
    - The recommendation is based on a hybrid approach combining collaborative filtering and content-based filtering.
    """
        
    # Get the book_id
    book_id = title_to_id(book_title)
    
    # Get indices for collaborative and content-based filtering
    coll_index = id_to_vin(book_id, book_simind)
    cont_index = id_to_vin(book_id, tag_ind)
    
    # Get similarity data
    coll_data = coll_similarities[coll_index]
    cont_data1 = cont2_similarities[cont_index]
    cont_data2 = cont1_similarities[cont_index]
    
    # Create DataFrames for collaborative and content-based filtering
    coll_df = book_simind.copy()
    cont_df = tag_ind.copy()
    
    # Merge the DataFrames
    coll_df['SIM1'] = coll_data
    cont_df['SIM2'] = cont_data1
    cont_df['SIM3'] = cont_data2
    
    merge_df = pd.merge(cont_df, coll_df, on='book_id', how='left')
    merge_df = pd.merge(merge_df, book_df[['book_id', 'average_rating']], on='book_id', how='right').set_index('book_id')
    
    merge_df.columns = merge_df.columns.map(''.join)
    # Calculate the total score
    merge_df['score'] = weight_score(merge_df)
    
    # Exclude books by the same author if specified
    if exclude_author:
        author = book_df[book_df['book_id'] == book_id]['authors'].values[0]
        same_author_ids = list(book_df[book_df['authors'] == author]['book_id'].values)
        merge_df = merge_df.drop(same_author_ids)
    else:
        merge_df = merge_df.drop(book_id)
    
    # Sort and retrieve top 15 recommendations
    merge_df = merge_df.sort_values('score', ascending=False)
    top10_ids = merge_df.head(15).index
    results = pd.DataFrame([top10_ids],index = [book_title], columns = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
    
    return results


In [65]:
book_demo = pd.DataFrame([])

for i in range(1,10001):
    tb = book_df[book_df['book_id'] == i]['title'].values[0]
    rank_df = demo_Recommendations(tb, False)
    book_demo = pd.concat([book_demo,rank_df])

book_demo.shape

(10000, 15)

In [66]:
book_demo.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
"The Hunger Games (The Hunger Games, #1)",17,507,20,3712,4720,1531,12,717,280,91,5308,3337,4193,2377,6722
"Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",18,23,24,27,25,3753,21,2101,422,3275,469,9048,6141,1286,3054
"Twilight (Twilight, #1)",49,52,992,834,2021,220,56,3075,636,1619,732,1109,1609,73,954
To Kill a Mockingbird,265,32,42,66,116,58,87,93,225,5,15,1094,650,540,468
The Great Gatsby,8,32,4,483,28,63,736,65,899,95,468,523,387,715,456


In [67]:
book_demo2 = pd.DataFrame([])

for i in range(1,10001):
    tb = book_df[book_df['book_id'] == i]['title'].values[0]
    rank_df = demo_Recommendations(tb, True)
    book_demo2 = pd.concat([book_demo2,rank_df])

book_demo.shape

(10000, 15)

In [68]:
book_demo2.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
"The Hunger Games (The Hunger Games, #1)",12,717,280,91,5308,3337,4193,2377,6722,525,8577,69,4226,327,422
"Harry Potter and the Sorcerer's Stone (Harry Potter, #1)",18,3753,422,469,9048,6141,1286,3054,6009,7888,1,3739,1531,3712,4415
"Twilight (Twilight, #1)",992,220,3075,636,1109,1609,954,3854,9486,2749,1737,9936,9500,6052,9106
To Kill a Mockingbird,265,32,42,66,116,58,87,93,225,5,15,1094,650,540,468
The Great Gatsby,8,32,4,483,28,63,736,65,899,95,468,523,387,715,456


In [None]:
book_demo.to_csv('Streamlit/book_demo_auth.csv')
book_demo2.to_csv('Streamlit/book_demo_noauth.csv')