# Recommender Systems Using Content-Based Method

> *Recommender Systems*  
> *MSc in Data Science, Department of Informatics*  
> *Athens University of Economics and Business*

---

Find a dataset that can be used to inform a ***content-based*** recommender systems.

Build a Python notebook that:

- Loads the dataset
- Creates a content-based recommender system
- Uses quantitative metrics to evaluate the recommendations of the system

## Introduction

### *Libraries*

In [1]:
import pandas as pd
import numpy as np
from dataclasses import dataclass
import torch
from torch import Tensor
from sentence_transformers import SentenceTransformer,util
import csv 
import re
import random
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import scipy
from random import random,sample,shuffle

### *Data*

- The data contain books sold on Amazon
- There are two files which contain the book info and the corresponding reviews
- The book attributes we will focus on are the following:
    - *Title*
    - *Genre*
    - *Description*
    - *Authors*
    - *Publisher*
    - *Published Year*
    
The data are available in the following link:

https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews/data

##### *Book class definition*

In [4]:
@dataclass
class book: 
    title = ""
    description = ""
    authors = []
    publisher = ""
    published_year = 0
    genre = ""

##### *Loading Sentence BERT*

In [3]:
#load SBERT
sbert = SentenceTransformer('all-MiniLM-L6-v2')

#### Loading Book Dataset

* Definition of function that loads the book info dataset
* Definition of function that calculates the cosine similarities of all book genres
* Usage of the cosine similarity in order to determine genre similarity is optional because of high execution time (aprox. 18 min)

In [4]:
def load_books(input_file:str):
    
    # function that load the books from the file
    
    title_index={}# title to index 
    books=[]
    with open(input_file, encoding='utf-8') as f:
        descriptions=[]
        genres = []
        for row in csv.DictReader(f):# for each book
            
            # make a new book object
            new_bk = book()
            
            # set book title
            new_bk.title = row['Title']
            
            # set book description
            desc = process_description(row['description'])
            new_bk.description = desc
            
            # set the book authors
            s = [x.strip() for x in row['authors'].split(',')]
            new_bk.authors = set([re.sub('[^.,a-zA-Z0-9 \n\.]', '', x) for x in s])
            
            # set the book publisher
            new_bk.publisher = row['publisher'] if len(row['publisher']) > 0 else None
            
            # set the books published year
            date = re.sub('[^.,a-zA-Z0-9 \n\.]', '', row['publishedDate'])
            new_bk.published_year = int(date[:4]) if len(date) >= 4 else None
            
            # set the books genre
            cat = row['categories'].strip()
            g = re.sub(r"[\[\]']+", '', cat) if len(row['categories']) > 0 else ''
            new_bk.genre = g
            
            #store the description separately
            descriptions.append(desc)
            genres.append(g)
            
            #update the index             
            title_index[row['Title']] = len(books)             
            books.append(new_bk)
        
    tfidf_matrix = tf_idf(descriptions, 500)    
    
    return books,genres,title_index,tfidf_matrix
    
def genre_similarity(genres: list): #execution time approximately 18 minutes
    
    # function that computes the cosine-similarities of all genres encoutered in the dataset
    
    keys = list(set(genres))
    genres_dict = {key: i+1 for i, key in enumerate(keys)}

    #encode all genres
    embedded=sbert.encode(list(set(genres)),convert_to_tensor=True)
    
    #Compute cosine-similarities
    sim_matrix = util.cos_sim(embedded, embedded)
    
    return genres_dict,sim_matrix

def process_description(doc: str):
    
    # function that pre-processes the book descriptions
    
    # Remove non-word (special) characters such as punctuation, numbers etc
    document = re.sub(r'\W', ' ', str(doc))

    # Remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
      
    # Substitute multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)

    # Convert to Lowercase
    document = document.lower()

    # Split the description based on whitespaces (--> List of words)
    document = document.split()
    
    # Lemmatization
    stemmer = WordNetLemmatizer()
    document = [stemmer.lemmatize(word) for word in document]
    
    # Reconstruct the description by joining the words on each whitespace
    document = ' '.join(document)
    
    return document

def tf_idf(descriptions: list, # book descriptions,
           max_feat: int # max_features parameter for the tf_idf vectorizer
          ):
    
    # returns the tf_idf matrix for the book descriptions
    
    tfidf = TfidfVectorizer(
    ngram_range = (1, 2), 
    max_features = max_feat,
    sublinear_tf = True, 
    stop_words = stopwords.words('english'))
    
    tfidf_matrix = tfidf.fit_transform(descriptions)
    
    return tfidf_matrix

In [2]:
# load books
books,genres,title_index,tfidf_matrix = load_books('books_data.csv')

In [6]:
# compute genre similarity
genres_dict,genre_sim_matrix = genre_similarity(genres)

#### *Function to calculate book similarity*

In [7]:
def book_similarity(title1:str, 
           title2:str, 
           books:list,
           weights:dict, 
           genre_sim_matrix:Tensor,
           tfidf_matrix: scipy.sparse._csr.csr_matrix,
           genre_dict: dict,
           title_index:dict)->float:
        
    mid1,mid2=title_index[title1],title_index[title2]# get the book ids
    m1=books[mid1]  
    m2=books[mid2]

    scores=dict() # stores a the score for each factor from the weights dict
    
    #authors jacard
    scores['authors'] = len(m1.authors.intersection(m2.authors))/len(m1.authors.union(m2.authors))

    #publisher similarity
    scores['publisher'] = 1 if m1.publisher == m2.publisher else 0
    
    #calculate genre jaccard if genre similarity tensor is empty
    if m1.genre is not None and m2.genre is not None and genre_sim_matrix.numel() == 0:
        scores['genre'] = 1 if m1.genre == m2.genre else 0
        #len(set(m1.genre).intersection(set(m2.genre)))/len(set(m1.genre).union(set(m2.genre)))
        
    #calculate genre cosine similarity if tensor has been defined
    if genre_sim_matrix.numel() > 0:
        
        # find each books genre and calculate similarity
        genre1 = [book.genre for book in books if book.title == title1]
        genre2 = [book.genre for book in books if book.title == title2]
        scores['genre'] = genre_sim_matrix[genre_dict[genre1[0]] - 1, genre_dict[genre2[0]] - 1].numpy()
    
    # published year diff
    try:
        scores['published_year']=abs(m1.published_year-m2.published_year)/100
    except:
        scores['published_year']=0
        
    # description similarity
    
    if m1.description != '' and m2.description != '':
        scores['description'] = cosine_similarity(tfidf_matrix[mid1], tfidf_matrix[mid2])[0][0]
    else:
        scores['description'] = 0

    #create the sim dict 
    factors={x:round(scores[x]*weights[x],2) for x in scores}
    
    #sort factors by sim
    sorted_factors=[factor for factor in sorted(factors.items(), key=lambda x:x[1],reverse=True) if factor[1]>0]
    
    #return overall score and explanations
    return round(np.sum(list(factors.values())),2),sorted_factors

In [8]:
# execute
book_similarity('The Alchemist', 
       'Redeeming Love',
       books,     
       {'published_year':1, 'genre':1, 'publisher':1, 'authors':1, 'description': 1},
       genre_sim_matrix,
       tfidf_matrix,
       genres_dict,
       title_index)

(1.24, [('genre', 1.0), ('description', 0.21), ('published_year', 0.03)])

#### *Book Recommendation*

In [12]:
def recommend(input_title:str, 
              k:int,
              books:list,
              weights:dict,
              genre_sim_matrix:Tensor,
              tfidf_matrix: scipy.sparse._csr.csr_matrix,
              genre_dict:dict,
              title_index:dict,
              subset_size: float
              )->list:

    results={} # recommendations 
    
    # number of books to select
    num_items_to_select = int(len(books) * subset_size)

    # select a random subset of books
    random_subset = sample(books, num_items_to_select)

    for candidate in random_subset: # for each candidate
        
        if candidate.title != input_title:
            
            #get the similarity and the explanation
            my_sim,my_exp=book_similarity(candidate.title,input_title,books, weights,genre_sim_matrix,tfidf_matrix,genre_dict,title_index)
    
            #remember
            results[candidate.title]=(my_sim,my_exp)

    #store, slice, return
    return sorted(results.items(),key=lambda x:x[1][0],reverse=True)[:k]

In [13]:
# going through all the books takes aprox 90 minutes,so we use a subset
recommend('The Alchemist',
          30,
          books,
          {'published_year':0.5, 'genre':1.1, 'publisher':1, 'authors':1.1, 'description': 1.2},
          genre_sim_matrix,
          tfidf_matrix,
          genres_dict,
          title_index,
          0.08)

[('The Valkyries',
  (3.41,
   [('authors', 1.1),
    ('genre', 1.1),
    ('publisher', 1),
    ('description', 0.2),
    ('published_year', 0.01)])),
 ('Animal Dreams',
  (2.46,
   [('genre', 1.1),
    ('publisher', 1),
    ('description', 0.35),
    ('published_year', 0.01)])),
 ('Warrior Angel',
  (2.41,
   [('genre', 1.1),
    ('publisher', 1),
    ('description', 0.3),
    ('published_year', 0.01)])),
 ('The Faithful Gardener',
  (2.38,
   [('genre', 1.1),
    ('publisher', 1),
    ('description', 0.22),
    ('published_year', 0.06)])),
 ('Masques', (2.37, [('published_year', 2.13), ('genre', 0.24)])),
 ('Vacuum Diagrams: Stories of the Xeelee Sequence',
  (2.37,
   [('genre', 1.1),
    ('publisher', 1),
    ('description', 0.26),
    ('published_year', 0.01)])),
 ('Adultery',
  (2.36,
   [('authors', 1.1),
    ('genre', 1.1),
    ('description', 0.12),
    ('published_year', 0.04)])),
 ("Something's Cooking: An Angie Amalfi Mystery (Angie Amalfi Mysteries)",
  (2.33,
   [('genre'

## Evaluation

#### User Recommendations / Evaluation

#### *Define Class for User*

In [15]:
@dataclass
class User:
    User_id: str #user id
    read_books:list #books the user has read
    ratings: list #user book ratings
    likes:list # known books that the user has liked 
    dislikes:list # known books that the user has disliked
    rec_likes:list # books that the user will like 
    rec_dislikes:list # books that the user will dislike
    weights:dict # user preferences 
    like_threshold:float # similarity threshold for liking a book

#### *Functions to get user sample and initialize their respective objects*

- This is required because the book reviews file is extremely large to handle

In [16]:
def get_user_sample(ratings_file: str,
                    sample_size: int):
    
    users_df = pd.read_csv(ratings_file, usecols=['User_id'])
    users_df.dropna(inplace = True)
    
    samples = users_df['User_id'].sample(n = sample_size, replace=False).unique()
    
    return list(samples)

def initialize_user_objects(sample_users: list,
                            factors: list):    
    users = []
    
    for u in sample_users:
         # Initialize weights for each factor
        weights = {factor: round(random(), 2) for factor in factors}
        
        # Create and append a new User object
        users.append(User(user_id, [], [], [], [], [], [], weights, 0))
    
    return users

#### *Functions to fill user information and define like threshold*    

In [17]:
def load_user_info(input_file: str,
                   sample_users: list,
                   users: list):
    
    """
    Loads missing user information from an input file for a given list of sample users.

    Parameters:
    - input_file (str): The file path of the input file.
    - sample_users (List[int]): A list of user IDs in the sample.
    - users (List[User]): A list of User objects to update.

    Returns:
    - List[User]: The updated list of User objects with loaded information.
    """
    
    with open(input_file, encoding='utf-8') as f:
    
        for row in csv.DictReader(f):
        
            if row['User_id'] in sample_users: # if the user is in the sample
            
                target_user = [user for user in users if user.User_id == row['User_id']][0]
                
                if row['Title'] not in target_user.read_books: # if the title has not already been read
                    
                    target_user.read_books.append(row['Title'])
                    target_user.ratings.append(row['review/score'])
            
                    if float(row['review/score']) > 3:
                        target_user.likes.append(row['Title'])
                    elif float(row['review/score']) <= 3:
                        target_user.dislikes.append(row['Title'])
    
    return users

def define_like_threshold(users: list,
                          books:list,
                          genre_sim_matrix:Tensor,
                          tfidf_matrix: scipy.sparse._csr.csr_matrix,
                          genre_dict:dict,
                          title_index:dict,
                          subset_size: float,
                          std_multiplier:float=1.5
                         ):
    
    """
    Calculates the user's like threshold based on book similarity.

    Parameters:
    - users (List[User]): A list of User objects.
    - books (List[Book]): A list of Book objects.
    - genre_sim_matrix (Tensor): The matrix representing genre similarity.
    - tfidf_matrix (Tensor): The TF-IDF matrix.
    - genre_dict (dict): A dictionary mapping genres to indices.
    - title_index (dict): A dictionary mapping book titles to indices.
    - subset_size (float): The fraction of books to consider for similarity computation.
    - std_multiplier (float, optional): The multiplier for the standard deviation in threshold calculation.

    Returns:
    - List[User]: The list of User objects with updated like thresholds.
    """
    
    for user in users:
        
        sim_scores=[] 
        
        for book in user.likes:
            
            # number of books to select
            num_items_to_select = int(len(books) * subset_size)

            # select a random subset of books
            random_subset = sample(books, num_items_to_select)
            
            for candidate in random_subset:
                if candidate.title != book:
                    
                    # Calculate similarity value between candidate and liked book
                    val = book_similarity(candidate.title, 
                                    book,
                                    books,     
                                   user.weights,
                                   genre_sim_matrix,
                                   tfidf_matrix,
                                   genres_dict,
                                   title_index)[0]
                    
                    sim_scores.append(val) # Store similarity score
         
        # Calculate the threshold as the mean plus std_multiplier times the standard deviation
        threshold=np.mean(sim_scores)+ std_multiplier*np.std(sim_scores)
        user.like_threshold = round(threshold,2)
        
    return users    

In [18]:
# get sample users
sample_users = get_user_sample('Books_rating.csv', 5)

In [19]:
# initialize objects
users = initialize_user_objects(sample_users,['published_year', 'genre', 'publisher', 'authors', 'description'])

# fill info
users = load_user_info('Books_rating.csv',sample_users,users)

# keep only the users that have posted ratings
users = [user for user in users if len(user.read_books) > 0]

In [151]:
# define like threshold
users = define_like_threshold(users,
                              books,
                              genre_sim_matrix,
                              tfidf_matrix,
                              genres_dict,
                              title_index,
                              0.009)

#### *Function that recommends/evaluates books for each user*    

In [25]:
def random_user_recommendations(users: list,
                               books:list,
                               genre_sim_matrix:Tensor,
                               tfidf_matrix: scipy.sparse._csr.csr_matrix,
                               genre_dict:dict,
                               title_index:dict,
                               recnum_per_user:int = 25):
    
    """
    Generates random book recommendations for users.

    Parameters:
    - users (list): A list of User objects.
    - books (list): A list of Book objects.
    - genre_sim_matrix (Tensor): The matrix representing genre similarity.
    - tfidf_matrix (Tensor): The TF-IDF matrix.
    - genre_dict (dict): A dictionary mapping genres to indices.
    - title_index (dict): A dictionary mapping book titles to indices.
    - recnum_per_user (int, optional): Number of recommendations to generate per user.
    """
    
    for user in users:
    
        #initialize likes and dislikes for this user
        user.rec_likes=[] 
        user.rec_dislikes=[]
    
        recommended=set() # remember titles of recommended books 
    
        for i in range(recnum_per_user): # for each recommendation to make 
            
            rec_book=None # book to be recommended
            
            # pick a random book that has not been recommended before 
            while rec_book==None or rec_book.title in recommended:
                rec_book=sample(books,1)[0]
            
            #remember the recommendation
            recommended.add(rec_book.title)
            
            found_similar_seed=False # becomes true if the random book is similar to an already read book
            
            for sm in user.read_books: # for each read book for this user
                
                # compute the sim between the random book and the read book
                val=book_similarity(rec_book.title, 
                       sm,
                       books,     
                       user.weights,
                       genre_sim_matrix,
                       tfidf_matrix,
                       genres_dict,
                       title_index)[0]
                
                # if the sim is over the like threshold of this user
                if val>user.like_threshold:
                    found_similar_seed=True
                    break
                    
            if found_similar_seed: # similar seed found, the user will like this book
                user.rec_likes.append(rec_book)
                print('YES',rec_book.title,len(user.rec_likes))
            else: 
                user.rec_dislikes.append(rec_book)
                print('NO',rec_book.title,len(user.rec_dislikes))
                    
        print('\n\nTotal Likes out of ' + str(recnum_per_user) + ' :', len(user.rec_likes)
              ,'\nRecommendations Performance: ' + str(round(len(user.rec_likes) / recnum_per_user,3) * 100) + ' %'
              ,'\n\n-----------------------\n')

In [240]:
random_user_recommendations(users,
                            books,
                            genre_sim_matrix,
                            tfidf_matrix,
                            genres_dict,
                            title_index,
                            25
                           )

YES Horses in the Air and Other Poems (Spanish Edition) 1
NO We Have Always Lived in the Castle 1
NO Lady Hellfire 2
YES AN EXPERIMENT IN CRITICISM 2
YES Beyond : Visions Of The Interplanetary Probes 3
YES Every Brilliant Eye 4
NO Riddle & the Knight 3
NO This is Earl Nightingale 4
YES Food & Wine Cocktails 2005: The Best Drinks from America's Hottest Bars, Lounges and Restaurants 5
YES Highland Sunset (Onyx) 6
NO My Blue-Checker Corker and Me 5
YES Keep The Ring: How to make your marriage sparkle forever 7
YES The Whole Craft of Spinning: From the Raw Material to the Finished Yarn 8
YES Kicking Dogs - A Novel Set In Bangkok 9
YES Blood of the Martyrs: How the Slaves in Rome Found Victory in Christ (Christian Epics) 10
NO The Superlawyers 6
NO Wintermute 7
NO The Lake At The End Of The World 8
NO Making Mead (Honey Wine). History, Recipes, Methods and Equipment 9
YES Heidegger's Hidden Sources: East-Asian Influences on his Work 11
YES The spell of the Yukon 12
YES The Paid Companion 13