# Book Recommender using Tensorflow Recommenders
$by:\space Jeremiah\space Chinyelugo$

This notebook shows how we can build a simple recommender system using tensorflow recommenders. 

**Disclamer:** This notebook is by no means an exhaustive introduction to building recommendation systems. To understand the basics, please refer to google's [resources](https://www.tensorflow.org/recommenders/examples/basic_retrieval) which goes into detail on how to build recommendation systems using the movielens dataset. Extremly helpful!


#### What is a Recommender system?
A Recommender systems is a models, algorithm, or technique used to suggest items, products, or content to users based on their preferences or attributes. These systems analyze large amounts of user data, such as past behavior, ratings, purchase history, or browsing patterns, to generate personalized recommendations.

Recommender systems are important because the help companies increase sales and conversions, help users discover new items or content and enhance user experience.

#### How do recommender systems work?
Recommender systems in practical applications typically consist of two main phases:

The first stage, known as retrieval, focuses on selecting an initial group of hundreds of potential candidates from the entire pool of available options. The primary goal of this stage is to efficiently filter out any candidates that are unlikely to be of interest to the user. Due to the potentially large number of candidates involved, the retrieval model must be designed to perform computations swiftly and effectively.

Following the retrieval stage is the ranking stage, which refines the outputs of the retrieval model to identify the best possible subset of recommendations. Its objective is to narrow down the set of items that the user might find appealing to a concise list of highly probable candidates.


## Contents
1. Importing the packages
2. Preparing our dataset
3. Model building
4. Training and evaluating the model
5. Creating a function that will recommend Books for a user based on their `User ID`, `Age`, and `Specific Author`

<br/>

**As mentioned above, recommendation systems consists of two parts, and in this notebook, we look at both parts (Retrieval & Ranking)**

#### Data
The [data](https://github.com/caserec/Datasets-for-Recommender-Systems/tree/master/Processed%20Datasets/BookCrossing) used for this project was gotten from GitHub. The Book Crossing dataset were collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. 

##### 1. Importing the packages

In [1]:
import pandas as pd
import numpy as np
import pprint
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
import tensorflow_recommenders as tfrs
import tempfile
import os
import pyinputplus as pyip

##### 2. Preparing the dataset

In [2]:
# loading the dataset

ratings = pd.read_csv('../../Downloads/book_crossing/book_crossing/book_ratings.dat', delimiter='\s+')
items = pd.read_excel('../../Downloads/book_crossing/book_crossing/items_info.xlsx')
users = pd.read_csv('../../Downloads/book_crossing/book_crossing/users_info.dat', delimiter='\t')

In [3]:
# renaming columns in the users dataframe and dropping features not relevant to this project

users = users.reset_index()
users.rename(columns={'index':'User-ID', 'User-ID':'Location', 'Location':'Age', 'Age':'nan'}, inplace=True)
users.drop(['nan', 'Location'], axis=1, inplace=True)

items = items[[' Book_ID','ISBN', 'Book-Title', 'Book-Author']]


**So are we droping some features?**

we drop some features because we only want features that will be available during inference or when being used by users. Remember we are building a recommendation system that should be able to recommend books for **Users** based on their user id and favourite author, so we 
we only include useful features that will available when the model has been deployed.

In [4]:
# merging our datasets into one encompassing dataset

df1 = pd.merge(ratings, items, left_on='item', right_on=' Book_ID')
df1.head(2)

Unnamed: 0,user,item,rating,Book_ID,ISBN,Book-Title,Book-Author
0,1,6264,7.0,6264,553280325,Something Wicked This Way Comes,Ray Bradbury
1,496,6264,8.0,6264,553280325,Something Wicked This Way Comes,Ray Bradbury


In [5]:
df = pd.merge(df1, users, left_on='user', right_on='User-ID')
df.head(2)

Unnamed: 0,user,item,rating,Book_ID,ISBN,Book-Title,Book-Author,User-ID,Age
0,1,6264,7.0,6264,553280325,Something Wicked This Way Comes,Ray Bradbury,1,24
1,1,4350,7.0,4350,345441184,The Mists of Avalon,MARION ZIMMER BRADLEY,1,24


In [6]:
# drooping duplicate and non relevant features

df.drop(['User-ID', 'item', ' Book_ID', 'ISBN'], axis=1, inplace=True)

In [7]:
# checking the percentage of unique values we have in each feature

for col in df.columns:
    print(f"{col:} has {df[col].nunique():,} values")

user has 1,295 values
rating has 10 values
Book-Title has 14,016 values
Book-Author has 8,500 values
Age has 72 values


In [8]:
df.dropna(inplace=True)
df.isna().sum()

user           0
rating         0
Book-Title     0
Book-Author    0
Age            0
dtype: int64

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62651 entries, 0 to 62655
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user         62651 non-null  int64  
 1   rating       62651 non-null  float64
 2   Book-Title   62651 non-null  object 
 3   Book-Author  62651 non-null  object 
 4   Age          62651 non-null  int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 2.9+ MB


a brief look through our `Age` feature we have some users over 100 years and some as old as 206. So we need to trim our age fearture, by droppping instances where user is over the age 99.

In [10]:
df['Age'].max()

204

In [11]:
# trimming age range

df = df[~df['Age'].between(100, 300)]
df.head()

Unnamed: 0,user,rating,Book-Title,Book-Author,Age
0,1,7.0,Something Wicked This Way Comes,Ray Bradbury,24
1,1,7.0,The Mists of Avalon,MARION ZIMMER BRADLEY,24
2,1,5.0,Sacred Sins,Nora Roberts,24
3,1,9.0,What a Wonderful World: A Lifetime of Recordings,Bob Thiele,24
4,1,6.0,A Coral Kiss,Jayne Ann Krentz,24


To train our dataset, we need to convert our pandas dataframe to a tensorflow dataset object, and that is what the following cells entails.

To create an effective model, proper preprocessing of string (str) and integer (int) features is of utmost importance. In machine learning, there are two predominant approaches to preprocess data, each with its own associated drawbacks.

The first approach involves preprocessing the data prior to feeding it into the model. This method is often favored when operating on low-performance devices, such as laptops, as including a preprocessing step within the model can potentially impede training time. However, a notable drawback arises during the deployment of the model. In such scenarios, a separate preprocessing step must be developed and integrated into the deployment pipeline. Furthermore, if the preprocessing step encounters unfamiliar data, it may encounter difficulties in handling it appropriately, thereby leading to suboptimal model performance.

The second approach entails integrating the preprocessing step directly within the model itself. While this may marginally impact training time, it offers the advantage of simplified model deployment. By incorporating the preprocessing step as an integral part of the model, the need for separate preprocessing code during deployment is eliminated, streamlining the overall process.

It is essential to acknowledge that both approaches involve trade-offs, and the selection between them hinges upon factors such as available computational resources, deployment requisites, and the inherent characteristics of the data under consideration.

We will be including our preprocessing step into our model, and to do that we need the unique values of each our features.

In [12]:
# converting categorical & numerical features to string & integer respectively

for col in df.columns:
    if col not in ['rating','Age']:
        df[col] = df[col].astype(str)
    else:
        df[col] = df[col].astype(int)

In [13]:
df.dtypes

user           object
rating          int32
Book-Title     object
Book-Author    object
Age             int32
dtype: object

In [14]:
# converting our df to dictionary

df_dict = {name: np.array(val) for name, val in df.items()}

# converting our dataframne dictionary
data = tf.data.Dataset.from_tensor_slices(df_dict)

text vex for book and author

In [15]:
# getting a dictionary of unique values in our features

vocabularies = {}

for feature in df_dict:
    if feature != 'rating':
        vocab = np.unique(df_dict[feature])
        vocabularies[feature] = vocab

In [16]:
# converting book-title to a tensorflow dataset
book_titles = tf.data.Dataset.from_tensor_slices(vocabularies['Book-Title'])
book_authors = df['Book-Author'].unique()
user_age = df['Age'].values

In [17]:
# shuffling and splitting our dataset into train, validation and test
tf.random.set_seed(42)

shuffled = data.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(46_797)
validation = shuffled.skip(46_797).take(9_359)
test = shuffled.skip(56_156).take(6_240)

### 3. Model Building

Recommender systems often consits of retrieval and ranking models. We will build both models in this section.

In these models, the required feature preprocessing steps will be included. This will reduce the chances of error that might be introduced when deployed in production, and also make deployment easier.

We will create 3 python classes:
- *UserModel:* This class will be responsible for preprocessing the user's attributes using Embeddings
- *TitleModel:* This class will preprocess the book titles in our dataset using Embeddings
- *FullModel:* This class will incorporate output of the UserModel and TitleModel (i.e, Embeddings) to perform a retrieval and ranking task. The retrieval and ranking tasks will be created using tensorflow recommenders.

**More info on the Entire model**

Our model which incorprates all the classes mentioned above will include:
- User atrributes embeddings like age, user-id, and book-author
- Title embeddings
- Deep & Cross Network
- Dense layers
- Retrieval task layer to retrieve top k categories that allign with the user's attributes
- Ranking task layer to rank categories 
- `call()` method to build the model
- `compute_loss()` method

The User attributes embeddings reduce categorical features with large number of unique items into a more managable form. To create the embeddings, the features have to be passed to a LookupLayer which assigns an index to each unique value in the vocabulary we created earlier, which is then passed to an embedding layer that creates an n-dimensional representation our feature. In this case, we will be using a dimension of 32.

Two different embeddings (integer and categorical) will be created based on the data type of our feature. 

The Deep & Cross Network layer is great for ranking tasks, where we have a lot of features and need additional information by feature crossing. By crossing our features, the model can learn more or identify patters about our data by looking at their interactions.

The Dense layer contains several densely connected layers with neurons that allow arbitrary nonlinear mapping betwwen inputs and outputs. 

The retrieval task layer efficiently weeds out books that a user will not be interested in, by reducing the number of potential candidates.

The ranking task layer ranks the candidates that were retrieved by the retrieval layer.

The call method excutes various steps and creates the model

The compute_loss method measures how well the model is performing 

In [18]:
class UserModel(tf.keras.Model):
  
    def __init__(self):
        super().__init__()
        
        max_tokens = 10_000
        
        # 1. User ID
        self.user_id_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=vocabularies['user'],
                mask_token=None),
            tf.keras.layers.Embedding(len(vocabularies['user'])+1, 32)
        ])
             
        
        #2. Book Authors
        self.author_vectorizer = keras.layers.TextVectorization(max_tokens=max_tokens)
        self.author_vectorizer.adapt(book_authors)
        self.author_text_embedding = keras.Sequential([
            self.author_vectorizer,
            keras.layers.Embedding(max_tokens, 32, mask_zero=True),
            keras.layers.GlobalAveragePooling1D()
        ])
        
        self.author_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=vocabularies['Book-Author'],
                mask_token=None),
            tf.keras.layers.Embedding(len(vocabularies['Book-Author'])+1, 32)
        ])
         
        
        # 3. User age
        self.normalized_age = keras.layers.Normalization()
        self.normalized_age.adapt(vocabularies['Age'].reshape(-1,1))
        
    # call method passes out input features to the embeddings above, excutes them and returns the output
    def call(self, inputs):
        
        return tf.concat([
            self.user_id_embedding(inputs['user']),
            self.author_embedding(inputs['Book-Author']),
            self.author_text_embedding(inputs['Book-Author']),
            tf.reshape(self.normalized_age(inputs['Age']), (-1,1))
        ], axis=1) 

In [19]:
class TitleModel(tf.keras.Model):
    
    def __init__(self,):
        super().__init__()
        
        max_tokens = 10_000
        
        #1. Book-Titles
        self.book_vectorizer = keras.layers.TextVectorization(max_tokens=max_tokens)
        self.book_vectorizer.adapt(book_titles)
        self.book_text_embedding = keras.Sequential([
            self.book_vectorizer,
            keras.layers.Embedding(max_tokens, 32, mask_zero=True),
            keras.layers.GlobalAveragePooling1D()
        ])
        
        self.book_embedding = tf.keras.Sequential([
            tf.keras.layers.StringLookup(
                vocabulary=vocabularies['Book-Title'],
                mask_token=None),
            tf.keras.layers.Embedding(len(vocabularies['Book-Title'])+1, 32)
        ])
        
        
    # call method passes category to the embedding layer above, executes it and returns the output embeddings
    def call(self, inputs):
        
        return tf.concat([
            self.book_embedding(inputs),
            self.book_text_embedding(inputs),
        ], axis=1)

In [20]:
tf.random.set_seed(7)
np.random.seed(7)


class FullModel(tfrs.models.Model):
    
    def __init__(self,):
        super().__init__()
        
        # handles how much weight we want to assign to the rating and retrieval task when computing loss
        self.rating_weight = 0.5
        self.retrieval_weight = 0.5
        
        #User model
        self.user_model = tf.keras.Sequential([
            UserModel(),
            tf.keras.layers.Dense(32),
        ])
        
        # Category model
        self.title_model = tf.keras.Sequential([
            TitleModel(),
            tf.keras.layers.Dense(32)
        ])
        
        
        # Deep & Cross layer
        self._cross_layer = tfrs.layers.dcn.Cross(projection_dim=None, kernel_initializer='he_normal')
        
        # Dense layers with l2 regularization to prevent overfitting
        self._deep_layers = [
            keras.layers.Dense(512, activation='relu', kernel_regularizer='l2'),
            keras.layers.Dense(256, activation='relu', kernel_regularizer='l2'),
            keras.layers.BatchNormalization(),
            keras.layers.Dropout(0.2),
            keras.layers.Dense(128, activation='relu', kernel_regularizer='l2'),
            keras.layers.BatchNormalization(),
            keras.layers.Dropout(0.3),
            keras.layers.Dense(64, activation='relu', kernel_regularizer='l2'),
            keras.layers.Dense(32, activation='relu', kernel_regularizer='l2'),
        ]
        
        # output layer
        self._logit_layer = keras.layers.Dense(1)
    
        # Multi-task Retrieval & Ranking
        self.rating_task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
            loss=tf.keras.losses.MeanSquaredError(),
            metrics=[tf.keras.metrics.RootMeanSquaredError()]
        )
        self.retrieval_task: tf.keras.layers.Layer = tfrs.tasks.Retrieval(
            metrics=tfrs.metrics.FactorizedTopK(
                candidates=book_titles.batch(128).map(self.title_model)
            )
        )
       
            
    def call(self, features) -> tf.Tensor:
        user_embeddings = self.user_model({
            'user': features['user'],
            'Book-Author': features['Book-Author'],
            'Age': features['Age'],
        })
        
        
        title_embeddings = self.title_model(
            features['Book-Title']
        )
        
        x = self._cross_layer(tf.concat([
                user_embeddings,
                title_embeddings], axis=1))
        
        for layer in self._deep_layers.layers:
            x = layer(x)
            
        
        return (
            user_embeddings, 
            title_embeddings,
            self._logit_layer(x)
        )
        
        
        

    def compute_loss(self, features, training=False) -> tf.Tensor:
        user_embeddings, title_embeddings, rating_predictions = self.call(features)
        # Retrieval loss
        retrieval_loss = self.retrieval_task(user_embeddings, title_embeddings)
        # Rating loss
        rating_loss = self.rating_task(
            labels=features['rating'],
            predictions=rating_predictions
        )
        
        # Combine two losses with hyper-parameters (to be tuned)
        return (self.rating_weight * rating_loss + self.retrieval_weight * retrieval_loss)

### 4. Training and evaluating the Model

In [21]:
# batching and caching our datasets to improve performance

cached_train = train.shuffle(143_000).batch(2000).cache()
cached_validation = validation.shuffle(30_000).batch(2000).cache()
cached_test = test.batch(1000).cache()

In [22]:
# calling our FullModel and training it

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

# calling and training our model

model = FullModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

model.fit(cached_train, validation_data=cached_validation, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x261ed59aee0>

**Evaluating our model on our test dataset**

In [23]:
scores = model.evaluate(cached_test, return_dict=True, verbose=False)

In [24]:
scores

{'root_mean_squared_error': 1.8871006965637207,
 'factorized_top_k/top_1_categorical_accuracy': 0.29391026496887207,
 'factorized_top_k/top_5_categorical_accuracy': 0.7072115540504456,
 'factorized_top_k/top_10_categorical_accuracy': 0.7903845906257629,
 'factorized_top_k/top_50_categorical_accuracy': 0.8823717832565308,
 'factorized_top_k/top_100_categorical_accuracy': 0.8939102292060852,
 'loss': 241.9789276123047,
 'regularization_loss': 12.987698554992676,
 'total_loss': 254.9666290283203}

In [25]:
print(f"Root mean square error: {scores['root_mean_squared_error']:.3f}")
print(f"Top 1 accuracy: {scores['factorized_top_k/top_1_categorical_accuracy']:.2%}")
print(f"Top 5 accuracy: {scores['factorized_top_k/top_5_categorical_accuracy']:.2%}")
print(f"Top 10 accuracy: {scores['factorized_top_k/top_10_categorical_accuracy']:.2%}")
print(f"Top 50 accuracy: {scores['factorized_top_k/top_50_categorical_accuracy']:.2%}")
print(f"Top 100 accuracy: {scores['factorized_top_k/top_100_categorical_accuracy']:.2%}")

Root mean square error: 1.887
Top 1 accuracy: 29.39%
Top 5 accuracy: 70.72%
Top 10 accuracy: 79.04%
Top 50 accuracy: 88.24%
Top 100 accuracy: 89.39%


**So what do these metrics mean?**

In the context of recommender systems, 
- Root Mean Square Error (RMSE): Measures the average prediction error between the recommended ratings and actual ratings. Lower values indicate better accuracy.

- Top 1 Accuracy: Represents the percentage of times the top-ranked recommendation matches the user's preference.

- Top 5 Accuracy: Represents the percentage of instances where the user's preferred item is within the top 5 recommendations.

- Top 10 Accuracy: Represents the percentage of times the user's preferred item appears in the top 10 recommendations.

- Top 50 Accuracy: Represents the percentage of cases where the user's preferred item is among the top 50 recommendations.

 - Top 100 Accuracy: Represents the percentage of times the user's preferred item appears in the top 100 recommendations.

 Overall, our model performs well, with noticeable accuracy percentages and root mean square error.

### 5. Creating a function that will recommend Books for a user based on their `User ID`, `Age`, and `Specific Author`

**Note:** The function below will take the attributes listed above, validate them using a custom function to ensure passed inputs either exist within a range of integer values or authors and user-id exist in our vocabularies.

Sure we could still recommend books for users who are not in our vocabularies (all we have to do is get rid of the validation functions), but for this project we will validate inputs.

In [26]:
# creating our input validation functions

def validate_number(value):
    try:
        number = int(value)
        if number in range(0,100):
            return number
        else:
            raise ValueError("Invalid Age")
    except ValueError:
        raise ValueError("Invalid Age")


def validate_author(value):
    if value in vocabularies['Book-Author']:
        return value
    else:
        raise ValueError("Invalid Author Name")
    
    
def validate_user(value):
    if value in vocabularies['user']:
        return value
    else:
        raise ValueError("Invalid User-ID")

In [27]:
# creating our recommendation functions

def Recommend():
    input_user = pyip.inputCustom(validate_user, prompt="Enter your User-ID: \n")
    input_author = pyip.inputCustom(validate_author, prompt="Enter an Author name: \n")
    input_age = pyip.inputCustom(validate_number, prompt="Enter your Age: \n")
    top_k = pyip.inputNum("Number of recommendations: \n")
    
    print(f"\nGetting your {top_k} book recommendations. Please be patient")
    print("=================================================================================================================================")
    
    index = tfrs.layers.factorized_top_k.BruteForce(model.user_model, k=top_k)
    index.index_from_dataset(
    tf.data.Dataset.zip((book_titles.batch(1000), book_titles.batch(1000).map(model.title_model)))
    )
    
    raw_input = {
        'Age': input_age,
        'Book-Author': input_author,
        'user': input_user
    }
    
    input_dict = {key: tf.constant(np.array([value])) for key, value in raw_input.items()}
    
    _, titles = index(input_dict)
    
    test_rating = {}
    for book in titles.numpy()[0]:
        raw_input['Book-Title'] = book

        input_dict = {key: tf.constant(np.array([value])) for key, value in raw_input.items()}

        trained_movie_embeddings, trained_user_embeddings, predicted_rating = model(input_dict)
        test_rating[book] = predicted_rating


    sorted_dict = sorted(test_rating.items(), key=lambda x: x[1], reverse=True)
    
    
    print("=================================================================================================================================")
    print(f"Top {top_k} recommendations for User: {input_user}\n")
    for i, (k, v) in enumerate(sorted_dict):
        print(' '*2,'-',k,)

In [28]:
# calling our recommendation function

Recommend()

Enter your User-ID: 
Enter an Author name: 
Enter your Age: 
Number of recommendations: 

Getting your 10 book recommendations. Please be patient
Top 10 recommendations for User: 2376

   - b'The Black Gryphon (Daw Book Collectors)'
   - b"Castle of Deception (The Bard's Tale, Book 1)"
   - b'Fiddler Fair'
   - b'Burning Water (Burning Water)'
   - b"Arrow's Flight (Heralds of Valdemar)"
   - b'Burning Water'
   - b"Bedlam's Bard"
   - b'Children of the Night: A Diana Tregarde Investigation'
   - b"The Serpent's Shadow"
   - b'LAMMAS NIGHT'


#### The end