# Content-Based Filtering: NLP Based Book Recommender Using BERT-Embeddings

In [None]:
__author__ = "Amoli Rajgor"
__email__ = "amoli.rajgor@gmail.com"
__website__ = "amolir.github.io"

# Introduction
- Content based filtering is one of the two common techniques of recommender systems. intelligible from the name, it uses the content of the entity (to be recommended) to find other relevant recommendations similar to it. In simpler terms the system finds the keywords or attributes related to the product that the user likes, later uses this information to recommend other products having similar attributes. 
- For a book recommendation system, given a book name the recommender will suggest books that are similar to it. The choice is made considering concise information of the book such as its theme, author, series, and summary of the description. 

## Book Recommendation System
- The succinct data of keywords that is provided to the recommender system is generated using NLP techniques such as word embeddings.  Keywords that most describe the book are extracted from the book description using BERT-embeddings, this word collection is further reduced using the frequentist feature extraction method TF-IDF that ranks the words based on their frequency in the book and the corpus.     
- Once the numeric vector representation of all the books is generated, each word vector is compared against the other vector and similar vectors (books) are found using cosine similarity.  
   

![architecture](../images/book_recommendation_system.svg)

---
# Environment and Project Flow
- The project pipeline is divided into three stages: 
1. [![Open Notebook](https://img.shields.io/badge/Jupyter-Open_Notebook-blue?logo=Jupyter)](eda.ipynb) **Cleaning**
2. [![Open Notebook](https://img.shields.io/badge/Jupyter-Open_Notebook-blue?logo=Jupyter)](feature_engineering.ipynb) **Feature Extraction**
3. [![Open Notebook](https://img.shields.io/badge/Jupyter-Open_Notebook-blue?logo=Jupyter)](model.ipynb) **Modeling**

- There’s a dedicated notebook for each of these stages containing detailed implementation of all intermediate steps. At the end of each stage the processed data is stored in the form of a CSV file. Current book will serve as a summarised representation of the project.
- I will be using the following list of packages for the project.
> <h4 style="color:blue"> ℹ️ Dependencies </h4> <div style="background-color:#dbeaff">  &#10148; numpy &ge; 1.22.3 <br> &#10148; pandas &ge; 1.4.1 <br> &#10148; scikit-learn &ge; 1.0.2 <br> &#10148; keybert &ge; 0.5.1 <br> &#10148; nltk &ge; 3.5 <br> &#10148; matplotlib &ge; 3.5.1 <br> &#10148; altair &ge; 4.2.0 <br> </div>

In [None]:
# Data Manipulation
import pandas as pd
import numpy as np

# RegEx and String Manipulation
import re
import string

# Language Detection
from nltk.classify.textcat import TextCat

# BERT-Embeddings
from keybert import KeyBERT

# TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# Plotting Heatmap of TF-IDF vectors 
import altair as alt

# Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity

---
# Text Cleaning
- Remove books with missing **Description**.
- Remove URLs and HTML tags from the **Description**.
- Remove punctuations from the **Description**
- Convert lowercase to lower for book **Name, Authors, Publishers** and **Description** and clip extra spaces.
- Remove book descriptions with shorter length.
- Remove variants of the same book.
- Extract and remove book series information from the **Name** of the book.
- Impute missing **Language** information using the language of the book **Name**.
- Remove double quotes from **Publisher** name.
- Transform Book **Name** and **Authors** into a single token.
- Merge all the textual summary into a single summary column.

# Feature Engineering

## Keyword Extraction Using KeyBERT

## Vectorization using TF-IDF

# Cosine Similarity

# Recommendation

# Conclusion