# Overview

This project demonstrates a recommendation system for books using cosine similarity and text embeddings generated from a BERT-based model. The goal is to recommend books that are semantically similar based on the content of the books the user has read or selected.

### Key Highlights:
1. **Dataset Preparation**: The dataset is preprocessed to combine relevant book details, including title, author, category, and description, into a single text input for embedding generation.
2. **Embedding Generation**: The text embeddings are created using the `bert-base-uncased` model from the Hugging Face Transformers library. The embeddings are generated in batches for efficiency and use GPU acceleration if available.
3. **Cosine Similarity for Recommendations**: Cosine similarity is employed to measure the distance between the embeddings, enabling the identification of books similar to a user's input or history.
4. **Recommendation Scenarios**:
   - **Single Book Recommendation**: A book selected by the user is used to find similar titles.
   - **Multi-Book Recommendation**: Based on the user's reading history (e.g., books categorized as "History"), the system calculates the mean embedding and suggests similar titles, avoiding duplicates.
5. **Outputs**: The system provides a list of recommended books, showcasing its ability to find thematically or contextually relevant titles.

In [20]:
from transformers import BertTokenizer, BertModel
from scipy.spatial.distance import cosine
import pandas as pd
import numpy as np

### Download and prepare the dataset

In [2]:
data = pd.read_csv('/content/BooksDataset.csv')
data.head()

Unnamed: 0,Title,Authors,Description,Category,Publisher,Publish Date,Price
0,Goat Brothers,"By Colton, Larry",,"History , General",Doubleday,"Friday, January 1, 1993",Price Starting at $8.79
1,The Missing Person,"By Grumbach, Doris",,"Fiction , General",Putnam Pub Group,"Sunday, March 1, 1981",Price Starting at $4.99
2,Don't Eat Your Heart Out Cookbook,"By Piscatella, Joseph C.",,"Cooking , Reference",Workman Pub Co,"Thursday, September 1, 1983",Price Starting at $4.99
3,When Your Corporate Umbrella Begins to Leak: A...,"By Davis, Paul D.",,,Natl Pr Books,"Monday, April 1, 1991",Price Starting at $4.99
4,Amy Spangler's Breastfeeding : A Parent's Guide,"By Spangler, Amy",,,Amy Spangler,"Saturday, February 1, 1997",Price Starting at $5.32


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103082 entries, 0 to 103081
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   Title         103082 non-null  object
 1   Authors       103082 non-null  object
 2   Description   70213 non-null   object
 3   Category      76912 non-null   object
 4   Publisher     103074 non-null  object
 5   Publish Date  103082 non-null  object
 6   Price         103082 non-null  object
dtypes: object(7)
memory usage: 5.5+ MB


In [4]:
data = data.dropna()
data.shape

(65305, 7)

In [5]:
def preprocess_data(data):
    texts = []
    for iter, row in data.iterrows():
      text = f"""Title: {row['Title']}\nAuthor: {row['Authors']}\nCategory: {row['Category']}\nDescription: {row['Description']}"""
      texts.append(text)
    return texts

In [6]:
texts = preprocess_data(data)
print(texts[0])

Title: Journey Through Heartsongs
Author: By Stepanek, Mattie J. T.
Category:  Poetry , General
Description: Collects poems written by the eleven-year-old muscular dystrophy patient, sharing his feelings and thoughts about his life, the deaths of his siblings, nature, faith, and hope.


### Create embeddings

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [8]:
import torch

def create_embeddings(texts, batch_size=16):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    embeddings = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        encoded_input = tokenizer(batch_texts, return_tensors='pt', max_length=512, truncation=True, padding=True)
        encoded_input = {key: value.to(device) for key, value in encoded_input.items()}

        with torch.no_grad():
            output = model(**encoded_input)
            batch_embeddings = output.last_hidden_state.mean(dim=1).detach().cpu().numpy()
            embeddings.extend(batch_embeddings)

    return embeddings

In [9]:
embeddings = create_embeddings(texts)
embeddings[0]

array([ 5.21900915e-02, -2.50997134e-02,  1.85730696e-01, -3.66387703e-02,
        2.31121168e-01,  4.58556898e-02,  1.27627075e-01,  2.04293400e-01,
        1.86326176e-01, -1.31288618e-01, -2.31094897e-01, -2.14743763e-01,
       -9.83988401e-04,  6.00608885e-02,  2.14387521e-01,  2.69950777e-01,
        1.12746460e-02, -4.22885157e-02,  1.40971437e-01,  8.24230015e-02,
       -7.09740864e-03,  2.69285217e-02,  1.44482344e-01,  1.17421784e-01,
        2.62233198e-01,  1.78257048e-01, -5.19039631e-02, -1.67517066e-01,
       -4.54299778e-01,  1.17470277e-02,  3.23596708e-02,  3.01385317e-02,
        1.04342317e-02, -9.18301195e-02, -5.76857664e-02, -9.10483077e-02,
       -2.58942813e-01, -1.87880829e-01,  2.23053852e-04,  3.43350694e-02,
       -1.91764742e-01, -1.15205705e-01,  1.26576340e-02, -2.71109402e-01,
        7.61848912e-02, -1.99734896e-01,  1.90485016e-01, -2.89497264e-02,
       -3.66237283e-01,  3.76341790e-02, -1.74928516e-01, -8.97033960e-02,
        1.96519077e-01, -

### Recommend a book based on the last book read by the user

In [21]:
def find_nclosest(query_embedding, embeddings, n=5):
  distances = []
  for index, embedding in enumerate(embeddings):
    dist = cosine(query_embedding, embedding)
    distances.append({'index': index, 'distance': dist})
  sorted_distances = sorted(distances, key=lambda x: x['distance'])
  return sorted_distances[0:n]

In [11]:
book = data.sample(1)
book

Unnamed: 0,Title,Authors,Description,Category,Publisher,Publish Date,Price
67810,Three Dog Bakery Cookbook: Over 50 Recipes for...,"By Dye, Dan, Beckloff, Mark, and Three Dog Bak...",Three Dog Bakery stores are legendary. Stocked...,"Pets , Dogs , General",Andrews McMeel Publishing,"Thursday, October 1, 1998",Price Starting at $5.29


In [15]:
processed_book = preprocess_data(book)
print(processed_book[0])

Title: Three Dog Bakery Cookbook: Over 50 Recipes for All-Natural Treats for Your Dog
Author: By Dye, Dan, Beckloff, Mark, and Three Dog Bakery (COR)
Category:  Pets , Dogs , General
Description: Three Dog Bakery stores are legendary. Stocked with cleverly named canine confections--from SnickerPoodles to Scotty Biscotti to Big Scary Kitties -- the pooch patisserie has grown into an international operation, featuring its fresh-baked, all-natural bakery treats for dogs.Three Dog Bakery&#39;s 1996 autobiography, Short Tails and Treats from Three Dog Bakery, tells all about how Dan Dye and Mark Beckloff, with inspiration from their three dogs, came to run 12 retail bakeries around the world, as well as wholesale and mail-order divisions. Now, Three Dog Bakery is sharing its secrets with dog devotees everywhere. With this new Three Dog Bakery Cookbook, readers will be able to concoct the kind of tasty treats that canines crave.Featuring more than 50 recipes--from Banana Mutt Cake to Great D

In [18]:
query_embedding = create_embeddings(processed_book)[0]
results = find_nclosest(query_embedding, embeddings)
results

[{'index': 43509, 'distance': 0.021235620028272728},
 {'index': 34237, 'distance': 0.05668585091772216},
 {'index': 55833, 'distance': 0.05753625363609549},
 {'index': 55439, 'distance': 0.05902721282041867},
 {'index': 9039, 'distance': 0.059494921850771765}]

In [19]:
for result in results:
  print(data.iloc[result['index']]['Title'])

Three Dog Bakery Cookbook: Over 50 Recipes for All-Natural Treats for Your Dog
Pillsbury: Best of the Bake-off Cookbook: 350 Recipes from Ameria's Favorite Cooking Contest
Perfect Cakes
Fix-It and Enjoy-It Healthy Cookbook: 400 Great Stove-Top And Oven Recipes
Cake Mix Magic


The book “Three Dog Bakery Cookbook: Over 50 Recipes for All-Natural Treats for Your Dog” successfully found similar results, as they are focused on culinary topics. Among the suggested results are books related to baking, recipes, and desserts, which fits the description of recipes for dogs in the format of confectionery.

### Recommend a book based on multiple recent books read by the user

In [38]:
def find_nclosest_multiple(query_embeddings, embeddings, history_indexes, n=5):
  mean_embedding = np.mean(query_embeddings, axis=0)
  distances = []
  for index, embedding in enumerate(embeddings):
    dist = cosine(mean_embedding, embedding)
    if index in history_indexes:
      continue
    distances.append({'index': index, 'distance': dist})
  sorted_distances = sorted(distances, key=lambda x: x['distance'])
  return sorted_distances[0:n]

In [28]:
user_history = data[data['Category'].str.contains('History')].sample(3)
user_history

Unnamed: 0,Title,Authors,Description,Category,Publisher,Publish Date,Price
60434,The Greek Way,"By Hamilton, Edith","""Five hundred years before Christ in a little ...","History , Ancient , Greece",W. W. Norton & Company,"Sunday, August 1, 1993",Price Starting at $129.99
74032,"Retribution: The Battle for Japan, 1944-45","By Hastings, Max",Hailed in Britain as “Spectacular . . . Searin...,"History , Military , World War II",Knopf,"Saturday, March 1, 2008",Price Starting at $10.99
28575,Crosscurrents in Quiet Waters: Portraits of th...,"By White, Dan",Text and photographs portray the daily life an...,"History , General",Taylor Publishing,"Thursday, October 1, 1987",Price Starting at $5.84


In [33]:
processed_history = preprocess_data(user_history)
for text in processed_history:
  print(text, '\n')

Title: The Greek Way
Author: By Hamilton, Edith
Category:  History , Ancient , Greece
Description: "Five hundred years before Christ in a little town on the far western border of the settled and civilizaed world, a strange new power was at work. . . . Athens had entered upon her brief and magnificent flowering of genius which so molded the world of mind and of spirit that our mind and spirit today are different. . . . What was then produced of art and of thought has never been surpasses and very rarely equalled, and the stamp of it is upon all the art and all the thought of the Western world."A perennial favorite in many different editions, Edith Hamilton's best-selling The Greek Way captures the spirit and achievements of Greece in the fifth century B.C. A retired headmistress when she began her writing career in the 1930s, Hamilton immediately demonstrated a remarkable ability to bring the world of ancient Greece to life, introducing that world to the twentieth century. The New York 

In [39]:
embeddings_history = create_embeddings(processed_history)
history_indexes = user_history.index.tolist()
results_history = find_nclosest_multiple(embeddings_history, embeddings, history_indexes)
results_history

[{'index': 38803, 'distance': 0.045011358070725316},
 {'index': 40971, 'distance': 0.04833045481265985},
 {'index': 52570, 'distance': 0.048787347560359406},
 {'index': 64899, 'distance': 0.04919429428382527},
 {'index': 24563, 'distance': 0.049496085661216216}]

In [41]:
for result in results_history:
  print(data.iloc[result['index']]['Title'])

The Greek Way
China: A New History
A Little History of the World (Little Histories)
V is for Victory: America Remembers World War II
Outrage, Passion, and Uncommon Sense: How Editorial Writers Have Taken On and Helped Shape the Great American Issues o f the Past 150 Years


The recommendation results demonstrate that the system effectively identifies books related to the user's interests. For example, "**China: A New History**" and "**A Little History of the World**" complement the interest in historical topics, offering a broader perspective on world history. "**V is for Victory: America Remembers World War II**" aligns with military history, similar to "**Retribution: The Battle for Japan, 1944-45**", while "**Outrage, Passion, and Uncommon Sense**" provides a different lens on cultural and societal issues. This confirms that the system not only finds similar themes but also expands the range of interesting reading ideas.