For context, this notebook was ran with Google Colab and this dataset was used: https://www.kaggle.com/datasets/elvinrustam/books-dataset

In [2]:
import pandas as pd
import csv

In [None]:
df = pd.read_csv("/content/BooksDatasetClean.csv", on_bad_lines="skip").dropna()

Author names in the data is formated strangley sometimes and generally as Last Name, First Name. Wanted to change this so that it is consistent with inference time data and a comma is only used to separate names.

In [None]:
def rearrange_author(row):
  row_list = row.split(', ')

  new_row_list = []
  for i in range(0, len(row_list), 2):
    new_row_list.append(row_list[i + 1].strip() + ' ' + row_list[i].strip())
  return ', '.join(new_row_list)

df['Authors'] = df['Authors'].str.replace(',and ', ',', regex=False)
df['Authors'] = df['Authors'].str.replace(' and ', ', ', regex=False)
df['Authors'] = df['Authors'].str.replace(r'\(.*?\)', '', regex=True)
df = df[df['Authors'].apply(lambda row: len(row.split(', ')) % 2 == 0)]
df['Authors'] = df['Authors'].apply(rearrange_author)

#Including categories related to textbooks

In [2]:
categories_to_consider = {
    'Study & Teaching',
    'Mathematics',
    'Language Arts & Disciplines',
    'Science',
    'Technology',
    'Business & Economics',
    'Cognitive Psychology & Cognition',
    'Medicine'
}

In [3]:
category_to_rows = {}

for cat in categories_to_consider:
  category_to_rows[cat] = df[df['Category'].notna() & df['Category'].str.contains(cat)]

{
    k: len(v) for k, v in category_to_rows.items()
}

{'Technology': 437,
 'Study & Teaching': 17,
 'Language Arts & Disciplines': 478,
 'Cognitive Psychology & Cognition': 18,
 'Science': 5407,
 'Medicine': 38,
 'Mathematics': 181,
 'Business & Economics': 2434}

In [4]:
subsampled_df = pd.concat([
    table.sample(min(300, len(table))) for table in category_to_rows.values()
]).drop_duplicates().dropna().reset_index()

subsampled_df

Unnamed: 0.2,index,Unnamed: 0.1,Unnamed: 0,Title,Authors,Description,Category
0,24520,28493,41047,"Information Technology Project Management, Sec...",Kathy Schwalbe,Each and every recent innovation in Informatio...,"Computers , Information Technology"
1,55812,63714,92173,John Deere: Touch and Feel: Tractor (Touch & F...,DK Publishing Parachute Press,"Preschoolers can touch chunky tractor tires, s...","Juvenile Nonfiction , Technology , Agriculture"
2,53440,61074,88090,"Newton's Telecom Dictionary, 21st Edition: Cov...",Harry Newton,"Newton, who has been called a ""telecom industr...","Technology & Engineering , Telecommunications"
3,29318,33989,48847,Doing What Scientists Do: Children Learn to In...,Ellen Doris,Teachers and administrators wanting to make el...,"Education , Teaching Methods & Materials , Sc..."
4,61480,70033,102732,A Fishkeeper's Guide to Aquarium Plants: A Sup...,Barry James,"Describes the needs of aquarium plants, discus...","Technology & Engineering , General"
...,...,...,...,...,...,...,...
1435,49148,56303,81096,Unlimited Real Estate Profit,"Marc Stephan Garrison, Paula Tripp-Garrison",A guide to creating wealth through real estate...,"Business & Economics , Real Estate , General"
1436,6160,7084,9975,Enlightened Leadership: Getting to the Heart o...,"Ed Oakley, Doug Krug",Being able to change to keep pace with a rapid...,"Business & Economics , Leadership"
1437,40901,47055,67440,Beyond Rational Management: Mastering the Para...,Robert E. Quinn,Draws together extensive research on leadershi...,"Business & Economics , General"
1438,13682,15740,22669,The 7 Irrefutable Rules of Small Business Growth,Steven S. Little,Starting a small business and making it a succ...,"Business & Economics , Small Business"


In [None]:
subsampled_df = subsampled_df[['Title', 'Authors', 'Description', 'Category']]
subsampled_df['Title'] = subsampled_df['Title'].apply(lambda s: s.strip().lower())
subsampled_df['Authors'] = subsampled_df['Authors'].apply(lambda s: s.strip().lower())
subsampled_df['Description'] = subsampled_df['Description'].apply(lambda s: s.strip().lower())
subsampled_df['Category'] = subsampled_df['Category'].apply(lambda s: s.strip().lower().replace("&", "and"))

In [11]:
subsampled_df.to_csv("SubsampledTrainingTable.csv")

In [5]:
subsampled_df = pd.read_csv('/content/SubsampledTrainingTable.csv')[['Authors', 'Title', 'Description', 'Category']]

In [6]:
subsampled_df.head()

Unnamed: 0,Authors,Title,Description,Category
0,"mark fainaru-wada, lance williams","game of shadows: barry bonds, balco, and the s...",chronicles the 2004 federal investigation that...,"medical , sports medicine"
1,isadore rosenfeld,dr. rosenfeld's guide to alternative medicine:...,here at long last is an unbiased look at alter...,"medical , alternative and complementary medicine"
2,philip goldberg,pain remedies: over 1000 quick and easy pain r...,"this book gives you authoritative, practical a...","medical , pain medicine"
3,"brenda adderly, brian beale",the arthritis cure for pets,a groundbreaking book on a common pet health i...,"medical , veterinary medicine , small animal"
4,"maribeth riggs, lawrence alma-tadema",the healing bath: holistic bubbles and soothin...,bathing in water enriched with concentrated ba...,"medical , holistic medicine"


In [None]:
!pip install keybert

In [8]:
from keybert import KeyBERT

model = KeyBERT()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
some_desc = list(subsampled_df[['Title', 'Description']].sample(5))

for i, row in subsampled_df.sample(5).iterrows():
  title = row['Title']
  desc = row['Description']

  print("Title:", title, "\nTitle Keywords:", model.extract_keywords(title))
  print("Description:", desc)
  print("Description Keywords:", model.extract_keywords(desc))
  print()

In [10]:
import random

query_templates = [
    'what are some books about {{topic}}',
    'are there any books that explain {{topic}}',
    'can you recommend books about {{topic}}',
    'what are some books in {{category}} that are related to {{topic}}',
    'where can i read about {{topic}}',
    'i am interested in {{category}}, can you tell me some books that talk about it and {{topic}}',
    'i am interested in {{category}}, where can i read about it and {{topic}}',
    'are there any {{category}} books about {{topic}}',
    'can you recommend {{category}} books about {{topic}}',
    '{{category}} books about {{topic}}',
    'what are some books you can recommend that will help me get into {{topic}}',
    'how can i get into {{topic}}',
    'want to get into {{topic}} and {{category}}',
    'are there any books by {{author}} about {{topic}}',
    'want to read {{author}}\'s work in {{topic}}',
    'want to read {{author}}\'s {{category}} work in {{topic}}',
    'which of {{author}}\'s book in {{category}} should I read',
    'which of {{author}}\'s work in {{topic}} should I read',
    'which of {{author}}\'s book in {{category}} about {{topic}} should I read',
    'which books by {{author}} are related to {{topic}}',
    'what are the best {{category}} books to learn about {{topic}}',
    'what is the best stuff in {{category}} learn about {{topic}}',
    'what are some books in {{category}} and {{category2}} about {{topic}}',
    'what are some works in {{category}} and {{category2}} about {{topic}}',
    'anything related to {{category}} and {{category2}} about {{topic}}',
    'what are some books in {{category}} or {{category2}} about {{topic}}',
    'what is a book related to {{category}} or {{category2}} about {{topic}}',
    'something related to {{category}} or science about {{topic}}',
    'where can i learn about {{category}} and {{topic}}',
    'where can i learn about {{topic}}',
    'want to learn about {{category}} and {{topic}}',
    'want to learn about {{topic}}',
    'what books combine {{category}} and {{topic}}',
    'combine {{topic}} and {{category}}',
    '{{topic}} and {{category}}',
    'where can i get information on {{topic}}',
    'information on {{topic}}',
    'where can i explore the relationship between {{category}} and {{topic}}',
    'relationship between {{category}} and {{topic}}',
    'are there any books that discuss {{topic}} in the context of {{category}}',
    'discussion on {{topic}} in the context of {{category}}',
    'what are some books that address {{topic}} within {{category}}',
    'addressing {{topic}} within {{category}}',
    'are there books about {{topic}} in {{category}} and {{category2}}',
    'what are books about {{category}} that also touch on {{topic}}',
    'what are books about {{topic}} that also mention {{category}}',
    'interested in {{topic}} but also {{category}}',
    'what are books about {{topic}} that also mention {{category}} and {{category2}}',
    'what are books about {{topic}} that also mention {{category}} or {{category2}}',
    'what are books that mention {{topic}}',
    'something about {{topic}}',
    'what are books that mention {{topic}} in {{category}}',
    'something about {{topic}} in {{category}}',
    'can you list some books about {{topic}} in the context of {{category}}',
    'list things about {{topic}} in the context of {{category}}',
    'are there books combining {{category}} and {{topic}}',
    'combine {{topic}} and {{category}}',
    'what are some resources for understanding {{topic}} in {{category}}',
    'want to understand {{topic}} in {{category}}',
    'can you point me to books that cover {{category}} and also mention {{topic}}',
    'want to cover {{topic}} and also learn about {{category}}',
    'are there books in {{category}} that discuss {{topic}}',
    'discussion about {{topic}}',
    'what books focus on {{category}} while also exploring {{topic}}',
    'focus on {{category}} while also exploring {{topic}}',
    'exploration of {{topic}}',
    'exploration of {{topic}} and {{category}}',
    'can you provide examples of books about {{topic}} related to {{category}}',
    'do any books in {{category}} reference {{topic}}',
    'what are some introductions to {{topic}} within the scope of {{category}}',
    'books related to {{category}} that explain {{topic}}',
    'find books that include {{category}} and talk about {{topic}}',
    'are there books about {{topic}} from a perspective of {{category}}',
    'exploration of {{topic}} through {{category}}',
    'show books on {{topic}} that fit under {{category}}',
    'discussion of {{topic}} with a focus on {{category}}'
]

category_templates = [
    'i am interested in {{category}}, can you tell me some books that are related?',
    'anything related to {{category}}',
    'anything in {{category}}',
    'can you recommend me a book in {{category}}',
    'what should i read if i am interested in {{category}}',
    'can you recommend me something in {{category}}',
    'what are some books about {{category}}',
    'can you give me some information about {{category}}',
    'where can i find information about {{category}}',
    'where can i learn about {{category}}',
    'where can i read about {{category}}',
    'what are some books in {{category}} and {{category2}}',
    'what is a book related to {{category}} and {{category2}}',
    'what are some books in {{category}} or {{category2}}',
    'what is a book related to {{category}} or {{category2}}',
    'what books combine {{category}} and {{category2}}',
    'anything that combines {{category}} and {{category2}}',
    'what are books that mention {{category}}',
    'things in {{category}}',
    'focus on {{category}}',
    '{{category}}',
    '{{category}} and {{category2}}',
]

author_templates = [
    'anything by {{author}}?',
    '{{author}}',
    'any books by {{author}}',
    'where is the stuff by {{author}}',
    'where is the stuff by {{author}} and {{author2}}',
    'can you recommend any book by {{author}}',
    'are there any books by {{author}}',
    'are there any books co-authored by {{author}} and {{author2}}',
    'what books are by {{author}}',
    'can you list books written by {{author}}',
    'are there any works by {{author}}',
    'do you have a list of books by {{author}}',
    'can you find books authored by {{author}}',
    'can you provide books written by {{author}}',
    'what are some titles by {{author}}',
    'do you know any books by {{author}}',
    'is there anything available by {{author}}',
    'are there books published by {{author}}',
    'can you show me books from {{author}}',
    'what are {{author}}’s published works',
    'does {{author}} have any books',
    'can you share the works of {{author}}',
    'are there books credited to {{author}}'
]

title_templates = [
    'is there the book {{title}}?',
    'where is {{title}}',
    '{{title}}'
]

training_data_positive = []
training_data_negative = []

for i, row in subsampled_df.iterrows():
  title, authors, description, categories = row

  #combining both to save the cost of separate embeddings
  title_author = title + "; author: " + authors
  authors = set(authors.split(', '))

  categories = set(categories.split(' , '))
  categories.discard("general")

  topics = set()

  #generating a list of topics to embed into query templates
  desc_topics = model.extract_keywords(description)
  for i in range(len(desc_topics)):
    if desc_topics[0][1] - 0.10 > desc_topics[i][1] or len(topics) >= 3:
      break
    topics.add(desc_topics[i][0])

  title_topics = model.extract_keywords(title)
  for i in range(len(title_topics)):
    if title_topics[0][1] - 0.10 > title_topics[i][1] or len(topics) >= 5:
      break
    topics.add(title_topics[i][0])

  #generating contrastive examples of books that differ in some category
  negative_examples = set()
  for i, other_row in subsampled_df.sample(frac=1).iterrows():
      other_categories = set(other_row["Category"].split(' , '))
      other_authors = set(other_row['Authors'].split(', '))

      if categories != other_categories and not authors.intersection(other_authors):
          negative_examples.add((other_row["Description"],
                                  other_row["Title"] + "; author: " + other_row["Authors"]))
          if len(negative_examples) >= 3:
              break

  #turning sets to lists so we can sample from them
  categories = list(categories)
  topics = list(topics)
  negative_examples = list(negative_examples)
  authors = list(authors)

  #matching description to random queries that incorporate a variety of information
  query = random.sample(query_templates, 1)[0]
  topic = random.sample(topics, 1)[0]
  categories_sample = random.sample(categories, min(len(categories), 2))

  query = query.replace("{{topic}}", topic)
  query = query.replace("{{category}}", categories_sample[0])
  if len(categories) < 2:
    query = query.replace(" and {{category2}}", "")
    query = query.replace(" or {{category2}}", "")
  else:
    query = query.replace("{{category2}}", categories_sample[1])
  query = query.replace("{{author}}", random.sample(authors, 1)[0])

  training_data_positive.append((query, random.choice((description, title_author)), 1))

  for neg_description, neg_title_author in negative_examples:
    training_data_negative.append((query, random.choice((neg_description, neg_title_author)), 0))

  #matching description and title/author to random queries about the category
  #doing so with 50% probability to not have too many queries per book
  if random.random() < 0.5:
    categories_sample = random.sample(categories, min(len(categories), 2))
    category_query = random.sample(category_templates, 1)[0].replace("{{category}}", categories_sample[0])

    if len(categories) < 2:
      category_query = category_query.replace(" and {{category2}}", "")
      category_query = category_query.replace(" or {{category2}}", "")
    else:
      category_query = category_query.replace("{{category2}}", categories_sample[1])

    training_data_positive.append((category_query, random.choice((description, title_author)), 1))

    for neg_description, neg_title_author in negative_examples:
      training_data_negative.append((category_query, random.choice((neg_description, neg_title_author)), 0))

  #matching title/author concatenation to random queries about the title and author
  authors_sample = random.sample(authors, min(len(authors), 2))
  author_query = random.sample(author_templates, 1)[0].replace("{{author}}", authors_sample[0])
  if len(authors) < 2:
    author_query = author_query.replace(" and {{author2}}", "")
  else:
    author_query = author_query.replace("{{author2}}", authors_sample[1])

  title_query = random.sample(title_templates, 1)[0].replace("{{title}}", title)
  training_data_positive.append((random.choice((author_query, title_query)), title_author, 1))

  for neg_description, neg_title_author in negative_examples:
    training_data_negative.append((random.choice((author_query, title_query)), neg_title_author, 0))

len(training_data_positive), len(training_data_negative)

(3037, 9111)

In [11]:
random.sample(training_data_positive, 10)

[('what are some books you can recommend that will help me get into cats',
  'tom howard; author: the love of cats',
  1),
 ('where is murray feshbach',
  'murray feshbach; author: ecological disaster: cleaning up the hidden legacy of the soviet regime : a twentieth century fund report (russia in transition)',
  1),
 ('can you recommend me something in mathematics',
  'do you want to double or triple the speed with which you calculate? how to calculate quickly is a tried and true method for helping you in the mathematics of daily life — addition, subtraction, multiplication, division, and fractions. the author can awaken for you a faculty which is surprisingly dormant in accountants, engineers, scientists, businesspeople, and others who work with figures. this is "number sense" — or the ability to recognize relations between numbers considered as whole quantities. lack of this number sense makes it entirely possible for a scientist to be proficient in higher mathematics, but to bog dow

In [12]:
random.sample(training_data_negative, 10)

[('i am interested in grammar and punctuation, where can i read about it and conner',
  "what we've lost addresses the fragile state of u.s. democracy with a critical review of the bush administration by one of our leading magazine editors, graydon carter. carter has expressed his deep dissatisfaction with the current state of the nation in his monthly editor's letters in vanity fair--which have aroused widespread comment--and now provides a sweeping, painstakingly detailed account of the ruinous effects of this president.the invasion of iraq, which has proven so costly for the u.s. in lives, dollars, and international standing, is only the tip of the iceberg. it is the war at home, a quiet, covert, and in many ways more lasting and damaging war, that carter is most wary of. the bush white house has chipped away at decades' worth of advances in personal rights, women's rights, the economy, and the environment. it is difficult to point to a single element of american society that comes 

In [None]:
!pip install -U sentence-transformers
!pip install -U datasets

In [14]:
from datasets import Dataset, IterableDataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, InputExample, losses
import torch.nn as nn

embedding_model = SentenceTransformer('Snowflake/snowflake-arctic-embed-m-v1.5')

In [15]:
training_set = training_data_positive + training_data_negative
random.shuffle(training_set)

train_query, train_document, train_score = zip(*training_set[:int(len(training_set) * 0.8)])
validation_query, validation_document, validation_score = zip(*training_set[int(len(training_set) * 0.8):])

train_examples = {
    "query": train_query,
    "document": train_document,
    "score": train_score
}

validation_examples = {
    "query": validation_query,
    "document": validation_document,
    "score": validation_score
}

train_dataset = Dataset.from_dict(train_examples)
validation_dataset = Dataset.from_dict(validation_examples)

In [17]:
loss = losses.CosineSimilarityLoss(embedding_model)

trainer = SentenceTransformerTrainer(
    model=embedding_model,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    loss=loss
)
trainer.train()
embedding_model.save_pretrained("/content/finetuned-snowflake-arctic-embed-m-v1.5")

Step,Training Loss
500,0.0477
1000,0.0366
1500,0.0285
2000,0.0214
2500,0.021
3000,0.0116
3500,0.0129


In [None]:
!zip -r /content/model_finetuned.zip /content/finetuned-snowflake-arctic-embed-m-v1.5