# Books Semantic Search With FAISS
Build a search engine using FAISS (Facebook AI Similarity Search) that can help us find the most related book to our quotes or queries, both in English books and Italian.

### Install & Import required dependencies
- `faiss-gpu`
- `transformers`
- `datasets`
- `torch`

In [1]:
!pip install -r requirements.txt

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━

In [2]:
import torch
from datasets import load_dataset, concatenate_datasets
from transformers import AutoTokenizer, AutoModel
import pandas as pd

### Load the datasets, tokenizer, model, and set up the device
- For english books: `IsmaelMousa/books` dataset.
- For italian books: `IsmaelMousa/libri-in-italiano` dataset.
- For embedding: `sentence-transformers/multi-qa-mpnet-base-dot-v1` model.
- Device: `cuda(GPU)` if it available, otherwise `CPU`.

In [14]:
books_id="IsmaelMousa/books"
libri_id="IsmaelMousa/libri-in-italiano"

checkpoint = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

books = load_dataset(books_id, split="train")
libri = load_dataset(libri_id, split="train")

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

### Concatenate the datasets
Concatenate both `books` and `libri` datasets vertically, and keeping only the [`category`, `author`, `title`, `EN`] columns, and then rename the `EN` column to `content`.

In [15]:
books_relevant = books.select_columns(["category", "author", "title", "EN"])
libri_relevant = libri.select_columns(["categoria", "autore", "titolo", "contenuto"])

libri_relevant = libri_relevant.rename_column('categoria', 'category')
libri_relevant = libri_relevant.rename_column('autore', 'author')
libri_relevant = libri_relevant.rename_column('titolo', 'title')
libri_relevant = libri_relevant.rename_column('contenuto', 'EN')

english_and_italy_books = concatenate_datasets([books_relevant, libri_relevant])

english_and_italy_books = english_and_italy_books.rename_column('EN', 'content')

print(english_and_italy_books)

Dataset({
    features: ['category', 'author', 'title', 'content'],
    num_rows: 58
})


### Embedding the text
First of all convert the plain text into tokens, after that convert these tokens into indices, and mapping these indices to the corresponding embedding vectors, and finally apply the FAISS.

In [16]:
def embeddings(text_list):
    encoded_input = tokenizer(text_list, padding=True, truncation=True, return_tensors="pt")
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}

    model_output = model(**encoded_input)

    return model_output.last_hidden_state[:, 0]

embeddings_dataset = english_and_italy_books.map(lambda x: {"embeddings": embeddings(x["content"]).detach().cpu().numpy()[0]})

embeddings_dataset.add_faiss_index(column="embeddings")

Map:   0%|          | 0/58 [00:00<?, ? examples/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['category', 'author', 'title', 'content', 'embeddings'],
    num_rows: 58
})

### Test the search engine
Test our search engine with a query, and print the top 5 related books.

In [20]:
def search_and_print(query, column_name):
    query_embedding = embeddings([query]).cpu().detach().numpy()

    scores, samples = embeddings_dataset.get_nearest_examples(column_name, query_embedding, k=5)

    samples = pd.DataFrame.from_dict(samples)
    samples["scores"] = scores
    samples.sort_values("scores", ascending=False, inplace=True)

    print(f"Query: {query}")
    print("=" * 50, "\n")
    for _, row in samples.iterrows():
        print(f"Score: {row.scores:.2f}")
        print(f"Name: {row.title}")
        print(f"Author: {row.author}")
        print(f"Category: {row.category}")
        print("=" * 50, "\n")

### Test in English
Test the search where the query is in english.

In [21]:
query_en = "Most people are so busy preparing for life that they forget to actually live it"
search_and_print(query_en, "embeddings")

Query: Most people are so busy preparing for life that they forget to actually live it

Score: 49.25
Name: Robinson Crusoe
Author: Daniel Defoe
Category: Adventure

Score: 48.97
Name: The Time Machine
Author: H.G. Wells
Category: Science Fiction

Score: 47.91
Name: The Woman in White
Author: Wilkie Collins
Category: Mystery

Score: 47.39
Name: Moby Dick
Author: Herman Melville
Category: Classics

Score: 46.77
Name: The Picture of Dorian Gray
Author: Oscar Wilde
Category: Classics


### Test in Italian
Test the search where the query is in italian.

In [22]:
query_it = "La maggior parte delle persone è così impegnata a prepararsi alla vita che dimentica di viverla davvero"
search_and_print(query_it, "embeddings")

Query: La maggior parte delle persone è così impegnata a prepararsi alla vita che dimentica di viverla davvero

Score: 31.41
Name: La guerra nell'aria
Author: H.G. Wells
Category: Fantascienza

Score: 30.71
Name: La donna in bianco
Author: Wilkie Collins
Category: Giallo

Score: 29.10
Name: La guerra dei mondi
Author: H.G. Wells
Category: Fantascienza

Score: 28.64
Name: La macchina del tempo
Author: H.G. Wells
Category: Fantascienza

Score: 28.37
Name: Ventimila leghe sotto i mari
Author: Jules Verne
Category: Fantascienza
