# Semantic search with FAISS (PyTorch)

**Reference:** 

[1] This notebook provided by Hugging Face: https://huggingface.co/learn/llm-course/en/chapter5/6

[2] FAISS: https://github.com/facebookresearch/faiss/wiki/Getting-started

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

# Libs

In [1]:
# ! pip install datasets evaluate transformers[sentencepiece]

# reference:
# [1] https://stackoverflow.com/questions/58957169/faiss-error-could-not-find-a-version-that-satisfies-the-requirement-faiss-from/58957380
# [1] Self-summary: 
#   1.1 Python version too high (for example: 3.13 has problem with installing faiss)
#   1.2 must state the cuda version explicitly while installing faiss -> (after install torch) check 'nvidia-...' version in  `conda list > requirement.txt` 
# ! pip install faiss-gpu-cu12 # 

In [1]:
import pandas as pd

# Data processing

In [3]:
# from datasets import load_dataset

# issues_dataset = load_dataset("lewtun/github-issues", split="train")
# issues_dataset

In [12]:
df_1 = pd.read_csv("/home/lephuonglantran/EPO2024/df_combine.csv")
print(f"row counts in df_1: {len(df_1)}")
df_2 = pd.read_csv("/home/lephuonglantran/EPO2024/df_combine_val.csv")
print(f"row counts in df_2: {len(df_2)}")
df = pd.concat([df_1, df_2], ignore_index=True)

print(f"row counts in df_epo: {len(df)}")

row counts in df_1: 1848
row counts in df_2: 22
row counts in df_epo: 1870


In [20]:
df["claims"][0], df["title"][0]

("A method of assaying nucleic acids in a sample, comprising the steps of: a) adding multiple sets of probes into the sample to form a mixture, each set of probes comprising: i. a first probe having a first portion at least partially complementary to a first region of a target nucleic acid in the sample and a second portion forming a first primer binding site;ii. a second probe having a first portion at least partially complementary to a second region of the target nucleic acid in the sample and a second portion forming a second primer binding site, wherein the 5' end of the first probe is adjacent to the 3' end of the second probe when both probes are hybridized to the target nucleic acid;b) denaturing nucleic acids in the mixture;c) hybridizing the set of probes to the complementary regions of the target nucleic acid;d) performing a ligation reaction with a ligase enzyme on the set of hybridized probes to connect the adjacent 5' end of the first probe and the 3' end of the second pro

In [2]:
# convert panda data frame to dataset
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['title', 'description', 'claims', 'ipc'],
    num_rows: 1870
})

In [5]:
# issues_dataset = issues_dataset.filter(
#     lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
# )
# issues_dataset

> We can see that there are a lot of columns in our dataset, most of which we don’t need to build our search engine. From a search perspective, the most informative columns are `title`, `body`, and `comments`, while `html_url` provides us with a link back to the source issue. Let’s use the `Dataset.remove_columns()` function to drop the rest:  

In [5]:
# columns = issues_dataset.column_names
# columns_to_keep = ["title", "body", "html_url", "comments"]
# columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
# issues_dataset = issues_dataset.remove_columns(columns_to_remove)
# issues_dataset

columns = dataset.column_names
columns_to_keep = ["title", "description", "ipc"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
dataset = dataset.remove_columns(columns_to_remove)
dataset


Dataset({
    features: ['title', 'description', 'ipc'],
    num_rows: 1870
})

> Now that we **have one comment per row**, let’s **create a new comments_length column** that **contains the number of words per comment**:  

In [6]:
# comments_dataset = comments_dataset.map(
#     lambda x: {"comment_length": len(x["comments"].split())}
# )

description_dataset = dataset.map(
    lambda x: {"description_length": len(x["description"].split())}
)

Map: 100%|██████████| 1870/1870 [00:02<00:00, 714.13 examples/s] 


> We can **use this new column to filter out short comments**, which typically **include things like “cc @lewtun” or “Thanks!” that are not relevant** for our search engine. There’s **no precise number to select for the filter**, **but around 15 words** seems like a good start:  

In [7]:
# comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
# comments_dataset

description_dataset = description_dataset.filter(lambda x: x["description_length"] > 15)
description_dataset

Filter: 100%|██████████| 1870/1870 [00:00<00:00, 16374.73 examples/s]


Dataset({
    features: ['title', 'description', 'ipc', 'description_length'],
    num_rows: 1870
})

In [8]:
# def concatenate_text(examples):
#     return {
#         "text": examples["title"]
#         + " \n "
#         + examples["description"]
#     }

# # comments_dataset = comments_dataset.map(concatenate_text)
# description_dataset = description_dataset.map(concatenate_text)

In [9]:
description_dataset[0]['description_length']

47423

In [10]:
max = 0
for i in range(len(description_dataset)):
    length_description_dataset = description_dataset[i]['description_length']
    if i== 0:
        min = length_description_dataset
    if length_description_dataset <= min:
        min = length_description_dataset
    else:
        max = length_description_dataset
print(f"min length in dataset: {min} words\nmax length in dataset: {max}")

min length in dataset: 359 words
max length in dataset: 21706


Split the long description into small chunks

reference:

[1] https://saturncloud.io/blog/how-to-split-text-in-a-column-into-multiple-rows-using-pandas/

[2] joing list of words into a string: https://stackoverflow.com/questions/67560768/join-list-element-after-split-into-str

In [11]:
def spilt_into_smaller_descriptions(examples):
    res = []
    index = 0
    num_words_per_chunk = 359
    total_chunks = examples["description"].split()
    total_len = examples["description_length"]
    while index < total_len:
        chunk = ' '.join(total_chunks[index: index+num_words_per_chunk]) 
                        # the elem with index = index + num_words_per_chunk 
                        # is excluded
        res.append(chunk)
        index = index + num_words_per_chunk
    last_chunk = ' '.join(total_chunks[index - num_words_per_chunk: total_len])
    res.append(last_chunk)
    return {
        "description": res
    }

In [33]:
# reference: https://discuss.huggingface.co/t/how-can-i-grab-the-first-n-rows-of-a-dataset-as-a-dataset-object/33093/2
# small_sample = description_dataset.select(range(10))
# small_sample

In [34]:
# small_sample[0]

In [35]:
# small_sample = small_sample.map(spilt_into_smaller_descriptions)

In [18]:
# small_sample[0]

In [38]:
description_dataset = description_dataset.map(spilt_into_smaller_descriptions)

Map: 100%|██████████| 1870/1870 [00:03<00:00, 494.21 examples/s]


convert to dataframe to use `explode`

In [39]:
# small_sample.set_format("pandas")
# df_small_sample = small_sample[:]
description_dataset.set_format("pandas")
df_description_dataset = description_dataset[:]

In [40]:
# df_small_sample_explode = df_small_sample.explode("description", ignore_index=True)
# df_small_sample_explode.head(4)
df_description_dataset_explode = df_description_dataset.explode("description", ignore_index=True)
df_description_dataset_explode.head(4)

Unnamed: 0,title,description,ipc,description_length
0,METHOD FOR MULTIPLEX NUCLEIC ACID ANALYSIS,FIELD OF INVENTIONThe present invention relate...,C,47423
1,METHOD FOR MULTIPLEX NUCLEIC ACID ANALYSIS,kits based on the present invention may be sui...,C,47423
2,METHOD FOR MULTIPLEX NUCLEIC ACID ANALYSIS,"example, the presence, absence or quantity of ...",C,47423
3,METHOD FOR MULTIPLEX NUCLEIC ACID ANALYSIS,the stuffer sequence may have about 1 to about...,C,47423


Convert the dataframe back to dataset 

In [42]:
description_dataset = Dataset.from_pandas(df_description_dataset_explode)
description_dataset

Dataset({
    features: ['title', 'description', 'ipc', 'description_length'],
    num_rows: 60154
})

# Creating text embeddings

In [3]:
from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [4]:
import torch

device = torch.device("cuda")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

 > As we mentioned earlier, we’d **like to represent each entry in our GitHub issues corpus as a single vector**, so we **need to “pool” or average our token embeddings** in some way. One popular approach is to **perform CLS pooling on our model’s outputs**, where we **simply collect the last hidden state for the special [CLS] token**. The following function does the trick for us:

In [5]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

> Next, we’ll create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs:

In [6]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [47]:
# embedding = get_embeddings(comments_dataset["text"][0])
# embedding.shape
embedding = get_embeddings(description_dataset["description"][0])
embedding.shape

torch.Size([1, 768])

In [48]:
# embeddings_dataset = comments_dataset.map(
#     lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
# )
embeddings_dataset = description_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["description"]).detach().cpu().numpy()[0]}
)

Map: 100%|██████████| 60154/60154 [24:40<00:00, 40.64 examples/s]


In [49]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/61 [00:00<?, ?it/s]

100%|██████████| 61/61 [00:00<00:00, 147.07it/s]


Dataset({
    features: ['title', 'description', 'ipc', 'description_length', 'embeddings'],
    num_rows: 60154
})

In [14]:
# question = "How can I load a dataset offline?"
question = "How to test nucleic acids in a sample"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [15]:
# scores, samples = embeddings_dataset.get_nearest_examples(
#     "embeddings", question_embedding, k=5
# )

# Loaded version
scores, samples = load_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [16]:
samples

{'title': ['Yeast hybrid vectors and their use for the production of polypeptides',
  'PANCREATIC CANCER DETECTION KIT, DEVICE, AND DETECTION METHOD',
  'PANCREATIC CANCER DETECTION KIT, DEVICE, AND DETECTION METHOD',
  'VACCINE COMPOSITION COMPRISING MUTANT CALRETICULIN',
  'NOVEL CRISPR ENZYMES AND SYSTEMS'],
 'description': ['nucleic acids are precipitated with 2 volumes of ethanol at -20°C for 10 h. The precipitate is collected by centrifugation (HB-4 rotor, 20 min, 10 000 rpm, 0°C) and dissolved in 20 µl dye mix containing 90% (v/v) formamide (Merck, pro analysis), 1 mM EDTA, 0.05% bromo-phenol blue and 0.05% xylene cyanol blue. The sample is heated at 90°C for 2 min and applied on a 5% polyacrylamide gel in Tris-borate-EDTA (cf. Peacock et al. (39). A single band is visible on the autoradiogram which migrates between the 267 bp and 435 bp 32P-labeled marker DNA fragments obtained from the Hae III digest of the plasmid pBR 322. The 32P-labelled cDNA fragment is extracted from the 

In [18]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [19]:
samples_df

Unnamed: 0,title,description,ipc,description_length,scores
4,NOVEL CRISPR ENZYMES AND SYSTEMS,"acid-targeting complex, including the guide se...",C,195429,30.071918
3,VACCINE COMPOSITION COMPRISING MUTANT CALRETIC...,"208, 212, 216, 220, 224, 228, 232, 236, 240, 2...",C,47968,29.956596
2,"PANCREATIC CANCER DETECTION KIT, DEVICE, AND D...","is hsa-miR-3178, miR-3656 is hsa-miR-3656, miR...",C,53527,29.729515
1,"PANCREATIC CANCER DETECTION KIT, DEVICE, AND D...","miR-187-5p, miR-1908-5p, miR-371a-5p, and miR-...",C,53527,29.226179
0,Yeast hybrid vectors and their use for the pro...,nucleic acids are precipitated with 2 volumes ...,C,29757,29.01017


In [21]:
for _, row in samples_df.iterrows():
    print(f"TITLE: {row.title}")
    print(f"SCORE: {row.scores}")
    print(f"DESCRIPTION: {row.description}")
    print("=" * 50)
    print()

TITLE: NOVEL CRISPR ENZYMES AND SYSTEMS
SCORE: 30.071918487548828
DESCRIPTION: acid-targeting complex, including the guide sequence to be tested and a control guide sequence different from the test guide sequence, and comparing binding or rate of cleavage at or in the vicinity of the target sequence between the test and control guide sequence reactions. Other assays are possible, and will occur to those skilled in the art.A guide sequence may be selected to target any target sequence. In some embodiments, the target sequence is a sequence within a gene transcript or mRNA.In some embodiments, the target sequence is a sequence within a genome of a cell.In some embodiments, a guide sequence is selected to reduce the degree of secondary structure within the guide sequence. Secondary structure may be determined by any suitable polynucleotide folding algorithm. Some programs are based on calculating the minimal Gibbs free energy. An example of one such algorithm is mFold, as described by Zuk

# Save and reload FAISS database

**references:**

[1] https://huggingface.co/docs/datasets/v1.2.0/faiss_and_ea.html

[2] https://discuss.huggingface.co/t/save-and-load-datasets/9260

## Save

In [51]:
description_dataset.save_to_disk('epo_dataset')

Saving the dataset (1/1 shards): 100%|██████████| 60154/60154 [00:00<00:00, 392558.11 examples/s]


In [50]:
# ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')
embeddings_dataset.save_faiss_index('embeddings', 'epo_index.faiss')

## Load

In [7]:
# ds = load_dataset('crime_and_punish', split='train[:100]')
# ds.load_faiss_index('embeddings', 'my_index.faiss')
from datasets import load_from_disk
load_dataset = load_from_disk('./epo_dataset')

In [8]:
load_dataset.load_faiss_index('embeddings', 'epo_index.faiss')