# Semantic search with FAISS (PyTorch)

**Reference:** 

[1] This notebook provided by Hugging Face: https://huggingface.co/learn/llm-course/en/chapter5/6

[2] FAISS: https://github.com/facebookresearch/faiss/wiki/Getting-started

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

# Libs

In [1]:
# ! pip install datasets evaluate transformers[sentencepiece]

# reference:
# [1] https://stackoverflow.com/questions/58957169/faiss-error-could-not-find-a-version-that-satisfies-the-requirement-faiss-from/58957380
# [1] Self-summary: 
#   1.1 Python version too high (for example: 3.13 has problem with installing faiss)
#   1.2 must state the cuda version explicitly while installing faiss -> (after install torch) check 'nvidia-...' version in  `conda list > requirement.txt` 
# ! pip install faiss-gpu-cu12 # 

In [2]:
# ! pip install peft 
# ! pip install joblib
# ! pip install scikit-learn

In [3]:
from transformers import (
    AutoModel,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import pandas as pd

# Data processing

In [3]:
# from datasets import load_dataset

# issues_dataset = load_dataset("lewtun/github-issues", split="train")
# issues_dataset

In [5]:
df_1 = pd.read_csv("/home/lephuonglantran/EPO2024/df_combine.csv")
print(f"row counts in df_1: {len(df_1)}")
df_2 = pd.read_csv("/home/lephuonglantran/EPO2024/df_combine_val.csv")
print(f"row counts in df_2: {len(df_2)}")
df = pd.concat([df_1, df_2], ignore_index=True)

print(f"row counts in df_epo: {len(df)}")

row counts in df_1: 1848
row counts in df_2: 22
row counts in df_epo: 1870


In [6]:
# df["claims"][0], df["title"][0]

In [7]:
# convert panda data frame to dataset
from datasets import Dataset

In [8]:
dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['title', 'description', 'claims', 'ipc'],
    num_rows: 1870
})

In [9]:
# issues_dataset = issues_dataset.filter(
#     lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
# )
# issues_dataset

> We can see that there are a lot of columns in our dataset, most of which we don’t need to build our search engine. From a search perspective, the most informative columns are `title`, `body`, and `comments`, while `html_url` provides us with a link back to the source issue. Let’s use the `Dataset.remove_columns()` function to drop the rest:  

In [10]:
# columns = issues_dataset.column_names
# columns_to_keep = ["title", "body", "html_url", "comments"]
# columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
# issues_dataset = issues_dataset.remove_columns(columns_to_remove)
# issues_dataset

columns = dataset.column_names
columns_to_keep = ["title", "description", "ipc"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
dataset = dataset.remove_columns(columns_to_remove)
dataset


Dataset({
    features: ['title', 'description', 'ipc'],
    num_rows: 1870
})

> Now that we **have one comment per row**, let’s **create a new comments_length column** that **contains the number of words per comment**:  

In [11]:
# comments_dataset = comments_dataset.map(
#     lambda x: {"comment_length": len(x["comments"].split())}
# )

description_dataset = dataset.map(
    lambda x: {"description_length": len(x["description"].split())}
)

Map: 100%|██████████| 1870/1870 [00:02<00:00, 681.33 examples/s] 


> We can **use this new column to filter out short comments**, which typically **include things like “cc @lewtun” or “Thanks!” that are not relevant** for our search engine. There’s **no precise number to select for the filter**, **but around 15 words** seems like a good start:  

In [8]:
# comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
# comments_dataset

description_dataset = description_dataset.filter(lambda x: x["description_length"] > 15)
description_dataset

Filter: 100%|██████████| 1870/1870 [00:00<00:00, 16375.00 examples/s]


Dataset({
    features: ['title', 'description', 'ipc', 'description_length'],
    num_rows: 1870
})

In [8]:
# def concatenate_text(examples):
#     return {
#         "text": examples["title"]
#         + " \n "
#         + examples["description"]
#     }

# # comments_dataset = comments_dataset.map(concatenate_text)
# description_dataset = description_dataset.map(concatenate_text)

In [44]:
description_dataset[0]['description_length']

47423

In [13]:
max = 0
for i in range(len(description_dataset)):
    length_description_dataset = description_dataset[i]['description_length']
    if i== 0:
        min = length_description_dataset
        max = length_description_dataset
    else:
        if length_description_dataset <= min:
            min = length_description_dataset
        
        if length_description_dataset >= max:
            max = length_description_dataset
print(f"min length in dataset: {min} words\nmax length in dataset: {max}")

min length in dataset: 359 words
max length in dataset: 195429


Split the long description into small chunks

reference:

[1] https://saturncloud.io/blog/how-to-split-text-in-a-column-into-multiple-rows-using-pandas/

[2] joing list of words into a string: https://stackoverflow.com/questions/67560768/join-list-element-after-split-into-str

In [16]:
def spilt_into_smaller_descriptions(examples):
    res = []
    index = 0
    num_words_per_chunk = 359
    total_chunks = examples["description"].split()
    total_len = examples["description_length"]
    while index < total_len:
        chunk = ' '.join(total_chunks[index: index+num_words_per_chunk]) 
                        # the elem with index = index + num_words_per_chunk 
                        # is excluded
        res.append(chunk)
        index = index + num_words_per_chunk
    last_chunk = ' '.join(total_chunks[index - num_words_per_chunk: total_len])
    res.append(last_chunk)
    return {
        "description": res
    }

In [54]:
# reference: https://discuss.huggingface.co/t/how-can-i-grab-the-first-n-rows-of-a-dataset-as-a-dataset-object/33093/2
# small_sample = description_dataset.select(range(20))
small_sample = description_dataset.select(range(20, 40))
small_sample

Dataset({
    features: ['title', 'description', 'ipc', 'description_length'],
    num_rows: 20
})

In [55]:
# small_sample[0]

In [56]:
sm_description_dataset = small_sample.map(spilt_into_smaller_descriptions)

Map: 100%|██████████| 20/20 [00:00<00:00, 461.86 examples/s]


In [57]:
# sm_description_dataset[0]

In [58]:
# description_dataset = description_dataset.map(spilt_into_smaller_descriptions)

convert to dataframe to use `explode`

In [59]:
# small_sample.set_format("pandas")
# df_small_sample = small_sample[:]

# description_dataset.set_format("pandas")
# df_description_dataset = description_dataset[:]

sm_description_dataset
sm_description_dataset.set_format("pandas")
df_sm_description_dataset = sm_description_dataset[:]

In [60]:
# df_small_sample_explode = df_small_sample.explode("description", ignore_index=True)
# df_small_sample_explode.head(4)

# df_description_dataset_explode = df_description_dataset.explode("description", ignore_index=True)
# df_description_dataset_explode.head(4)

df_sm_description_dataset_explode = df_sm_description_dataset.explode("description", ignore_index=True)
df_sm_description_dataset_explode.head(4)

Unnamed: 0,title,description,ipc,description_length
0,INHALATION PARTICLES: method of preparation,The present invention relates to methods for t...,A,5279
1,INHALATION PARTICLES: method of preparation,It is well acknowledged that the adhesion and ...,A,5279
2,INHALATION PARTICLES: method of preparation,"conditions. However, solid state properties (p...",A,5279
3,INHALATION PARTICLES: method of preparation,when all conventional techniques have failed. ...,A,5279


Convert the dataframe back to dataset 

In [61]:
# description_dataset = Dataset.from_pandas(df_description_dataset_explode)
# description_dataset

sm_description_dataset = Dataset.from_pandas(df_sm_description_dataset_explode)
sm_description_dataset

Dataset({
    features: ['title', 'description', 'ipc', 'description_length'],
    num_rows: 826
})

# Creating text embeddings

In [21]:
from transformers import AutoTokenizer, AutoModel
# model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
model_ckpt = "sadickam/sdg-classification-bert"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

In [22]:
import torch

device = torch.device("cuda")
model.to(device)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

 > As we mentioned earlier, we’d **like to represent each entry in our GitHub issues corpus as a single vector**, so we **need to “pool” or average our token embeddings** in some way. One popular approach is to **perform CLS pooling on our model’s outputs**, where we **simply collect the last hidden state for the special [CLS] token**. The following function does the trick for us:

In [23]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

> Next, we’ll create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs:

In [24]:
def get_embeddings(text_list):
    # encoded_input = tokenizer(
    #     text_list, padding=True, truncation=True, return_tensors="pt"
    # )
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [25]:
# embedding = get_embeddings(comments_dataset["text"][0])
# embedding.shape

# embedding = get_embeddings(description_dataset["description"][0])
# embedding.shape

embedding = get_embeddings(sm_description_dataset["description"][0])
embedding.shape

torch.Size([1, 768])

In [62]:
# embeddings_dataset = comments_dataset.map(
#     lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
# )
embeddings_dataset = sm_description_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["description"]).detach().cpu().numpy()[0]}
)

Map: 100%|██████████| 826/826 [00:10<00:00, 77.78 examples/s]


In [63]:
embeddings_dataset.add_faiss_index(column="embeddings")

100%|██████████| 1/1 [00:00<00:00, 330.16it/s]


Dataset({
    features: ['title', 'description', 'ipc', 'description_length', 'embeddings'],
    num_rows: 826
})

In [64]:
# question = "How can I load a dataset offline?"
question = "How to test nucleic acids in a sample"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [65]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

# Loaded version
# scores, samples = load_dataset.get_nearest_examples(
#     "embeddings", question_embedding, k=5
# )

In [66]:
# samples

In [67]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [68]:
# samples_df

In [69]:
for _, row in samples_df.iterrows():
    print(f"TITLE: {row.title}")
    print(f"SCORE: {row.scores}")
    print(f"DESCRIPTION: {row.description}")
    print("=" * 50)
    print()

TITLE: PHARMACEUTICAL COMPOSITION
SCORE: 215.9669189453125
DESCRIPTION: analysis was performed using an Agilent XDB-C18 reverse-phase column (250x4.6 mm, 120 Ǻ, 5 µm), thermostated to 50 °C, with detection at 254 nm. Eluent solvents were as follows: solvent A, 95% Acetonitrile, 5% Water, 4.8 mM phosphoric acid; solvent B, 95% Isopropanol, 5% Water, 4.8 mM phosphoric acid. A gradient from 10% B to 70% B was applied during 30 min with a flow rate of 1 ml/min.4.1.3 Liposome surface potentialTau liposomal construct samples were diluted 100-fold with PBS. Analysis was performed using a Zetasizer Nano (Malvern, USA) at 25 °C. Measurement duration and voltage selection were performed in automatic mode, with a typical applied voltage of 50 mV. Data was transformed using the Smoluchowski equation automatically using DTS 5.0 (Malvern) software to calculate the zeta potential. As the tau liposomal constructs are composed of a mixture of DMPC/DMPG/Cholesterol/MPLA at molar ratio of 9:1:7:0.2; the

# Save and reload FAISS database

**references:**

[1] https://huggingface.co/docs/datasets/v1.2.0/faiss_and_ea.html

[2] https://discuss.huggingface.co/t/save-and-load-datasets/9260

## Save

IMPORTANT:

[1] must save the dataset which contains the corresponding computed embeddings

In [70]:
embeddings_dataset.drop_index('embeddings')
embeddings_dataset.save_to_disk('./data_embeddings/epo_dataset2')

Saving the dataset (1/1 shards): 100%|██████████| 826/826 [00:00<00:00, 48132.03 examples/s]


In [52]:
# ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')
# embeddings_dataset.save_faiss_index('embeddings', './data_embeddings/epo_index1.faiss')

## Load

In [7]:
# # ds = load_dataset('crime_and_punish', split='train[:100]')
# # ds.load_faiss_index('embeddings', 'my_index.faiss')
# from datasets import load_from_disk
# load_dataset = load_from_disk('./data_embeddings/epo_dataset')

In [53]:
# load_dataset.load_faiss_index('embeddings', './data_embeddings/epo_index.faiss')

# Searching in multiple databse using FAISS index

[1] https://huggingface.co/learn/cookbook/en/semantic_cache_chroma_vector_database

[2] https://www.pinecone.io/learn/series/faiss/faiss-tutorial/

In [71]:
from datasets import load_from_disk
load_all_dataset = []
for i in range(2):
    dataset_name = './data_embeddings/epo_dataset' + str(i+1)
    loaded_dataset = load_from_disk(dataset_name)
    # dataset_faiss_name = './data_embeddings/epo_index' + str(i+1) + '.faiss'
    # loaded_dataset.load_faiss_index('embeddings', dataset_faiss_name)
    load_all_dataset.append(loaded_dataset)
print(len(load_all_dataset))

2


In [73]:
print(
    load_all_dataset[1],
    load_all_dataset[0]['title'][0],
    load_all_dataset[1]['title'][0]
)

Dataset({
    features: ['title', 'description', 'ipc', 'description_length', 'embeddings'],
    num_rows: 826
}) METHOD FOR MULTIPLEX NUCLEIC ACID ANALYSIS INHALATION PARTICLES: method of preparation


In [74]:
from datasets import concatenate_datasets
all_dataset = concatenate_datasets(load_all_dataset)

In [75]:
all_dataset

Dataset({
    features: ['title', 'description', 'ipc', 'description_length', 'embeddings'],
    num_rows: 1669
})

In [76]:
all_dataset.add_faiss_index(column="embeddings")

100%|██████████| 2/2 [00:00<00:00, 261.92it/s]


Dataset({
    features: ['title', 'description', 'ipc', 'description_length', 'embeddings'],
    num_rows: 1669
})

In [77]:
question = "How to test nucleic acids in a sample"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [78]:

scores, samples = all_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=10
)
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

for _, row in samples_df.iterrows():
    print(f"TITLE: {row.title}")
    print(f"SCORE: {row.scores}")
    print(f"DESCRIPTION: {row.description}")
    print("=" * 50)
    print()


TITLE: COMPOSITIONS FOR INCREASING POLYPEPTIDE STABILITY AND ACTIVITY, AND RELATED METHODS
SCORE: 157.3109893798828
DESCRIPTION: activity at a temperature between about -20 °C to about 35 °C, wherein said polypeptide is encoded by a nucleic acid sequence having a eukaryotic translation initiation sequence. In some embodiments, the polypeptide is a thermostable protein. In some embodiments, the thermostable protein is an enzyme. In some embodiments, the enzyme is a polymerase, a pyrophosphatase, or a deaminase. In some embodiments, the polymerase is a DNA polymerase I, Thermus aquaticus DNA polymerase I (Taq), or Thermococcus gorgonarius DNA polymerase (Tgo). In some embodiments, the polymerase is a Taq polymerase. In some embodiments, the polymerase is not Taq polymerase. In some embodiments, the pyrophosphatase is Thermoplasma acidophilum pyrophosphatase (TAPP). In some embodiments, the deaminase is Pyrococcus horikoshii dCTP deaminase. In some embodiments, the deaminase is a cytidine