# Semantic search with FAISS

In [None]:
import datasets
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import torch

### Loading and preparing the dataset

In [2]:
# Load the dataset
issues_dataset = datasets.load_dataset(
    "json", data_files="transformers-issues-with-comments.jsonl", split="train"
)

Map the dataset to set the "is_pull_request" field based on whether the issue is a pull request or not

In [3]:
issues_dataset = issues_dataset.map(
    lambda x: {"is_pull_request": False if x["pull_request"] is None else True}
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 28908
})

Filtering for non-pull request issues with comments as these are most helpful for user queries and will introduce noise in the search engine

In [4]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 13447
})

There are a lot of columns in our dataset, most of which we don’t need to build our search engine. We will keep the most informative columns and drop the rest

In [5]:
# Get the column names from the dataset
columns = issues_dataset.column_names

# Define the columns to keep in the dataset
columns_to_keep = ["title", "body","html_url", "comments"]

# Find the columns to remove from the dataset
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)

# Remove the columns from the dataset
issues_dataset = issues_dataset.remove_columns(columns_to_remove)

# Display the updated dataset
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 13447
})

To create our embeddings we’ll augment each comment with the issue’s title and body, since these fields often include useful contextual information. Because our comments column is currently a list of comments for each issue, we need to “explode” the column so that each row consists of an (html_url, title, body, comment) tuple. In Pandas we can do this with the DataFrame.explode() function, which creates a new row for each element in a list-like column, while replicating all the other column values. To see this in action, let’s first switch to the Pandas DataFrame format:

In [6]:
# Set the format of the issues_dataset to "pandas"
issues_dataset.set_format("pandas")
# Create a new dataframe df by copying the entire issues_dataset
df = issues_dataset[:]

In [7]:
# Drop rows with missing values
df.dropna(axis=0,how = "any",inplace = True)

If we inspect the first row in this DataFrame we can see there are four comments associated with this issue:

In [8]:
df["comments"][0].tolist()

['Hi @arda1906, thanks for raising an issue!\r\n\r\nWithout more information about the error i.e. what does it mean to "not work" and what is the expected behaviour? we won\'t be able to help you.  \r\n\r\nFrom the snippet, it\'s not entirely clear how the code is being run, but there are two separate commands which should be entered on separate lines or cells\r\n\r\n```py\r\nfrom huggingface_hub import notebook_login\r\n\r\nnotebook_login()\r\n```',
 'hi,I am giving details\r\n> I am trying this code to train the model\r\n\r\n>```python\r\n>trainer = Trainer(model=model,args=training_args,\r\n                 compute_metrics=compute_metrics,\r\n                 train_dataset=emotion_encoded["train"],\r\n                 eval_dataset=emotion_encoded["validation"],\r\n                 tokenizer=tokenizer)\r\ntrainer.train()\r\n\r\n>and I am facing this error:\r\n>LocalTokenNotFoundError: Token is required (`token=True`), but no token found. You need to provide a token or be logged in to

When we explode df, we expect to get one row for each of these comments. Let’s check if that’s the case:

In [9]:
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/transformers/is...,To enter token in jupyter notebook issue,"Hi @arda1906, thanks for raising an issue!\r\n...",I run this [from huggingface_hub import notebo...
1,https://github.com/huggingface/transformers/is...,To enter token in jupyter notebook issue,"hi,I am giving details\r\n> I am trying this c...",I run this [from huggingface_hub import notebo...
2,https://github.com/huggingface/transformers/is...,To enter token in jupyter notebook issue,"Hi @arda1906, are you running the notebook log...",I run this [from huggingface_hub import notebo...
3,https://github.com/huggingface/transformers/is...,To enter token in jupyter notebook issue,"> Hi @arda1906, are you running the notebook l...",I run this [from huggingface_hub import notebo...


Looks like it worked. Now that we’re finished with Pandas, we can quickly switch back to a Dataset by loading the DataFrame in memory

In [10]:
comments_dataset = datasets.Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 62546
})

Now that we have one comment per row, let’s create a new comments_length column that contains the number of words per comment:

In [11]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

Map: 100%|██████████| 62546/62546 [00:03<00:00, 18942.89 examples/s]


We can use this new column to filter out short comments, which typically include things like “cc @lewtun” or “Thanks!” that are not relevant for our search engine. There’s no precise number to select for the filter, but around 15 words seems like a good start:

In [12]:
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Filter: 100%|██████████| 62546/62546 [00:00<00:00, 146027.82 examples/s]


Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 46959
})

Having cleaned up our dataset a bit, let’s concatenate the issue title, description, and comments together in a new text column. We’ll write a simple function that we can pass to Dataset.map():

In [13]:
def concatenate_text(examples):
    # Concatenate title, body, and comments
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }

comments_dataset = comments_dataset.map(concatenate_text)

Map: 100%|██████████| 46959/46959 [00:05<00:00, 8300.16 examples/s]


Let's check out our new text column

In [14]:
comments_dataset[1]

{'html_url': 'https://github.com/huggingface/transformers/issues/29161',
 'title': 'To enter token in jupyter notebook issue',
 'comments': 'hi,I am giving details\r\n> I am trying this code to train the model\r\n\r\n>```python\r\n>trainer = Trainer(model=model,args=training_args,\r\n                 compute_metrics=compute_metrics,\r\n                 train_dataset=emotion_encoded["train"],\r\n                 eval_dataset=emotion_encoded["validation"],\r\n                 tokenizer=tokenizer)\r\ntrainer.train()\r\n\r\n>and I am facing this error:\r\n>LocalTokenNotFoundError: Token is required (`token=True`), but no token found. You need to provide a token or be logged in to Hugging Face with `huggingface-cli login` or `huggingface_hub.login`. See https://huggingface.co/settings/tokens.\r\n>I have thought to apply the my token in the jupyter notebook like this:\r\n>```\r\n> ```python\r\n> from huggingface_hub import notebook_login\r\n> \r\n> notebook_login()\r\n>\r\n> ```\r\n>help me 

### Creating text embeddings
We can obtain token embeddings by using the AutoModel class. All we need to do is pick a suitable checkpoint to load the model from. Fortunately, there’s a library called sentence-transformers that is dedicated to creating embeddings. As described in the library’s [documentation](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search), our use case is an example of asymmetric semantic search because we have a short query whose answer we’d like to find in a longer document, like a an issue comment. The handy [model overview](https://www.sbert.net/docs/pretrained_models.html#model-overview) table in the documentation indicates that the multi-qa-mpnet-base-dot-v1 checkpoint has the best performance for semantic search, so we’ll use that for our application. We’ll also load the tokenizer using the same checkpoint:

In [9]:
# Load the pre-trained model checkpoint for multi-qa-mpnet-base-dot-v1
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

# Initialize the tokenizer using the pre-trained model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# Initialize the model using the pre-trained model checkpoint
model = AutoModel.from_pretrained(model_ckpt)

To speed up the embedding process, it helps to place the model and inputs on a GPU device, so let’s do that now:

In [16]:
device = torch.device("cuda")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

We’d like to represent each entry in our GitHub issues corpus as a single vector, so we need to “pool” or average our token embeddings in some way. One popular approach is to perform CLS pooling on our model’s outputs, where we simply collect the last hidden state for the special [CLS] token. The following function does the trick for us:

In [14]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

Next, we’ll create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs:

In [7]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

We can test the function works by feeding it the first text entry in our corpus and inspecting the output shape:

In [21]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

Great, we’ve converted the first entry in our corpus into a 768-dimensional vector! We can use Dataset.map() to apply our get_embeddings() function to each row in our corpus, so let’s create a new embeddings column as follows:

In [22]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map: 100%|██████████| 46959/46959 [16:17<00:00, 48.04 examples/s]


Notice that we’ve converted the embeddings to NumPy arrays — that’s because 🤗 Datasets requires this format when we try to index them with FAISS, which we’ll do next.

In [None]:
# save embeddings dataset to disk
embeddings_dataset.save_to_disk('embeddings_dataset')

### Using FAISS for efficient similarity search

Now that we have a dataset of embeddings, we need some way to search over them. To do this, we’ll use a special data structure in 🤗 Datasets called a [FAISS](https://faiss.ai/) index. FAISS (short for Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

The basic idea behind FAISS is to create a special data structure called an index that allows one to find which embeddings are similar to an input embedding. Creating a FAISS index in 🤗 Datasets is simple — we use the Dataset.add_faiss_index() function and specify which column of our dataset we’d like to index:

In [None]:
# load embeddings dataset from disk
embeddings_dataset=datasets.load_from_disk('embeddings_dataset')

In [None]:
# add faiss index
embeddings_dataset.add_faiss_index(column="embeddings")

We can now perform queries on this index by doing a nearest neighbor lookup with the Dataset.get_nearest_examples() function. Let’s test this out by first embedding a question as follows:

In [15]:

# Define the question
question = "How can I load a dataset offline?"

# Get the embeddings for the question
question_embedding = get_embeddings([question]).cpu().detach().numpy()

# Print the shape of the question embedding
question_embedding.shape

(1, 768)

Just like with the documents, we now have a 768-dimensional vector representing the query, which we can compare against the whole corpus to find the most similar embeddings:

In [16]:
# Compute the nearest examples using embeddings dataset
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

The Dataset.get_nearest_examples() function returns a tuple of scores that rank the overlap between the query and the document, and a corresponding set of samples (here, the 5 best matches). Let’s collect these in a pandas.DataFrame so we can easily sort them:

In [18]:
# Convert samples dictionary to DataFrame and sort by scores
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

Now we can iterate over the first few rows to see how well our query matched the available comments:

In [19]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: The relevant docs to load from local data can be found here: https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.

The `from_pretrained` method accepts either a `repo_id` to a repo on the 🤗 hub or a local path to a folder.
SCORE: 43.051761627197266
TITLE: Provide a different API solution instead of offline mode
URL: https://github.com/huggingface/transformers/issues/23117

COMMENT: Yes, if you place all files which you find on the model page on the hub in a directory, then it will work.
SCORE: 42.095550537109375
TITLE: Getting a model to work on a system with no internet access
URL: https://github.com/huggingface/transformers/issues/10900

COMMENT: > You can do it, instead of loading `from_pretrained(roberta.large)` like this download the respective `config.json` and `<mode_name>.bin` and save it on your folder then just write `.from_pretrained('Users/<location>/<your folder name>')` and thats about it.

This appr

Not bad! All the resulting outputs are helpful though the first two match the query the best. Let's try another query.

In [43]:
# Get the embedding for the question
question = "How do I speed up inference?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()

# Get the nearest examples from the embeddings dataset
scores, samples = embeddings_dataset.get_nearest_examples("embeddings", question_embedding, k=5)

# Convert the samples to a DataFrame and add the scores
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

# Print the information for each sample
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Hi! I've used this:


distilbert-base-uncased-distilled-squad |  
-- | --
or 

distilbert-base-cased-distilled-squad |  
-- | --

It improved quite a bit! 

SCORE: 32.76213836669922
TITLE: How to speed up inference step in BertQuestionAnswering?
URL: https://github.com/huggingface/transformers/issues/4535

COMMENT: Sorry maybe I was not precise enough:
Which model of the `transformers` library (e.g. Bert, GPT2) did you use? And can you copy / paste the exact code which has a `transformers` model in it that was slow for inference.
SCORE: 32.67070007324219
TITLE: How to speed up the transformer inference?
URL: https://github.com/huggingface/transformers/issues/3753

COMMENT: > I suggest you use Stack Overflow, where you will more likely receive answers to your question.

OK, thx
SCORE: 32.13627624511719
TITLE: How to speed up the transformer inference?
URL: https://github.com/huggingface/transformers/issues/3753

COMMENT: Hi @hahadashi, 

Can you add a code snippet so that we kn

Great! The last result answered our query. 