# Semantic search with FAISS

## Loading and preparing the dataset

In [1]:
from datasets import load_dataset

issues_dataset = load_dataset("DrSly/github-issues", split="train")
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'type', 'active_lock_reason', 'sub_issues_summary', 'issue_dependencies_summary', 'body', 'closed_by', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 5000
})

The first order of business is to filter out the pull requests, as these tend to be rarely used for answering user queries and will introduce noise in our search engine. let‚Äôs also filter out rows with no comments, since these provide no answers to user queries:

In [2]:
issues_dataset = issues_dataset.filter(
    lambda x: (x['is_pull_request'] == False and len(x['comments']) > 0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'type', 'active_lock_reason', 'sub_issues_summary', 'issue_dependencies_summary', 'body', 'closed_by', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 1934
})

We can see that there are a lot of columns in our dataset, most of which we don‚Äôt need to build our search engine. From a search perspective, the most informative columns are title, body, and comments, while html_url provides us with a link back to the source issue

In [3]:
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 1934
})

To create our embeddings we‚Äôll augment each comment with the issue‚Äôs title and body, since these fields often include useful contextual information. Because our comments column is currently a list of comments for each issue, we need to ‚Äúexplode‚Äù the column so that each row consists of an (html_url, title, body, comment) tuple. 

In [4]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]
df = df.fillna("") 
df["comments"][0].tolist()

['I suggest metion this in docs specifically for attention with use, tell users explicitly to pass arguments with `fn_kwargs` param or using `functools.partial` to create a pure funcion.']

In [5]:
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,`Dataset.map()` causes cache miss/fingerprint ...,I suggest metion this in docs specifically for...,### Describe the bug\n\nWhen using `.map()` wi...
1,https://github.com/huggingface/datasets/issues...,"cast_column(..., Audio) fails with load_datase...",The following code *does* work:\n```py\nfrom d...,### Describe the bug\n\nAttempt to load a data...
2,https://github.com/huggingface/datasets/issues...,"cast_column(..., Audio) fails with load_datase...",Thanks for reporing ! Are you using pandas v3 ...,### Describe the bug\n\nAttempt to load a data...
3,https://github.com/huggingface/datasets/issues...,"cast_column(..., Audio) fails with load_datase...",pandas 3.0.0 was present but I've also reprodu...,### Describe the bug\n\nAttempt to load a data...


Great, we can see the rows have been replicated, with the comments column containing the individual comments! Now that we‚Äôre finished with Pandas, we can quickly switch back to a Dataset by loading the DataFrame in memory:

In [6]:
from datasets import Dataset
comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 7255
})

Now that we have one comment per row, let‚Äôs create a new comments_length column that contains the number of words per comment:

In [7]:
comments_dataset = comments_dataset.map(lambda x: {"comment_length": len(x["comments"].split())})

Map:   0%|          | 0/7255 [00:00<?, ? examples/s]

We can use this new column to filter out short comments, which typically include things like ‚Äúcc @lewtun‚Äù or ‚ÄúThanks!‚Äù that are not relevant for our search engine. 

In [8]:
comments_dataset = comments_dataset.filter(lambda x: x['comment_length'] > 15)
comments_dataset

Filter:   0%|          | 0/7255 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 5273
})

Having cleaned up our dataset a bit, let‚Äôs concatenate the issue title, description, and comments together in a new text column.

In [9]:
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

Map:   0%|          | 0/5273 [00:00<?, ? examples/s]

## Creating text embeddings

In [10]:
from transformers import AutoModel, AutoTokenizer

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

In [11]:
import torch

device = torch.device("mps")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

As we mentioned earlier, we‚Äôd like to represent each entry in our GitHub issues corpus as a single vector, so we need to ‚Äúpool‚Äù or average our token embeddings in some way. One popular approach is to perform CLS pooling on our model‚Äôs outputs, where we simply collect the last hidden state for the special [CLS] token.

In [12]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

Next, we‚Äôll create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs

In [13]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

We can test the function works by feeding it the first text entry in our corpus and inspecting the output shape

In [14]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

Great, we‚Äôve converted the first entry in our corpus into a 768-dimensional vector! We can use Dataset.map() to apply our get_embeddings() function to each row in our corpus, so let‚Äôs create a new embeddings column

In [15]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/5273 [00:00<?, ? examples/s]

we‚Äôve converted the embeddings to NumPy arrays ‚Äî that‚Äôs because ü§ó Datasets requires this format when we try to index them with FAISS

### Using FAISS for efficient similarity search

Now that we have a dataset of embeddings, we need some way to search over them. To do this, we‚Äôll use a special data structure in ü§ó Datasets called a FAISS index. FAISS (short for Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

The basic idea behind FAISS is to create a special data structure called an index that allows one to find which embeddings are similar to an input embedding

In [16]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/6 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 5273
})

We can now perform queries on this index by doing a nearest neighbor lookup with the Dataset.get_nearest_examples() function. 

In [17]:
question = "How can i load a dataset from the Hugging Face Hub offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

Just like with the documents, we now have a 768-dimensional vector representing the query, which we can compare against the whole corpus to find the most similar embeddings:

In [18]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

The Dataset.get_nearest_examples() function returns a tuple of scores that rank the overlap between the query and the document, and a corresponding set of samples (here, the 5 best matches). Let‚Äôs collect these in a pandas.DataFrame so we can easily sort them:

In [19]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

Now we can iterate over the first few rows to see how well our query matched the available comments:

In [20]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Hi ! The `gen_kwargs` dictionary is passed to `_generate_examples`, so in your case it must be defined this way:
```python
def _generate_examples(self, filepath):
    ...
```

And here is an additional tip: you can use `os.path.join(downloaded_file, "dataset/testing_data")` instead of `f"downloaded_file}/dataset/testing_data/"` to get compatibility with Windows and streaming.

Indeed Windows uses a backslash separator, not a slash, and streaming uses chained URLs (like `zip://dataset/testing_data::https://https://guillaumejaume.github.io/FUNSD/dataset.zip` for example)
SCORE: 29.43366813659668
TITLE: ‚ùì Dataset loading script from Hugging Face Hub
URL: https://github.com/huggingface/datasets/issues/3300

COMMENT: Also I think the viewer will be updated when you fix the dataset script, let me know if it doesn't
SCORE: 29.401784896850586
TITLE: ‚ùì Dataset loading script from Hugging Face Hub
URL: https://github.com/huggingface/datasets/issues/3300

COMMENT: Thanks for you quic