## Using embeddings for semantic search

In this notebook, we will utilize embeddings to build a semantic search engine.

## Loading and Preparing dataset

In [2]:
# load dataset
from datasets import load_dataset

The **GitHub Issues** dataset contains issues and pull requests associated with the 🤗 **Datasets** repository. It is designed for educational use and can support tasks such as **semantic search** and **multilabel text classification**. The dataset provides a rich resource of English-language GitHub issues, focusing on the domain of **datasets** for **NLP**, **computer vision**, and related fields.

### Dataset Information:
- **Name**: GitHub Issues
- **Repository**: [🤗 Datasets GitHub Repo](https://github.com/huggingface/datasets)
- **Use Cases**: Semantic search, multilabel text classification
- **Domain**: Datasets for NLP, computer vision, and beyond
- **Language**: English

For more details and to access the dataset, visit the official repository:
- [🤗 GitHub Issues Dataset](https://huggingface.co/datasets/github-issues) on Hugging Face.

This dataset is an excellent resource for those interested in machine learning tasks related to **text analysis** in the context of real-world GitHub issues and pull requests.

In [3]:
dataset = load_dataset('lewtun/github-issues', split='train')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Repo card metadata block was not found. Setting CardData to empty.


In [4]:
dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

First thing we do is to filter out pull requests as they tend to be rarely used for answering user queries and intrdcue noise in our dataset.

For that purpose, we will use `Dataset.filter()` method.

In [5]:
revised_dataset = dataset.filter(
    lambda x: (x['is_pull_request'] == False and len(x["comments"]) > 0)
)

In [6]:
revised_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

We also see that thare are many columns in our dataset that not informative and henece we need to drop them. From serach engine perspective, most informative columns are `title`, `body`, `comments` and `html_url`.

Let's utilize `Dataset.remove_columns()` function to drop the rest:

In [7]:
columns = revised_dataset.column_names
columns

['url',
 'repository_url',
 'labels_url',
 'comments_url',
 'events_url',
 'html_url',
 'id',
 'node_id',
 'number',
 'title',
 'user',
 'labels',
 'state',
 'locked',
 'assignee',
 'assignees',
 'milestone',
 'comments',
 'created_at',
 'updated_at',
 'closed_at',
 'author_association',
 'active_lock_reason',
 'pull_request',
 'body',
 'timeline_url',
 'performed_via_github_app',
 'is_pull_request']

In [8]:
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_keep

['title', 'body', 'html_url', 'comments']

In [9]:
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
columns_to_remove

{'active_lock_reason',
 'assignee',
 'assignees',
 'author_association',
 'closed_at',
 'comments_url',
 'created_at',
 'events_url',
 'id',
 'is_pull_request',
 'labels',
 'labels_url',
 'locked',
 'milestone',
 'node_id',
 'number',
 'performed_via_github_app',
 'pull_request',
 'repository_url',
 'state',
 'timeline_url',
 'updated_at',
 'url',
 'user'}

In [10]:
revised_dataset = revised_dataset.remove_columns(column_names=columns_to_remove)
revised_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

To create our embeddings we’ll augment each comment with the issue’s title and body, since these fields often include useful contextual information.

for this purpose, we will use Pandas `DataFrame.explode()` funtion.

In [11]:
revised_dataset.set_format("pandas")
df = revised_dataset[:]

If we explore first row in our datset, we can see that tehere are two comments associated with the issue:

In [12]:
df["comments"][0].tolist()

['Cool, I think we can do both :)',
 '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']

When we explode `df`, we will get one row for each commemnt. Let's see:

In [13]:
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(2)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...


Next, we will load `DataFrame` in memory:

In [14]:
from datasets import Dataset

In [15]:
comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

As now we have one omment per row, let's craete `comment_length` colum which contains number of words per comment.

In [16]:
comments_dataset  = comments_dataset.map(
    lambda x: {'comment_length': len(x["comments"].split())}
)

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

In [17]:
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2964
})

As a next step we will concatenate the `title`, 'body` and `comment` together in a new `text` column

In [18]:
def concatenate_text(example):
  return {
      "text": example["title"] + ' \n ' +
              example["body"] + ' \n ' +
              example["comments"]
  }

In [19]:
comments_dataset = comments_dataset.map(concatenate_text)
comments_dataset

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text'],
    num_rows: 2964
})

## Creating text embeddings

For creating text embeddings we will utilize `sentence-transformers` library that is dedicated to creating embeddings.

As described in the library's documentation, our use case is an example of **asymmetric semantic search** because we have a short query whose answer we'd like to find in a longer document, like a an issue comment.

In [20]:
from transformers import AutoTokenizer, AutoModel

In [21]:
model_checkpoint = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

In [22]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModel.from_pretrained(model_checkpoint)



To speed up the embeddings process, we will utilize GPU device:

In [23]:
import torch

In [24]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [25]:
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

As we mentioned earlier, we’d like to represent each entry in our GitHub issues corpus as a single vector, so we need to “pool” or average our token embeddings in some way. One popular approach is to perform CLS pooling on our model’s outputs, where we simply collect the last hidden state for the special [CLS] token. The following function does the trick for us:

In [26]:
def clas_pooling(model_output):
  return model_output.last_hidden_state[:, 0]

Next, we’ll create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs:

In [27]:
def get_embeddings(text_list):
  encoded_input = tokenizer(
      text_list,
      padding=True,
      truncation=True,
      return_tensors='pt'
  )

  encoded_input = {k:v.to(device) for k, v in encoded_input.items()}
  model_output = model(**encoded_input)
  return clas_pooling(model_output)

Let's test this function by giving first entry in the corpus

In [28]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding

tensor([[-2.3901e-01, -9.6629e-02, -7.5987e-02, -9.8427e-02, -9.9332e-02,
         -1.1464e-01,  7.8535e-03,  2.3955e-01, -2.9256e-02, -7.6068e-02,
          3.0458e-01, -6.9792e-02, -1.5569e-01,  2.4725e-01, -9.1881e-02,
          2.5280e-01,  1.6807e-01,  9.9753e-05, -9.9873e-02,  8.6820e-02,
         -7.9818e-03, -1.0165e-01,  1.2426e-01,  4.6468e-02, -2.1741e-01,
          4.1782e-02,  4.9948e-02,  1.3912e-01, -4.1841e-01, -4.5949e-01,
          1.3835e-01,  1.7075e-01,  3.3363e-03,  5.6527e-01, -1.0426e-04,
          2.1617e-02,  2.3902e-01,  1.0892e-02, -1.4058e-01, -1.7697e-01,
         -5.2251e-01, -4.0497e-01, -1.4152e-01,  2.8701e-02,  7.4875e-02,
          6.8656e-02, -9.1020e-02,  2.4530e-03,  3.4009e-01,  9.7650e-02,
          2.4080e-01, -1.9418e-01,  1.7939e-01, -7.8935e-03,  3.3115e-01,
          5.5396e-01, -5.1814e-02,  2.2530e-01,  4.5613e-03, -1.0242e-02,
          1.1259e-01,  2.0863e-01, -8.3750e-02, -2.6130e-01,  2.9129e-01,
         -1.7616e-01,  5.2985e-01, -3.

In [29]:
embedding.shape

torch.Size([1, 768])

Great, we’ve converted the first entry in our corpus into a 768-dimensional vector! We can use Dataset.map() to apply our get_embeddings() function to each row in our corpus, so let’s create a new embeddings column as follows:

In [30]:
embedding_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/2964 [00:00<?, ? examples/s]

## Using FAISS for efficient similarity search

To search through our embedding dataset efficiently, we'll use a FAISS index, provided by the Facebook AI Similarity Search (FAISS) library. FAISS offers algorithms for fast searching and clustering of embedding vectors, enabling us to find similar embeddings based on an input vector.

In Hugging Face Datasets, we can easily create a FAISS index using the `Dataset.add_faiss_index()` method. By specifying the dataset column containing the embeddings, we can build an index for quick similarity searches. This approach significantly speeds up tasks like nearest neighbor searches over large embedding datasets.

In [31]:
embedding_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2964
})

We can now perform queries on this index by conducting a nearest neighbor search using the `Dataset.get_nearest_examples()` function. Let's test this by embedding a question as follows:

In [56]:
question = "how can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()

In [57]:
question_embedding.shape

(1, 768)

In [62]:
scores, samples = embedding_dataset.get_nearest_examples(
    "embeddings",
    question_embedding,
    k = 5
)


The `Dataset.get_nearest_examples()` function returns a tuple that includes similarity scores and a set of samples (e.g., the top 5 matches). To facilitate sorting and analysis, you can collect these results into a `pandas.DataFrame`. This will allow you to easily organize the data and review the ranked matches.

In [63]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [64]:
samples_df

Unnamed: 0,html_url,title,comments,body,comment_length,text,embeddings,scores
4,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,Requiring online connection is a deal breaker ...,"`datasets.load_dataset(""csv"", ...)` breaks if ...",57,Discussion using datasets in offline mode \n `...,"[-0.4731806814670563, 0.24578382074832916, -0....",25.50502
3,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"The local dataset builders (csv, text , json a...","`datasets.load_dataset(""csv"", ...)` breaks if ...",38,Discussion using datasets in offline mode \n `...,"[-0.4490852952003479, 0.20950652658939362, -0....",24.555538
2,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,I opened a PR that allows to reload modules th...,"`datasets.load_dataset(""csv"", ...)` breaks if ...",179,Discussion using datasets in offline mode \n `...,"[-0.4716479778289795, 0.2902272641658783, -0.0...",24.148989
1,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"> here is my way to load a dataset offline, bu...","`datasets.load_dataset(""csv"", ...)` breaks if ...",76,Discussion using datasets in offline mode \n `...,"[-0.4992601275444031, 0.22699788212776184, -0....",22.894001
0,https://github.com/huggingface/datasets/issues...,Discussion using datasets in offline mode,"here is my way to load a dataset offline, but ...","`datasets.load_dataset(""csv"", ...)` breaks if ...",47,Discussion using datasets in offline mode \n `...,"[-0.4902574121952057, 0.22889623045921326, -0....",22.406656


In [65]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505020141601562
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
```python
datasets = load_dataset('text', data_files=data_files)
```

We'll do a new release soon
SCORE: 24.555538177490234
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's n