# Load dataset (GitHub issues)

The issues are from the 🤗 Datasets repository, we'll try to find answers for questions related to this library.

In [3]:
from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

# Filter dataset
We remove pull requests and issues without comments - they bring no information for our task

In [4]:
issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

# Remove columns
We keep only columns that may be beneficial for the search task, such as issue title, issue text or comments. We do not need e.g. labels assigned to issue or username.

In [5]:
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

# Explode comments into separate rows
We keep the source issue (its title, link and body) for the specific comment. This allows to create meaningful context for the comment.

In [6]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...


In [7]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

# Add comment length column & filter by length

Short comments are considered useless.

In [8]:
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Map: 100%|██████████| 2964/2964 [00:00<00:00, 35596.04 examples/s]
Filter: 100%|██████████| 2964/2964 [00:00<00:00, 280680.87 examples/s]


Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

# Concatenate textual columns into single wall of text

This operation allows to create mini-texts that may be later vectorized for semantic search.

In [9]:
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

Map: 100%|██████████| 2175/2175 [00:00<00:00, 18888.72 examples/s]


# Create embeddings for texts

The model (`sentence-transformers/multi-qa-mpnet-base-dot-v1`) is one of the Sentence Transformers models trained for semantic search. When these models' scores for the Performance Semantic Search are considered, this model achieves the highest score.

[List with models and their scores](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)

In [10]:
from transformers import AutoTokenizer, AutoModel
import torch

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

In [11]:
# We get the [CLS] token hidden state from the transformer output and treat it as the text embedding
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [12]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [13]:
# Example: embed one text
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

In [14]:
# Embed all texts
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map: 100%|██████████| 2175/2175 [00:24<00:00, 88.87 examples/s] 


# Create FAISS index

FAISS index serves as a faster alternative to vanilla search in the list of embedding vectors. It is optimized for searching similar vectors to the query vector.

It needs NumPy arrays, this is why we created NumPy arrays from embeddings.

Because the dataset is very small AND we make few searches, we can use Flat FAISS index (generally slow, but very accurate)

In [21]:
len(embeddings_dataset["embeddings"][0])

768

In [23]:
import faiss

index = faiss.IndexFlatIP(len(embeddings_dataset["embeddings"][0]))
embeddings_dataset.add_faiss_index(column="embeddings", custom_index=index)

100%|██████████| 3/3 [00:00<00:00, 682.93it/s]


Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

# Find similar texts - example

In [24]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [25]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [26]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: here is my way to load a dataset offline, but it **requires** an online machine
1. (online machine)
```
import datasets
data = datasets.load_dataset(...)
data.save_to_disk(/YOUR/DATASET/DIR)
```
2. copy the dir from online to the offline machine
3. (offline machine)
```
import datasets
data = datasets.load_from_disk(/SAVED/DATA/DIR)
```

HTH.
SCORE: 30.05308723449707
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: > here is my way to load a dataset offline, but it **requires** an online machine
> 
> 1. (online machine)
> 
> ```
> 
> import datasets
> 
> data = datasets.load_dataset(...)
> 
> data.save_to_disk(/YOUR/DATASET/DIR)
> 
> ```
> 
> 2. copy the dir from online to the offline machine
> 
> 3. (offline machine)
> 
> ```
> 
> import datasets
> 
> data = datasets.load_from_disk(/SAVED/DATA/DIR)
> 
> ```
> 
> 
> 
> HTH.


SCORE: 29.860294342041016
TITLE: Discussion using datasets in offline mode
URL: https:/

# Find texts related to other questions

In [27]:
import textwrap

questions = ["How can I load an XLSX file?"]


def get_similar_comments(questions):
    question_embeddings = get_embeddings(questions).cpu().detach().numpy()
    scores, samples = embeddings_dataset.get_nearest_examples(
        "embeddings", question_embeddings, k=5
    )
    samples_df = pd.DataFrame.from_dict(samples)
    samples_df["scores"] = scores
    samples_df.sort_values("scores", ascending=False, inplace=True)
    return samples_df


def print_similar_comments(questions):
    for question in questions:
        print(f"QUESTION: {question}")
        print()
        similar_comments = get_similar_comments([question])
        for _, row in similar_comments.iterrows():
            print(f"SCORE: {row.scores}")
            print(f"TITLE: {row.title}")
            wrapped = "\n".join(textwrap.wrap(row.comments, 100))
            print(f"COMMENT: {wrapped}")
            print()


print_similar_comments(questions)

QUESTION: How can I load an XLSX file?

SCORE: 15.626007080078125
TITLE: can't load "german_legal_entity_recognition" dataset
COMMENT: > Please if you could tell me more about the error?  >   > 1. Please check the directory you've been
working on  > 2. Check for any typos    Error happens during the execution of this line:  dataset =
load_dataset("german_legal_entity_recognition")    Also, when I try to open mentioned links via
Opera I have errors "404: Not Found" and "This XML file does not appear to have any style
information associated with it. The document tree is shown below." respectively.

SCORE: 15.515144348144531
TITLE: can't load "german_legal_entity_recognition" dataset
COMMENT: Please if you could tell me more about the error?     1. Please check the directory you've been
working on  2. Check for any typos

SCORE: 15.150965690612793
TITLE: viewer "fake_news_english" error
COMMENT: Thanks for reporting !  The viewer doesn't have all the dependencies of the datasets. We may a

The question `How can I load an XLSX file?` has unrelated answers because Datasets does not support XLSX out of the box, yet the searcher tries to return anything.

In [28]:
print_similar_comments(["How can I load a CSV file?"])

QUESTION: How can I load a CSV file?

SCORE: 22.139331817626953
TITLE: load_dataset with 'csv' is not working. while the same file is loading with 'text' mode or with pandas
COMMENT: This did help to load the data. But the problem now is that I get:  ArrowInvalid: CSV parse error:
Expected 5 columns, got 187    It seems that this change the parsing so I changed the table to tab-
separated and tried to load it directly from pyarrow  But I got a similar error, again it loaded
fine in pandas so I am not sure what to do.

SCORE: 22.139331817626953
TITLE: load_dataset with 'csv' is not working. while the same file is loading with 'text' mode or with pandas
COMMENT: We should expose the [`block_size` argument](https://arrow.apache.org/docs/python/generated/pyarrow.
csv.ReadOptions.html#pyarrow.csv.ReadOptions) of Apache Arrow csv `ReadOptions` in the
[script](https://github.com/huggingface/datasets/blob/master/datasets/csv/csv.py).      In the
meantime you can specify yourself the `ReadOptio

The question `How can I load a CSV file?` has slightly related answers because CSV is widely used and many people report issues regarding its usage. However this question is still very general which does not give precise results.

In [29]:
print_similar_comments(["How to write README.md for dataset?"])

QUESTION: How to write README.md for dataset?

SCORE: 32.54219055175781
TITLE: Add documentaton for dataset README.md files
COMMENT: @lhoestq hmm - ok thanks for the answer.  To be honest I am not sure if this issue can be closed
now.  I just wanted to point out that this should either be documented or linked in the
documentation.  If you feel like it is (will be) please just close this.

SCORE: 32.13023376464844
TITLE: Add documentaton for dataset README.md files
COMMENT: We're still working on the validation+documentation in this.  Feel free to keep this issue open till
we've added them

SCORE: 31.412778854370117
TITLE: Add documentaton for dataset README.md files
COMMENT: Hi ! We are using the [datasets-tagging app](https://github.com/huggingface/datasets-tagging) to
select the tags to add.    We are also adding the full list of tags in #2107   This covers
multilinguality, language_creators, licenses, size_categories and task_categories.    In general if
you want to add a tag that d

The question `How to write README.md for dataset?` has strongly related answers because there exists an issue that precisely answers this question (`Add documentaton for dataset README.md files`)

In [34]:
print_similar_comments(
    ["Do we have VoxPopuli dataset in the HuggingFace Datasets library?"]
)

QUESTION: Do we have VoxPopuli dataset in the HuggingFace Datasets library?

SCORE: 24.042560577392578
TITLE: FileNotFound remotly, can't load a dataset
COMMENT: This dataset will be available in version-2 of the library. If you want to use this dataset now,
install datasets from `master` branch rather.    Command to install datasets from `master` branch:
`!pip install git+https://github.com/huggingface/datasets.git@master`

SCORE: 23.409957885742188
TITLE: wikiann dataset is missing columns 
COMMENT: Hi !  Apparently you can get the spans from the NER tags using `tags_to_spans` defined here:    http
s://github.com/tensorflow/datasets/blob/c7096bd38e86ed240b8b2c11ecab9893715a7d55/tensorflow_datasets
/text/wikiann/wikiann.py#L81-L126    It would be nice to include the `spans` field in this dataset
as in TFDS. This could be a good first issue for new contributors !    The objective is to use
`tags_to_spans` in the `_generate_examples` method [here](https://github.com/huggingface/nlp/blob

The question `Do we have VoxPopuli dataset in the HuggingFace Datasets library?` has no related answers, but not because there isn't one in the dataset, but because the question has too many words that are not needed. Compare this with the following (shorter!) question:

In [35]:
print_similar_comments(["Do we have VoxPopuli dataset?"])

QUESTION: Do we have VoxPopuli dataset?

SCORE: 21.702285766601562
TITLE: Add VoxPopuli
COMMENT: I'm happy to take this on:) One question: The original unlabelled data is stored unsegmented (see
e.g. https://github.com/facebookresearch/voxpopuli/blob/main/voxpopuli/get_unlabelled_data.py#L30),
but segmenting the audio in the dataset would require a dependency on something like soundfile or
torchaudio. An alternative could be to provide the segments start and end times as a Sequence and
then it's up to the user to perform the segmentation on-the-fly if they wish?

SCORE: 21.026845932006836
TITLE: Does both 'bookcorpus' and 'wikipedia' belong to the same datasets which Google used for pretraining BERT?
COMMENT: No they are other similar copies but they are not provided by the official Bert models authors.

SCORE: 20.259286880493164
TITLE: Add VoxPopuli
COMMENT: Hey @jfainberg,    This sounds great! I think adding a dependency would not be a big problem,
however automatically segmenting t

As we can see, there actually is an issue that targets this specific question (`Add VoxPopuli`)