# The 🤗 Datasets library
## [Introduction](https://huggingface.co/course/chapter5/1?fw=pt)

In [Chapter 3](https://huggingface.co/course/chapter3) you got your first taste of the 🤗 Datasets library and saw that there were three main steps when it came to fine-tuning a model:
- Load a dataset from the Hugging Face Hub.
- Preprocess the data with `Dataset.map()`.
- Load and compute metrics.

But this is just scratching the surface of what 🤗 Datasets can do! In this chapter, we will take a deep dive into the library. Along the way, we'll find answers to the following questions:
- What do you do when your dataset is not on the Hub?
- How can you slice and dice a dataset? (And what if you *really* need to use Pandas?)
- What do you do when your dataset is huge and will melt your laptop's RAM?
- What the heck are "memory mapping" and Apache Arrow?
- How can you create your own dataset and push it to the Hub?

The techniques you learn here will prepare you for the advanced tokenization and fine-tuning tasks in [Chapter 6](https://huggingface.co/course/chapter6) and [Chapter 7](https://huggingface.co/course/chapter7) — so grab a coffee and let's get started!

## [What if my dataset isn't on the Hub?](https://huggingface.co/course/chapter5/2?fw=pt)

You know how to use the Hugging Face Hub to download datasets, but you'll often find yourself working with data that is stored either on your laptop or on a remote server. In this section we'll show you how 🤗 Datasets can be used to load datasets that aren't available on the Hugging Face Hub.

In [27]:
from IPython.display import HTML
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/HyQgpJTkRdE" allowfullscreen></iframe>')



### Working with local and remote datasets
🤗 Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:

| Data format | Loading script | Example
| ---- | ------- | ---- |
| CSV & TSV | `csv` | `load_dataset("csv", data_files="my_file.csv")` |
| Text files | `text` | `load_dataset("text", data_files="my_file.txt")` |
| JSON & JSON Lines | `json` | `load_dataset("json", data_files="my_file.jsonl")` |
| Pickled DataFrames | `pandas` | `load_dataset("pandas", data_files="my_dataframe.pkl")` |

As shown in the table, for each data format we just need to specify the type of loading script in the `load_dataset()` function, along with a `data_files` argument that specifies the path to one or more files. Let's start by loading a dataset from local files; later we'll see how to do the same with remote files.

### Loading a local dataset

For this example we'll use the [SQuAD-it](https://github.com/crux82/squad-it/) dataset, which is a large-scale dataset for question answering in Italian.

The training and test splits are hosted on GitHub, so we can download them with a simple `wget` command:

In [28]:
# the following commands to download the data files and move them to the "data" folder need to run only once
#!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
#!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz
#!mv SQuAD_it-train.json.gz data
#!mv SQuAD_it-test.json.gz data

This will download two compressed files called *SQuAD_it-train.json.gz* and *SQuAD_it-test.json.gz*, which we can decompress with the Linux `gzip` command:

In [29]:
# the following command needs to run only once
#!gzip -dkv data/SQuAD_it-*.json.gz

We can see that the compressed files have been replaced with *SQuAD_it-train.json* and *SQuAD_it-test.json*, and that the data is stored in the JSON format.
> <font color="darkgreen">✎ If you're wondering why there's a `!` character in the above shell commands, that's because we're running them within a Jupyter notebook. Simply remove the prefix if you want to download and unzip the dataset within a terminal.</font>

To load a JSON file with the `load_dataset()` function, we just need to know if we're dealing with ordinary JSON (similar to a nested dictionary) or JSON Lines (line-separated JSON). Like many question answering datasets, SQuAD-it uses the nested format, with all the text stored in a `data` field. This means we can load the dataset by specifying the `field` argument as follows:

In [30]:
from datasets import load_dataset
squad_it_dataset = load_dataset("json", data_files="data/SQuAD_it-train.json", field="data")

Using custom data configuration default-1bbfd74a7c0a2c55
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-1bbfd74a7c0a2c55/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/1 [00:00<?, ?it/s]

By default, loading local files creates a `DatasetDict` object with a `train` split. We can see this by inspecting the `squad_it_dataset` object:

In [31]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

This shows us the number of rows and the column names associated with the training set. We can view one of the examples by indexing into the `train` split as follows:

In [32]:
squad_it_dataset["train"][0]

{'title': 'Terremoto del Sichuan del 2008',
 'paragraphs': [{'context': "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si è verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.",
   'qas': [{'answers': [{'answer_start': 29, 'text': '2008'}],
     'id': '56cdca7862d2951400fa6826',
     'question': 'In quale anno si è verificato il terremoto nel Sichuan?'},
    {'answers': [{'answer_start': 232, 'text': '69.197'}],
     'id': '56cdca7862d2951400fa6828',
     'question': 'Quante persone sono state uccise come risultato?'},
    {'answers': [{'answer_start': 29, 'text': '2008'}],
     'id': '56d4f9902ccc5a1400d833c0',
     'question': 'Quale anno ha avuto luogo il terremoto del Sichuan?'},
    {'answers': [{'answer_start': 78, 'text': '8.0 Ms e 7.9 Mw'}],
     'id': '56d4f9902ccc5a1400d833c1',
     'question': 'Che cosa ha

Great, we've loaded our first local dataset! But while this worked for the training set, what we really want is to include both the `train` and `test` splits in a single `DatasetDict` object so we can apply `Dataset.map()` functions across both splits at once. To do this, we can provide a dictionary to the `data_files` argument that maps each split name to a file associated with that split:

In [33]:
data_files = {"train": "data/SQuAD_it-train.json", "test": "data/SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

Using custom data configuration default-4ce099a266f33e7b
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-4ce099a266f33e7b/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

This is exactly what we wanted. Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.

> The `data_files` argument of the `load_dataset()` function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting `data_files="*.json"`). See the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) for more details.

The loading scripts in 🤗 Datasets actually support automatic decompression of the input files, so we could have skipped the use of `gzip` by pointing the `data_files` argument directly to the compressed files:

In [34]:
data_files = {"train": "data/SQuAD_it-train.json.gz", "test": "data/SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Using custom data configuration default-a17967084772b575
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-a17967084772b575/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/2 [00:00<?, ?it/s]

This can be useful if you don't want to manually decompress many GZIP files. The automatic decompression also applies to other common formats like ZIP and TAR, so you just need to point `data_files` to the compressed files and you're good to go!

Now that you know how to load local files on your laptop or desktop, let's take a look at loading remote files.
### Loading a remote dataset

If you're working as a data scientist or coder in a company, there's a good chance the datasets you want to analyze are stored on some remote server. Fortunately, loading remote files is just as simple as loading local ones! Instead of providing a path to local files, we point the `data_files` argument of `load_dataset()` to one or more URLs where the remote files are stored. For example, for the SQuAD-it dataset hosted on GitHub, we can just point `data_files` to the <i>SQuAD_it-\*.json.gz</i> URLs as follows:

In [35]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Using custom data configuration default-57dcee3ea6992346
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-57dcee3ea6992346/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/2 [00:00<?, ?it/s]

This returns the same `DatasetDict` object obtained above, but saves us the step of manually downloading and decompressing the <i>SQuAD_it-\*.json.gz</i> files. This wraps up our foray into the various ways to load datasets that aren't hosted on the Hugging Face Hub. Now that we've got a dataset to play with, let's get our hands dirty with various data-wrangling techniques!

> ✏️ Try it out! <font color="darkgreen">Pick another dataset hosted on GitHub or the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and try loading it both locally and remotely using the techniques introduced above. For bonus points, try loading a dataset that's stored in a CSV or text format (see the [documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) for more information on these formats).</font>

In [36]:
# Trying it out
## load dataset locally, from csv
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00611/accelerometer.csv"
!mv accelerometer.csv data
accelerometer_dataset = load_dataset("csv", data_files="data/accelerometer.csv")
print("Accelerometer dataset:\n{}".format(accelerometer_dataset))
## load dataset remotely
data_files = {"train": "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"}
iris_dataset = load_dataset("csv", data_files=data_files)
text = "Add some preprocessing to correct the `features` and to add the first instance (=current `features`)!"
print("Iris dataset:\n{}\n{}".format(iris_dataset, text))

--2022-05-09 08:51:11--  https://archive.ics.uci.edu/ml/machine-learning-databases/00611/accelerometer.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3731094 (3,6M) [application/x-httpd-php]
Saving to: ‘accelerometer.csv’


2022-05-09 08:51:15 (1005 KB/s) - ‘accelerometer.csv’ saved [3731094/3731094]



Using custom data configuration default-529bdd5cde81af21


Downloading and preparing dataset csv/default to /Users/matthias/.cache/huggingface/datasets/csv/default-529bdd5cde81af21/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /Users/matthias/.cache/huggingface/datasets/csv/default-529bdd5cde81af21/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Accelerometer dataset:
DatasetDict({
    train: Dataset({
        features: ['wconfid', 'pctid', 'x', 'y', 'z'],
        num_rows: 153000
    })
})


Using custom data configuration default-24f66c6afbe12d7b


Downloading and preparing dataset csv/default to /Users/matthias/.cache/huggingface/datasets/csv/default-24f66c6afbe12d7b/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/4.55k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /Users/matthias/.cache/huggingface/datasets/csv/default-24f66c6afbe12d7b/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Iris dataset:
DatasetDict({
    train: Dataset({
        features: ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'],
        num_rows: 149
    })
})
Add some preprocessing to correct the `features` and to add the first instance (=current `features`)!


## [Time to slice and dice](https://huggingface.co/course/chapter5/3?fw=pt)

Most of the time, the data you work with won't be perfectly prepared for training models. In this section we'll explore the various features that 🤗 Datasets provides to clean up your datasets.

In [37]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/tqfSFcPMgOI" allowfullscreen></iframe>')

### Slicing and dicing our data

Similar to Pandas, 🤗 Datasets provides several functions to manipulate the contents of `Dataset` and `DatasetDict` objects. We already encountered the `Dataset.map()` method in Chapter 3, and in this section we'll explore some of the other functions at our disposal.

For this example we'll use the [Drug Review Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29) that's hosted on the [UC Irvine Machine Learning Repository}(https://archive.ics.uci.edu/ml/index.php), which contains patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient's satisfaction.

First we need to download and extract the data, which can be done with the `wget` and `unzip` commands:

In [38]:
# the following commands need to run only once
#!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
#!mv drugsCom_raw.zip data
#!unzip data/drugsCom_raw.zip -d data

Since TSV is just a variant of CSV that uses tabs instead of commas as the separator, we can load these files by using the csv loading script and specifying the `delimiter` argument in the `load_dataset()` function as follows:

In [39]:
data_files = {"train": "data/drugsComTrain_raw.tsv", "test": "data/drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Using custom data configuration default-936f472160ee3f45
Reusing dataset csv (/Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/2 [00:00<?, ?it/s]

A good practice when doing any sort of data analysis is to grab a small random sample to get a quick feel for the type of data you're working with. In 🤗 Datasets, we can create a random sample by chaining the `Dataset.shuffle()` and `Dataset.select()` functions together:

In [40]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

Loading cached shuffled indices for dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-de03cc33fccffb38.arrow


{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

Note that we've fixed the seed in `Dataset.shuffle()` for reproducibility purposes. `Dataset.select()` expects an iterable of indices, so we've passed `range(1000)` to grab the first 1,000 examples from the shuffled dataset. From this sample we can already see a few quirks in our dataset:
- The `Unnamed: 0` column looks suspiciously like an anonymized ID for each patient.
- The `condition` column includes a mix of uppercase and lowercase labels.
- The `reviews` are of varying length and contain a mix of Python line separators (`\r\n`) as well as HTML character codes like `&\#039;`.

Let's see how we can use 🤗 Datasets to deal with each of these issues. To test the patient ID hypothesis for the `Unnamed: 0` column, we can use the `Dataset.unique()` function to verify that the number of IDs matches the number of rows in each split:

In [41]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

This seems to confirm our hypothesis, so let's clean up the dataset a bit by renaming the `Unnamed: 0` column to something a bit more interpretable. We can use the `DatasetDict.rename_column()` function to rename the column across both splits in one go:

In [42]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

> ✏️ Try it out! <font color="darkgreen">Use the `Dataset.unique()` function to find the number of unique drugs and conditions in the training and test sets.</font>

In [43]:
# Trying it out
print("unique drugs in the 'train' set:\t{}".format(len(drug_dataset["train"].unique("drugName"))))
print("unique drugs in the 'test' set: \t{}".format(len(drug_dataset["test"].unique("drugName"))))
print("unique conditions in the 'train' set:\t{}".format(len(drug_dataset["train"].unique("condition"))))
print("unique conditions in the 'test' set:\t{}".format(len(drug_dataset["test"].unique("condition"))))

unique drugs in the 'train' set:	3436
unique drugs in the 'test' set: 	2637
unique conditions in the 'train' set:	885
unique conditions in the 'test' set:	709


Next, let's normalize all the `condition` labels using `Dataset.map()`. As we did with tokenization in [Chapter 3](https://huggingface.co/course/chapter3), we can define a simple function that can be applied across all the rows of each split in `drug_dataset`:

```python
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}
drug_dataset.map(lowercase_condition)

AttributeError: 'NoneType' object has no attribute 'lower'
```

Oh no, we've run into a problem with our map function! From the error we can infer that some of the entries in the `condition` column are `None`, which cannot be lowercased as they're not strings. Let's drop these rows using `Dataset.filter()`, which works in a similar way to `Dataset.map()` and expects a function that receives a single example of the dataset. Instead of writing an explicit function like:
```python
def filter_nones(x):
    return x["condition"] is not None
```
and then running `drug_dataset.filter(filter_nones)`, we can do this in one line using a *lambda function*. In Python, lambda functions are small functions that you can define without explicitly naming them. They take the general form:
```python
lambda <arguments> : <expression>
```
where `lambda` is one of Python's special [keywords](https://docs.python.org/3/reference/lexical_analysis.html#keywords), `<arguments>` is a list/set of comma-separated values that define the inputs to the function, and `<expression>` represents the operations you wish to execute. For example, we can define a simple lambda function that squares a number as follows:

In [44]:
lambda x : x * x

<function __main__.<lambda>(x)>

To apply this function to an input, we need to wrap it and the input in parentheses:

In [45]:
(lambda x: x * x)(3)

9

Similarly, we can define lambda functions with multiple arguments by separating them with commas. For example, we can compute the area of a triangle as follows:

In [46]:
(lambda base, height: 0.5 * base * height)(4, 8)

16.0

Lambda functions are handy when you want to define small, single-use functions (for more information about them, we recommend reading the excellent [Real Python tutorial](https://realpython.com/python-lambda/) by Andre Burgaud). In the 🤗 Datasets context, we can use lambda functions to define simple map and filter operations, so let's use this trick to eliminate the `None` entries in our dataset:

In [47]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-f64f09b63c706564.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-7fea3cdfc8bb1b98.arrow


With the `None` entries removed, we can normalize our `condition` column:

In [48]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-77429251a248855c.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-213c6b1d44c7f94e.arrow


['left ventricular dysfunction', 'adhd', 'birth control']

It works! Now that we've cleaned up the labels, let's take a look at cleaning up the reviews themselves.

### Creating new columns
Whenever you're dealing with customer reviews, a good practice is to check the number of words in each review. A review might be just a single word like "Great!" or a full-blown essay with thousands of words, and depending on the use case you'll need to handle these extremes differently. To compute the number of words in each review, we'll use a rough heuristic based on splitting each text by whitespace.

Let's define a simple function that counts the number of words in each review:

In [49]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

Unlike our `lowercase_condition()` function, `compute_review_length()` returns a dictionary whose key does not correspond to one of the column names in the dataset. In this case, when `compute_review_length()` is passed to `Dataset.map()`, it will be applied to all the rows in the dataset to create a new `review_length` column:

In [50]:
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-193c1d46a6f267ec.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-86816261ceb2b42f.arrow


{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

As expected, we can see a `review_length` column has been added to our training set. We can sort this new column with `Dataset.sort()` to see what the extreme values look like:

In [51]:
drug_dataset["train"].sort("review_length")[:3]

Loading cached sorted indices for dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-f8818bfccacb538c.arrow


{'patient_id': [103488, 23627, 20558],
 'drugName': ['Loestrin 21 1 / 20', 'Chlorzoxazone', 'Nucynta'],
 'condition': ['birth control', 'muscle spasm', 'pain'],
 'review': ['"Excellent."', '"useless"', '"ok"'],
 'rating': [10.0, 1.0, 6.0],
 'date': ['November 4, 2008', 'March 24, 2017', 'August 20, 2016'],
 'usefulCount': [5, 2, 10],
 'review_length': [1, 1, 1]}

As we suspected, some reviews contain just a single word, which, although it may be okay for sentiment analysis, would not be informative if we want to predict the condition.

> <font color="darkgreen">🙋 An alternative way to add new columns to a dataset is with the `Dataset.add_column()` function. This allows you to provide the column as a Python list or NumPy array and can be handy in situations where `Dataset.map()` is not well suited for your analysis.</font>

Let's use the `Dataset.filter()` function to remove reviews that contain fewer than 30 words. Similarly to what we did with the condition column, we can filter out the very short reviews by requiring that the reviews have a length above this threshold:

In [52]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-9162fbabc86b82d2.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-b480588b5a2d2126.arrow


{'train': 138514, 'test': 46108}


As you can see, this has removed around 15% of the reviews from our original training and test sets.

> ✏️ Try it out! <font color="darkgreen">Use the `Dataset.sort()` function to inspect the reviews with the largest numbers of words. See the documentation to see which argument you need to use to sort the reviews by length in descending order.</font>

In [53]:
# Trying it out
drug_dataset_sorted_by_review_length = drug_dataset["train"].sort("review_length", reverse=True)
print(drug_dataset_sorted_by_review_length)
for i in range(3):
    review = drug_dataset_sorted_by_review_length["review"][i]
    review_length = drug_dataset_sorted_by_review_length["review_length"][i]
    print("\nreview {}\nlength {}\n{}".format(i, review_length, review))
# documentation: https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.Dataset.sort

Dataset({
    features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
    num_rows: 138514
})

review 0
length 1894
"Two and a half months ago I was prescribed Venlafaxine to help prevent chronic migraines.
It did help the migraines (reduced them by almost half), but with it came a host of side effects that were far worse than the problem I was trying to get rid of.
Having now come off of the stuff, I would not recommend anyone ever use Venlafaxine unless they suffer from extreme / suicidal depression. I mean extreme in the most emphatic sense of the word. 
Before trying Venlafaxine, I was a writer. While on Venlafaxine, I could barely write or speak or communicate at all. More than that, I just didn&#039;t want to. Not normal for a usually outgoing extrovert.
Now, I&#039;m beginning to write again - but my ability to speak and converse with others has deteriorated by about 95%. Writing these words is taking forever; keeping up in 


review 1
length 1162
"I don&rsquo;t find a lot of positive stories about antidepressants, or I find stories where people are taking the antidepressant the wrong way.

I wanted to share my experience.  A positive one.

I&rsquo;ve had generalized anxiety disorder, SEVERE OCD, and panic disorder for as long as I can remember.  My first memory of having an episode was when I was 4 years old at my kindergarten interview.  I feel as though I was born with the illnesses mentioned above, right from the womb.  When I was a child I was extremely anxious, had bad separation anxiety from my parents and had extreme OCD, I was just a kid and thought that the way I was feeling is how all kids felt, I didn&rsquo;t realize that I was different.  This went on, and got even worse in middle school.  I began developing trichtilomania in middle school.  In high school I went from being a 90% above student, to failing every class within a couple of years.  I couldn&rsquo;t leave the house.  My panic disorde

The last thing we need to deal with is the presence of HTML character codes in our reviews. We can use Python's `html`
 module to unescape these characters, like so:

In [54]:
import html
text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

We'll use `Dataset.map()` to unescape all the HTML characters in our corpus:

In [55]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-61eca8a742e40f3e.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-935d9d6f9521c0b6.arrow


As you can see, the `Dataset.map()` method is quite useful for processing data — and we haven't even scratched the surface of everything it can do!

### The `map()` method's superpowers

The `Dataset.map()` method takes a batched argument that, if set to `True`, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000). For instance, the previous map function that unescaped all the HTML took a bit of time to run (you can read the time taken from the progress bars). We can speed this up by processing several elements at the same time using a list comprehension.

When you specify `batched=True` the function receives a dictionary with the fields of the dataset, but each value is now a *list of values*, and not just a single value. The return value of `Dataset.map()` should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to unescape all HTML characters, but using `batched=True`:

In [56]:
new_drug_dataset = drug_dataset.map(lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True)

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-a0cbbd2ffb7355cd.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-32491582872fd4f2.arrow


If you're running this code in a notebook, you'll see that this command executes way faster than the previous one. And it's not because our reviews have already been HTML-unescaped — if you re-execute the instruction from the previous section (without `batched=True`), it will take the same amount of time as before. This is because list comprehensions are usually faster than executing the same code in a `for` loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.

Using `Dataset.map()` with `batched=True` will be essential to unlock the speed of the "fast" tokenizers that we'll encounter in [Chapter 6](https://huggingface.co/course/chapter6), which can quickly tokenize big lists of texts. For instance, to tokenize all the drug reviews with a fast tokenizer, we could use a function like this:

In [57]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

As you saw in [Chapter 3](https://huggingface.co/course/chapter3), we can pass one or several examples to the tokenizer, so we can use this function with or without `batched=True`. Let's take this opportunity to compare the performance of the different options. In a notebook, you can time a one-line instruction by adding `%time` before the line of code you wish to measure:

In [58]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-ecb7a569a9a15e82.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-9bd5e7e841461c1f.arrow


CPU times: user 19.6 ms, sys: 17.6 ms, total: 37.2 ms
Wall time: 65.6 ms


You can also time a whole cell by putting `%%time` at the beginning of the cell. On the hardware we executed this on, it showed 10.8s for this instruction (it's the number written after "Wall time").
> ✏️ Try it out! <font color="darkgreen">Execute the same instruction with and without `batched=True`, then try it with a slow tokenizer (add `use_fast=False` in the `AutoTokenizer.from_pretrained()` method) so you can see what numbers you get on your hardware.</font>

In [59]:
# Trying it out
## a fast tokenizer
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=False)
## not a fast tokenizer
not_a_fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
def not_a_fast_tokenize_function(examples):
    return not_a_fast_tokenizer(examples["review"], truncation=True)
%time tokenized_dataset = drug_dataset.map(not_a_fast_tokenize_function, batched=True)
%time tokenized_dataset = drug_dataset.map(not_a_fast_tokenize_function, batched=False)

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-ecb7a569a9a15e82.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-9bd5e7e841461c1f.arrow


CPU times: user 19.1 ms, sys: 13.1 ms, total: 32.2 ms
Wall time: 30.6 ms


  0%|          | 0/138514 [00:00<?, ?ex/s]

  0%|          | 0/46108 [00:00<?, ?ex/s]

CPU times: user 1min 3s, sys: 518 ms, total: 1min 4s
Wall time: 1min 4s


  0%|          | 0/139 [00:00<?, ?ba/s]

  0%|          | 0/47 [00:00<?, ?ba/s]

CPU times: user 3min 41s, sys: 498 ms, total: 3min 41s
Wall time: 3min 41s


  0%|          | 0/138514 [00:00<?, ?ex/s]

  0%|          | 0/46108 [00:00<?, ?ex/s]

CPU times: user 4min 3s, sys: 1.94 s, total: 4min 5s
Wall time: 4min 5s


Here are the results we obtained with and without batching, with a fast and a slow tokenizer:

|Options|Fast tokenizer|Slow tokenizer|
|-------|--------------|--------------|
|`batched=True`|10.8s|4min41s|
|`batched=False`|59.2s|5min3s|

This means that using a fast tokenizer with the `batched=True` option is 30 times faster than its slow counterpart with no batching — this is truly amazing! That's the main reason why fast tokenizers are the default when using `AutoTokenizer` (and why they are called "fast"). They're able to achieve such a speedup because behind the scenes the tokenization code is executed in Rust, which is a language that makes it easy to parallelize code execution.

Parallelization is also the reason for the nearly 6x speedup the fast tokenizer achieves with batching: you can't parallelize a single tokenization operation, but when you want to tokenize lots of texts at the same time you can just split the execution across several processes, each responsible for its own texts.

`Dataset.map()` also has some parallelization capabilities of its own. Since they are not backed by Rust, they won't let a slow tokenizer catch up with a fast one, but they can still be helpful (especially if you're using a tokenizer that doesn't have a fast version). To enable multiprocessing, use the `num_proc` argument and specify the number of processes to use in your call to `Dataset.map()`:

In [60]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)
tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-436b0d1c59b8aa41.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-71c08cd5e57ebaa3.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-43d5021c26bef8ba.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-55edafcd9a34f2bc.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-e224d450c4c9362c.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-b944340046e7161b.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-9bae87aa32d50eba.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-6b0330801c7f46fb.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-d7b37f1960d2065a.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-f4d01ea38426a5a9.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-4dd898bdc91be19a.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-67a290b17857f9e0.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-1347d520ff37f5c6.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-34fd335e0e4c5e56.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-654d4aebd38208d6.arrow


 

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-e07ce084fa2eaf19.arrow


You can experiment a little with timing to determine the optimal number of processes to use; in our case 8 seemed to produce the best speed gain. Here are the numbers we got with and without multiprocessing:

|Options|Fast tokenizer|Slow tokenizer|
|-------|--------------|--------------|
|`batched=True`|10.8s|4min41s|
|`batched=False`|59.2s|5min3s|
|`batched=True`, `num_proc=8`|6.52s|41.3s|
|`batched=False`, `num_proc=8`|9.49s|45.2s|

Those are much more reasonable results for the slow tokenizer, but the performance of the fast tokenizer was also substantially improved. Note, however, that won't always be the case — for values of `num_proc` other than 8, our tests showed that it was faster to use `batched=True` without that option. In general, we don't recommend using Python multiprocessing for fast tokenizers with `batched=True`.
> <font color="darkgreen">Using `num_proc` to speed up your processing is usually a great idea, as long as the function you are using is not already doing some kind of multiprocessing of its own.</font>

All of this functionality condensed into a single method is already pretty amazing, but there's more! With `Dataset.map()` and `batched=True` you can change the number of elements in your dataset. This is super useful in many situations where you want to create several training features from one example, and we will need to do this as part of the preprocessing for several of the NLP tasks we'll undertake in [Chapter 7](https://huggingface.co/course/chapter7).
> <font color="darkgreen">💡 In machine learning, an *example* is usually defined as the set of *features* that we feed to the model. In some contexts, these features will be the set of columns in a `Dataset`, but in others (like here and for question answering), multiple features can be extracted from a single example and belong to a single column.</font>

Let's have a look at how it works! Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return *all* the chunks of the texts instead of just the first one. This can be done with `return_overflowing_tokens=True`:

In [61]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

Let's test this on one example before using `Dataset.map()` on the whole dataset:

In [62]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

So, our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified: the first one of length 128 and the second one of length 49. Now let's do this for all elements of the dataset!
```python
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

ArrowInvalid: Column 1 named condition expected length 1463 but got length 1000
```

Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. If you've looked at the `Dataset.map()` documentation, you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.

The problem is that we're trying to mix two different datasets of different sizes: the `drug_dataset` columns will have a certain number of examples (the 1,000 in our error), but the `tokenized_dataset` we are building will have more (the 1,463 in the error message). That doesn't work for a `Dataset`, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:

In [63]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-00d65e85202c60ee.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-dd2c01358151e4b1.arrow


Now this works without error. We can check that our new dataset has many more elements than the original dataset by comparing the lengths:

In [64]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

We mentioned that we can also deal with the mismatched length problem by making the old columns the same size as the new ones. To do this, we will need the `overflow_to_sample_mapping` field the tokenizer returns when we set `return_overflowing_tokens=True`. It gives us a mapping from a new feature index to the index of the sample it originated from. Using this, we can associate each key present in our original dataset with a list of values of the right size by repeating the values of each example as many times as it generates new features:

In [65]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

We can see it works with `Dataset.map()` without us needing to remove the old columns:

In [66]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-4cc454344d95ff24.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-cc8dd0e4a7bcff76.arrow


DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

We get the same number of training features as before, but here we've kept all the old fields. If you need them for some post-processing after applying your model, you might want to use this approach.

You've now seen how 🤗 Datasets can be used to preprocess a dataset in various ways. Although the processing functions of 🤗 Datasets will cover most of your model training needs, there may be times when you'll need to switch to Pandas to access more powerful features, like `DataFrame.groupby()` or high-level APIs for visualization. Fortunately, 🤗 Datasets is designed to be interoperable with libraries such as Pandas, NumPy, PyTorch, TensorFlow, and JAX. Let's take a look at how this works.

### From `Datasets` to `DataFrames` and back

In [28]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/tfcY1067A5Q" allowfullscreen></iframe>')



To enable the conversion between various third-party libraries, 🤗 Datasets provides a `Dataset.set_format()` function. This function only changes the *output format* of the dataset, so you can easily switch to another format without affecting the underlying *data format*, which is Apache Arrow. The formatting is done in place. To demonstrate, let's convert our dataset to Pandas:

In [68]:
drug_dataset.set_format("pandas")

Now when we access elements of the dataset we get a `pandas.DataFrame` instead of a dictionary:

In [69]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


Let's create a `pandas.DataFrame` for the whole training set by selecting all the elements of `drug_dataset["train"]`:

In [70]:
train_df = drug_dataset["train"][:]

> <font color="darkgreen">🚨 Under the hood, `Dataset.set_format()` changes the return format for the dataset's `__getitem__()` dunder method. This means that when we want to create a new object like `train_df` from a `Dataset` in the `"pandas"` format, we need to slice the whole dataset to obtain a `pandas.DataFrame`. You can verify for yourself that the type of `drug_dataset["train"]` is `Dataset`, irrespective of the output format.</font>

From here we can use all the Pandas functionality that we want. For example, we can do fancy chaining to compute the class distribution among the `condition` entries:

In [88]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

Unnamed: 0,condition,frequency
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


And once we're done with our Pandas analysis, we can always create a new `Dataset` object by using the `Dataset.from_pandas()` function as follows:

In [72]:
from datasets import Dataset
freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['condition', 'frequency'],
    num_rows: 819
})

> ✏️ Try it out! <font color="darkgreen">Compute the average rating per drug and store the result in a new `Dataset`.</font>

In [139]:
# Trying it out
## https://stackoverflow.com/questions/30482071/how-to-calculate-mean-values-grouped-on-another-column-in-pandas
## https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns
ratings = (
    train_df
    .groupby("drugName", as_index=False)["rating"]
    .mean()
    .sort_values(["rating", "drugName"], ascending=[False, True])
)
ratings

Unnamed: 0,drugName,rating
0,A + D Cracked Skin Relief,10.0
1,A / B Otic,10.0
8,Abiraterone,10.0
12,Absorbine Jr.,10.0
17,Accolate,10.0
...,...,...
2999,Zileuton,1.0
3024,Zostavax,1.0
3025,Zoster vaccine live,1.0
3027,Zostrix Diabetic Foot Pain,1.0


This wraps up our tour of the various preprocessing techniques available in 🤗 Datasets. To round out the section, let's create a validation set to prepare the dataset for training a classifier on. Before doing so, we'll reset the output format of `drug_dataset` from `"pandas"` to `"arrow"`:

In [140]:
drug_dataset.reset_format()

### Creating a validation set

Although we have a test set we could use for evaluation, it's a good practice to leave the test set untouched and create a separate validation set during development. Once you are happy with the performance of your models on the validation set, you can do a final sanity check on the test set. This process helps mitigate the risk that you'll overfit to the test set and deploy a model that fails on real-world data.

🤗 Datasets provides a `Dataset.train_test_split()` function that is based on the famous functionality from `scikit-learn`. Let's use it to split our training set into `train` and `validation` splits (we set the seed argument for reproducibility):

In [141]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

Loading cached split indices for dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-fa2e5035d44eb731.arrow and /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-77e4b8011fe95505.arrow


DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

Great, we've now prepared a dataset that's ready for training some models on! In section 5 we'll show you how to upload datasets to the Hugging Face Hub, but for now let's cap off our analysis by looking at a few ways you can save datasets on your local machine.

### Saving a dataset

In [29]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/blF9uxYcKHo" allowfullscreen></iframe>')

Although 🤗 Datasets will cache every downloaded dataset and the operations performed on it, there are times when you'll want to save a dataset to disk (e.g., in case the cache gets deleted). As shown in the table below, 🤗 Datasets provides three main functions to save your dataset in different formats:

|Data format|Function|
|-----------|--------|
|Arrow|`Dataset.save_to_disk()`|
|CSV|`Dataset.to_csv()`|
|JSON|`Dataset.to_json()`|

For example, let's save our cleaned dataset in the Arrow format:

In [143]:
drug_dataset_clean.save_to_disk("drug-reviews")

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-c4dec85fb2b4f24b.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/csv/default-936f472160ee3f45/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-2971f6df953fc535.arrow


This will create a directory with the following structure:
```
drug-reviews/
├── dataset_dict.json
├── test
│   ├── dataset.arrow
│   ├── dataset_info.json
│   └── state.json
├── train
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── indices.arrow
│   └── state.json
└── validation
    ├── dataset.arrow
    ├── dataset_info.json
    ├── indices.arrow
    └── state.json
```
where we can see that each split is associated with its own *dataset.arrow* table, and some metadata in *dataset_info.json* and *state.json*. You can think of the Arrow format as a fancy table of columns and rows that is optimized for building high-performance applications that process and transport large datasets.

Once the dataset is saved, we can load it by using the `load_from_disk()` function as follows:

In [144]:
from datasets import load_from_disk
drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

For the CSV and JSON formats, we have to store each split as a separate file. One way to do this is by iterating over the keys and values in the `DatasetDict` object:

In [145]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

This saves each split in [JSON Lines format](https://jsonlines.org/), where each row in the dataset is stored as a single line of JSON. Here's what the first example looks like:

In [146]:
!mv drug-reviews-test.jsonl data
!mv drug-reviews-train.jsonl data
!mv drug-reviews-validation.jsonl data
!head -n 1 data/drug-reviews-train.jsonl

{"patient_id":89879,"drugName":"Cyclosporine","condition":"keratoconjunctivitis sicca","review":"\"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on.\"","rating":2.0,"date":"April 20, 2013","usefulCount":69,"review_length":147}


We can then use the techniques from [section 2](https://huggingface.co/course/chapter5/2) to load the JSON files as follows:

In [147]:
data_files = {
    "train": "data/drug-reviews-train.jsonl",
    "validation": "data/drug-reviews-validation.jsonl",
    "test": "data/drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)
drug_dataset_reloaded

Using custom data configuration default-5a5ec922f92efec5


Downloading and preparing dataset json/default to /Users/matthias/.cache/huggingface/datasets/json/default-5a5ec922f92efec5/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /Users/matthias/.cache/huggingface/datasets/json/default-5a5ec922f92efec5/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

And that's it for our excursion into data wrangling with 🤗 Datasets! Now that we have a cleaned dataset for training a model on, here are a few ideas that you could try out:
1. Use the techniques from [Chapter 3](https://huggingface.co/course/chapter3) to train a classifier that can predict the patient condition based on the drug review.
1. Use the `summarization` pipeline from [Chapter 1](https://huggingface.co/course/chapter1) to generate summaries of the reviews.

Next, we'll take a look at how 🤗 Datasets can enable you to work with huge datasets without blowing up your laptop!

## [Big data? 🤗 Datasets to the rescue!](https://huggingface.co/course/chapter5/4?fw=pt)

Nowadays it is not uncommon to find yourself working with multi-gigabyte datasets, especially if you're planning to pretrain a transformer like BERT or GPT-2 from scratch. In these cases, even *loading* the data can be a challenge. For example, the WebText corpus used to pretrain GPT-2 consists of over 8 million documents and 40 GB of text — loading this into your laptop's RAM is likely to give it a heart attack!

Fortunately, 🤗 Datasets has been designed to overcome these limitations. It frees you from memory management problems by treating datasets as *memory-mapped* files, and from hard drive limits by *streaming* the entries in a corpus.

In [30]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/JwISwTCPPWo" allowfullscreen></iframe>')

In this section we'll explore these features of 🤗 Datasets with a huge 825 GB corpus known as [the Pile](https://pile.eleuther.ai/). Let's get started!

### What is the Pile?

The Pile is an English text corpus that was created by [EleutherAI](https://www.eleuther.ai/) for training large-scale language models. It includes a diverse range of datasets, spanning scientific articles, GitHub code repositories, and filtered web text. The training corpus is available in [14 GB chunks](https://mystic.the-eye.eu/public/AI/pile/), and you can also download several of the [individual components](https://mystic.the-eye.eu/public/AI/pile_preliminary_components/). Let's start by taking a look at the PubMed Abstracts dataset, which is a corpus of abstracts from 15 million biomedical publications on [PubMed](https://pubmed.ncbi.nlm.nih.gov/). The dataset is in [JSON Lines format](https://jsonlines.org/) and is compressed using the `zstandard` library, so first we need to install that:

In [149]:
# the following command needs to run only once
#!conda install -c conda-forge zstandard

Next, we can load the dataset using the method for remote files that we learned in [section 2](https://huggingface.co/course/chapter5/2):

In [150]:
# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/"
data_files += "PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

Using custom data configuration default-6ad3aefcb3b64942
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-6ad3aefcb3b64942/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


Dataset({
    features: ['meta', 'text'],
    num_rows: 15518009
})

We can see that there are 15,518,009 rows and 2 columns in our dataset — that's a lot!
> <font color="darkgreen">✎ By default, 🤗 Datasets will decompress the files needed to load a dataset. If you want to preserve hard drive space, you can pass `DownloadConfig(delete_extracted=True)` to the `download_config` argument of `load_dataset()`. See the [documentation](https://huggingface.co/docs/datasets/package_reference/builder_classes.html?#datasets.utils.DownloadConfig) for more details.</font>

Let's inspect the contents of the first example:

In [151]:
pubmed_dataset[0]

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that i

Okay, this looks like the abstract from a medical article. Now let's see how much RAM we've used to load the dataset!

### The magic of memory mapping
A simple way to measure memory usage in Python is with the [`psutil`](https://psutil.readthedocs.io/en/latest/) library, which can be installed with `conda` as follows:

In [152]:
# the following command needs to run only once
#!conda install -c conda-forge psutil

It provides a `Process` class that allows us to check the memory usage of the current process as follows:

In [153]:
import psutil
# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 2978.64 MB


Here, the `rss` attribute refers to the *resident set size*, which is the fraction of memory that a process occupies in RAM. This measurement also includes the memory used by the Python interpreter and the libraries we've loaded, so the actual amount of memory used to load the dataset is a bit smaller. For comparison, let's see how large the dataset is on disk, using the `dataset_size` attribute. Since the result is expressed in bytes like before, we need to manually convert it to gigabytes:

In [155]:
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

Number of files in dataset : 20978892555
Dataset size (cache file) : 19.54 GB


Nice — despite it being almost 20 GB large, we're able to load and access the dataset with much less RAM!
> ✏️ Try it out! <font color="darkgreen">Pick one of the subsets from the Pile that is larger than your laptop or desktop's RAM, load it with 🤗 Datasets, and measure the amount of RAM used. Note that to get an accurate measurement, you'll want to do this in a new process. You can find the decompressed sizes of each subset in Table 1 of the Pile paper.</font>

In [156]:
# Trying it out
## https://pile.eleuther.ai/
## https://arxiv.org/pdf/2101.00027.pdf
## https://mystic.the-eye.eu/public/AI/pile_preliminary_components/
data_files = "https://mystic.the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst"
FreeLaw_dataset = load_dataset("json", data_files=data_files, split="train")
print(f"Number of files in dataset : {FreeLaw_dataset.dataset_size}")
size_gb = FreeLaw_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

Using custom data configuration default-ecdac2973eb354f0


Downloading and preparing dataset json/default to /Users/matthias/.cache/huggingface/datasets/json/default-ecdac2973eb354f0/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/17.0G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /Users/matthias/.cache/huggingface/datasets/json/default-ecdac2973eb354f0/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.
Number of files in dataset : 55157239146
Dataset size (cache file) : 51.37 GB


If you're familiar with Pandas, this result might come as a surprise because of Wes Kinney's famous [rule of thumb](https://wesmckinney.com/blog/apache-arrow-pandas-internals/) that you typically need 5 to 10 times as much RAM as the size of your dataset. So how does 🤗 Datasets solve this memory management problem? 🤗 Datasets treats each dataset as a [memory-mapped file](https://en.wikipedia.org/wiki/Memory-mapped_file), which provides a mapping between RAM and filesystem storage that allows the library to access and operate on elements of the dataset without needing to fully load it into memory.

Memory-mapped files can also be shared across multiple processes, which enables methods like `Dataset.map()` to be parallelized without needing to move or copy the dataset. Under the hood, these capabilities are all realized by the [Apache Arrow](https://arrow.apache.org/) memory format and [`pyarrow`](https://arrow.apache.org/docs/python/index.html) library, which make the data loading and processing lightning fast. (For more details about Apache Arrow and comparisons to Pandas, check out [Dejan Simic's blog post](https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a).) To see this in action, let's run a little speed test by iterating over all the elements in the PubMed Abstracts dataset:

In [157]:
import timeit
code_snippet = """
batch_size = 1000
for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""
time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print("Iterated over {} examples (about {:.1f} GB) in {:.1f}s, i.e., {:.3f} GB/s".format(
    len(pubmed_dataset), size_gb, time, size_gb/time
))

Iterated over 15518009 examples (about 51.4 GB) in 147.5s, i.e., 0.348 GB/s


Here we've used Python's `timeit` module to measure the execution time taken by `code_snippet`. You'll typically be able to iterate over a dataset at speeds of a few tenths of a GB/s to several GB/s. This works great for the vast majority of applications, but sometimes you'll have to work with a dataset that is too large to even store on your laptop's hard drive. For example, if we tried to download the Pile in its entirety, we'd need 825 GB of free disk space! To handle these cases, 🤗 Datasets provides a streaming feature that allows us to download and access elements on the fly, without needing to download the whole dataset. Let's take a look at how this works.
> <font color="darkgreen">💡 In Jupyter notebooks, you can also time cells using the [`%%timeit` magic function](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit).</font>

### Streaming datasets
To enable dataset streaming you just need to pass the `streaming=True` argument to the `load_dataset()` function. For example, let's load the PubMed Abstracts dataset again, but in streaming mode:

In [158]:
pubmed_dataset_streamed = load_dataset("json", data_files=data_files, split="train", streaming=True)

Using custom data configuration default-ecdac2973eb354f0


Instead of the familiar `Dataset` that we've encountered elsewhere in this chapter, the object returned with `streaming=True` is an `IterableDataset`. As the name suggests, to access the elements of an `IterableDataset` we need to iterate over it. We can access the first element of our streamed dataset as follows:

In [159]:
next(iter(pubmed_dataset_streamed))

{'meta': {'case_jurisdiction': 'scotus.tar.gz',
  'case_ID': '110921.json',
  'date_created': '2010-04-28T17:12:49Z'},
 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General.\nRobert Gilbert Johnston argued the cause for respondent. With him on the brief was Clayton C. Ikei.[*]\n*240 JUSTICE BLACKMUN delivered the opinion of the Court.\nThe issue in this case is whether the transfer of a prisoner from a state prison in Hawaii to one in California implicates a liberty interest within the meaning of the Due Process Clause of the Fourteenth Amendment.\n\nI\n\nA\nRespondent Delbert Kaahanui Wakinekona is serving a sentence of life imprisonment withou

The elements from a streamed dataset can be processed on the fly using `IterableDataset.map()`, which is useful during training if you need to tokenize the inputs. The process is exactly the same as the one we used to tokenize our dataset in [Chapter 3](https://huggingface.co/course/chapter3), with the only difference being that outputs are returned one by one:

In [160]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))

Token indices sequence length is longer than the specified maximum sequence length for this model (10289 > 512). Running this sequence through the model will result in indexing errors


{'meta': {'case_jurisdiction': 'scotus.tar.gz',
  'case_ID': '110921.json',
  'date_created': '2010-04-28T17:12:49Z'},
 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General.\nRobert Gilbert Johnston argued the cause for respondent. With him on the brief was Clayton C. Ikei.[*]\n*240 JUSTICE BLACKMUN delivered the opinion of the Court.\nThe issue in this case is whether the transfer of a prisoner from a state prison in Hawaii to one in California implicates a liberty interest within the meaning of the Due Process Clause of the Fourteenth Amendment.\n\nI\n\nA\nRespondent Delbert Kaahanui Wakinekona is serving a sentence of life imprisonment withou

> <font color="darkgreen">💡 To speed up tokenization with streaming you can pass `batched=True`, as we saw in the last section. It will process the examples batch by batch; the default batch size is 1,000 and can be specified with the `batch_size` argument.</font>

You can also shuffle a streamed dataset using `IterableDataset.shuffle()`, but unlike `Dataset.shuffle()` this only shuffles the elements in a predefined `buffer_size`:

In [161]:
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

{'meta': {'case_jurisdiction': 'scotus.tar.gz',
  'case_ID': '127009.json',
  'date_created': '2010-04-28T17:22:54Z'},
 'text': '537 U.S. 1176\nMONTUEv.CALIFORNIA DEPARTMENT OF CORRECTIONS.\nNo. 02-7879.\nSupreme Court of United States.\nJanuary 27, 2003.\n\n1\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT.\n\n\n2\nC. A. 9th Cir. Certiorari denied. Reported below: 48 Fed. Appx. 654.\n\n'}

In this example, we selected a random example from the first 10,000 examples in the buffer. Once an example is accessed, its spot in the buffer is filled with the next example in the corpus (i.e., the 10,001st example in the case above). You can also select elements from a streamed dataset using the `IterableDataset.take()` and `IterableDataset.skip()` functions, which act in a similar way to `Dataset.select()`. For example, to select the first 5 examples in the PubMed Abstracts dataset we can do the following:

In [162]:
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

[{'meta': {'case_jurisdiction': 'scotus.tar.gz',
   'case_ID': '110921.json',
   'date_created': '2010-04-28T17:12:49Z'},
  'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General.\nRobert Gilbert Johnston argued the cause for respondent. With him on the brief was Clayton C. Ikei.[*]\n*240 JUSTICE BLACKMUN delivered the opinion of the Court.\nThe issue in this case is whether the transfer of a prisoner from a state prison in Hawaii to one in California implicates a liberty interest within the meaning of the Due Process Clause of the Fourteenth Amendment.\n\nI\n\nA\nRespondent Delbert Kaahanui Wakinekona is serving a sentence of life imprisonment wi

Similarly, you can use the `IterableDataset.skip()` function to create training and validation splits from a shuffled dataset as follows:

In [163]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

Let's round out our exploration of dataset streaming with a common application: combining multiple datasets together to create a single corpus. 🤗 Datasets provides an `interleave_datasets()` function that converts a list of `IterableDataset` objects into a single `IterableDataset`, where the elements of the new dataset are obtained by alternating among the source examples. This function is especially useful when you're trying to combine large datasets, so as an example let's stream the FreeLaw subset of the Pile, which is a 51 GB dataset of legal opinions from US courts:

In [164]:
law_dataset_streamed = load_dataset(
    "json",
    data_files="https://mystic.the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)
next(iter(law_dataset_streamed))

Using custom data configuration default-ecdac2973eb354f0


{'meta': {'case_jurisdiction': 'scotus.tar.gz',
  'case_ID': '110921.json',
  'date_created': '2010-04-28T17:12:49Z'},
 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General.\nRobert Gilbert Johnston argued the cause for respondent. With him on the brief was Clayton C. Ikei.[*]\n*240 JUSTICE BLACKMUN delivered the opinion of the Court.\nThe issue in this case is whether the transfer of a prisoner from a state prison in Hawaii to one in California implicates a liberty interest within the meaning of the Due Process Clause of the Fourteenth Amendment.\n\nI\n\nA\nRespondent Delbert Kaahanui Wakinekona is serving a sentence of life imprisonment withou

This dataset is large enough to stress the RAM of most laptops, yet we've been able to load and access it without breaking a sweat! Let's now combine the examples from the FreeLaw and PubMed Abstracts datasets with the `interleave_datasets()` function:

In [165]:
from itertools import islice
from datasets import interleave_datasets
combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

[{'meta': {'case_jurisdiction': 'scotus.tar.gz',
   'case_ID': '110921.json',
   'date_created': '2010-04-28T17:12:49Z'},
  'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued January 19, 1983.\nDecided April 26, 1983.\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE NINTH CIRCUIT\n*239 Michael A. Lilly, First Deputy Attorney General of Hawaii, argued the cause for petitioners. With him on the brief was James H. Dannenberg, Deputy Attorney General.\nRobert Gilbert Johnston argued the cause for respondent. With him on the brief was Clayton C. Ikei.[*]\n*240 JUSTICE BLACKMUN delivered the opinion of the Court.\nThe issue in this case is whether the transfer of a prisoner from a state prison in Hawaii to one in California implicates a liberty interest within the meaning of the Due Process Clause of the Fourteenth Amendment.\n\nI\n\nA\nRespondent Delbert Kaahanui Wakinekona is serving a sentence of life imprisonment wi

Here we've used the `islice()` function from Python's `itertools` module to select the first two examples from the combined dataset, and we can see that they match the first examples from each of the two source datasets.

Finally, if you want to stream the Pile in its 825 GB entirety, you can grab all the prepared files as follows:

In [166]:
base_url = "https://mystic.the-eye.eu/public/AI/pile/"
data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
}
pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
next(iter(pile_dataset["train"]))

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Using custom data configuration default-ad49e2168ced215d


{'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on the web works, but you have to simulate multi-touch for table moving and that can be a bit confusing.\n\nThere’s a lot I’d like to talk about. I’ll go through every topic, insted of making the typical what went right/wrong list.\n\nConcept\n\nWorking over the theme was probably one of the hardest tasks I had to face.\n\nOriginally, I had an idea of what kind of game I wanted to develop, gameplay wise – something with lots of enemies/actors, simple graphics, maybe set in space, controlled from a top-down view. I was confident I could fit any theme around it.\n\nIn the end, the problem with a theme like “Evolution” in a game is that evolution is unassisted. It happens through several seemingly random mutations over time, with the most apt permutation surviving. This genetic car simulator is, in my opinion, a great example of actual evolution of a species facing a challenge.

> ✏️ Try it out! <font color="darkgreen">Use one of the large Common Crawl corpora like [`mc4`](https://huggingface.co/datasets/mc4) or [`oscar`](https://huggingface.co/datasets/oscar) to create a streaming multilingual dataset that represents the spoken proportions of languages in a country of your choice. For example, the four national languages in Switzerland are German, French, Italian, and Romansh, so you could try creating a Swiss corpus by sampling the Oscar subsets according to their spoken proportion.</font>

In [205]:
# Trying it out
## https://huggingface.co/datasets/mc4#dataset-summary
mc4_DeFrEn_train = load_dataset("mc4", languages=["de", "en", "fr"], split="train", streaming=True)
## https://huggingface.co/docs/datasets/stream#shuffle
shuffled_mc4_DeFrEn_train = mc4_DeFrEn_train.shuffle(buffer_size=300, seed=42)
i = 0
for inst in iter(shuffled_mc4_DeFrEn_train):
    if i==3:
        break
    print("\nsample text {}:\n{}".format(i+1, inst["text"]))
    i += 1

Using custom data configuration de+en+fr-81dec6f972abeec4



sample text 1:
﻿ education Archives - muzmatch Blog education Archives - muzmatch Blog
Marriage education is a must for single and engaged Muslims
Being a spouse and a parent are among the most important jobs you’ll ever have. Marriage education, premarital advisement and counseling can help singles and engaged people obtain the knowledge and skills they need...

sample text 2:
Black Color Net Saree [VOL6-138] - USD $126.50 : Designer Sarees, Indian Saree Online, Wedding Bridal Lehenga Saris, Salwar Kameez, Buy Sarees Online Shopping
Home :: Fancy Sarees :: Black Color Net Saree
Model: VOL6-138 Price: USD $126.50
Quantity: DescriptionItem Code-VOL6-138 .
Color -Black .
I received my saree yesterday. it is very nice exactly as on website. It is great to order from Sangini. The service was quick and the shipping was...-Orange brasso sareeHi vinaybhai

sample text 3:
Troxel Fallon Taylor Helmet - Vintage Cactus | HorseLoverZ
> Troxel Fallon Taylor Helmet - Vintage Cactus
Troxel Fallon Ta

You now have all the tools you need to load and process datasets of all shapes and sizes — but unless you're exceptionally lucky, there will come a point in your NLP journey where you'll have to actually create a dataset to solve the problem at hand. That's the topic of the next section!

## [Creating your own dataset](https://huggingface.co/course/chapter5/5?fw=pt)

Sometimes the dataset that you need to build an NLP application doesn't exist, so you'll need to create it yourself. In this section we'll show you how to create a corpus of [GitHub issues](https://github.com/features/issues/), which are commonly used to track bugs or features in GitHub repositories. This corpus could be used for various purposes, including:
- Exploring how long it takes to close open issues or pull requests
- Training a *multilabel classifier* that can tag issues with metadata based on the issue's description (e.g., "bug", "enhancement", or "question")
- Creating a semantic search engine to find which issues match a user's query

Here we'll focus on creating the corpus, and in the next section we'll tackle the semantic search application. To keep things meta, we'll use the GitHub issues associated with a popular open source project: 🤗 Datasets! Let's take a look at how to get the data and explore the information contained in these issues.

### Getting the data

You can find all the issues in 🤗 Datasets by navigating to the repository's [Issues tab](https://github.com/huggingface/datasets/issues). As shown in the following screenshot, at the time of writing there were 331 open issues and 668 closed ones.

<img style="float=center;" src="images/503a11cba6d2a53a2e1a1e6d8ff681cc2128fc8bd57f724252f72dc50fb04e9c.png">

If you click on one of these issues you'll find it contains a title, a description, and a set of labels that characterize the issue. An example is shown in the screenshot below.

<img style="float=center;" src="images/04d9715957f0c0073e90edca667cd90d9ba1b34340b51a72f20a7eeee1636a8a.png">

To download all the repository's issues, we'll use the [GitHub REST API](https://docs.github.com/en/rest) to poll the [`Issues` endpoint](https://docs.github.com/en/rest/reference/issues#list-repository-issues). This endpoint returns a list of JSON objects, with each object containing a large number of fields that include the title and description as well as metadata about the status of the issue and so on.

A convenient way to download the issues is via the `requests` library, which is the standard way for making HTTP requests in Python. You can install the library by running:

In [1]:
# the following command needs to run only once
#!conda install -c anaconda requests

Once the library is installed, you can make GET requests to the `Issues` endpoint by invoking the `requests.get()` function. For example, you can run the following command to retrieve the first issue on the first page:

In [2]:
import requests
url = "https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
response = requests.get(url)

The `response` object contains a lot of useful information about the request, including the HTTP status code:

In [3]:
response.status_code

200

where a `200` status means the request was successful (you can find a list of possible HTTP status codes [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)). What we are really interested in, though, is the *payload*, which can be accessed in various formats like bytes, strings, or JSON. Since we know our issues are in JSON format, let's inspect the payload as follows:

In [4]:
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/4376',
  'repository_url': 'https://api.github.com/repos/huggingface/datasets',
  'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/4376/labels{/name}',
  'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/4376/comments',
  'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/4376/events',
  'html_url': 'https://github.com/huggingface/datasets/issues/4376',
  'id': 1242218144,
  'node_id': 'I_kwDODunzps5KCr6g',
  'number': 4376,
  'title': 'irc_disentagle viewer error',
  'user': {'login': 'labouz',
   'id': 25671683,
   'node_id': 'MDQ6VXNlcjI1NjcxNjgz',
   'avatar_url': 'https://avatars.githubusercontent.com/u/25671683?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/labouz',
   'html_url': 'https://github.com/labouz',
   'followers_url': 'https://api.github.com/users/labouz/followers',
   'following_url': 'https://api.github.com/users/

Whoa, that's a lot of information! We can see useful fields like `title`, `body`, and `number` that describe the issue, as well as information about the GitHub user who opened the issue.
> ✏️ Try it out! <font color="darkgreen">Click on a few of the URLs in the JSON payload above to get a feel for what type of information each GitHub issue is linked to.</font>

In [5]:
# Trying it out
try_str = """
The top url (https://api.github.com/repos/huggingface/datasets/issues/4296) leads to a json object that seems to be
identical with the one depicted above. Apparantly, this json object contains all the information or links (to links)
to all the information that specify this issue (4296).

The "html_url" (https://github.com/huggingface/datasets/pull/4296) leads to a conversation concerning a pull request
(follow the link to see it).

The "avatar_url" (https://avatars.githubusercontent.com/u/8515462?v=4) leads to the avatar / photo of the user who
submitted the pull request.

The "timeline_url" (https://api.github.com/repos/huggingface/datasets/issues/4296/timeline) leads to a list json
objects that specify the timeline of this issue
"""
print(try_str)


The top url (https://api.github.com/repos/huggingface/datasets/issues/4296) leads to a json object that seems to be
identical with the one depicted above. Apparantly, this json object contains all the information or links (to links)
to all the information that specify this issue (4296).

The "html_url" (https://github.com/huggingface/datasets/pull/4296) leads to a conversation concerning a pull request
(follow the link to see it).

The "avatar_url" (https://avatars.githubusercontent.com/u/8515462?v=4) leads to the avatar / photo of the user who
submitted the pull request.

The "timeline_url" (https://api.github.com/repos/huggingface/datasets/issues/4296/timeline) leads to a list json
objects that specify the timeline of this issue



As described in the GitHub [documentation](https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting), unauthenticated requests are limited to 60 requests per hour. Although you can increase the `per_page` query parameter to reduce the number of requests you make, you will still hit the rate limit on any repository that has more than a few thousand issues. So instead, you should follow GitHub's [instructions](https://docs.github.com/en/github/authenticating-to-github/creating-a-personal-access-token) on creating a *personal access token* so that you can boost the rate limit to 5,000 requests per hour. Once you have your token, you can include it as part of the request header:

In [6]:
GITHUB_TOKEN = "gh===p_2lfsxDUkHuwlUBXpSfV7gcZgDOuZph4aqicf" # remove "===" (a new token might be required)
headers = {"Authorization": f"token {GITHUB_TOKEN}"}

> <font color="darkred">⚠️ Do not share a notebook with your `GITHUB_TOKEN` pasted in it. We recommend you delete the last cell once you have executed it to avoid leaking this information accidentally. Even better, store the token in a *.env* file and use the [`python-dotenv` library](https://github.com/theskumar/python-dotenv) to load it automatically for you as an environment variable</font>.

Now that we have our access token, let's create a function that can download all the issues from a GitHub repository:

In [7]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm

def fetch_issues(
    owner="huggingface",
    repo="datasets",
    num_issues=10000,
    rate_limit=5000,
    issues_path=Path("./data")
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)
    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"
    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", headers=headers)
        batch.extend(issues.json())
        print(f"page: {page}\t batch length: {len(batch)}\t query: {query}", end="\r")
        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = []  # Flush batch for next time period
            print()
            print("Reached GitHub rate limit. Sleeping for one hour ...")
            time.sleep(60 * 60 + 1)
    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)
    print()
    print(f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl")

Now when we call `fetch_issues()` it will download all the issues in batches to avoid exceeding GitHub's limit on the number of requests per hour; the result will be stored in a *repository_name-issues.jsonl* file, where each line is a JSON object the represents an issue. Let's use this function to grab all the issues from 🤗 Datasets:

In [8]:
# Depending on your internet connection, this can take several minutes to run...
# The following line needs to run only once (run 'fetch_issues(repo="transformers")' for "Try it out" further below)
#fetch_issues()

Once the issues are downloaded we can load them locally using our newfound skills from [section 2](https://huggingface.co/course/chaper5/2):

In [9]:
from datasets import load_dataset
issues_dataset = load_dataset("json", data_files="data/datasets-issues.jsonl", split="train")

Using custom data configuration default-7a2b59943d24c3c3
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


Great, we've created our first dataset from scratch! But why are there several thousand issues when the [Issues tab](https://github.com/huggingface/datasets/issues) of the 🤗 Datasets repository only shows around 1,000 issues in total 🤔? As described in the GitHub [documentation](https://docs.github.com/en/rest/reference/issues#list-issues-assigned-to-the-authenticated-user), that's because we've downloaded all the pull requests as well:

> <i>"GitHub's REST API v3 considers every pull request an issue, but not every *issue* is a pull request. For this reason, "*Issues*" endpoints may return both issues and pull requests in the response. You can identify pull requests by the `pull_request` key. Be aware that the `id` of a pull request returned from "*Issues*" endpoints will be an issue id."</i>

Since the contents of issues and pull requests are quite different, let's do some minor preprocessing to enable us to distinguish between them.

### Cleaning up the data

The above snippet from GitHub's documentation tells us that the `pull_request` column can be used to differentiate between issues and pull requests. Let's look at a random sample to see what the difference is. As we did in [section 3](https://huggingface.co/course/chapter5/3), we'll chain `Dataset.shuffle()` and `Dataset.select()` to create a random sample and then zip the `html_url` and `pull_request` columns so we can compare the various URLs:

In [10]:
issues_dataset.info

DatasetInfo(description='', citation='', homepage='', license='', features={'url': Value(dtype='string', id=None), 'repository_url': Value(dtype='string', id=None), 'labels_url': Value(dtype='string', id=None), 'comments_url': Value(dtype='string', id=None), 'events_url': Value(dtype='string', id=None), 'html_url': Value(dtype='string', id=None), 'id': Value(dtype='int64', id=None), 'node_id': Value(dtype='string', id=None), 'number': Value(dtype='int64', id=None), 'title': Value(dtype='string', id=None), 'user': {'login': Value(dtype='string', id=None), 'id': Value(dtype='int64', id=None), 'node_id': Value(dtype='string', id=None), 'avatar_url': Value(dtype='string', id=None), 'gravatar_id': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'html_url': Value(dtype='string', id=None), 'followers_url': Value(dtype='string', id=None), 'following_url': Value(dtype='string', id=None), 'gists_url': Value(dtype='string', id=None), 'starred_url': Value(dtype='string', id=

In [11]:
sample = issues_dataset.shuffle(seed=666).select(range(3))
print(sample) # maybe remove this extra print statement
# Print out the URL and pull request entries
for url, pr in zip(sample["html_url"], sample["pull_request"]):
    print(f">> URL: {url}")
    print(f">> Pull request: {pr}\n")

Loading cached shuffled indices for dataset at /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-0e4efc3f6aff7381.arrow


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app'],
    num_rows: 3
})
>> URL: https://github.com/huggingface/datasets/pull/978
>> Pull request: {'url': 'https://api.github.com/repos/huggingface/datasets/pulls/978', 'html_url': 'https://github.com/huggingface/datasets/pull/978', 'diff_url': 'https://github.com/huggingface/datasets/pull/978.diff', 'patch_url': 'https://github.com/huggingface/datasets/pull/978.patch', 'merged_at': None}

>> URL: https://github.com/huggingface/datasets/pull/308
>> Pull request: {'url': 'https://api.github.com/repos/huggingface/datasets/pulls/308', 'html_url': 'https://github.com/huggingface/datasets/pull/308'

Here we can see that each pull request is associated with various URLs, while ordinary issues have a `None` entry. We can use this distinction to create a new `is_pull_request` column that checks whether the `pull_request` field is `None` or not:

In [12]:
issues_dataset = issues_dataset.map(lambda x: {"is_pull_request": False if x["pull_request"] is None else True})
issues_dataset

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-22cd926f1f95de72.arrow


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 4387
})

> ✏️ Try it out! <font color="darkgreen">Calculate the average time it takes to close issues in 🤗 Datasets. You may find the `Dataset.filter()` function useful to filter out the pull requests and open issues, and you can use the `Dataset.set_format()` function to convert the dataset to a `DataFrame` so you can easily manipulate the `created_at` and `closed_at` timestamps. For bonus points, calculate the average time it takes to close pull requests.</font>

In [13]:
# Trying it out
import datetime
## get closed issues and pull requests
print("total items:\t\t{}".format(issues_dataset.num_rows))
closed_dataset = issues_dataset.filter(lambda x: x["closed_at"] is not None)
print("closed items:\t\t{}".format(closed_dataset.num_rows))
closed_issues_dataset = closed_dataset.filter(lambda x: x["pull_request"] is None)
print("closed issues:\t\t{}".format(closed_issues_dataset.num_rows))
closed_pullRequests_dataset = closed_dataset.filter(lambda x: x["pull_request"] is not None)
print("closed pull requests:\t{}".format(closed_pullRequests_dataset.num_rows))
## define helper functions
### format time
def formatTime(tstr, value, unit, trail):
    if tstr!="" or value!=0:
        tstr += "{}{}{}".format(int(value), unit, trail)
    return tstr
### turn seconds into time string
def secs2DHMS(secs):
    tstr=""
    days = secs // (24 * 3600)
    secs %= 24 * 3600
    tstr = formatTime(tstr, days, "d", " ")
    hours = secs // 3600
    secs %= 3600
    tstr = formatTime(tstr, hours, "h", ":")
    mins = secs // 60
    secs %= 60
    tstr = formatTime(tstr, mins, "m", ":")
    tstr = formatTime(tstr, secs, "s", "")
    return tstr
### get mean closing time (in seconds) of github issues or pull requests
def get_mean_closing_time(github_dataset):
    durations = []
    for item in github_dataset:
        start = str(item["created_at"])
        end = str(item["closed_at"])
        # https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
        start_seconds = time.mktime(datetime.datetime.strptime(start, "%Y-%m-%d %H:%M:%S").timetuple())
        end_seconds = time.mktime(datetime.datetime.strptime(end, "%Y-%m-%d %H:%M:%S").timetuple())
        durations.append(end_seconds - start_seconds)
    mean = sum(durations) / len(durations)
    return secs2DHMS(mean)
## produce output
print("average duration until closing for GitHub issues:\t\t{}".format(get_mean_closing_time(closed_issues_dataset)))
print("average duration until closing for GitHub pull requests:\t{}".format(
    get_mean_closing_time(closed_pullRequests_dataset)
))

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-5204adedd9e5bce1.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-de6ce268abc3869f.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-aeefa77fd0374a31.arrow


total items:		4387
closed items:		3802
closed issues:		1067
closed pull requests:	2735
average duration until closing for GitHub issues:		33d 22h:17m:51s
average duration until closing for GitHub pull requests:	5d 23h:52m:14s


Although we could proceed to further clean up the dataset by dropping or renaming some columns, it is generally a good practice to keep the dataset as "raw" as possible at this stage so that it can be easily used in multiple applications.

Before we push our dataset to the Hugging Face Hub, let's deal with one thing that's missing from it: the comments associated with each issue and pull request. We'll add them next with — you guessed it — the GitHub REST API!

### Augmenting the dataset

As shown in the following screenshot, the comments associated with an issue or pull request provide a rich source of information, especially if we're interested in building a search engine to answer user queries about the library.

<img style="float=center;" src="images/9d275b8a98797d3d66002f688bca85b0c3002d1f758352e76c79d27ec21d31f5.png">

The GitHub REST API provides a [`Comments` endpoint](https://docs.github.com/en/rest/reference/issues#list-issue-comments) that returns all the comments associated with an issue number. Let's test the endpoint to see what it returns:

In [14]:
issue_number = 2792
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url, headers=headers)
response.json()

[{'url': 'https://api.github.com/repos/huggingface/datasets/issues/comments/897594128',
  'html_url': 'https://github.com/huggingface/datasets/pull/2792#issuecomment-897594128',
  'issue_url': 'https://api.github.com/repos/huggingface/datasets/issues/2792',
  'id': 897594128,
  'node_id': 'IC_kwDODunzps41gDMQ',
  'user': {'login': 'bhavitvyamalik',
   'id': 19718818,
   'node_id': 'MDQ6VXNlcjE5NzE4ODE4',
   'avatar_url': 'https://avatars.githubusercontent.com/u/19718818?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/bhavitvyamalik',
   'html_url': 'https://github.com/bhavitvyamalik',
   'followers_url': 'https://api.github.com/users/bhavitvyamalik/followers',
   'following_url': 'https://api.github.com/users/bhavitvyamalik/following{/other_user}',
   'gists_url': 'https://api.github.com/users/bhavitvyamalik/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/bhavitvyamalik/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/

We can see that the comment is stored in the `body` field, so let's write a simple function that returns all the comments associated with an issue by picking out the `body` contents for each element in `response.json()`:

In [15]:
def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=headers)
    return [r["body"] for r in response.json()]
# Test our function works as expected
get_comments(2792)

["@albertvillanova my tests are failing here:\r\n```\r\ndataset_name = 'gooaq'\r\n\r\n    def test_load_dataset(self, dataset_name):\r\n        configs = self.dataset_tester.load_all_configs(dataset_name, is_local=True)[:1]\r\n>       self.dataset_tester.check_load_dataset(dataset_name, configs, is_local=True, use_local_dummy_data=True)\r\n\r\ntests/test_dataset_common.py:234: \r\n_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \r\ntests/test_dataset_common.py:187: in check_load_dataset\r\n    self.parent.assertTrue(len(dataset[split]) > 0)\r\nE   AssertionError: False is not true\r\n```\r\nWhen I try loading dataset on local machine it works fine. Any suggestions on how can I avoid this error?",
 'Thanks for the help, @albertvillanova! All tests are passing now.']

This looks good, so let's use `Dataset.map()` to add a new `comments` column to each issue in our dataset:

In [16]:
# Own edit: speed up the following processes by focussing on the first 300 issues
issues_dataset_300 = issues_dataset.select(range(300))
# Depending on your internet connection, this can take a few minutes...
issues_with_comments_dataset_300 = issues_dataset_300.map(lambda x: {"comments": get_comments(x["number"])})

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-8e204e7a8cb7643a.arrow


The final step is to save the augmented dataset alongside our raw data so we can push them both to the Hub:

In [17]:
issues_with_comments_dataset_300.to_json("data/issues-datasets-with-comments.jsonl")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1284903

### Uploading the dataset to the Hugging Face Hub

In [18]:
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/HaN6qCr_Afc" allowfullscreen></iframe>')

NameError: name 'HTML' is not defined

Now that we have our augmented dataset, it's time to push it to the Hub so we can share it with the community! To upload the dataset we'll use the [🤗 Hub library](https://github.com/huggingface/huggingface_hub), which allows us to interact with the Hugging Face Hub through a Python API. 🤗 Hub comes preinstalled with 🤗 Transformers, so we can use it directly. For example, we can use the `list_datasets()` function to get information about all the public datasets currently hosted on the Hub:

In [19]:
from huggingface_hub import list_datasets
all_datasets = list_datasets()
print(f"Number of datasets on Hub: {len(all_datasets)}")
print(all_datasets[0])

Number of datasets on Hub: 4929
Dataset Name: acronym_identification, Tags: ['arxiv:2010.14678', 'annotations_creators:expert-generated', 'language_creators:found', 'languages:en', 'licenses:mit', 'multilinguality:monolingual', 'size_categories:10K<n<100K', 'source_datasets:original', 'task_categories:token-classification', 'task_ids:token-classification-other-acronym-identification', 'pretty_name:Acronym Identification Dataset']


We can see that there are currently over 4,700 datasets on the Hub, and the `list_datasets()` function also provides some basic metadata about each dataset repository.

For our purposes, the first thing we need to do is create a new dataset repository on the Hub. To do that we need an authentication token, which can be obtained by first logging into the Hugging Face Hub with the `notebook_login()` function:

In [20]:
from huggingface_hub import notebook_login
notebook_login()

Login successful
Your token has been saved to /Users/matthias/.huggingface/token


This will create a widget where you can enter your username and password, and an API token will be saved in *~/.huggingface/token*. If you're running the code in a terminal, you can log in via the CLI instead:
```
huggingface-cli login
```
Once we've done this, we can create a new dataset repository with the `create_repo()` function:

In [21]:
from huggingface_hub import create_repo
#repo_url = create_repo(name="github-issues", repo_type="dataset") # 1 run either this (create dataset repo)
repo_url = "https://huggingface.co/datasets/mdroth/github-issues"  # 2 or run that (dataset repo url)
repo_url

'https://huggingface.co/datasets/mdroth/github-issues'

In this example, we've created an empty dataset repository called `github-issues` under the `mdroth` username (the username should be your Hub username when you're running this code!).

> ✏️ Try it out! <font color="darkgreen">Use your Hugging Face Hub username and password to obtain a token and create an empty repository called `github-issues`. Remember to **never save your credentials** in Colab or any other repository, as this information can be exploited by bad actors.</font>

In [22]:
# Trying it out
## data splits: test (20%), validation (16%), training (64%)
## current state: training = 100% => split off 16% for validation
## https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090/9
from datasets import DatasetDict
train_validTest = issues_with_comments_dataset_300.train_test_split(shuffle=True, seed=42, test_size=0.36)
valid_test = train_validTest["test"].train_test_split(shuffle=True, seed=42, test_size=5/9)
issues_dataset_300 = DatasetDict({
    "train": train_validTest["train"],
    "valid": valid_test["train"],
    "test": valid_test["test"]
})
print(issues_dataset_300)
## pushing to hub
## https://discuss.huggingface.co/t/save-datasetdict-to-huggingface-hub/12075/4
issues_dataset_300.push_to_hub(repo_id="github_issues_300")

Loading cached split indices for dataset at /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-6e9c1b47618f53de.arrow and /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-d20ac262fb2b6bbe.arrow
Loading cached split indices for dataset at /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-69796d564f495533.arrow and /Users/matthias/.cache/huggingface/datasets/json/default-7a2b59943d24c3c3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-d9656842970cf3a6.arrow
Pushing split train to the Hub.


DatasetDict({
    train: Dataset({
        features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
        num_rows: 192
    })
    valid: Dataset({
        features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
        num_rows: 48
    })
    test: Dataset({
        features: ['url', 'repo

The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split valid to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split test to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Next, let's clone the repository from the Hub to our local machine and copy our dataset file into it. 🤗 Hub provides a handy `Repository` class that wraps many of the common Git commands, so to clone the remote repository we simply need to provide the URL and local path we wish to clone to:

In [23]:
from huggingface_hub import Repository
repo = Repository(local_dir="github-issues", clone_from=repo_url)
repo.git_pull()
!cp data/issues-datasets-with-comments.jsonl github-issues/
repo_url

/Users/matthias/Desktop/Huggingface/Huggingface-course/github-issues is already a clone of https://huggingface.co/datasets/mdroth/github-issues. Make sure you pull the latest changes with `repo.git_pull()`.


'https://huggingface.co/datasets/mdroth/github-issues'

By default, various file extensions (such as *.bin*, *.gz*, and *.zip*) are tracked with Git LFS so that large files can be versioned within the same Git workflow. You can find a list of tracked file extensions inside the repository's *.gitattributes* file. To include the JSON Lines format in the list, we can run the following command:

In [24]:
repo.lfs_track("*.jsonl")

Then we can use `Repository.push_to_hub()` to push the dataset to the Hub:

In [25]:
repo.push_to_hub()

If we navigate to the URL contained in `repo_url`, we should now see that our dataset file has been uploaded.

<img style="float=center;" src="images/github-issues.png">

From here, anyone can download the dataset by simply providing `load_dataset()` with the repository ID as the `path` argument:

In [26]:
remote_dataset = load_dataset("mdroth/github-issues", split="train")
remote_dataset

Using custom data configuration mdroth--github-issues-e6b7052b14b3688c
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/mdroth--github-issues-e6b7052b14b3688c/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 300
})

Cool, we've pushed our dataset to the Hub and it's available for others to use! There's just one important thing left to do: adding a *dataset card* that explains how the corpus was created and provides other useful information for the community.
> <font color="darkgreen">💡 You can also upload a dataset to the Hugging Face Hub directly from the terminal by using `huggingface-cli` and a bit of Git magic. See the [🤗 Datasets guide](https://huggingface.co/docs/datasets/share.html#add-a-community-dataset) for details on how to do this.</font>

### Creating a dataset card

Well-documented datasets are more likely to be useful to others (including your future self!), as they provide the context to enable users to decide whether the dataset is relevant to their task and to evaluate any potential biases in or risks associated with using the dataset.

On the Hugging Face Hub, this information is stored in each dataset repository's *README.md* file. There are two main steps you should take before creating this file:

1. Use the [`datasets-tagging` application](https://huggingface.co/datasets/tagging/) to create metadata tags in YAML format. These tags are used for a variety of search features on the Hugging Face Hub and ensure your dataset can be easily found by members of the community. Since we have created a custom dataset here, you'll need to clone the `datasets-tagging` repository and run the application locally. Here's what the interface looks like:
<img style="float=center;" width="900" src="images/datasetCard.png">
1. Read the [🤗 Datasets guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md) on creating informative dataset cards and use it as a template.

You can create the *README.md* file directly on the Hub, and you can find a template dataset card in the `lewtun/github-issues` dataset repository. A screenshot of the filled-out dataset card is shown below.
<img style="float=center;" width="900" src="images/d25674d363f79ac314b3c34cad8606c402178f53b435b80c688f1bcf4563a45f.png">

> ✏️ Try it out! <font color="darkgreen">Use the `dataset-tagging` application and [🤗 Datasets guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md) to complete the *README.md* file for your GitHub issues dataset.</font>

In [27]:
# Trying it out
print("Procedure")
print('1. Read the above section on "Creating a dataset card".')
print('2. Create tags with the "HuggingFace Dataset Tagger" at {}.'.format(
    "https://huggingface.co/spaces/huggingface/datasets-tagging"
))
print('3. Create a README.md file / dataset card with the "card-creator" at {}.'.format(
    "https://huggingface.co/datasets/card-creator"
))
print('4. Upload the README.md file / dataset card to the model repo (in this case {}):'.format(
    "https://huggingface.co/datasets/mdroth/github_issues_300"
))
print('=> "Files and versions" > "Add file" > "Upload file" => Upload the README.md file')
print('\nSee {} for a guide and {} for an example.'.format(
    "https://huggingface.co/docs/datasets/v2.0.0/dataset_card",
    "https://huggingface.co/datasets/mdroth/github_issues_300"
))

Procedure
1. Read the above section on "Creating a dataset card".
2. Create tags with the "HuggingFace Dataset Tagger" at https://huggingface.co/spaces/huggingface/datasets-tagging.
3. Create a README.md file / dataset card with the "card-creator" at https://huggingface.co/datasets/card-creator.
4. Upload the README.md file / dataset card to the model repo (in this case https://huggingface.co/datasets/mdroth/github_issues_300):
=> "Files and versions" > "Add file" > "Upload file" => Upload the README.md file

See https://huggingface.co/docs/datasets/v2.0.0/dataset_card for a guide and https://huggingface.co/datasets/mdroth/github_issues_300 for an example.


That's it! We've seen in this section that creating a good dataset can be quite involved, but fortunately uploading it and sharing it with the community is not. In the next section we'll use our new dataset to create a semantic search engine with 🤗 Datasets that can match questions to the most relevant issues and comments.

> ✏️ Try it out! <font color="darkgreen">Go through the steps we took in this section to create a dataset of GitHub issues for your favorite open source library (pick something other than 🤗 Datasets, of course!). For bonus points, fine-tune a multilabel classifier to predict the tags present in the `labels` field.</font>

In [28]:
# Trying it out
import json
## load the file "transformers-issues.jsonl" that has been created by using 'fetch_issues(repo="transformers")' ...
## ... instead of 'fetch_issues()' just below the definition of the 'fetch_issues()' function further above
transformers_issues_dataset = load_dataset("json", data_files="data/transformers-issues.jsonl", split="train")
## add columns "text" and "num_labels"
print(transformers_issues_dataset)
transformers_issues_text_dataset_1 = transformers_issues_dataset.rename_column(
    original_column_name="repository_url", new_column_name="text"
)
transformers_issues_text_dataset_0 = transformers_issues_text_dataset_1.rename_column(
    original_column_name="labels_url", new_column_name="num_labels"
)
transformers_issues_text_dataset = transformers_issues_text_dataset_0.rename_column(
    original_column_name="comments_url", new_column_name="arr_labels"
)
del transformers_issues_dataset, transformers_issues_text_dataset_0, transformers_issues_text_dataset_1
## combine the columns "title", "comments", "reactions", and "body" into a single string in the new "text" column
feature_keys = ["title", "comments", "reactions", "body"]
reaction_keys = ["+1", "-1", "laugh", "hooray", "heart", "rocket", "eyes"]
def make_text(item):
    text = ""
    for fk_i in feature_keys:
        if fk_i=="reactions":
            text += f"\n\n{fk_i.upper()}"
            reactions = item[fk_i]
            reactions_json = json.loads(json.dumps(reactions, indent = 4))
            for rk_i in reaction_keys:
                rk_iCount = reactions_json[rk_i]
                text += f"\n{rk_i}: {rk_iCount}"
        else:
            text += f"\n\n{fk_i.upper()}\n{item[fk_i]}"
    item["text"] = text
    return item
transformers_issues_text_dataset = transformers_issues_text_dataset.map(make_text)
## build labels (list of strings)
def make_labels(item):
    labels = item["labels"]
    label_list = []
    for label in labels:
        label_json = json.loads(json.dumps(label, indent=4))
        label_name = label_json["name"]
        label_list.append(label_name)
    item["labels"] = label_list
    return item
transformers_issues_text_dataset = transformers_issues_text_dataset.map(make_labels)
## build num_labels (list of numbers) ...
## ... see also https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-a-list-of-lists
flat_label_list = [label for labels_i in transformers_issues_text_dataset["labels"] for label in labels_i]
unique_labels = list(set(flat_label_list))
n_unique_labels = len(unique_labels)
def make_num_labels(item):
    label_list = item["labels"]
    num_label_list = []
    for label in label_list:
        num_label = unique_labels.index(label)
        num_label_list.append(num_label)
    item["num_labels"] = num_label_list
    return item
transformers_issues_text_dataset = transformers_issues_text_dataset.map(make_num_labels)
## build arr_labels (full list of 0s and 1s)
def make_arr_labels(item):
    num_labels = item["num_labels"]
    arr_label_list = [0 for _ in range(n_unique_labels)]
    for num_label in num_labels:
        arr_label_list[num_label] = 1
    item["arr_labels"] = arr_label_list
    return item
transformers_issues_text_dataset = transformers_issues_text_dataset.map(make_arr_labels)
## filter for instances with labels!=[]
idx = 0
print(f'Before filtering\n=> Empty labels for index {idx}:\t\t{transformers_issues_text_dataset["labels"][idx]}')
transformers_issues_text_dataset = transformers_issues_text_dataset.filter(lambda x: x["labels"]!=[])
print(f'After filtering\n=> Non-empty labels for index {idx}:\t{transformers_issues_text_dataset["labels"][idx]}')
## remove all columns but "labels", "text", and "url"
keep_keys = ["text", "labels", "num_labels", "arr_labels", "url"]
remove_keys = [key for key in list(transformers_issues_text_dataset.features.keys()) if key not in keep_keys]
transformers_issues_text_dataset = transformers_issues_text_dataset.remove_columns(remove_keys)
## show dataset
print(f"\nDataset with num_labels:\n{transformers_issues_text_dataset}")
idx = 5 # 5 or 13
print(f'\n{3*"#"+" "}text{" "+71*"#"}{transformers_issues_text_dataset["text"][idx]}')
print(f'\n{3*"#"+" "}labels{" "+69*"#"}\n\n{transformers_issues_text_dataset["labels"][idx]}')
print(f'\n{3*"#"+" "}num_labels{" "+65*"#"}\n\n{transformers_issues_text_dataset["num_labels"][idx]}')
print(f'\n{3*"#"+" "}arr_labels{" "+65*"#"}\n\n{transformers_issues_text_dataset["arr_labels"][idx]}')
print(f'\n{3*"#"+" "}url{" "+72*"#"}\n\n{transformers_issues_text_dataset["url"][idx]}')
print(f'\n{80*"#"}\n')
## try to add class names
#
#https://discuss.huggingface.co/t/how-to-create-custom-classlabels/13650
#from datasets import ClassLabel
#features = transformers_issues_text_dataset.features.copy()
#print(features)
#features["arr_labels"] = ClassLabel(names=unique_labels)
#print(features)
#transformers_issues_text_dataset = transformers_issues_text_dataset.map(
#    lambda batch: batch, batched=False, features=features
#)
#
## DatasetDict: make splits, build, print, and push to hub
train_dev = transformers_issues_text_dataset.train_test_split(shuffle=True, seed=421, test_size=0.003)
train_validTest = train_dev["train"].train_test_split(shuffle=True, seed=42, test_size=0.36)
valid_test = train_validTest["test"].train_test_split(shuffle=True, seed=42, test_size=5/9)
transformers_issues_text_dataset = DatasetDict({
    "train": train_validTest["train"],
    "valid": valid_test["train"],
    "test": valid_test["test"],
    "dev": train_dev["test"]
})
print(transformers_issues_text_dataset)
transformers_issues_text_dataset.push_to_hub(repo_id="transformers_issues_labels")

Using custom data configuration default-141858a465b2fe0e
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-141858a465b2fe0e/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-141858a465b2fe0e/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-b05d5a409dd115cb.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-141858a465b2fe0e/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-5f7c0994dd04b4d9.arrow


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app'],
    num_rows: 10000
})


  0%|          | 0/10000 [00:00<?, ?ex/s]

  0%|          | 0/10000 [00:00<?, ?ex/s]

Before filtering
=> Empty labels for index 0:		[]


  0%|          | 0/10 [00:00<?, ?ba/s]

Pushing split train to the Hub.


After filtering
=> Non-empty labels for index 0:	['bug']

Dataset with num_labels:
Dataset({
    features: ['url', 'text', 'num_labels', 'arr_labels', 'labels'],
    num_rows: 1385
})

### text #######################################################################

TITLE
[Kernel Fusion] Training benchmarks of Torchdynamo + AOTAutograd (many models)

COMMENTS
5

REACTIONS
+1: 0
-1: 0
laugh: 0
hooray: 0
heart: 1
rocket: 0
eyes: 0

BODY
Note to maintainers: We are using this PR to collaborate and there is no intention yet to merge anything, so please ignore unless you want to experiment with the latest auto-speedups.

## What was the issue with the previous AOTAutograd integration?
So, there was some investigation into applying AOTAutograd a couple months ago in this PR (https://github.com/huggingface/transformers/pull/15264). Although the performance results were quite promising, @stas00 and I found one major blocker - the potential for incorrect semantics. AOTAutograd is a tracing-b

The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split valid to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split test to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split dev to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

In [179]:
# 5
checkpoint = "bert-base-cased"
#tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer = BertTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
transformers_datasets = load_dataset("mdroth/transformers_issues_labels")
transformers_dev = transformers_datasets["valid"]
def transformers_tokenize_function(item):                         # tokenization function for .map method
    return tokenizer(item["text"], truncation=True)
tokenized_transformers_datasets = transformers_datasets.map(transformers_tokenize_function, batched=True)
transformers_dev_samples = tokenized_transformers_datasets["dev"][:3]
# keys: 'url', 'text', 'num_labels', 'arr_labels', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'
# keep: 'input_ids', 'token_type_ids', 'attention_mask', 'labels'
drop_list = ["url", "text", "num_labels", "labels", "labels", "num_labels", "num_labels", "num_labels"]
transformers_dev_purged = {k: v for k, v in transformers_dev_samples.items() if k not in drop_list}
print(transformers_dev_purged.keys())
transformers_dev_batch = data_collator(transformers_dev_purged)
{k: v.shape for k, v in transformers_dev_batch.items()}

loading file https://huggingface.co/bert-base-cased/resolve/main/vocab.txt from cache at /Users/matthias/.cache/huggingface/transformers/6508e60ab3c1200bffa26c95f4b58ac6b6d95fba4db1f195f632fa3cd7bc64cc.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
loading file https://huggingface.co/bert-base-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-cased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-cased/resolve/main/tokenizer_config.json from cache at /Users/matthias/.cache/huggingface/transformers/ec84e86ee39bfe112543192cf981deebf7e6cbe8c91b8f7f8f63c9be44366158.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/matthias/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

dict_keys(['arr_labels', 'input_ids', 'token_type_ids', 'attention_mask'])


{'arr_labels': torch.Size([3, 57]),
 'input_ids': torch.Size([3, 512]),
 'token_type_ids': torch.Size([3, 512]),
 'attention_mask': torch.Size([3, 512])}

In [171]:
# 3
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
sst2_raw_datasets = load_dataset("glue", "sst2")
sst2_train = sst2_raw_datasets["train"]
def sst2_tokenize_function(item):                         # tokenization function for .map method
    return tokenizer(item["sentence"], truncation=True)
tokenized_sst2_datasets = sst2_raw_datasets.map(sst2_tokenize_function, batched=True) # batch-tokenize all datasets
sst2_train_samples = tokenized_sst2_datasets["train"][:3] # get first 3 tokenized samples of the training set
sst2_train_purged = {k: v for k, v in sst2_train_samples.items() if k not in ["idx", "sentence"]}
# keep: input_ids', 'token_type_ids', 'attention_mask', 'labels'
sst2_train_batch = data_collator(sst2_train_purged)       # use data_collator to turn samples into a batch
{k: v.shape for k, v in sst2_train_batch.items()}

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /Users/matthias/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/

  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-4c0157f618b9067d.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-50c7d36b9e5220b0.arrow


{'input_ids': torch.Size([3, 15]),
 'token_type_ids': torch.Size([3, 15]),
 'attention_mask': torch.Size([3, 15]),
 'labels': torch.Size([3])}

Still trying it out...<br>
Dataset complete - model, training, and inference next.

In [155]:
from transformers import BertTokenizer, BertForSequenceClassification, AutoTokenizer, AutoModelForSequenceClassification
from transformers import DefaultDataCollator, TrainingArguments, Trainer, DataCollatorWithPadding
import torch
num_labels = len(unique_labels)
# instantiate model
#tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
#model = BertForSequenceClassification.from_pretrained("bert-base-cased", num_labels=num_labels)
checkpoint = "bert-base-cased"
transformers_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
transformers_model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
# produce model outputs (https://huggingface.co/docs/transformers/main_classes/output)
inputs = transformers_tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = transformers_model(**inputs, labels=labels)
# find examples on how to:
# (check previous chapters)
# > turn logits to predictions
# > turn logits + labels to loss
# > use loss for optimization
# > ...?
print(outputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/matthias/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.17.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file https://huggingface.co/bert-base-cased/resolv

SequenceClassifierOutput(loss=tensor(3.8668, grad_fn=<NllLossBackward>), logits=tensor([[-0.1130,  0.2962, -0.7920, -0.4224, -0.5464, -0.1320, -0.4443,  0.0040,
         -0.3799, -0.5812, -0.3424,  0.1943,  0.6863,  0.4740,  0.4720,  0.1674,
         -0.3009,  0.5291, -0.3000, -0.2796,  0.5407, -0.1862, -0.1723, -0.0596,
         -0.0424,  0.2617, -0.5475, -0.0855, -0.5546,  0.5824,  0.3372,  0.0988,
          0.1305,  0.0407, -0.4151, -0.2011, -0.2987, -0.8742,  0.3055, -0.0754,
         -0.5178, -1.0479,  0.5191,  1.4233,  0.7420,  0.6309, -0.3406,  0.0797,
         -0.2000,  0.0123,  0.2187,  1.2506, -0.5081,  0.6031, -0.3315,  0.1463,
         -0.1579]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
tensor([[0.0139, 0.0209, 0.0070, 0.0102, 0.0090, 0.0136, 0.0100, 0.0156, 0.0106,
         0.0087, 0.0110, 0.0189, 0.0309, 0.0250, 0.0249, 0.0184, 0.0115, 0.0264,
         0.0115, 0.0118, 0.0267, 0.0129, 0.0131, 0.0147, 0.0149, 0.0202, 0.0090,
         0.0143, 0.0089, 0.

In [156]:
# inspect each line!
## load dataset, tokenize, and apply datacollator
transformers_dataset = load_dataset("mdroth/transformers_issues_labels")
#checkpoint = "bert-base-cased"
#transformers_tokenizer = BertTokenizer.from_pretrained(checkpoint)
def transformers_tokenize_function(item):                         # tokenization function for .map method
    return transformers_tokenizer(item["text"], padding=True, truncation=True)
transformers_tokenized_datasets = transformers_dataset.map(transformers_tokenize_function, batched=True)
# 'arr_labels', 'labels'
# 'labels' -> "int_labels"
# 'arr_labels' -> "labels"
transformers_tokenized_datasets = transformers_tokenized_datasets.rename_column("labels", "int_labels")
transformers_tokenized_datasets = transformers_tokenized_datasets.rename_column("arr_labels", "labels")
transformers_tokenized_datasets

Using custom data configuration mdroth--transformers_issues_labels-e1a55ed64424aafd
Reusing dataset parquet (/Users/matthias/.cache/huggingface/datasets/parquet/mdroth--transformers_issues_labels-e1a55ed64424aafd/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    valid: Dataset({
        features: ['url', 'text', 'num_labels', 'labels', 'int_labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 220
    })
    dev: Dataset({
        features: ['url', 'text', 'num_labels', 'labels', 'int_labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5
    })
    test: Dataset({
        features: ['url', 'text', 'num_labels', 'labels', 'int_labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 277
    })
    train: Dataset({
        features: ['url', 'text', 'num_labels', 'labels', 'int_labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 883
    })
})

In [9]:
# inspect each line!
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer
from datasets import load_dataset
import torch
## load dataset, tokenize, adapt columns, and apply datacollator
checkpoint = "bert-base-cased"
transformers_tokenizer = BertTokenizer.from_pretrained(checkpoint)
def transformers_tokenize_function(item):
    return transformers_tokenizer(item["text"], padding=True, truncation=True)
transformers_tokenized_datasets = (
    load_dataset("mdroth/transformers_issues_labels")
    .map(transformers_tokenize_function, batched=True)
    .remove_columns(column_names=["url", "text", "num_labels", "labels"])
    .rename_column("arr_labels", "labels") # https://discuss.huggingface.co/t/why-am-i-getting-keyerror-loss/6948
)
transformers_data_collator = DataCollatorWithPadding(tokenizer=transformers_tokenizer)
## training arguments
training_args = TrainingArguments(
    # https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
    "5_try_transformers_dataset",
    evaluation_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4
)
## model
num_labels = 57#len(unique_labels) # =57
transformers_model = BertForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
## compute_metrics
## trainer
trainer = Trainer(
    transformers_model,
    training_args,
    train_dataset=transformers_tokenized_datasets["dev"],
    eval_dataset=transformers_tokenized_datasets["dev"],
    data_collator=transformers_data_collator,
    tokenizer=transformers_tokenizer,
    #preprocess_logits_for_metrics=lambda x: torch.reshape(x, (-1, 57))
)
## train
trainer.train()

loading file https://huggingface.co/bert-base-cased/resolve/main/vocab.txt from cache at /Users/matthias/.cache/huggingface/transformers/6508e60ab3c1200bffa26c95f4b58ac6b6d95fba4db1f195f632fa3cd7bc64cc.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
loading file https://huggingface.co/bert-base-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-cased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-cased/resolve/main/tokenizer_config.json from cache at /Users/matthias/.cache/huggingface/transformers/ec84e86ee39bfe112543192cf981deebf7e6cbe8c91b8f7f8f63c9be44366158.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/matthias/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388

  0%|          | 0/4 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/parquet/mdroth--transformers_issues_labels-e1a55ed64424aafd/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/cache-6abb82432ea4f160.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/parquet/mdroth--transformers_issues_labels-e1a55ed64424aafd/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/cache-71378fd7107bbfbb.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/parquet/mdroth--transformers_issues_labels-e1a55ed64424aafd/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/cache-697e1bed1cdd157e.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/parquet/mdroth--transformers_issues_labels-e1a55ed64424aafd/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901/cache-eb557e2d389372ac.arrow
PyTorch: setting up devices
The default 

ValueError: Expected input batch_size (4) to match target batch_size (228).

In [3]:
import torch
torch.__version__

'1.8.1'

In [8]:
def preprocess_logits_for_metrics(logits):
    x = torch.reshape(logits, (-1, 57))
    return x
logits = torch.randn(228)
preprocess_logits_for_metrics(logits).size(), logits.size()

torch.Size([4, 57])


(torch.Size([4, 57]), torch.Size([228]))

In [190]:
# inspect each line!
## load dataset, tokenize, and apply datacollator
transformers_dataset = load_dataset("mdroth/transformers_issues_labels")
checkpoint = "bert-base-cased"
transformers_tokenizer = BertTokenizer.from_pretrained(checkpoint)
def transformers_tokenize_function(item):
    return transformers_tokenizer(item["text"], padding=True, truncation=True)
transformers_tokenized_datasets = transformers_dataset.map(transformers_tokenize_function, batched=True)
remove_columns = ["url", "text", "num_labels", "labels"]
transformers_tokenized_datasets = transformers_tokenized_datasets.remove_columns(column_names=remove_columns)
# https://discuss.huggingface.co/t/why-am-i-getting-keyerror-loss/6948
transformers_tokenized_datasets = transformers_tokenized_datasets.rename_column("arr_labels", "labels")
transformers_data_collator = DataCollatorWithPadding(tokenizer=transformers_tokenizer)
## training arguments
training_args = TrainingArguments(
    # https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
    "5_try_transformers_dataset",
    evaluation_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4
)
## model
num_labels = len(unique_labels)
transformers_model = BertForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)
## compute_metrics
# comput_metrics = ?
## trainer
trainer = Trainer(
    transformers_model,
    training_args,
    train_dataset=transformers_tokenized_datasets["dev"], # adapt
    eval_dataset=transformers_tokenized_datasets["dev"],  # adapt
    data_collator=transformers_data_collator,
    tokenizer=transformers_tokenizer
    # comput_metrics = ?
)
## train
trainer.train()

Using custom data configuration mdroth--transformers_issues_labels-e1a55ed64424aafd
Reusing dataset parquet (/Users/matthias/.cache/huggingface/datasets/parquet/mdroth--transformers_issues_labels-e1a55ed64424aafd/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)


  0%|          | 0/4 [00:00<?, ?it/s]

loading file https://huggingface.co/bert-base-cased/resolve/main/vocab.txt from cache at /Users/matthias/.cache/huggingface/transformers/6508e60ab3c1200bffa26c95f4b58ac6b6d95fba4db1f195f632fa3cd7bc64cc.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
loading file https://huggingface.co/bert-base-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-cased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-cased/resolve/main/tokenizer_config.json from cache at /Users/matthias/.cache/huggingface/transformers/ec84e86ee39bfe112543192cf981deebf7e6cbe8c91b8f7f8f63c9be44366158.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/matthias/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388

DatasetDict({
    valid: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 220
    })
    dev: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 277
    })
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 883
    })
})


loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/matthias/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18",
    "19": "LABEL_19",
    "20": "LABEL_20",
    "

ValueError: Expected input batch_size (4) to match target batch_size (228).

In [145]:
transformers_data_collator
#--> 724                 raise ValueError(
#    725                     "Unable to create tensor, you should probably activate truncation and/or padding "
#    726                     "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
#
#ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

DataCollatorWithPadding(tokenizer=PreTrainedTokenizer(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

In [158]:
remove_cols = ["url", "num_labels", "labels"]
transformers_tokenized_datasets_rem = transformers_tokenized_datasets.remove_columns(remove_cols)
transformers_tokenized_datasets_rem = transformers_tokenized_datasets_rem.rename_column("arr_labels", "labels")
dataset_0 = transformers_tokenized_datasets_rem["dev"] # "dev", "valid", "test", "train"
print(dataset_0)
for i in range(len(dataset_0)):
    i_len = len(dataset_0["input_ids"][i])
    if i_len<512:
        print(i)
    print(dataset_0["input_ids"][i])
    print()

Dataset({
    features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 5
})
[101, 157, 12150, 17516, 11359, 7249, 1104, 169, 10454, 168, 22559, 1116, 168, 1106, 168, 5101, 169, 1107, 2393, 7488, 22559, 17260, 119, 18732, 25290, 11680, 11365, 127, 155, 12420, 16647, 24805, 1708, 116, 122, 131, 121, 118, 122, 131, 121, 4046, 131, 121, 16358, 6533, 1183, 131, 121, 1762, 131, 121, 8964, 131, 121, 1257, 131, 121, 139, 15609, 3663, 8790, 117, 1103, 2393, 7488, 22559, 17260, 24935, 1103, 169, 10454, 168, 22559, 1116, 168, 1106, 168, 5101, 169, 3053, 131, 18630, 131, 120, 120, 176, 7088, 10354, 119, 3254, 120, 19558, 10931, 120, 11303, 1468, 120, 171, 2858, 1830, 120, 171, 1161, 1568, 1181, 11049, 2087, 18202, 19203, 2087, 1568, 1181, 1830, 1568, 1162, 1568, 1527, 1161, 18910, 1665, 13976, 1830, 1475, 2087, 1571, 1559, 2093, 1568, 1665, 1580, 1475, 1665, 15292, 1477, 120, 188, 19878, 120, 11303, 1468, 120, 3584, 120, 2393, 7488, 120, 22559, 2734, 168, 2393, 

In [159]:
from torch.utils.data import DataLoader
dataloader = DataLoader(
    transformers_tokenized_datasets_rem["dev"],
    shuffle=True,
    batch_size=4,
    collate_fn=transformers_data_collator
)
dataloader

<torch.utils.data.dataloader.DataLoader at 0x7fe0a85dfb20>

In [160]:
for batch in dataloader:
    print(batch)
    break
{k: v.shape for k, v in batch.items()}

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

In [None]:
dev_dataloader = DataLoader(tokenized_datasets["dev"], shuffle=True, batch_size=8, collate_fn=data_collator)


In [82]:
#transformers_data_collator = DataCollatorWithPadding(tokenizer=transformers_tokenizer)
transformers_data_collator
for batch in transformers_data_collator:
    break
batch

TypeError: 'DataCollatorWithPadding' object is not iterable

In [78]:
split = transformers_tokenized_datasets["valid"]
print(split)
for i in range(split.num_rows):
    print(len(split["input_ids"][i]), len(split["token_type_ids"][i]), len(split["attention_mask"][i]))
    #print(split["input_ids"][i][-10:])

Dataset({
    features: ['url', 'text', 'num_labels', 'arr_labels', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 220
})
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512
512 512 512

In [109]:
model.config.id2label

{0: 'LABEL_0',
 1: 'LABEL_1',
 2: 'LABEL_2',
 3: 'LABEL_3',
 4: 'LABEL_4',
 5: 'LABEL_5',
 6: 'LABEL_6',
 7: 'LABEL_7',
 8: 'LABEL_8',
 9: 'LABEL_9',
 10: 'LABEL_10',
 11: 'LABEL_11',
 12: 'LABEL_12',
 13: 'LABEL_13',
 14: 'LABEL_14',
 15: 'LABEL_15',
 16: 'LABEL_16',
 17: 'LABEL_17',
 18: 'LABEL_18',
 19: 'LABEL_19',
 20: 'LABEL_20',
 21: 'LABEL_21',
 22: 'LABEL_22',
 23: 'LABEL_23',
 24: 'LABEL_24',
 25: 'LABEL_25',
 26: 'LABEL_26',
 27: 'LABEL_27',
 28: 'LABEL_28',
 29: 'LABEL_29',
 30: 'LABEL_30',
 31: 'LABEL_31',
 32: 'LABEL_32',
 33: 'LABEL_33',
 34: 'LABEL_34',
 35: 'LABEL_35',
 36: 'LABEL_36',
 37: 'LABEL_37',
 38: 'LABEL_38',
 39: 'LABEL_39',
 40: 'LABEL_40',
 41: 'LABEL_41',
 42: 'LABEL_42',
 43: 'LABEL_43',
 44: 'LABEL_44',
 45: 'LABEL_45',
 46: 'LABEL_46',
 47: 'LABEL_47',
 48: 'LABEL_48',
 49: 'LABEL_49',
 50: 'LABEL_50',
 51: 'LABEL_51',
 52: 'LABEL_52',
 53: 'LABEL_53',
 54: 'LABEL_54',
 55: 'LABEL_55',
 56: 'LABEL_56'}

In [148]:
num_labels = len(unique_labels)
# instantiate model
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=num_labels)
print(model)
# produce model outputs (https://huggingface.co/docs/transformers/main_classes/output)


Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [53]:
# https://huggingface.co/docs/transformers/main_classes/output#transformers.modeling_outputs.MultipleChoiceModelOutput
# https://huggingface.co/docs/transformers/main_classes/output#transformers.modeling_outputs.SequenceClassifierOutput
# https://huggingface.co/docs/transformers/main_classes/configuration
# https://huggingface.co/docs/transformers/tasks/sequence_classification
# https://huggingface.co/docs/transformers/tasks/multiple_choice
# https://huggingface.co/docs/transformers/training

# look up classification in chapter 3
# https://huggingface.co/transformers/v4.1.1/notebooks.html
# https://huggingface.co/docs/transformers/main_classes/output (=> logits)
# https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb#scrollTo=bRDimbGwQFrB
# https://discuss.huggingface.co/t/fine-tune-for-multiclass-or-multilabel-multiclass/4035/20

'4.17.0'

In [None]:
train_validTest = transformers_issues_text_dataset.train_test_split(shuffle=True, seed=42, test_size=0.36)
valid_test = train_validTest["test"].train_test_split(shuffle=True, seed=42, test_size=5/9)

transformers_issues_text_dataset = DatasetDict({
    "train": train_validTest["train"],
    "valid": valid_test["train"],
    "test": valid_test["test"]
})
print(transformers_issues_text_dataset)
transformers_issues_text_dataset.push_to_hub(repo_id="transformers_issues_labels")

In [90]:
# Trying it out

#####################
#                   #
#  copy and polish  #
#                   #
#####################

## load the file "transformers-issues.jsonl" that has been created by using 'fetch_issues(repo="transformers")' ...
## ... instead of 'fetch_issues()' just below the definition of the 'fetch_issues()' function further above
transformers_issues_dataset = load_dataset("json", data_files="data/transformers-issues.jsonl", split="train")
## add "text" column
transformers_issues_text_dataset = transformers_issues_dataset.rename_column(
    original_column_name="url", new_column_name="text"
)
## combine the columns "title", "comments", "reactions", and "body" into a single string and store that string ...
## ... in the new "text" column
feature_keys = ["title", "comments", "reactions", "body"]
reaction_keys = ["+1", "-1", "laugh", "hooray", "heart", "rocket", "eyes"]
def make_text(item):
    text = ""
    for fk_i in feature_keys:
        if fk_i=="reactions":
            text += f"\n\n{fk_i.upper()}"
            reactions = item[fk_i]
            reactions_json = json.loads(json.dumps(reactions, indent = 4))
            for rk_i in reaction_keys:
                rk_iCount = reactions_json[rk_i]
                text += f"\n{rk_i}: {rk_iCount}"
        else:
            text += f"\n\n{fk_i.upper()}\n{item[fk_i]}"
    item["text"] = text
    return item
transformers_issues_text_dataset = transformers_issues_dataset.map(make_text)
########################
## define and apply make_labels => for each instance "labels" holds a list of label names ...
## ... (e.g., ["benchmark", "performance"]) for the labels of the instance with index 13
def make_labels(item):
    labels = item["labels"]
    label_list = []
    #
    for label in labels:
        label_json = json.loads(json.dumps(label, indent = 4))
        label_name = label_json["name"]
        label_list.append(label_name)
    #
    item["labels"] = label_list
    return item
transformers_issues_text_dataset = transformers_issues_text_dataset.map(make_labels)
########################

# optionally filter for instances with labels!=[]
# make splits and build DatasetDict

## remove all columns but "labels", "text", and "url"
keep_keys = ["text", "labels", "url"]
remove_keys = [key for key in list(transformers_issues_text_dataset.features.keys()) if key not in keep_keys]
transformers_issues_text_dataset = transformers_issues_text_dataset.remove_columns(remove_keys)
print(transformers_issues_text_dataset)
idx = 1 # 13
print(f'\ntext:{transformers_issues_text_dataset["text"][idx]}')
print(f'\nlabels:\n{transformers_issues_text_dataset["labels"][idx]}')
print(f'\nurl:\n{transformers_issues_text_dataset["url"][idx]}')
########
# [1] https://discuss.huggingface.co/t/how-to-add-a-new-column-to-a-dataset/6453
# [2] https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes
# [3] https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.add_column
# [4] https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.remove_columns
# [5] https://huggingface.co/docs/datasets/v2.2.1/en/package_reference/main_classes#datasets.Dataset.map

Using custom data configuration default-141858a465b2fe0e
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-141858a465b2fe0e/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-141858a465b2fe0e/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-1ad8b64b4243b1d6.arrow
Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-141858a465b2fe0e/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-c55a828df18dcd38.arrow


Dataset({
    features: ['url', 'labels', 'text'],
    num_rows: 10000
})

text:

TITLE
Fixed incorrect error message on missing weight file.

COMMENTS
1

REACTIONS
+1: 0
-1: 0
laugh: 0
hooray: 0
heart: 0
rocket: 0
eyes: 0

BODY
# What does this PR do?
I just started using Hugging Face Transformers for the first time, and encountered this error.

    OSError: Error no file named pytorch_model.bin found in directory (...) but there is a file for Flax weights. Use `from_flax=True` to load this model from those weights.

Indeed, I forgot to download `pytorch_model.bin`, but the model I tried to use was not using Flax, so I dug a little bit to see which file was the library looking for.

For me it seems that there was a simple mistake...

## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#sta

**Restart the kernel in preparation for the next section.**

In [None]:
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## [Semantic search with FAISS](https://huggingface.co/course/chapter5/6?fw=pt)

In [section 5](https://huggingface.co/course/chapter5/5), we created a dataset of GitHub issues and comments from the 🤗 Datasets repository. In this section we'll use this information to build a search engine that can help us find answers to our most pressing questions about the library!

In [1]:
from IPython.display import HTML
HTML('<iframe width="640" height="360" src="https://www.youtube.com/embed/OATCgQtNX2o" allowfullscreen></iframe>')



### Using embeddings for semantic search
As we saw in [Chapter 1](https://huggingface.co/course/chapter1), Transformer-based language models represent each token in a span of text as an *embedding vector*. It turns out that one can "pool" the individual embeddings to create a vector representation for whole sentences, paragraphs, or (in some cases) documents. These embeddings can then be used to find similar documents in the corpus by computing the dot-product similarity (or some other similarity metric) between each embedding and returning the documents with the greatest overlap.

In this section we'll use embeddings to develop a semantic search engine. These search engines offer several advantages over conventional approaches that are based on matching keywords in a query with the documents.
<img style="float=center;" width="900" src="images/semantic-search.svg">
### Loading and preparing the dataset
The first thing we need to do is download our dataset of GitHub issues, so let's use the 🤗 Hub library to resolve the URL where our file is stored on the Hugging Face Hub:

In [2]:
from huggingface_hub import hf_hub_url
data_files = hf_hub_url(
    repo_id="lewtun/github-issues",
    filename="datasets-issues-with-comments.jsonl",
    repo_type="dataset"
)

With the URL stored in data_files, we can then load the remote dataset using the method introduced in [section 2](https://huggingface.co/course/chapter5/2):

In [3]:
from datasets import load_dataset
issues_dataset = load_dataset("json", data_files=data_files, split="train")
issues_dataset

Using custom data configuration default-6a579f365d89f2f1
Reusing dataset json (/Users/matthias/.cache/huggingface/datasets/json/default-6a579f365d89f2f1/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 3019
})

Here we've specified the default train split in `load_dataset()`, so it returns a `Dataset` instead of a `DatasetDict`. The first order of business is to filter out the pull requests, as these tend to be rarely used for answering user queries and will introduce noise in our search engine. As should be familiar by now, we can use the `Dataset.filter()` function to exclude these rows in our dataset. While we're at it, let's also filter out rows with no comments, since these provide no answers to user queries:

In [4]:
issues_dataset = issues_dataset.filter(lambda x: (x["is_pull_request"]==False and len(x["comments"]) > 0))
issues_dataset

Loading cached processed dataset at /Users/matthias/.cache/huggingface/datasets/json/default-6a579f365d89f2f1/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-7adeb9322eeeff4e.arrow


Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'timeline_url', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 808
})

We can see that there are a lot of columns in our dataset, most of which we don't need to build our search engine. From a search perspective, the most informative columns are `title`, `body`, and `comments`, while `html_url` provides us with a link back to the source issue. Let's use the `Dataset.remove_columns()` function to drop the rest:

In [5]:
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 808
})

To create our embeddings we'll augment each comment with the issue's title and body, since these fields often include useful contextual information. Because our comments column is currently a list of comments for each issue, we need to "explode" the column so that each row consists of an (`html_url`, `title`, `body`, `comment`) tuple. In Pandas, we can do this with the [`DataFrame.explode()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html), which creates a new row for each element in a list-like column, while replicating all the other column values. To see this in action, let's first switch to the Pandas `DataFrame` format:

In [6]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]
df

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"[Cool, I think we can do both :), @lhoestq now...",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,[Hi ! I guess the caching mechanism should hav...,## Describe the bug\r\nAfter upgrading to data...
2,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,[I tried `unshuffled_original_da` and it is al...,## Describe the bug\r\n\r\nCannot download OSC...
3,https://github.com/huggingface/datasets/issues...,load_dataset using default cache on Windows ca...,"[Hi @daqieq, thanks for reporting.\r\n\r\nUnfo...",## Describe the bug\r\nStandard process to dow...
4,https://github.com/huggingface/datasets/issues...,to_tf_dataset keeps a reference to the open da...,"[I did some investigation and, as it seems, th...",To reproduce:\r\n```python\r\nimport datasets ...
...,...,...,...,...
803,https://github.com/huggingface/datasets/issues/6,Error when citation is not given in the Datase...,[Yes looks good to me.\r\nNote that we may ref...,The following error is raised when the `citati...
804,https://github.com/huggingface/datasets/issues/5,ValueError when a split is empty,[To fix this I propose to modify only the file...,"When a split is empty either TEST, VALIDATION ..."
805,https://github.com/huggingface/datasets/issues/4,[Feature] Keep the list of labels of a dataset...,[Yes! I see mostly two options for this:\r\n- ...,It would be useful to keep the list of the lab...
806,https://github.com/huggingface/datasets/issues/3,[Feature] More dataset outputs,[Yes!\r\n- pandas will be a one-liner in `arro...,Add the following dataset outputs:\r\n\r\n- Sp...


If we inspect the first row in this `DataFrame` we can see there are two comments associated with this issue:

In [7]:
print(df["comments"][0].tolist())
len(df["comments"][0].tolist())

['Cool, I think we can do both :)', '@lhoestq now the 2 are implemented.\r\n\r\nPlease note that for the the second protection, finally I have chosen to protect the master branch only from **merge commits** (see update comment above), so no need to disable/re-enable the protection on each release (direct commits, different from merge commits, can be pushed to the remote master branch; and eventually reverted without messing up the repo history).']


2

When we explode `df`, we expect to get one row for each of these comments. Let's check if that's the case:

In [8]:
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(9)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...
4,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Well it can cause issue with anyone that updat...,## Describe the bug\r\nAfter upgrading to data...
5,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"I just merged a fix, let me know if you're sti...",## Describe the bug\r\nAfter upgrading to data...
6,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Definitely works on several manual cases with ...,## Describe the bug\r\nAfter upgrading to data...
7,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Fixed by #2947.,## Describe the bug\r\nAfter upgrading to data...
8,https://github.com/huggingface/datasets/issues...,OSCAR unshuffled_original_ko: NonMatchingSplit...,I tried `unshuffled_original_da` and it is als...,## Describe the bug\r\n\r\nCannot download OSC...


Great, we can see the rows have been replicated, with the `comments` column containing the individual comments! Now that we're finished with Pandas, we can quickly switch back to a `Dataset` by loading the `DataFrame` in memory:

In [9]:
from datasets import Dataset
comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2964
})

Okay, this has given us a few thousand comments to work with!
> ✏️ Try it out! <font color="darkgreen">See if you can use `Dataset.map()` to explode the `comments` column of `issues_dataset` *without* resorting to the use of Pandas. This is a little tricky; you might find the ["Batch mapping"](https://huggingface.co/docs/datasets/v2.0.0/about_map_batch?batch-mapping#batch-mapping) section of the 🤗 Datasets documentation useful for this task.</font>

In [10]:
# Trying it out
def explode_comments(items):
    # initialize empty arrays "html_url_arr = []", "title_arr = []", "comments_arr = []", and "body_arr = []"
    # loop over items
    # for each item:
    ## get the values for "html_url", "title", and "body"
    ## get "n_i" = the number of copies that need to be made (length of this item's "comments_arr" array)
    ## loop index "ii" over "range(n_i)"
    ### in step ii of the loop, get the ii-th value (= i-th comment) of the current item's "comments_arr" array
    ### append all current values (html_url, title, comment, body) to their corresponding arrays
    # build the dictionary and return it
    html_url_arr = []
    title_arr = []
    comments_arr = [] # only nested array
    body_arr = []
    n_items = len(items["html_url"])
    for i in range(n_items):
        html_url = items["html_url"][i]
        title = items["title"][i]
        body = items["body"][i]
        n_i = len(items["comments"][i])
        for ii in range(n_i):
            comment = items["comments"][i][ii]
            html_url_arr.append(html_url)
            title_arr.append(title)
            comments_arr.append(comment)
            body_arr.append(body)
    return {"html_url": html_url_arr, "title": title_arr, "comments": comments_arr, "body": body_arr}
# https://huggingface.co/docs/datasets/v2.0.0/about_map_batch?batch-mapping#batch-mapping
try_issues_dataset_exploded_comments = issues_dataset.map(explode_comments, batched=True)
# https://discuss.huggingface.co/t/how-do-you-rename-a-column-in-a-dataset/15121
try_issues_dataset_exploded_comments = try_issues_dataset_exploded_comments.rename_column("comments", "comment")
print("try_issues_dataset_exploded_comments")
try_issues_dataset_exploded_comments.set_format("pandas")
try_issues_dataset_exploded_comments[:]

  0%|          | 0/1 [00:00<?, ?ba/s]

try_issues_dataset_exploded_comments


Unnamed: 0,html_url,title,comment,body
0,https://github.com/huggingface/datasets/issues...,Protect master branch,"Cool, I think we can do both :)",After accidental merge commit (91c55355b634d0d...
1,https://github.com/huggingface/datasets/issues...,Protect master branch,@lhoestq now the 2 are implemented.\r\n\r\nPle...,After accidental merge commit (91c55355b634d0d...
2,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Hi ! I guess the caching mechanism should have...,## Describe the bug\r\nAfter upgrading to data...
3,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,"If it's easy enough to implement, then yes ple...",## Describe the bug\r\nAfter upgrading to data...
4,https://github.com/huggingface/datasets/issues...,Backwards compatibility broken for cached data...,Well it can cause issue with anyone that updat...,## Describe the bug\r\nAfter upgrading to data...
...,...,...,...,...
2959,https://github.com/huggingface/datasets/issues/2,Issue to read a local dataset,My first bug report ❤️\r\nLooking into this ri...,"Hello,\r\n\r\nAs proposed by @thomwolf, I open..."
2960,https://github.com/huggingface/datasets/issues/2,Issue to read a local dataset,"Ok, there are some news, most good than bad :l...","Hello,\r\n\r\nAs proposed by @thomwolf, I open..."
2961,https://github.com/huggingface/datasets/issues/2,Issue to read a local dataset,"Ok great, so as discussed today, let's:\r\n- h...","Hello,\r\n\r\nAs proposed by @thomwolf, I open..."
2962,https://github.com/huggingface/datasets/issues/2,Issue to read a local dataset,Good plan!\r\n\r\nYes I do use `builder_kwargs...,"Hello,\r\n\r\nAs proposed by @thomwolf, I open..."


Now that we have one comment per row, let's create a new `comments_length` column that contains the number of words per comment:

In [11]:
comments_dataset = comments_dataset.map(lambda x: {"comment_length": len(x["comments"].split())})

  0%|          | 0/2964 [00:00<?, ?ex/s]

We can use this new column to filter out short comments, which typically include things like "cc @lewtun" or "Thanks!" that are not relevant for our search engine. There's no precise number to select for the filter, but around 15 words seems like a good start:

In [12]:
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

  0%|          | 0/3 [00:00<?, ?ba/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2175
})

Having cleaned up our dataset a bit, let's concatenate the issue title, description, and comments together in a new `text` column. As usual, we'll write a simple function that we can pass to `Dataset.map()`:

In [13]:
def concatenate_text(examples):
    return {"text": examples["title"] + " \n " + examples["body"] + " \n " + examples["comments"]}
comments_dataset = comments_dataset.map(concatenate_text)

  0%|          | 0/2175 [00:00<?, ?ex/s]

We're finally ready to create some embeddings! Let's take a look.

### Creating text embeddings
We saw in [Chapter 2](https://huggingface.co/course/chapter2) that we can obtain token embeddings by using the `AutoModel` class. All we need to do is pick a suitable checkpoint to load the model from. Fortunately, there's a library called `sentence-transformers` that is dedicated to creating embeddings. As described in the library's [documentation](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search), our use case is an example of *asymmetric semantic search* because we have a short query whose answer we'd like to find in a longer document, like a an issue comment. The handy [model overview table](https://www.sbert.net/docs/pretrained_models.html#model-overview) in the documentation indicates that the `multi-qa-mpnet-base-dot-v1` checkpoint has the best performance for semantic search, so we'll use that for our application. We'll also load the tokenizer using the same checkpoint:

In [15]:
from transformers import AutoTokenizer, AutoModel
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

To speed up the embedding process, it helps to place the model and inputs on a GPU device, so let's do that now:

In [16]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use gpu if possible, else cpu
print(device)
model.to(device)

cpu


MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0): MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_features

As we mentioned earlier, we'd like to represent each entry in our GitHub issues corpus as a single vector, so we need to "pool" or average our token embeddings in some way. One popular approach is to perform *CLS pooling* on our model's outputs, where we simply collect the last hidden state for the special `[CLS]` token. The following function does the trick for us:

In [17]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

Next, we'll create a helper function that will tokenize a list of documents, place the tensors on the GPU, feed them to the model, and finally apply CLS pooling to the outputs:

In [18]:
def get_embeddings(text_list):
    encoded_input = tokenizer(text_list, padding=True, truncation=True, return_tensors="pt")
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

We can test the function works by feeding it the first text entry in our corpus and inspecting the output shape:

In [19]:
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

Great, we've converted the first entry in our corpus into a 768-dimensional vector! We can use `Dataset.map()` to apply our `get_embeddings()` function to each row in our corpus, so let's create a new `embeddings` column as follows:

In [20]:
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

  0%|          | 0/2175 [00:00<?, ?ex/s]

Notice that we've converted the embeddings to NumPy arrays — that's because 🤗 Datasets requires this format when we try to index them with FAISS, which we'll do next.

### Using FAISS for efficient similarity search
Now that we have a dataset of embeddings, we need some way to search over them. To do this, we'll use a special data structure in 🤗 Datasets called a *FAISS index*. [FAISS](https://faiss.ai/) (short for Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

The basic idea behind FAISS is to create a special data structure called an *index* that allows one to find which embeddings are similar to an input embedding. Creating a FAISS index in 🤗 Datasets is simple — we use the `Dataset.add_faiss_index()` function and specify which column of our dataset we'd like to index:

In [21]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 2175
})

We can now perform queries on this index by doing a nearest neighbor lookup with the `Dataset.get_nearest_examples()` function. Let's test this out by first embedding a question as follows:

In [22]:
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

Just like with the documents, we now have a 768-dimensional vector representing the query, which we can compare against the whole corpus to find the most similar embeddings:

In [23]:
scores, samples = embeddings_dataset.get_nearest_examples("embeddings", question_embedding, k=5)

The `Dataset.get_nearest_examples()` function returns a tuple of scores that rank the overlap between the query and the document, and a corresponding set of samples (here, the 5 best matches). Let's collect these in a `pandas.DataFrame` so we can easily sort them:

In [24]:
import pandas as pd
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

Now we can iterate over the first few rows to see how well our query matched the available comments:

In [25]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.50501251220703
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
```python
datasets = load_dataset('text', data_files=data_files)
```

We'll do a new release soon
SCORE: 24.555553436279297
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's no

Not bad! Our second hit seems to match the query.
> ✏️ Try it out! <font color="darkgreen">Create your own query and see whether you can find an answer in the retrieved documents. You might have to increase the `k` parameter in `Dataset.get_nearest_examples()` to broaden the search.</font>

In [26]:
# Trying it out
## query and embeddings
query = "How does batch mapping work?" # custom query
query_embedding = get_embeddings([query]).cpu().detach().numpy()
print(query_embedding.shape)
## sample the k nearest instances as well as their score (wrt the query)
scores, samples = embeddings_dataset.get_nearest_examples("embeddings", query_embedding, k=15) # maybe adapt k
## turn samples into a pandas dataframe, add the scores as a column, and sort the rows by their scores
samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)
## print the results sample by sample
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

(1, 768)
COMMENT: Hi ! Thanks for reporting. Indeed it looks like type inference makes it fail. We should probably just ignore this step until a non-empty batch is passed.
SCORE: 45.637550354003906
TITLE: Batched `map` not allowed to return 0 items
URL: https://github.com/huggingface/datasets/issues/2644

COMMENT: Sure if you're interested feel free to open a PR :)

You can also ping me anytime if you have questions or if I can help !
SCORE: 45.637550354003906
TITLE: Batched `map` not allowed to return 0 items
URL: https://github.com/huggingface/datasets/issues/2644

COMMENT: I fixed a bug that could cause this issue earlier today. Could you pull the latest version and try again ?
SCORE: 45.54084014892578
TITLE: Indices incorrect with multiprocessing
URL: https://github.com/huggingface/datasets/issues/597

COMMENT: Hi @albertvillanova, thanks for the reply. I just tried the new version and the problem still persists. 

Do I need to rebuild the saved dataset (which I load from disk)

## [🤗 Datasets, check!](https://huggingface.co/course/chapter5/7?fw=pt)
Well, that was quite a tour through the 🤗 Datasets library — congratulations on making it this far! With the knowledge that you've gained from this chapter, you should be able to:
- Load datasets from anywhere, be it the Hugging Face Hub, your laptop, or a remote server at your company.
- Wrangle your data using a mix of the `Dataset.map()` and `Dataset.filter()` functions.
- Quickly switch between data formats like Pandas and NumPy using `Dataset.set_format()`.
- Create your very own dataset and push it to the Hugging Face Hub.
- Embed your documents using a Transformer model and build a semantic search engine using FAISS.

In [Chapter 7](https://huggingface.co/course/chapter7), we'll put all of this to good use as we take a deep dive into the core NLP tasks that Transformer models are great for. Before jumping ahead, though, put your knowledge of 🤗 Datasets to the test with a quick quiz!

## [End-of-chapter quiz](https://huggingface.co/course/chapter5/8?fw=pt)
This chapter covered a lot of ground! Don't worry if you didn't grasp all the details; the next chapters will help you understand how things work under the hood.

Before moving on, though, let's test what you learned in this chapter.

**1. The `load_dataset()` function in 🤗 Datasets allows you to load a dataset from which of the following locations?**<br>
⚫️ Locally, e.g. on your laptop
> **Correct!** Correct! You can pass the paths of local files to the `data_files` argument of `load_dataset()` to load local datasets.

⚫️ The Hugging Face Hub
> **Correct!** Correct! You can load datasets on the Hub by providing the dataset ID, e.g. `load_dataset('emotion')`.

⚫️ A remote server
> **Correct!** Correct! You can pass URLs to the `data_files` argument of `load_dataset()` to load remote files.

**2. Suppose you load one of the GLUE tasks as follows:**
```python
from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")
```
Which of the following commands will produce a random sample of 50 elements from dataset?<br>
⚪️ `dataset.sample(50)`<br>
⚫️ `dataset.shuffle().select(range(50))`
> **Correct!** Correct! As you saw in this chapter, you first shuffle the dataset and then select the samples from it.

⚪️ `dataset.select(range(50)).shuffle()`

**3. Suppose you have a dataset about household pets called `pets_dataset`, which has a `name` column that denotes the name of each pet. Which of the following approaches would allow you to filter the dataset for all pets whose names start with the letter "L"?**<br>
⚫️ `pets_dataset.filter(lambda x : x['name'].startswith('L'))`
> **Correct!** Correct! Using a Python lambda function for these quick filters is a great idea. Can you think of another solution?

⚪️ `pets_dataset.filter(lambda x['name'].startswith('L'))`<br>
⚫️ Create a function like `def filter_names(x): return x['name'].startswith('L')` and run `pets_dataset.filter(filter_names)`.
> **Correct!** Correct! Just like with `Dataset.map()`, you can pass explicit functions to `Dataset.filter()`. This is useful when you have some complex logic that isn't suitable for a short lambda function. Which of the other solutions would work?

**4. What is memory mapping?**<br>
⚪️ A mapping between CPU and GPU RAM<br>
⚫️ A mapping between RAM and filesystem storage
> **Correct!** Correct! 🤗 Datasets treats each dataset as a memory-mapped file. This allows the library to access and operate on elements of the dataset without needing to fully load it into memory.

⚪️ A mapping between two files in the 🤗 Datasets cache

**5. Which of the following are the main benefits of memory mapping?**<br>
⚫️ Accessing memory-mapped files is faster than reading from or writing to disk.
> **Correct!** Correct! This allows 🤗 Datasets to be blazing fast. That's not the only benefit, though.

⚫️ Applications can access segments of data in an extremely large file without having to read the whole file into RAM first.
> **Correct!** Correct! This allows 🤗 Datasets to load multi-gigabyte datasets on your laptop without blowing up your CPU. What other advantage does memory mapping offer?

⚪️ It consumes less energy, so your battery lasts longer.

**6. Why does the following code fail?**
```python
from datasets import load_dataset
dataset = load_dataset("allocine", streaming=True, split="train")
dataset[0]
```
<br>
⚪️ It tries to stream a dataset that's too large to fit in RAM.<br>
⚫️ It tries to access an `IterableDataset`.
> **Correct!** Correct! An `IterableDataset` is a generator, not a container, so you should access its elements using `next(iter(dataset))`.

⚪️ The `allocine` dataset doesn't have a `train` split.

**7. Which of the following are the main benefits of creating a dataset card?**<br>
⚫️ It provides information about the intended use and supported tasks of the dataset so others in the community can make an informed decision about using it.
> **Correct!** Undocumented datasets may be used to train models that may not reflect the intentions of the dataset creators, or may produce models whose legal status is murky if they're trained on data that violates privacy or licensing restrictions. This isn't the only benefit, though!

⚫️ It helps draw attention to the biases that are present in a corpus.
> **Correct!** Correct! Almost all datasets have some form of bias, which can produce negative consequences downstream. Being aware of them helps model builders understand how to address the inherent biases. What else do dataset cards help with?

⚫️ It improves the chances that others in the community will use my dataset.
> **Correct!** Correct! A well-written dataset card will tend to lead to higher usage of your precious dataset. What other benefits does it offer?

**8. What is semantic search?**<br>
⚪️ A way to search for exact matches between the words in a query and the documents in a corpus<br>
⚫️ A way to search for matching documents by understanding the contextual meaning of a query
> **Correct!** Correct! Semantic search uses embedding vectors to represent queries and documents, and uses a similarity metric to measure the amount of overlap between them. How else might you describe it?

⚫️ A way to improve search accuracy<br>
> **Correct!** Correct! Semantic search engines can capture the intent of a query much better than keyword matching and typically retrieve documents with higher precision. But this isn't the only right answer - what else does semantic search provide?

**9. For asymmetric semantic search, you usually have:**<br>
⚫️ A short query and a longer paragraph that answers the query
> **Correct!** Correct!

⚪️ Queries and paragraphs that are of about the same length<br>
⚪️ A long query and a shorter paragraph that answers the query


**10. Can I use 🤗 Datasets to load data for use in other domains, like speech processing?**<br>
⚪️ No<br>
⚫️ Yes
> **Correct!** Correct! Check out the exciting developments with speech and vision in the 🤗 Transformers library to see how 🤗 Datasets is used in these domains.