## Hugging Face datasets library

### Goal of this tutorial:
- Know the usage of datasets library from Hugging Face
- What will be covered:
    - Quick overview
    - Installation
    - Datasets
        - List available datasets
        - Online dataset explorer 
        - Loading yelp dataset
        - Inspect and modify yelp dataset
        - Train a BERT model using yelp dataset
    - Metrics
        - List available metrics
        - Load NER evaluation metric
    - Loading custom dataset
        - Loading from a python dictionary
        - Loading from a pandas dataframe
        - Using a custom dataset loading script (Bonus)
        

###  General:
- This notebook was last tested on Python 3.6.4, PyTorch 1.4.0, transformers 2.0.0, datasets 1.1.2 
- We would like to acknowledge the tutorial on datasets library from huggingface (https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb) which we used as a reference.


### References:
To know more about the above-mentioned concepts, take a look at the following:
1. Original GitHub repository (https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)
2. Documentation (https://huggingface.co/docs/datasets/)
3. Online dataset explorer (https://huggingface.co/nlp/viewer)

## Quick overview

`🤗Datasets` is a fast and efficient library to easily share and load dataset and evaluation metrics, already providing access to 150+ datasets and 12+ evaluation metrics.

The library has several interesting features (besides easy access to datasets/metrics):

- Built-in interoperability with PyTorch, Tensorflow 2, Pandas and Numpy
- Lighweight and fast library with a transparent and pythonic API
- Strive on large datasets: frees you from RAM memory limits, all datasets are memory-mapped on drive by default.
- Smart caching with an intelligent `tf.data`-like cache: never wait for your data to process several times

## Installation

Let's install the datasets library.

In [None]:
!pip install datasets

# Make sure that we have a recent version of pyarrow in the session before we continue - otherwise reboot Colab to activate it
import pyarrow
if int(pyarrow.__version__.split('.')[1]) < 16 and int(pyarrow.__version__.split('.')[0]) == 0:
    import os
    os.kill(os.getpid(), 9)



## Datasets

Let's import the library. We typically only need at most four methods:

In [None]:
from datasets import list_datasets, list_metrics, load_dataset, load_metric
from pprint import pprint

### List available datasets
Let's list the currently available datasets

In [None]:
datasets = list_datasets()
print(f"Currently {len(datasets)} datasets are available on the hub:")
pprint(datasets, compact=True)

Currently 170 datasets are available on the hub:
['aeslc', 'ag_news', 'ai2_arc', 'allocine', 'amazon_us_reviews', 'anli', 'arcd',
 'art', 'aslg_pc12', 'billsum', 'biomrc', 'blended_skill_talk', 'blimp',
 'blog_authorship_corpus', 'bookcorpus', 'boolq', 'break_data', 'c4', 'cfq',
 'civil_comments', 'clue', 'cmrc2018', 'cnn_dailymail', 'coarse_discourse',
 'com_qa', 'common_gen', 'commonsense_qa', 'compguesswhat', 'conll2000',
 'conll2003', 'coqa', 'cornell_movie_dialog', 'cos_e', 'cosmos_qa', 'crd3',
 'crime_and_punish', 'csv', 'daily_dialog', 'definite_pronoun_resolution',
 'discofuse', 'docred', 'doqa', 'drop', 'eli5', 'emo', 'emotion',
 'empathetic_dialogues', 'eraser_multi_rc', 'esnli', 'event2Mind', 'fever',
 'flores', 'fquad', 'gap', 'germeval_14', 'gigaword', 'glue',
 'guardian_authorship', 'hans', 'hansards', 'hellaswag', 'hotpot_qa',
 'hyperpartisan_news_detection', 'imdb', 'iwslt2017', 'jeopardy', 'json',
 'kilt_tasks', 'kilt_wikipedia', 'kor_nli', 'lc_quad', 'lhoestq/squad',


### Online dataset viewer

All these datasets can also be browsed on 
* the HuggingFace Hub (https://huggingface.co/datasets) 
* the 🤗datasets viewer (https://huggingface.co/nlp/viewer/)

For the sake of understanding we will look at yelp reviews polarity dataset.

#### Task
Binary Sentiment Classification: Given a yelp review, the task is to predict the sentiment for the given review.

#### Dataset
The yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 training samples and 38,000 testing samples. Negative polarity is class 1, and positive class 2.

### Loading yelp dataset

Before downloading any dataset, we can access various attributes of the datasets. We will access the attributes of yelp dataset now: 

In [None]:
yelp_dataset = list_datasets(with_details=True)[datasets.index('yelp_polarity')]
pprint(yelp_dataset.__dict__)  # It's a simple python dataclass

{'author': None,
 'citation': '@article{zhangCharacterlevelConvolutionalNetworks2015,\n'
             '  archivePrefix = {arXiv},\n'
             '  eprinttype = {arxiv},\n'
             '  eprint = {1509.01626},\n'
             '  primaryClass = {cs},\n'
             '  title = {Character-Level {{Convolutional Networks}} for {{Text '
             'Classification}}},\n'
             '  abstract = {This article offers an empirical exploration on '
             'the use of character-level convolutional networks (ConvNets) for '
             'text classification. We constructed several large-scale datasets '
             'to show that character-level convolutional networks could '
             'achieve state-of-the-art or competitive results. Comparisons are '
             'offered against traditional models such as bag of words, n-grams '
             'and their TFIDF variants, and deep learning models such as '
             'word-based ConvNets and recurrent neural networks.},\n'
      

Let us now download and load yelp dataset.

In [None]:
from datasets import load_dataset
dataset = load_dataset('yelp_polarity', split='test[:1%]')

Reusing dataset yelp_polarity (/root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/2b33212d89209ed1ea0522001bccc5f5a5c920dd9c326f3c828e67a22c51a98c)


This call to `datasets.load_dataset()` does the following steps under the hood:

1. Download and import in the library the **Yelp polarity python processing script** from HuggingFace AWS bucket if it's not already stored in the library. You can find the Yelp processing script [here](https://github.com/huggingface/datasets/blob/master/datasets/yelp_polarity/yelp_polarity.py) for instance.

   Processing scripts are small python scripts which define the info (citation, description) and format of the dataset and contain the URL to the original Yelp JSON files and the code to load examples from the original Yelp JSON files.


2. Run the Yelp python processing script which will:
    - **Download the Yelp dataset** from the original URL (see the script) if it's not already downloaded and cached.
    - **Process and cache** all Yelp reviews in a structured Arrow table for each standard splits stored on the drive.

      Arrow table are arbitrarily long tables, typed with types that can be mapped to numpy/pandas/python standard types and can store nested objects. They can be directly access from drive, loaded in RAM or even streamed over the web.
    

3. Return a **dataset built from the splits** asked by the user (default: all), in the above example we create a dataset with the first 1% of the test split.

The returned `Dataset` object is a memory mapped dataset that behave similarly to a normal map-style dataset. It is backed by an Apache Arrow table which allows many interesting features.

Let us get information on the dataset (description, citation, size, splits, format, ...):

In [None]:
# are provided in `dataset.info` (a simple python dataclass) and also as direct attributes in the dataset object
pprint(dataset.info.__dict__)

{'builder_name': 'yelp_polarity',
 'citation': '@article{zhangCharacterlevelConvolutionalNetworks2015,\n'
             '  archivePrefix = {arXiv},\n'
             '  eprinttype = {arxiv},\n'
             '  eprint = {1509.01626},\n'
             '  primaryClass = {cs},\n'
             '  title = {Character-Level {{Convolutional Networks}} for {{Text '
             'Classification}}},\n'
             '  abstract = {This article offers an empirical exploration on '
             'the use of character-level convolutional networks (ConvNets) for '
             'text classification. We constructed several large-scale datasets '
             'to show that character-level convolutional networks could '
             'achieve state-of-the-art or competitive results. Comparisons are '
             'offered against traditional models such as bag of words, n-grams '
             'and their TFIDF variants, and deep learning models such as '
             'word-based ConvNets and recurrent neural netw

### Inspect and modify yelp dataset

Let's pretty print the dataset object:

In [None]:
pprint(dataset)

Dataset(features: {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['1', '2'], names_file=None, id=None)}, num_rows: 380)


We can query it's length like we would do normally with a python mapping.

In [None]:
print(f"👉Dataset len(dataset): {len(dataset)}")

👉Dataset len(dataset): 380


We can get items or slices like we would do normally with a python mapping. Let us get the first example:

In [None]:
print("\n👉First item 'dataset[0]':")
pprint(dataset[0])


👉First item 'dataset[0]':
{'label': 1,
 'text': 'Contrary to other reviews, I have zero complaints about the service '
         'or the prices. I have been getting tire service here for the past 5 '
         'years now, and compared to my experience with places like Pep Boys, '
         "these guys are experienced and know what they're doing. \\nAlso, "
         'this is one place that I do not feel like I am being taken advantage '
         'of, just because of my gender. Other auto mechanics have been '
         'notorious for capitalizing on my ignorance of cars, and have sucked '
         'my bank account dry. But here, my service and road coverage has all '
         'been well explained - and let up to me to decide. \\nAnd they just '
         'renovated the waiting room. It looks a lot better than it did in '
         'previous years.'}


Let us slice several examples (8th, 9th, 10th):

In [None]:
print("\n👉Slice of the three items 'dataset[7:10]':")
pprint(dataset[7:10])


👉Slice of the three items 'dataset[7:10]':
OrderedDict([('label', [0, 0, 1]),
             ('text',
              ['Ok! Let me tell you about my bad experience first. I went to '
               'D&B last night for a post wedding party - which, side note, is '
               "a great idea!\\n\\nIt was around midnight and the bar wasn't "
               'really populated. There were three bartenders and only one was '
               'actually making rounds to see if anyone needed anything. The '
               'two other bartenders were chatting on the far side of the bar '
               'that no one was sitting at. Kind of counter productive if you '
               'ask me. \\n\\nI stood there for about 5 minutes, which for a '
               'busy bar is fine but when I am the only one with my card out '
               'then, it just seems a little ridiculous. I made eye contact '
               'with the one girl twice and gave her a smile and she literally '
               'turned 

Let us get a full column of the dataset for first 5 examples by indexing with its name as a string:

In [None]:
pprint(dataset['text'][0:5])

['Contrary to other reviews, I have zero complaints about the service or the '
 'prices. I have been getting tire service here for the past 5 years now, and '
 'compared to my experience with places like Pep Boys, these guys are '
 "experienced and know what they're doing. \\nAlso, this is one place that I "
 'do not feel like I am being taken advantage of, just because of my gender. '
 'Other auto mechanics have been notorious for capitalizing on my ignorance of '
 'cars, and have sucked my bank account dry. But here, my service and road '
 'coverage has all been well explained - and let up to me to decide. \\nAnd '
 'they just renovated the waiting room. It looks a lot better than it did in '
 'previous years.',
 'Last summer I had an appointment to get new tires and had to wait a super '
 'long time. I also went in this week for them to fix a minor problem with a '
 'tire they put on. They \\""fixed\\"" it for free, and the very next morning '
 'I had the same issue. I called to com

The `__getitem__` method will return different format depending on the type of query:

- Items like `dataset[0]` are returned as dict of elements.
- Slices like `dataset[7:10]` are returned as dict of lists of elements.
- Columns like `dataset['text']` are returned as a list of elements.

In particular, we can easily iterate along columns in slices, and also naturally permute consecutive indexings with identical results as showed here by permuting column indexing with elements and slices:

In [None]:
print(dataset[0]['text'])
print(dataset['text'][0]) # returns same result as previous command
print(dataset[0]['text'] == dataset['text'][0])

Contrary to other reviews, I have zero complaints about the service or the prices. I have been getting tire service here for the past 5 years now, and compared to my experience with places like Pep Boys, these guys are experienced and know what they're doing. \nAlso, this is one place that I do not feel like I am being taken advantage of, just because of my gender. Other auto mechanics have been notorious for capitalizing on my ignorance of cars, and have sucked my bank account dry. But here, my service and road coverage has all been well explained - and let up to me to decide. \nAnd they just renovated the waiting room. It looks a lot better than it did in previous years.
Contrary to other reviews, I have zero complaints about the service or the prices. I have been getting tire service here for the past 5 years now, and compared to my experience with places like Pep Boys, these guys are experienced and know what they're doing. \nAlso, this is one place that I do not feel like I am bei

Similarly we can apply permuation for slice of multiple indices:

In [None]:
print(dataset[0:5]['text'] == dataset['text'][0:5])

True


#### Dataset are internally typed and structured

The dataset is backed by one (or several) Apache Arrow tables which are typed and allows for fast retrieval and access as well as arbitrary-size memory mapping.

This means respectively that the format for the dataset is clearly defined and that you can load datasets of arbitrary size without worrying about RAM memory limitation (basically the dataset take no space in RAM, it's directly read from drive when needed with fast IO access).

We can inspect the dataset column names:

In [None]:
print("Column names:")
pprint(dataset.column_names)

Column names:
['label', 'text']


We can inspect the dataset column types:

In [None]:
print("Features:")
pprint(dataset.features)

Features:
{'label': ClassLabel(num_classes=2, names=['1', '2'], names_file=None, id=None),
 'text': Value(dtype='string', id=None)}


#### Modifying the dataset with `dataset.map`

Now that we know how to inspect our dataset we also want to update it. For that there is a powerful method `.map()` that we can use to apply a function to each examples, independently or in batch.

`.map()` takes a callable accepting a dict as argument (same dict as the one returned by `dataset[i]`) and iterate over the dataset by calling the function on each example.

Let us use map function to print length of all texts:

In [None]:
dataset.map(lambda example: print(len(example['text']), end=','))

681,

HBox(children=(FloatProgress(value=0.0, max=380.0), HTML(value='')))

681,374,93,300,783,1069,198,1720,858,352,1063,922,1124,632,217,199,2230,317,1157,414,862,505,153,1165,170,1772,575,1166,114,548,627,286,1133,725,261,828,657,2853,88,727,1008,383,176,617,180,439,1637,1963,1625,749,1693,1773,212,4125,118,266,858,403,564,845,257,1321,1176,1395,566,409,177,188,241,266,63,373,225,1040,283,813,189,244,499,184,124,1655,340,845,1348,167,2874,407,560,271,2837,190,476,55,1192,30,1727,1216,486,267,363,470,710,145,459,482,165,847,900,1237,1186,452,903,575,1993,445,414,1131,1686,1141,396,881,433,255,282,681,583,974,1361,3354,563,1333,885,845,407,718,509,207,334,458,305,1504,641,317,498,465,410,881,768,228,441,426,119,1863,878,238,2260,642,2246,158,941,922,175,823,699,427,239,122,146,1176,238,1102,423,125,1742,106,356,614,269,1555,348,1223,636,790,689,502,143,291,201,275,144,1005,649,1242,207,559,397,442,1049,1263,184,720,355,504,788,1316,2220,275,1051,921,373,431,753,784,1733,133,45,122,1084,43,492,414,203,187,710,611,1482,112,401,980,764,1578,206,650,992,708,642,5

Dataset(features: {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['1', '2'], names_file=None, id=None)}, num_rows: 380)

This is basically the same as doing

```python
for example in dataset:
    function(example)
```

The main interest of `.map()` is to update and modify the content of the table and leverage smart caching and fast backend.

To use `.map()` to update elements in the table you need to provide a function with the following signature: `function(example: dict) -> dict`.

Let us try to add a new field `cute_text` that contains prefix `My cute review:` to all reviews in the dataset:

In [None]:
def add_prefix_to_text(example):
    example['cute_text'] = 'My cute review: ' + example['text']
    return example

prefixed_dataset = dataset.map(add_prefix_to_text)
print(prefixed_dataset.column_names) # print column names
pprint(prefixed_dataset.unique('cute_text')[0:3])  # `.unique()` is a super fast way to print the unique elemnts in a column (see the doc for all the methods)

Loading cached processed dataset at /root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/2b33212d89209ed1ea0522001bccc5f5a5c920dd9c326f3c828e67a22c51a98c/cache-b88eb39dd560101c.arrow


['cute_text', 'label', 'text']
['My cute review: Contrary to other reviews, I have zero complaints about the '
 'service or the prices. I have been getting tire service here for the past 5 '
 'years now, and compared to my experience with places like Pep Boys, these '
 "guys are experienced and know what they're doing. \\nAlso, this is one place "
 'that I do not feel like I am being taken advantage of, just because of my '
 'gender. Other auto mechanics have been notorious for capitalizing on my '
 'ignorance of cars, and have sucked my bank account dry. But here, my service '
 'and road coverage has all been well explained - and let up to me to decide. '
 '\\nAnd they just renovated the waiting room. It looks a lot better than it '
 'did in previous years.',
 'My cute review: Last summer I had an appointment to get new tires and had to '
 'wait a super long time. I also went in this week for them to fix a minor '
 'problem with a tire they put on. They \\""fixed\\"" it for free, and 

The function you provide to `.map()` should accept an input with the format of an item of the dataset: `function(dataset[0])` and return a python dict.

Let us remove the column `cute_text` by running map with the `remove_columns=List[str]` argument:

In [None]:
less_columns_dataset = prefixed_dataset.map(remove_columns=['cute_text'])
print(less_columns_dataset.column_names) # print column names
pprint(less_columns_dataset.unique('text')[0:3]) # print three texts

Loading cached processed dataset at /root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/2b33212d89209ed1ea0522001bccc5f5a5c920dd9c326f3c828e67a22c51a98c/cache-56c4ecfed98f304d.arrow


['label', 'text']
['Contrary to other reviews, I have zero complaints about the service or the '
 'prices. I have been getting tire service here for the past 5 years now, and '
 'compared to my experience with places like Pep Boys, these guys are '
 "experienced and know what they're doing. \\nAlso, this is one place that I "
 'do not feel like I am being taken advantage of, just because of my gender. '
 'Other auto mechanics have been notorious for capitalizing on my ignorance of '
 'cars, and have sucked my bank account dry. But here, my service and road '
 'coverage has all been well explained - and let up to me to decide. \\nAnd '
 'they just renovated the waiting room. It looks a lot better than it did in '
 'previous years.',
 'Last summer I had an appointment to get new tires and had to wait a super '
 'long time. I also went in this week for them to fix a minor problem with a '
 'tire they put on. They \\""fixed\\"" it for free, and the very next morning '
 'I had the same issu

#### Train a BERT model using yelp dataset

Let us start by tokenizing 1% of train dataset. For that, we need to load train dataset.

In [None]:
train_dataset = load_dataset('yelp_polarity', split='train[:1%]')
pprint(train_dataset)

Reusing dataset yelp_polarity (/root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/2b33212d89209ed1ea0522001bccc5f5a5c920dd9c326f3c828e67a22c51a98c)


Dataset(features: {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['1', '2'], names_file=None, id=None)}, num_rows: 5600)


Now let us try to tokenize all the reviews. We will use `Tokenizer` from transformers library.

Input to Tokenizer: The tokenizers of the 🤗transformers library can accept lists of texts as inputs and tokenize them efficiently in batch (for the fast tokenizers in particular).

Output to Tokenizer: This tokenizer will output a dictionary-like object with three fields: input_ids, token_type_ids, attention_mask corresponding to model’s required inputs. Each field contain a list (batch) of samples.

Let's load the tokenizer:

In [None]:
!pip install transformers 
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')



Let's tokenize all the reviews now:

In [None]:
encoded_train_dataset = train_dataset.map(lambda examples: tokenizer(examples['text'], truncation=True, padding='max_length'), batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/2b33212d89209ed1ea0522001bccc5f5a5c920dd9c326f3c828e67a22c51a98c/cache-e9bb4a6ed5bff8ae.arrow


Let's look at the column names in the `encoded_train_dataset`:

In [None]:
print(encoded_train_dataset.column_names)

['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids']


Here goes some details of each column name:

`attention_mask`: List of indices specifying which tokens should be attended to by the model

`input_ids`:  List of token ids to be fed to a model.

`token_type_ids`: List of token type ids to be fed to a model 

For more details of above, check https://huggingface.co/transformers/main_classes/tokenizer.html

`text`: Raw review

`label`: Raw label

Let's print one example:

In [None]:
pprint(encoded_train_dataset[0])

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0

#### formatting outputs for PyTorch, Tensorflow, Numpy, Pandas

Now that we have tokenized our inputs, we probably want to use this dataset in a `torch.Dataloader`

To be able to do this we need to tweak two things:

- format the indexing (`__getitem__`) to return numpy/pytorch/tensorflow tensors, instead of python objects, and probably
- format the indexing (`__getitem__`) to return only the subset of the columns that we need for our model inputs.

  We don't want the columns `id` or `title` as inputs to train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model.
    
This is handled by the `.set_format(type: Union[None, str], columns: Union[None, str, List[str]])` where:

- `type` define the return type for our dataset `__getitem__` method and is one of `[None, 'numpy', 'pandas', 'torch', 'tensorflow']` (`None` means return python objects), and
- `columns` define the columns returned by `__getitem__` and takes the name of a column in the dataset or a list of columns to return (`None` means return all columns).

Let us list the columns required for training:

In [None]:
columns_to_return = ['input_ids', 'attention_mask', 'label']

Let us change the format of the dataset to torch (suitable for `torch.Dataloader`):

In [None]:
encoded_train_dataset.set_format(type='torch', columns=columns_to_return)

Our dataset indexing output is now ready for being used in a pytorch dataloader:

In [None]:
pprint(encoded_train_dataset[1], compact=True)

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0

  return torch.tensor(x, **format_kwargs)


Let's instantiate a data loader that consumes our dataset indexing output:

In [None]:
import torch
dataloader = torch.utils.data.DataLoader(encoded_train_dataset, batch_size=2)

Let's import Bert base model, Adam optimizer and loss function from transformers library:

In [None]:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2, output_attentions = False, output_hidden_states = False,)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Let's train BertModel base model on our tokenized dataset for 1 step:

In [None]:
model.train() # toggle training mode
for i, batch in enumerate(dataloader):
    loss, logits = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['label']) # pass the input batch to model
    loss.backward() # backprop the grads (store in grad buffer)
    optimizer.step() # use grads from grad buffer to update the model
    model.zero_grad() # zero the grad buffer
    print(f'Step {i} - loss: {loss.item():.3}') # print the step loss
    break

Step 0 - loss: 0.715


## Metrics

`datasets` also provides easy access and sharing of metrics.

This aspect of the library is still experimental and the API may still evolve more than the datasets API.

Like datasets, metrics are added as small scripts wrapping common metrics in a common API.

There are several reason you may want to use metrics with `datasets` and in particular:

- metrics for specific datasets like GLUE or SQuAD are provided out-of-the-box in a simple, convenient and consistant way integrated with the dataset,
- metrics in `datasets` leverage the powerful backend to provide smart features out-of-the-box like support for distributed evaluation in PyTorch

Let's list available metrics:

In [None]:
metrics = list_metrics()
print(f"Currently {len(metrics)} metrics are available on the hub:")
pprint(metrics, compact=True)

Currently 13 metrics are available on the hub:
['bertscore', 'bleu', 'bleurt', 'coval', 'gleu', 'glue', 'meteor', 'rouge',
 'sacrebleu', 'seqeval', 'squad', 'squad_v2', 'xnli']


Let's look at an example metric: `seqeval`. `seqeval` is a Python framework for sequence labeling evaluation that can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on. 

For more details about `seqeval` metric, look at: https://huggingface.co/metrics/seqeval

Let's install the dependency for the metric and load that metric now:

In [None]:
!pip install seqeval 
ner_metric = load_metric('seqeval')



Let's generate sample references and predictions for NER task:

In [None]:
references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
predictions =  [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]

Let's use `seqeval` metric to score the predictions:

In [None]:
ner_metric.compute(predictions=predictions, references=references)

{'MISC': {'f1': 0, 'number': 1, 'precision': 0.0, 'recall': 0.0},
 'PER': {'f1': 1.0, 'number': 1, 'precision': 1.0, 'recall': 1.0},
 'overall_accuracy': 0.8,
 'overall_f1': 0.5,
 'overall_precision': 0.5,
 'overall_recall': 0.5}

## Loading local content

It’s also possible to create a dataset from local files or in-memory data.

There are multiple file types which is currently supported which we can use:
- CSV files 
- JSON files 
- text files (read as a line-by-line dataset)
- pandas pickled dataframe


To begin, we need to import Dataset class from the library:

In [None]:
from datasets import Dataset

### Loading from a python dictionary

Let's create a python dictionary:

In [None]:
my_dict = {'id': [0, 1, 2, 3], 'name': ['mary', 'bob', 'eve', 'rob'], 'age': [24, 53, 19, 25]}

Let's instantiate a dataset object from the above dictionary:

In [None]:
dataset = Dataset.from_dict(my_dict)
pprint(dataset)

Dataset(features: {'id': Value(dtype='int64', id=None), 'name': Value(dtype='string', id=None), 'age': Value(dtype='int64', id=None)}, num_rows: 4)


Let's print second row:

In [None]:
pprint(dataset[1])

{'age': 53, 'id': 1, 'name': 'bob'}


### Loading from a pandas dataframe

Similarly, let's create a pandas dataframe:

In [None]:
import pandas as pd
df = pd.DataFrame({"id": [0, 1, 2, 3], 'name': ['mary', 'bob', 'eve', 'rob'], 'age': [24, 53, 19, 25]})
pprint(df)

   id  name  age
0   0  mary   24
1   1   bob   53
2   2   eve   19
3   3   rob   25


Let's instantiate a dataset object from the above pandas frame:

In [None]:
dataset = Dataset.from_pandas(df)
pprint(dataset)

Dataset(features: {'id': Value(dtype='int64', id=None), 'name': Value(dtype='string', id=None), 'age': Value(dtype='int64', id=None)}, num_rows: 4)


Let's print second row:

In [None]:
pprint(dataset[1])

{'age': 53, 'id': 1, 'name': 'bob'}


### Using a custom dataset loading script (Bonus)

If the provided loading scripts for Hub dataset or for local files are not adapted for our use case, we can also easily write and use our own dataset loading script.

We can use a local loading script just by providing its path instead of the usual shortcut name:


```python
from datasets import load_dataset
dataset = load_dataset('PATH/TO/MY/LOADING/SCRIPT', data_files='PATH/TO/MY/FILE')
```

More details on how to create our own dataset generation script on the [Writing a dataset loading script page](https://huggingface.co/docs/datasets/add_dataset.html) and we can also find some inspiration in from the already provided loading scripts on the [GitHub repository](https://github.com/huggingface/datasets/tree/master/datasets).

That's it!