# HuggingFace Datasets

Datasets are available at [https://huggingface.co/datasets](https://huggingface.co/datasets). Following simple example demonstrates how to use some of the dataset functionality from HuggingFace. Data sources:

* [GLUE](https://gluebenchmark.com/) (General Language Understanding Evaluation)

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer

In [2]:
# GLUE stands for General Language Understanding Evaluation
# Loads sst2 (sentiment analysis) task data from the GLUE dataset
raw_dataset = load_dataset("glue", "sst2")

In [3]:
print(f"Dataset:\n{raw_dataset}")

Dataset:
DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})


In [9]:
# Type of the train dataset
type(raw_dataset["train"])

datasets.arrow_dataset.Dataset

In [10]:
# Contents of the train dataset
raw_dataset['train']

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 67349
})

In [16]:
# Show the types of each of the elements
raw_dataset['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [12]:
# Contents of the first element of the train dataset
raw_dataset['train'][0]

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

In [15]:
# Contents of the first 5 elements
raw_dataset['train'][0:5]

{'sentence': ['hide new secretions from the parental units ',
  'contains no wit , only labored gags ',
  'that loves its characters and communicates something rather beautiful about human nature ',
  'remains utterly satisfied to remain the same throughout ',
  'on the worst revenge-of-the-nerds clich√©s the filmmakers could dredge up '],
 'label': [0, 0, 1, 0, 0],
 'idx': [0, 1, 2, 3, 4]}

### Tokenization with Datasets

Following shows how to tokenize a dataset. In essence, in order to tokenize the dataset, we create a mapping function that is passed to the dataset's map-function.

In [6]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_fn(batch):
    return tokenizer(batch['sentence'], truncation=True)

tokenized_dataset = raw_dataset.map(tokenize_fn, batched=True)

In [7]:
print(tokenized_dataset['train'][0])

{'sentence': 'hide new secretions from the parental units ', 'label': 0, 'idx': 0, 'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
