# Purpose
The purpose of this notebook is to explore the usage of datasets in HF, primarily by following the [official tutorial](https://huggingface.co/docs/datasets/tutorial).

# Load
You can read metadata using `load_dataset_builder` before committing to downloading the dataset.  Load the dataset using `load_dataset`, specifying split if desired.  Splits can be found using `get_dataset_split_names`.

In [1]:
from datasets import load_dataset_builder

dataset_name = "rotten_tomatoes"

ds_builder = load_dataset_builder(dataset_name)
print(ds_builder.info.description)
print("Features \n", ds_builder.info.features)

  from .autonotebook import tqdm as notebook_tqdm


Movie Review Dataset.
This is a dataset of containing 5,331 positive and 5,331 negative processed
sentences from Rotten Tomatoes movie reviews. This data was first used in Bo
Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for
sentiment categorization with respect to rating scales.'', Proceedings of the
ACL, 2005.

Features 
 {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}


In [2]:
from datasets import get_dataset_split_names

print("Splits: ", get_dataset_split_names(dataset_name))

Splits:  ['train', 'validation', 'test']


In [3]:
from datasets import load_dataset

dataset = load_dataset(dataset_name, split="train")
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

# Indexing
You can index the dataset by row with integers, or by column with column name.  You can also use ranges of values to slice the dataset.

In [5]:
dataset[5]["text"]

'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .'

In [6]:
dataset[3:6]["text"]

['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
 "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
 'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .']

# Tokenize
It is also possible to perform preprocessing tasks on the `dataset` object, including:
- tokenize text data
- resample audio data
- transform or augment image data

Most relevant to this project is tokenization so I'll explore below.

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer(dataset[0]["text"])

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 32.6kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 1.33MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 9.45MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 8.98MB/s]


{'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [8]:
def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

Map: 100%|██████████| 8530/8530 [00:00<00:00, 11276.52 examples/s]


## Format
You can also name fields and define data formats to work with particular modeling frameworks.

In [9]:
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset.format['type']

ValueError: PyTorch needs to be installed to be able to return PyTorch tensors.