[![Colab Badge Link](https://img.shields.io/badge/open-in%20colab-blue)](https://colab.research.google.com/github/Glasgow-AI4BioMed/tutorials/blob/main/creating_a_huggingface_dataset_object.ipynb)

# Creating a Hugging Face Dataset and DatasetDict object

Hugging Face often uses their own [Dataset](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset) and [DatasetDict](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.DatasetDict) objects in their tutorials as wrappers for various datasets. For example, it is used in their [token classification tutorial](https://huggingface.co/docs/transformers/tasks/token_classification). The API documentation for the two classes can be a little confusing. Hence it can be a bit fiddly to create your own dataset. So here is some example code.


## Install dependencies

If needed, you could install the [datasets package](https://huggingface.co/docs/datasets/index) with the command below.

```
pip install datasets
```

## Creating datasets

Let's make up some arbitrary data (which we assume we've got from somewhere else). Here the data has already been tokenized and comes with associated named entity (NER) tags - which in this case are nonsense. The data could also be untokenized, then the following code needs to do the tokenization.

In [None]:
training_data = [
  {'tokens': ['this', 'is', 'a', 'sentence'], 'ner_tags': ['O', 'O', 'O', 'B-WORD']},
  {'tokens': ['this', 'is', 'ananother', 'sentence'], 'ner_tags': ['O', 'O', 'O', 'O', 'B-WORD']},
  {'tokens': ['look', 'a', 'third', 'sentence'], 'ner_tags': ['O', 'O', 'O', 'B-WORD']}
]

validation_data = [
    {'tokens': ['this', 'is', 'a', 'sentence', 'in', 'the', 'validation', 'set'],
     'ner_tags': ['O', 'O', 'O', 'B-WORD', 'O', 'O', 'O', 'O']}
]

The code below creates a Dataset object for the training and validation data above using the `.from_list` method.

In [None]:
from datasets import Dataset

train_dataset = Dataset.from_list(training_data)
train_dataset

In [None]:
validation_dataset = Dataset.from_list(validation_data)
validation_dataset

Alternatively, if you have the fields stored as separate lists, you could create the dataset using the `.from_dict` method.

In [None]:
training_tokens  = [ ['this', 'is', 'a', 'sentence'], ['this', 'is', 'an' 'another', 'sentence'], ['look', 'a', 'third', 'sentence'] ]
training_nertags = [ ['O',    'O',  'O', 'B-WORD'],   ['O',    'O',  'O', 'O',       'B-WORD'],   ['O',    'O',  'O',    'B-WORD'] ]

validation_tokens  = [ ['this', 'is', 'a', 'sentence', 'in', 'the', 'validation', 'set' ] ]
validation_nertags = [ ['O',    'O',  'O', 'B-WORD',   'O',  'O',   'O',          'O'] ]

In [None]:
train_dataset = Dataset.from_dict({'tokens':training_tokens, 'ner_tags':training_nertags})
validation_dataset = Dataset.from_dict({'tokens':validation_tokens, 'ner_tags':validation_nertags})
train_dataset

Sometimes, Hugging Face tutorials use a single object that combines all the data splits, so you can easily reference the training or validation parts. This uses a Dataset Dict object like below.

In [None]:
from datasets import DatasetDict

dataset = DatasetDict()
dataset['train'] = train_dataset
dataset['validation'] = validation_dataset

dataset

## Benefits

Why is this data structure useful? It allows you to split the data by the samples or by the fields.

So, we could get samples by their index:

In [None]:
dataset['train'][0]

But we could already do that with the previous structure:

In [None]:
training_data[0]

What's new is that you can select an individual field (without having to do anything special):

In [None]:
dataset['train']['tokens']

## Applying a function across the dataset

Another nice thing is that the structure works well with map functionality.

Let's make up a text dataset with `label` and `text` fields:

In [None]:
train_samples = [
    { 'label':'positive', 'text': "The restaurant was great" },
    { 'label':'negative', 'text': "Worst meal we've ever had" },
]

val_samples = [
    { 'label':'positive', 'text': "A highlight of the trip" },
    { 'label':'negative', 'text': "I'm still recovering." },
]

Put it into the datasets structure:

In [None]:
dataset = DatasetDict()
dataset['train'] = Dataset.from_list(train_samples)
dataset['validation'] = Dataset.from_list(val_samples)

A common step involves running a tokenizer on the text using a helper function as below.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenizer_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

Then you can use the `.map` method on a dataset as below.

In [None]:
tokenized_train = dataset['train'].map(tokenizer_function)
tokenized_train

Or better yet, run it across the whole dataset including the different splits (train and validation in this case):

In [None]:
tokenized_dataset = dataset.map(tokenizer_function)
tokenized_dataset