[![Colab Badge Link](https://img.shields.io/badge/open-in%20colab-blue)](https://colab.research.google.com/github/Glasgow-AI4BioMed/tutorials/blob/main/creating_a_huggingface_dataset_object.ipynb)

## Example code for creating a HuggingFace Dataset and DatasetDict object

HuggingFace often uses their own [Dataset](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset) and [DatasetDict](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.DatasetDict) objects in their tutorials as wrappers for various datasets. For example, it is used in their [token classification tutorial](https://huggingface.co/docs/transformers/tasks/token_classification). The API documentation for the two classes can be a little confusing. Hence it can be a bit fiddly to create your own dataset. So here is some example code.

Before we get stuck in, we need to

## Install dependencies

If needed, you could install the [datasets package](https://huggingface.co/docs/datasets/index) with the command below.

```
pip install datasets
```

## Creating datasets

Let's make up some arbitrary data (which we assume we've got from somewhere else). Here the data has already been tokenized and comes with associated named entity (NER) tags - which in this case are nonsense. The data could also be untokenized, then the following code needs to do the tokenization.

In [1]:
training_tokens  = [ ['this', 'is', 'a', 'sentence'], ['this', 'is', 'an' 'another', 'sentence'], ['look', 'a', 'third', 'sentence'] ]
training_nertags = [ ['O',    'O',  'O', 'B-WORD'],   ['O',    'O',  'O', 'O',       'B-WORD'],   ['O',    'O',  'O',    'B-WORD'] ]

validation_tokens  = [ ['this', 'is', 'a', 'sentence', 'in', 'the', 'validation', 'set' ] ]
validation_nertags = [ ['O',    'O',  'O', 'B-WORD',   'O',  'O',   'O',          'O'] ]

The code below creates a Dataset object for the training and validation data above.

In [2]:
from datasets import Dataset

train_dataset = Dataset.from_dict({'tokens':training_tokens, 'ner_tags':training_nertags})
train_dataset

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 3
})

And do the same for the validation dataset

In [3]:
validation_dataset = Dataset.from_dict({'tokens':validation_tokens, 'ner_tags':validation_nertags})
validation_dataset

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 1
})

Sometimes, HuggingFace tutorials use a single object that combines all the data splits, so you can easily reference the training or validation parts. This uses a Dataset Dict object like below.

In [4]:
from datasets import DatasetDict

dataset = DatasetDict()
dataset['train'] = train_dataset
dataset['validation'] = validation_dataset

dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 3
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1
    })
})