<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/sequence_labeling_dataset_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence labeling dataset examples

Let's have a look at some sequence labling datasets from the Hugging Face datasets repository (https://huggingface.co/datasets).

(You can find a tutorial to `datasets` here: https://huggingface.co/docs/datasets/tutorial)

First, install the `datasets` Python package:

In [None]:
!pip install --quiet datasets

Make loading a bit less verbose. (This only affects what shows on screen when loading.)

In [None]:
from datasets import disable_progress_bar

disable_progress_bar()

We'll again mainly use the `load_dataset` to download data by data name. For available datasets, see https://huggingface.co/datasets with the filter "token classification".

## Example: CoNLL 2003

A classic reference dataset for Named Entity Recognition. Still relevant for comparing methods, although mostly superceded by more recent resources for practical purposes.

In [None]:
from datasets import load_dataset, load_dataset_builder
from pprint import pprint

builder = load_dataset_builder('conll2003')

print(builder.info.description)

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on
four types of named entities: persons, locations, organizations and names of miscellaneous entities that do
not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on
a separate line and there is an empty line after each sentence. The first item on each line is a word, the second
a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags
and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only
if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag
B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2
tagging scheme, whereas the original dataset uses IOB1.

For 

In [None]:
conll2003 = load_dataset('conll2003')



Let's see a summary of the contents:

In [None]:
print(conll2003)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


We have the standard three-way split:

* `train`: 14,041 rows
* `validation`: 3,250 rows
* `test`: 3,453 rows

Each row of this dataset has an `id`, `tokens` and three sets of "labels": `pos_tags`, `chunk_tags` and `ner_tags`. The POS and chunk tags are included as part of the input for machine learning approaches with manually engineered features, but are rarely used in deep learning approaches.

Let's look at an example:

In [None]:
pprint(conll2003['train'][0]['tokens'])
print('\nLabel:', conll2003['train'][0]['ner_tags'])

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

Label: [3, 0, 7, 0, 0, 0, 7, 0, 0]


By contrast to text classification datasets:

* The text is pre-split into tokens to facilitate pairing labels (here, NER tags) with tokens. (Note that these will in general _not_ align fully with the tokenization created by a tokenizer for a deep learning model!)
* Each element of the dataset typically has multiple tokens and thus multiple labels, so the number of labeled pieces of data ("signal" for training) is many times larger than the number of rows in the dataset.

As before, to interpret the label IDs, we can look at `features`:

In [None]:
pprint(conll2003['train'].features)

{'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'id': Value(dtype='string', id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}


In [None]:
print(conll2003['train'].features['ner_tags'])

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)


Label `0` corresponds to `O` ("out"), label 1 to `B-PER` ("begin person name"), etc.

---

## Example: OntoNotes NER

Following the same process as above:

In [None]:
builder = load_dataset_builder('conll2012_ontonotesv5', 'english_v12')

print(builder.info.description)

OntoNotes v5.0 is the final version of OntoNotes corpus, and is a large-scale, multi-genre,
multilingual corpus manually annotated with syntactic, semantic and discourse information.

This dataset is the version of OntoNotes v5.0 extended and is used in the CoNLL-2012 shared task.
It includes v4 train/dev and v9 test data for English/Chinese/Arabic and corrected version v12 train/dev/test data (English only).

The source of data is the Mendeley Data repo [ontonotes-conll2012](https://data.mendeley.com/datasets/zmycy7t9h9), which seems to be as the same as the official data, but users should use this dataset on their own responsibility.

See also summaries from paperwithcode, [OntoNotes 5.0](https://paperswithcode.com/dataset/ontonotes-5-0) and [CoNLL-2012](https://paperswithcode.com/dataset/conll-2012-1)

For more detailed info of the dataset like annotation, tag set, etc., you can refer to the documents in the Mendeley repo mentioned above.



In [None]:
ontov5eng = load_dataset('conll2012_ontonotesv5', 'english_v12')

Downloading and preparing dataset conll2012_ontonotesv5/english_v12 to /root/.cache/huggingface/datasets/conll2012_ontonotesv5/english_v12/1.0.0/c541e760a5983b07e403e77ccf1f10864a6ae3e3dc0b994112eff9f217198c65...
Dataset conll2012_ontonotesv5 downloaded and prepared to /root/.cache/huggingface/datasets/conll2012_ontonotesv5/english_v12/1.0.0/c541e760a5983b07e403e77ccf1f10864a6ae3e3dc0b994112eff9f217198c65. Subsequent calls will reuse this data.


In [None]:
print(ontov5eng)

DatasetDict({
    train: Dataset({
        features: ['document_id', 'sentences'],
        num_rows: 10539
    })
    validation: Dataset({
        features: ['document_id', 'sentences'],
        num_rows: 1370
    })
    test: Dataset({
        features: ['document_id', 'sentences'],
        num_rows: 1200
    })
})


Here we have the conventional split into train, validation and test sets, but this time on the document level:

* `train`: 10,539 documents
* `validation`: 1,370 documents
* `test`: 1,200 documents

Let's see what's in one of those documents:

In [None]:
print(type(ontov5eng['train'][0]['sentences']))

<class 'list'>


In [None]:
pprint(ontov5eng['train'][0]['sentences'][0])

{'coref_spans': [],
 'named_entities': [0, 0, 0, 0, 0],
 'parse_tree': '(TOP(SBARQ(WHNP(WHNP (WP What)  (NN kind) )(PP (IN of) (NP (NN '
               'memory) ))) (. ?) ))',
 'part_id': 0,
 'pos_tags': [48, 25, 18, 25, 8],
 'predicate_framenet_ids': [None, None, None, None, None],
 'predicate_lemmas': [None, None, None, 'memory', None],
 'speaker': 'Speaker#1',
 'srl_frames': [],
 'word_senses': [None, None, None, 1.0, None],
 'words': ['What', 'kind', 'of', 'memory', '?']}


This is quite a lot of annotation! For NER, we would only use `words` (tokens) and `named_entities`.

Again, we can find the textual labels for numerical IDs in `features`:

In [None]:
pprint(ontov5eng['train'].features)

{'document_id': Value(dtype='string', id=None),
 'sentences': [{'coref_spans': Sequence(feature=Sequence(feature=Value(dtype='int32', id=None), length=3, id=None), length=-1, id=None),
                'named_entities': Sequence(feature=ClassLabel(names=['O', 'B-PERSON', 'I-PERSON', 'B-NORP', 'I-NORP', 'B-FAC', 'I-FAC', 'B-ORG', 'I-ORG', 'B-GPE', 'I-GPE', 'B-LOC', 'I-LOC', 'B-PRODUCT', 'I-PRODUCT', 'B-DATE', 'I-DATE', 'B-TIME', 'I-TIME', 'B-PERCENT', 'I-PERCENT', 'B-MONEY', 'I-MONEY', 'B-QUANTITY', 'I-QUANTITY', 'B-ORDINAL', 'I-ORDINAL', 'B-CARDINAL', 'I-CARDINAL', 'B-EVENT', 'I-EVENT', 'B-WORK_OF_ART', 'I-WORK_OF_ART', 'B-LAW', 'I-LAW', 'B-LANGUAGE', 'I-LANGUAGE'], id=None), length=-1, id=None),
                'parse_tree': Value(dtype='string', id=None),
                'part_id': Value(dtype='int32', id=None),
                'pos_tags': Sequence(feature=ClassLabel(names=['XX', '``', '$', "''", '*', ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'HYPH',

In [None]:
print(ontov5eng['train'].features['sentences'][0]['named_entities'])

Sequence(feature=ClassLabel(names=['O', 'B-PERSON', 'I-PERSON', 'B-NORP', 'I-NORP', 'B-FAC', 'I-FAC', 'B-ORG', 'I-ORG', 'B-GPE', 'I-GPE', 'B-LOC', 'I-LOC', 'B-PRODUCT', 'I-PRODUCT', 'B-DATE', 'I-DATE', 'B-TIME', 'I-TIME', 'B-PERCENT', 'I-PERCENT', 'B-MONEY', 'I-MONEY', 'B-QUANTITY', 'I-QUANTITY', 'B-ORDINAL', 'I-ORDINAL', 'B-CARDINAL', 'I-CARDINAL', 'B-EVENT', 'I-EVENT', 'B-WORK_OF_ART', 'I-WORK_OF_ART', 'B-LAW', 'I-LAW', 'B-LANGUAGE', 'I-LANGUAGE'], id=None), length=-1, id=None)


---

## Example: `wikiann`

[`wikiann`](https://huggingface.co/datasets/wikiann/viewer/fi/validation) is a massively multilingual automatically annotated NER dataset based on Wikipedia. Let's have a look at the Finnish (`fi`) subset.

In [None]:
builder = load_dataset_builder('wikiann', 'fi')

pprint(builder.info.description)

('WikiANN (sometimes called PAN-X) is a multilingual named entity recognition '
 'dataset consisting of Wikipedia articles annotated with LOC (location), PER '
 '(person), and ORG (organisation) tags in the IOB2 format. This version '
 'corresponds to the balanced train, dev, and test splits of Rahimi et al. '
 '(2019), which supports 176 of the 282 languages from the original WikiANN '
 'corpus.')


In [None]:
wikiannfi = load_dataset('wikiann', 'fi')

Downloading and preparing dataset wikiann/fi to /root/.cache/huggingface/datasets/wikiann/fi/1.1.0/4bfd4fe4468ab78bb6e096968f61fab7a888f44f9d3371c2f3fea7e74a5a354e...
Dataset wikiann downloaded and prepared to /root/.cache/huggingface/datasets/wikiann/fi/1.1.0/4bfd4fe4468ab78bb6e096968f61fab7a888f44f9d3371c2f3fea7e74a5a354e. Subsequent calls will reuse this data.


In [None]:
print(wikiannfi)

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 20000
    })
})


The usual split:

* `train`: 20,000 rows
* `validation`: 10,000 rows
* `test`: 10,000 rows

In [None]:
pprint(wikiannfi['train'][0]['tokens'])
print('\nLabel:', wikiannfi['train'][0]['ner_tags'])

['Se',
 'sijaitsee',
 'Borneon',
 'saarella',
 'ja',
 'on',
 'Etelä-Kalimantanin',
 'provinssin',
 'pääkaupunki',
 '.']

Label: [0, 0, 5, 0, 0, 0, 5, 0, 0, 0]


In [None]:
print(wikiannfi['train'].features['ner_tags'])

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None)
