# Tutorial on Token Classification (HF): Load Dataset and Preprocess

# Load data

First of all, let's load data from conll2003 dataset and see how it looks like.

In [32]:
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")

print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


Unique labels are stored in `raw_datasets["train"].features["ner_tags"]`

In [34]:
raw_datasets["train"].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

Print all the tags available for each of the word pre-tokens

In [39]:
ner_tags = raw_datasets["train"].features["ner_tags"].feature.names
pos_tags = raw_datasets["train"].features["pos_tags"].feature.names
chunk_tags = raw_datasets["train"].features["chunk_tags"].feature.names
idx = 10
sample = raw_datasets["train"][idx]

line1 = ""
line2 = ""
line3 = ""
line4 = ""
for token, ner, pos_tag, chunk in zip(sample["tokens"], sample["ner_tags"], sample["pos_tags"], sample["chunk_tags"]):
    # print(f"{token:15s} {pos_tags.feature.int2str(pos_tag)}")
    full_ner = ner_tags[ner]
    full_pos = pos_tags[pos_tag]
    full_chunk = chunk_tags[chunk]
    max_length = max(len(token), len(full_pos), len(full_chunk), len(full_ner))
    line1 += f"{token:>{max_length}s} "
    line2 += f"{full_ner:>{max_length}s} "
    line3 += f"{full_pos:>{max_length}s} "
    line4 += f"{full_chunk:>{max_length}s} "
    
print(line1)
print(line2)
print(line3)
print(line4)


Spanish Farm Minister Loyola    de Palacio  had earlier accused Fischler   at   an    EU farm ministers    ' meeting   of causing unjustified alarm through " dangerous generalisation . " 
 B-MISC    O        O  B-PER I-PER   I-PER    O       O       O    B-PER    O    O B-ORG    O         O    O       O    O       O           O     O       O O         O              O O O 
    NNP  NNP      NNP    NNP   NNP     NNP  VBD     RBR     VBN      NNP   IN   DT    JJ   NN       NNS  POS      NN   IN     VBG          JJ    NN      IN "        JJ             NN . " 
   B-NP I-NP     I-NP   I-NP  I-NP    I-NP B-VP    I-VP    I-VP     B-NP B-PP B-NP  I-NP I-NP      I-NP B-NP    I-NP B-PP    B-VP      B-ADJP  B-NP    B-PP O      B-NP           I-NP O O 


# Tokenize words

In the previous step, we have pre-tokenized texts with their corresponding labels. However, we need to tokenize it accordingly to the model we are going to use.

In this case, as it is a BERT-like model, the tokenizer splits in subwords such as "united" -> "unit" + "##ed"

In [7]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

This tokenizer is a Fast one [Table of Fast Tokenizers](https://huggingface.co/transformers/#supported-frameworks) and in depth explanation in [Chapter 6](https://huggingface.co/learn/nlp-course/en/chapter6/3). 

Besides their parallelization capabilities, the key functionality of fast tokenizers is that they always keep track of the original span of texts the final tokens come from — a feature we call offset mapping. This in turn unlocks features like mapping each word to the tokens it generated or mapping each character of the original text to the token it’s inside, and vice versa.

Special methods:
- `tokenizer.is_fast` to check if the tokenizer is fast or not.
- `encoding.tokens()` to convert ids to tokens.
```python
encoding = tokenizer(example)
print(encoding.tokens())
```

```python
['\[CLS\]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in',  'Brooklyn', '.', '[SEP]']
```
- `encoding.word_ids()` to get the word ids, having None for special tokens and the same id for the rest of the tokens in the same word.
```python
[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]
```


In [8]:
tokenizer.is_fast

True

In [11]:
sample.keys()

dict_keys(['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'])

In [31]:
print(tokenizer(sample["tokens"], is_split_into_words=True).word_ids()[:10])
print(tokenizer(sample["tokens"], is_split_into_words=True).word_ids()[-10:])

[None, 0, 1, 2, 3, 4, 5, 5, 5, 6]
[19, 20, 21, 22, 23, 24, 24, 25, 26, None]


# Align tokens with labels

Now, the new tokenizer splitted some of our words into sub-words. We need to align the labels with the new tokens.

```
Palacio [B-ADJP] -> ['Pa', '##la', '##cio'] [I-ADJP, I-ADJP, I-ADJP]
```

In [42]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

new_labs = align_labels_with_tokens(sample["ner_tags"], tokenizer(sample["tokens"], is_split_into_words=True).word_ids())
print(sample["tokens"])
print(tokenizer(sample["tokens"], is_split_into_words=True).tokens())
print(new_labs)
print([raw_datasets["train"].features["chunk_tags"].feature.names[i] for i in new_labs[1:-1]])
print(len(sample["tokens"]))
print(len(new_labs))

['Spanish', 'Farm', 'Minister', 'Loyola', 'de', 'Palacio', 'had', 'earlier', 'accused', 'Fischler', 'at', 'an', 'EU', 'farm', 'ministers', "'", 'meeting', 'of', 'causing', 'unjustified', 'alarm', 'through', '"', 'dangerous', 'generalisation', '.', '"']
['[CLS]', 'Spanish', 'Farm', 'Minister', 'Loyola', 'de', 'Pa', '##la', '##cio', 'had', 'earlier', 'accused', 'Fi', '##sch', '##ler', 'at', 'an', 'EU', 'farm', 'ministers', "'", 'meeting', 'of', 'causing', 'un', '##ju', '##st', '##ified', 'alarm', 'through', '"', 'dangerous', 'general', '##isation', '.', '"', '[SEP]']
[-100, 7, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
['B-INTJ', 'O', 'O', 'B-ADJP', 'I-ADJP', 'I-ADJP', 'I-ADJP', 'I-ADJP', 'O', 'O', 'O', 'B-ADJP', 'I-ADJP', 'I-ADJP', 'O', 'O', 'B-ADVP', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
27
37


Now, this process is applied to all the examples in the dataset and so do the labels.

In [43]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs


As the tokenizer is a fast one, it can be applied in batches

In [44]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map: 100%|██████████| 3250/3250 [00:00<00:00, 22460.96 examples/s]
