# Let's get it started with the XTREME benchmark from Hugging Face datasets.
To import the dataset, we can use the `load_dataset` function from the `datasets` library.
This benchmark includes a variety of tasks across multiple languages, making it a great choice for evaluating multilingual models.
It use IOB format for sequence labeling tasks, which is a common format for named entity recognition (NER) and other similar tasks.

In [35]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")


XTREME has 183 configurations


Whoa, that’s a lot of configurations! `XTREME` includes a variety of tasks such as:
- Named Entity Recognition (NER)
- Part-of-Speech Tagging (POS)
- Question Answering (QA)
- Sentence Retrieval (SR)

But we'll focus on the `NER` task for this example.
Let’s narrow the search by just looking for the configurations that start with “`PAN`”

**Why?**

Because `PAN-X` is the subset of `XTREME` that focuses on `NER` across multiple languages.

In [2]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg']

So, we have several configurations for `PAN-X`, each corresponding to a different language.
Like you can see, each one has a two-letter language code at the end, such as `en` for English, `de` for German, and `fr` for French. it follows the **ISO 639-1** standard for language codes.

Alright, if we want to use the German corpus, we can load it like this:

In [3]:
from datasets import load_dataset

load_dataset("xtreme", name="PAN-X.de")

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
})

But what if we want to load multiple languages at once? for exemple, Swiss corpus which includes German, French, English and Italian.

This corpus is particularly interesting because it reflects the multilingual nature of Switzerland, where multiple languages are spoken and imbalanced.

We have like:
- 62% of German (de)
- 22% of French (fr)
- 8% of Italian (it)
- 5% of English (en)

So, To keep track of each language, let’s create a Python `defaultdict` that stores the language code as the `key` and a `PAN-X` corpus of type DatasetDict as the value:

In [4]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)
for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))


To ensure that our dataset don't accidentally bias our dataset splits, we `shuffle` each split with a fixed seed before downsampling it according to the spoken proportion.

Let's take a look at the number of training examples in each language:

In [5]:
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs},
             index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


Like you can see, we have more training examples for Geman than for the other languages, which reflects the linguistic landscape of Switzerland.

So, we can use it as a starting point from which zero-shot cross-lingual transfer
to French, Italian, and English.

Let's take a look at a few examples from the German training set:

In [6]:
element = panx_ch["de"]["train"][0]
for key, value in element.items():
    print(f"{key}: {value}")


tokens: ['2.000', 'Einwohnern', 'an', 'der', 'Danziger', 'Bucht', 'in', 'der', 'polnischen', 'Woiwodschaft', 'Pommern', '.']
ner_tags: [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


As you can see, each example consists of a sentence and its corresponding named entity tags in IOB format.

ner_tags column corresponds to the mapping of each entity to a class ID. This is a bit cryptic, so let's add a column that maps each class ID to its corresponding entity label

First, let's take a look at the features of the dataset to find the mapping:

In [7]:
for key, value in panx_ch["de"]["train"].features.items():
    print(f"{key}: {value}")

tokens: List(Value('string'))
ner_tags: List(ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']))
langs: List(Value('string'))


The `ner_tags` feature is of type `ClassLabel`, which means it has a predefined set of labels.

Let's pick up the mapping from class IDs to entity labels:

In [8]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'])


The `ClassLabel` object provides a method called `int2str` that allows us to convert class IDs to their corresponding string labels.
With `map` method, we can easily create a new column in the dataset that contains the string labels for each entity tag.

In [9]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}


panx_de = panx_ch["de"].map(create_tag_names)

And now, let's take a look at the first example in the German training set with the new `ner_tags_str` column
> Yeah, this is a data Analyst Habits! 😅

In [10]:
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]],
             ['Tokens', 'Tags'])


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


The presence of the `LOC` tags make sense since the sentence “2,000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern” means “2,000 inhabitants at the Gdansk Bay in the Polish voivodeship of Pomerania” in English. And “**Danziger Bucht**” is indeed a location, a bay in the Baltic sea.


Now, let's make a quick check to see if we don't have any unusual imbalance in the tags, let's look at the distribution of each entity across each split.

In [11]:
from collections import Counter

split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,LOC,ORG,PER
train,6186,5366,5810
validation,3172,2683,2893
test,3180,2573,3071


This is a pretty good distribution of entity tags across the `training`, `validation`, and `test` sets.

`LOC`, `PER`, and `ORG` are roughly the same for each split, which is what we want to see.


# What are we going to do next?

Like i mentioned earlier, we are going to make a zero-shot cross-lingual transfer from German to French, Italian, and English.

What is Zero-shot cross-lingual transfer?

>In short, it means training a model on one language (German in this case) and then evaluating its performance on other languages (French, Italian, and English) without any additional training on those languages.

So, now, we need a model to evaluate.

One of the first multilingual transformers was `mBERT`, which uses the same architecture and pretraining objective as `BERT` but is trained on additional data from wikipedia articles in many languages. After that, there are many other models like `XLM-R`, `mT5`, and `mDeBERTaV3` that have shown even better performance on various multilingual benchmarks.

So we will focus on `XLM-R`, which is a robustly optimized version of `mBERT` and has shown strong performance on various multilingual benchmarks, including `XTREME`.

`XLM-R` is a transformer-based model like `BERT`, but use a `SentencePiece` tokenizer instead of `WordPiece` tokenizer used in `BERT`.

To get a feel for how SentencePiece compares to WordPiece, let's load the BERT tokenizer and the XLM-R tokenizer in 🤗(Hugging Face) Transformers.

In [12]:
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

And now, let's take a example sentence

In [13]:
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()
bert_tokens, xlmr_tokens

(['[CLS]', 'Jack', 'Spa', '##rrow', 'loves', 'New', 'York', '!', '[SEP]'],
 ['<s>', '▁Jack', '▁Spar', 'row', '▁love', 's', '▁New', '▁York', '!', '</s>'])

Here, one of the main differences is that instead of the `[CLS]` and `[SEP]` tokens that `BERT` uses for sentence
classification tasks, `XLM-R` uses `<s>` and `<\s>` to denote the start and end of a sequence

One other difference is how the two tokenizers handle subword tokenization. `BERT` uses `##` to indicate that a token is a continuation of the previous token, while `XLM-R` uses `▁` to indicate the start of a new word or just a space.

Here, `BERT` tokenizer lost the information that there is no whitespace between “York” and “!”

Why?

>In short, the `WordPiece` tokenizer first splits text on whitespace and punctuation, then breaks each word into subword units from its vocabulary. It does not explicitly preserve whitespace information.
In contrast, the `SentencePiece` tokenizer treats the input as a raw character sequence and uses a statistical model (`Unigram` or `BPE`) to segment it, encoding spaces explicitly and requiring no pre-tokenization.

In [14]:
"".join(xlmr_tokens).replace(u"\u2581", " ")

'<s> Jack Sparrow loves New York!</s>'

# Do you know how is the transformer model architecture looks like?
in general, you have two main components:
1. The model body (the transformer layers)
2. The task-specific head (like a classification head for text classification tasks or a token classification head for NER tasks)

so, take it like a sample exercise, let's implement a token classification model using `XLM-R` as the model body.

> Note: In fact, 🤗 Transformers library already provides a pre-implemented class called `XLMRobertaForTokenClassification` that does exactly this. But implementing it from scratch is a great way to understand how these models work under the hood. And if someday you need a custom modification, you will know where to start.


In [15]:
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel


class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
    config_class = XLMRobertaConfig

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        # Load model body
        self.roberta = RobertaModel(config, add_pooling_layer=False)
        # Set up token classification head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Load and initialize weights
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
        # Use model body to get encoder representations
        outputs = self.roberta(input_ids, attention_mask=attention_mask,
                               token_type_ids=token_type_ids, **kwargs)
        # Apply classifier to encoder representation
        sequence_output = self.dropout(outputs[0])
        logits = self.classifier(sequence_output)
        # Calculate losses
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        # Return model output object
        return TokenClassifierOutput(loss=loss, logits=logits,
                                     hidden_states=outputs.hidden_states,
                                     attentions=outputs.attentions)


Like you can see, we first load the `XLM-R` model body using the `RobertaModel` class from 🤗 Transformers. Then, we set up a token classification head consisting of a dropout layer followed by a linear layer that maps the hidden states to the number of labels.

My custom class inherits from `RobertaPreTrainedModel`, which provides useful methods for loading and saving pretrained models.

But, you might wondering, what does do `config_class` ?
> In 🤗 transformers, each model inherits from a base class like `PreTrainedModel` (here `RobertaPreTrainedModel`).
> These base classes define several utility methods such as:
> - rom_pretrained(...)
> - save_pretrained(...)
> - from_config(...)
>
> These methods often need to know which configuration class to use for the specific model.
> By setting the `config_class` attribute, we inform the base class about the appropriate configuration class to use when instantiating the model from a configuration object.

Note that, we set `add_pooling_layer=False` to ensure all `hidden states` are returned and not only the one
associated with the `[CLS]` token

Now we are ready to load our token classification model. However, first, We’ll need to provide some
additional information beyond the model name, including the tags that we will use to
label each entity and the mapping of each tag to an ID and vice versa

In [16]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}

We'll store thes mapping and the `tags.num_classes` attribute in the AutoConfig object by passing them as keyword arguments to the `from_pretrained` method.

In [17]:
from transformers import AutoConfig

xlmr_config = AutoConfig.from_pretrained(xlmr_model_name,
                                         num_labels=tags.num_classes,
                                         id2label=index2tag, label2id=tag2index)

Now, we can load our token classification model using the `from_pretrained` method of our custom class and passing the configuration object we just created.

In [34]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = (XLMRobertaForTokenClassification
              .from_pretrained(xlmr_model_name, config=xlmr_config)
              .to(device))

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's make a quick check to see if the model is working as expected

In [19]:
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Input IDs,0,21763,37456,15555,5161,7,2356,5753,38,2


Finally, we pass the input IDs to the model and extract the predicted class for each token by taking the `argmax` of the output logits.

In [20]:
outputs = xlmr_model(input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}")

Number of tokens in sequence: 10
Shape of outputs: torch.Size([1, 10, 7])


Like you can see, the logits have the shape `[batch_size, num_tokens, num_tags]`, with each token having a score for each seven possible tags.

By enumerating over the sequence, we can quickly see the predicted tag for each token.

In [21]:
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Tags,I-PER,I-PER,I-PER,I-PER,I-PER,I-PER,I-PER,I-PER,I-PER,I-PER


So... I don't think "Jack" is a Location, right? 😅

But well, with random weights — what did I expect anyway 😅

Okay, time to fine-tune it on some labeled data — let’s make it smarter 😎
Before that, though, let’s wrap the previous steps into a neat helper function.

In [22]:
def tag_text(text, tags, model, tokenizer):
    tokens = tokenizer(text).tokens()
    input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
    outputs = model(input_ids)[0]
    predictions = torch.argmax(outputs, dim=2)
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
    return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])

Before we start to train our model, we need to tokenize the input and prepare the labels.

So, like we can see, the tokenizer and model can encode a single example. Our next step is to tokenize the entire dataset so that we can feed it into the model for fine-tuning.

🤗 Datasets provide a convenient `map` method that allows us to apply a function to each example in the dataset.

For that, we need to define a function with the minimum signature :

`function(examples: Dict[str, List]) -> Dict[str, List]`

Since the XLM-R tokenizer returns the input IDs, but we also need :
- first the attention masks to indicate which tokens are real tokens and which are padding tokens.
- second, the label IDs to say which tag (e.g., `B-PER`, `I-LOC`, etc.) corresponds to each token.

Let's look at how this works for a single example:

In [29]:
words, labels = de_example["tokens"], de_example["ner_tags"]
words, labels

(['2.000',
  'Einwohnern',
  'an',
  'der',
  'Danziger',
  'Bucht',
  'in',
  'der',
  'polnischen',
  'Woiwodschaft',
  'Pommern',
  '.'],
 [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0])

Next, we can use the tokenizer to tokenize the input words. Since our input is already tokenized into words, we need to set the `is_split_into_words` parameter to `True` to say to the tokenizer that the input is already split into words.

In [25]:
tokenized_input = xlmr_tokenizer(de_example["tokens"], is_split_into_words=True)
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
pd.DataFrame([tokens], index=["Tokens"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>


Like we can see, the tokenizer has split "_Einwohnern" into two subwords: "▁Einwohner" and "n".

Since we're following the convention that "_Einwohner" should be associated with the "B-LOC", we need a way to mask the subword representations after the first one. Fortunately, the tokenizer provides a method called `word_ids()` that returns a list mapping each token to its corresponding word index in the original input.

In [28]:
word_ids = tokenized_input.word_ids()
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>
Word IDs,,0,1,1,2,3,4,4,4,5,...,9,9,9,9,10,10,10,11,11,


Now, each token is associated with a word index, where `None` indicates special tokens like `<s>` and `<\s>`.
The original words are indexed from `0` to `n-1`, where `n` is the number of words in the input.

So... we can use this mapping to align the original labels with the tokenized input and now we know if for exemple "n" is a continuation of the previous word or not.

Let's set `-100` for the tokens that we want to ignore during loss computation.

In [30]:
previous_word_idx = None
label_ids = []

for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100)
    elif word_idx != previous_word_idx:
        label_ids.append(labels[word_idx])
    previous_word_idx = word_idx

labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]

pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>
Word IDs,,0,1,1,2,3,4,4,4,5,...,9,9,9,9,10,10,10,11,11,
Label IDs,-100,0,0,-100,0,0,5,-100,-100,6,...,5,-100,-100,-100,6,-100,-100,0,-100,-100
Labels,IGN,O,O,IGN,O,O,B-LOC,IGN,IGN,I-LOC,...,B-LOC,IGN,IGN,IGN,I-LOC,IGN,IGN,O,IGN,IGN


Maybe you might wonder, **why ignore the special tokens and subword tokens during loss computation?**

> Ignoring special tokens and subword tokens during loss computation is important because these tokens do not correspond to actual words in the input text. Including them in the loss calculation could introduce noise and lead to incorrect learning signals for the model.

And second, **why set the label of subword tokens to -100 specifically?**

> In PyTorch's `CrossEntropyLoss`, the label `-100` is used as a special value to indicate that a particular token should be ignored during loss computation. This is a convention in PyTorch, and using `-100` allows us to effectively exclude those tokens from contributing to the loss, ensuring that the model focuses on learning from the relevant tokens only.

Now, we can wrap all these steps into a single function that we can use with the `map` method of the dataset.

In [31]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True,
                                      is_split_into_words=True)
    labels = []
    for idx, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs


Now, we can write a function to iterate over.

In [32]:
def encode_panx_dataset(corpus):
    return corpus.map(tokenize_and_align_labels, batched=True,
                      remove_columns=['langs', 'ner_tags', 'tokens'])

When we use it to a DatasetDict, we'll get a encoded DatasetDict per split.

In [33]:
panx_de_encoded = encode_panx_dataset(panx_ch["de"])

Map: 100%|██████████| 12580/12580 [00:12<00:00, 1002.17 examples/s]
Map: 100%|██████████| 6290/6290 [00:01<00:00, 4757.25 examples/s]
Map: 100%|██████████| 6290/6290 [00:01<00:00, 6188.71 examples/s]
