# Let's get it started with the XTREME benchmark from Hugging Face datasets.
To import the dataset, we can use the `load_dataset` function from the `datasets` library.
This benchmark includes a variety of tasks across multiple languages, making it a great choice for evaluating multilingual models.
It use IOB format for sequence labeling tasks, which is a common format for named entity recognition (NER) and other similar tasks.

In [2]:
from datasets import get_dataset_config_names
xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")


XTREME has 183 configurations


Whoa, that’s a lot of configurations! `XTREME` includes a variety of tasks such as:
- Named Entity Recognition (NER)
- Part-of-Speech Tagging (POS)
- Question Answering (QA)
- Sentence Retrieval (SR)

But we'll focus on the `NER` task for this example.
Let’s narrow the search by just looking for the configurations that start with “`PAN`”

**Why?**

Because `PAN-X` is the subset of `XTREME` that focuses on `NER` across multiple languages.

In [3]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg']

So, we have several configurations for `PAN-X`, each corresponding to a different language.
Like you can see, each one has a two-letter language code at the end, such as `en` for English, `de` for German, and `fr` for French. it follows the **ISO 639-1** standard for language codes.

Alright, if we want to use the German corpus, we can load it like this:

In [5]:
from datasets import load_dataset
load_dataset("xtreme", name="PAN-X.de")

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
})

But what if we want to load multiple languages at once? for exemple, Swiss corpus which includes German, French, English and Italian.

This corpus is particularly interesting because it reflects the multilingual nature of Switzerland, where multiple languages are spoken and imbalanced.

We have like:
- 62% of German (de)
- 22% of French (fr)
- 8% of Italian (it)
- 5% of English (en)

So, To keep track of each language, let’s create a Python `defaultdict` that stores the language code as the `key` and a `PAN-X` corpus of type DatasetDict as the value:

In [6]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)
for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
        ds[split]
        .shuffle(seed=0)
        .select(range(int(frac * ds[split].num_rows))))


Generating train split: 100%|██████████| 20000/20000 [00:00<00:00, 938470.01 examples/s]
Generating validation split: 100%|██████████| 10000/10000 [00:00<00:00, 739892.75 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 857327.64 examples/s]
Generating train split: 100%|██████████| 20000/20000 [00:00<00:00, 824190.21 examples/s]
Generating validation split: 100%|██████████| 10000/10000 [00:00<00:00, 386383.06 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 499072.37 examples/s]
Generating train split: 100%|██████████| 20000/20000 [00:00<00:00, 599344.68 examples/s]
Generating validation split: 100%|██████████| 10000/10000 [00:00<00:00, 467905.40 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 655564.86 examples/s]


To ensure that our dataset don't accidentally bias our dataset splits, we `shuffle` each split with a fixed seed before downsampling it according to the spoken proportion.

Let's take a look at the number of training examples in each language:

In [7]:
import pandas as pd
pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs},
 index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


Like you can see, we have more training examples for Geman than for the other languages, which reflects the linguistic landscape of Switzerland.

So, we can use it as a starting point from which zero-shot cross-lingual transfer
to French, Italian, and English.

Let's take a look at a few examples from the German training set:

In [8]:
element = panx_ch["de"]["train"][0]
for key, value in element.items():
 print(f"{key}: {value}")


tokens: ['2.000', 'Einwohnern', 'an', 'der', 'Danziger', 'Bucht', 'in', 'der', 'polnischen', 'Woiwodschaft', 'Pommern', '.']
ner_tags: [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


As you can see, each example consists of a sentence and its corresponding named entity tags in IOB format.

ner_tags column corresponds to the mapping of each entity to a class ID. This is a bit cryptic, so let's add a column that maps each class ID to its corresponding entity label

First, let's take a look at the features of the dataset to find the mapping:

In [16]:
for key, value in panx_ch["de"]["train"].features.items():
 print(f"{key}: {value}")

tokens: List(Value('string'))
ner_tags: List(ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']))
langs: List(Value('string'))


The `ner_tags` feature is of type `ClassLabel`, which means it has a predefined set of labels.

Let's pick up the mapping from class IDs to entity labels:

In [17]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'])


The `ClassLabel` object provides a method called `int2str` that allows us to convert class IDs to their corresponding string labels.
With `map` method, we can easily create a new column in the dataset that contains the string labels for each entity tag.

In [18]:
def create_tag_names(batch):
 return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}
panx_de = panx_ch["de"].map(create_tag_names)

Map: 100%|██████████| 12580/12580 [00:03<00:00, 4122.66 examples/s]
Map: 100%|██████████| 6290/6290 [00:01<00:00, 5520.57 examples/s]
Map: 100%|██████████| 6290/6290 [00:01<00:00, 5646.22 examples/s]


And now, let's take a look at the first example in the German training set with the new `ner_tags_str` column
> Yeah, this is a data Analyst Habits! 😅

In [19]:
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]],
['Tokens', 'Tags'])


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


The presence of the `LOC` tags make sense since the sentence “2,000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern” means “2,000 inhabitants at the Gdansk Bay in the Polish voivodeship of Pomerania” in English. And “**Danziger Bucht**” is indeed a location, a bay in the Baltic sea.


Now, let's make a quick check to see if we don't have any unusual imbalance in the tags, let's look at the distribution of each entity across each split.

In [20]:
from collections import Counter
split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,LOC,ORG,PER
train,6186,5366,5810
validation,3172,2683,2893
test,3180,2573,3071


This is a pretty good distribution of entity tags across the `training`, `validation`, and `test` sets.

`LOC`, `PER`, and `ORG` are roughly the same for each split, which is what we want to see.
