# Building a MultiLingual NER Tagger

In [None]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

In [None]:
# many config names; 183. Narrow to those that start with "PAN"
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]

In [None]:
# code passed as suffix
from datasets import load_dataset

load_dataset("xtreme", name="PAN-X.de")

Realistic Swiss corpus: Sample German (de), French (fr), Italian (it) and English (en) corpora from PAN-X according to their spoken proportions. This creates a language imbalance very common in real-world datasets and so will simulate an imbalanced dataset, so we can seehow we can build a model that works across all languages.

Create a Python *defaultdict* that stores the language code as the key and PAN-X corpus of type DatasetDict as the value:

In [None]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            # shuffle to not accidently bias dataset splits
            # select allows to downsample each corpus according to values in fracs
            ds[split].shuffle(seed=0).select(range(int(frac * ds[split].num_rows)))
        )

In [None]:
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs}, index=["Number of training examples"])

More German examples than all other languages combined, so can use it as a starting point for zero-shot cross-lingual transfer to French, Italian and English.

In [None]:
# inspect one of the examples in the German corpus
element = panx_ch["de"]["train"][0]
for key, value in element.items():
    print(f"{key}: {value}")

Keys of example correspond to column names of an Arrow table, while the values denote the entries in each column. We see that *ner_tags* column corresponds to mapping of each entity to a class ID. Slightly cryptic to human eye, so create a new column with LOC, PER and ORG tags. 

First we use `features` attribute of Dataset object that specifies the underlying data types associated with each column.

In [None]:
for key, value in panx_ch["de"]["train"].features.items():
    print(f"{key}: {value}")

In [None]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

In [None]:
def create_tag_names(batch):
    # use int2str method encountered in chapt2 to create a new column in training set
    # with class names for each tag
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

panx_de = panx_ch["de"].map(create_tag_names)

In [None]:
# look at how tokens and tags align for first example in training set
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]], ["Tokens", "Tags"])

In [None]:
from collections import Counter

# calculate frequencies of each entity across each split
split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Distributions of PER, LOC and ORG are roughly the same for each split; so validation and test sets should prove a good measure of NER tagger's ability to generalise.

## MultiLingual Transformers

Many architectures and training procedures as their monolingual counterparts, except that the corupus for pretraining consists of documents in many languages. Amazingly, our models are able to differentiate across languages for a number of downstream tasks, even being competitive in translation compared to monolingual models.

To measure the progress of cross-lingual transfer for NER, the CoNLL-2002 and CoNLL-2003 datasets are often used as a benchmark for English, Dutch, Spanish and German. Multilingual transformer models are usually evaluated in three different ways:

- **en**: Fine-tune on English training data then evaluate on each language's test set
- **each**: Fine-tune and evaluate on monolingual test data to measure per-language performance
- **all**: Fine-tune on all the training data to evaluate on all on each language's test set

We will adopt a similar evaluation strategy for our NER task, but first need to select a model to evaluate. One of the first models for multilingual transformers was mBERT which uses the same architecture and pretraining objective as BERT but adds Wikipedia articles from many languages to pretraining corpus. Since then, mBERT has been superseded by XLM-RoBERTa, so that's the model we'll consider in this chapter. 

XLM-R is distinguised by the huge size of its pretraining corpus of Wikipedia dumps of each language and 2.5TB of Common Crawl data from the web. Compared to its predecessors, XLM-R provides a significant boost for low-resource languages like Burmese and Swahili.

RoBERTA refers to the fact that pretraining approach is the samea s for monolingual RoBERTa models. Improving on BERT by removing next sentence prediction, also dropping language embeddings in XLM; using SentencePiece to tokenise raw texts directly. Also 250,000 tokens vs 55,000!

XLM-R is a great coice for multilingual NLU tasks; next we'll explore how it can efficiently tokenise across many languages.

## A Closer Look at Tokenisation

In [None]:
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base" # uses SentencePiece trained on raw text of all 100 languages
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

In [None]:
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
print(bert_tokens)
xlmr_tokens = xlmr_tokenizer(text).tokens()
print(xlmr_tokens)

Difference in [CLS] and [SEP] tokens; as XLM-R uses <s> and </s> for start and end of a sequence. These tokens are added in the final stage of tokenization.

### Tokenizer Pipeline

High level view is one that transforms strings to integers, though if we take a closer look we usually have four steps:
1. **Normalisation**: Make a raw string "cleaner". Common operations: Strip whitespace, remove accented characters; unicode normalisation also, lowercasing; or reducing the vocab size
2. **Pretokenisation**: Splits text to smaller objects and gives upper bound to what tokens will be at the end of training. Pretokenizer will split text into "words" and final tokens will be parts of those words. Not always a good choice as splitting words can make sentences incoherent.
3. **Tokeniser Model**: Applies subword splitting model on words; this part needs to be trained on corpus. Splits words into subwords to reduce size of vocabulary and reduce number of out-of-vocabulary tokens. Several algorithms exist, including BPE, Unigram and WordPiece. No longer have a list of strings but a list of integers
4. **Postprocessing**: Final transformations, e.g. adding special tokens at beginning or end of input sequence of token indices, like [CLS], [SEP] etc.. To then feed into the model.

So comparing XLM-R and BERT, SentencePiece adds <s> and <\s> instead of [CLS] and [SEP] in postprocessing. Look into SentencePiece tokenizer to understand more of what makes it special.
    
### SentencePiece Tokeniser
    
Based on Unigram subword segmentation, encodes each input text as sequence of Unicode characters. Useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation and that many languages do not have whitespace characters. Whitespace itself is assigned a unicode symbol so can detokenise without any ambiguities.

In [None]:
"".join(xlmr_tokens).replace(u"\u2581", " ")

Lets see how we can encode our simple example in a form suitable for NER; first we load the pretrained model with token classification head. Although, we can build the transformer model ourselves!

## Transformers for Named Entity Recognition

BERT uses special CLS token to represent an entire sequence of text; this representation is fed through a fully connected or dense layer to output the distribution of all discrete label values. BERT and other encoder-only transformers take a similar approach forNER, except the representation of each individual input token is fed into the same fully connected layer to output the entity of the token. Thus, NER is often framed as *token classification* task. 

Can indicate ignored subwords with IGN. Can later propagate the predicted label of the first subword to subsequent subwords in postprocessing step. Can also have chosen to include the representation of "##ista" subword by assigning it a copy of the B-LOC label, but this violates IOB2 format. 

Fortunately, all the architecture aspects we've seen in BERT carry over to XLM-R since its architecture is based on RoBERTa, which is identical to BERT! We can see how Transformers supports many other tasks with minor modifications

## Anatomy of Transformers Model Class

Name convention <ModelName>For<Task>; or AutoModelFor<Task>.
    
### Bodies and Heads
    
Last layer is the model head; it is the part that is task-specific. The rest is the body, including the token embeddings and transformer layers which are task-agnostic. There are pure body models, like `BertModel` or `GPT2Model`. This separation of bodies and heads allows us to build a custom head for any task and mount it on top of a pretrained model.
    
### Creating a Custom Model for Token Classification
    
Custom token classification head for XLM-R; uses same architecture as RoBERTa, so will use RoBERTa as base model, but augmented with settings specific to XLM-R. 

In [None]:
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel

# require data structure to represent XLM-R NER tagger
# need configuration object to initialise model and a forward() function to generate outputs
class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
    config_class = XLMRobertaConfig
    
    def __init__(self, config):
        
        # initialise RobertaPreTrainedModel class
        super().__init__(config)
        self.num_labels = config.num_labels
        # load model body; add_pooling=False to ensure we return all the hidden states and note one associated with [CLS] token
        self.roberta = RobertaModel(config, add_pooling_layer=False)
        # set up classification head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # load and initialise weights
        self.init_weights
        
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
        # use model body to get encoder representations
        outputs = self.roberta(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, **kwargs)
        # apply classifier to encoder representation
        sequence_output = self.dropout(outputs[0])
        logits = self.classifier(sequence_output)
        # calculate losses directly if we have labels
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        # return model output object. Wrap in TokenClassifierOutput
        return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions)

### Loading a Custom Model

In [None]:
# provide label of each entity and mapping of each tag to an ID and vice-versa
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag:idx for idx, tag in enumerate(tags.names)}

In [None]:
# store these and tags.num_classes in AutoConfig
from transformers import AutoConfig

# holds blueprint of model architecture; usually pretrained model has one already, though if we want to modify
# then we can load the configuration with parameters we would like to customise
xlmr_config = AutoConfig.from_pretrained(xlmr_model_name, num_labels=tags.num_classes, id2label=index2tag, label2id=tag2index)

In [None]:
import torch

# load model weights as usual with from_pretrained() with additional config argument
# we get these weights for free by inheriting from RobertaPreTrainedModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = (XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config=xlmr_config).to(device))

In [None]:
# quick check on initialising the tokeniser and model correctly
# test the predictions on small sequence of known entities
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])

In [None]:
# pass inputs to model and extract probabilities by taking argmax to get most likely class per token
outputs = xlmr_model(input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}") # shape [batch_size, num_tokens, num_tags]; each token has a logit among seven possible NER tags

In [None]:
# see what pretrained model predicts
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"]) # unsurprisingly, random weights layer leaves a lot to be desired!

In [None]:
# wrap preceding steps into a helper function for later use
def tag_text(text, tags, model, tokenizer):
    # get tokens with special characters
    tokens = tokenizer(text).tokens()
    # encode sequence into IDs
    input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
    # get predictions as distribution over 7 possible classes
    outputs = model(input_ids)[0]
    # take argmax to get most likely class per token
    predictions = torch.argmax(outputs, dim=2)
    # convert to DataFrame
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
    return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])

Before we can train model, we also need to tokenize inputs and prepare labels. We'll do that next.

### Tokenising Texts for NER

Tokenise the whole dataset so we can pass to XLM-R model for fune-tuning. We can use the map() function to achieve this.

In [None]:
# collect words and tags as ordinary lists
words, labels = de_example["tokens"], de_example["ner_tags"]

In [None]:
# tokenise each word and use the is_split_into_words argument to tell tokeniser that our input sequence has already been split into words
tokenized_input = xlmr_tokenizer(de_example["tokens"], is_split_into_words=True)
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
pd.DataFrame([tokens], index=["Tokens"])

We want to mask the subword representations after the first subword, e.g. Einwohnern is broken into "Einwohner" and "n". Luckily, tokenized_input has a class that contains a words_ids() function to help us achieve this

In [None]:
word_ids = tokenized_input.word_ids()
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])

word_ids has mapped each subword to corresponding index in words sequence; with the same word being mapped to the same index even if broken into multiple subwords. Also we see special tokens like <s//> mapped to `None`. Let's set -100 as the label for the special tokens and subwords we wish to mask during training:

In [None]:
previous_word_idx = None
label_ids = []

for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100)
    elif word_idx != previous_word_idx:
        label_ids.append(labels[word_idx])
    previous_word_idx = word_idx
    
labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]

pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)

We select -100 as id, because PyTorch has cross-enropy loss that has an attribute called `ignore_index` whose value is -100; so this is ignored during training, and we use it here to ignore tokens associated with consecutive subwords.

In [None]:
# scale to whole dataset by defining a single function that wraps all the logic
def tokenize_and_align_labels(examples):
    tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    
    labels = []
    for idx, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
# write a function we can iterate over

def encode_panx_dataset(corpus):
    return corpus.map(tokenize_and_align_labels, batched=True, remove_columns=['langs', 'ner_tags', 'tokens'])

In [None]:
# encode our German corpus
panx_de_encoded = encode_panx_dataset(panx_ch["de"])

In [None]:
panx_de_encoded["train"]["labels"][0][:15]

## Performance Measures

All words of an entity need to be predicted correctly in order for a prediction to be counted as correct. We have a library called seqeval designed for these kinds of tasks; it can compute metrics via classification_report() function

In [None]:
!pip install seqeval

In [None]:
from seqeval.metrics import classification_report

y_true = [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"], ["B-PER", "I-PER", "O"]]
y_pred = [["O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "O"], ["B-PER", "I-PER", "O"]]
print(classification_report(y_true, y_pred))

`seqeval` expects the predictions and labels as lists of lists, each list corresponing to a single example in our validation or test sets. So we need to write a function that converts the output of our model into lists that `seqeval` expects.

In [None]:
import numpy as np

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)
    batch_size, seq_len = preds.shape
    labels_list, preds_list = [], []
    
    for batch_idx in range(batch_size):
        example_labels, example_preds = [], []
        for seq_idx in range(seq_len):
            # Ignore label IDs = -100
            if label_ids[batch_idx, seq_idx] != -100:
                example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
                example_preds.append(index2tag[preds[batch_idx][seq_idx]])
                
        labels_list.append(example_labels)
        preds_list.append(example_preds)
    return preds_list, labels_list

Now with a performance metric, we can move on to actually train the model.

## Fine-Tuning XLM-RoBERTa

Fine-tune base model on German subset of PAN-X, then evaluate its zero-shot cross-lingual performance on French, Italian and English. We will use Transformers Trainer to handle training loop, so first need to define the training attributes using the TrainingArguments class.

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
from transformers import TrainingArguments

num_epochs = 3
batch_size = 24
logging_steps = len(panx_de_encoded["train"]) // batch_size
model_name = f"{xlmr_model_name}-finetuned-panx-de"
training_args = TrainingArguments(
    output_dir=model_name, log_level="error", num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="epoch",
    save_steps=1e6, weight_decay=0.01, disable_tqdm=False,
    logging_steps=logging_steps, push_to_hub=True
)

In [None]:
from seqeval.metrics import f1_score

# convert format to what is needed by seqeval to calculate f1 score
def compute_metrics(eval_pred):
    y_pred, y_true = align_predictions(eval_pred.predictions, eval_pred.label_ids)
    return {"f1": f1_score(y_true, y_pred)}

In [None]:
# define data collector to pad each input sequence to largest sequence length in a batch
# Huggingface Transformers provides a dedicated data collator for token classification that will pad labels along with inputs
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(xlmr_tokenizer)

Padding the labels is necessary, as unlike in text-classification, the labels are also sequences. Note here: The label sequences are padded with value `-100` which is ignored by PyTorch's loss functions.

Define a `model_init()` method to load an untrained model and is at the beginning of `train()` call:

In [None]:
def model_init():
    return (XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config=xlmr_config).to(device))

In [None]:
!sudo apt-get install git-lfs

In [None]:
# pass information to Trainer
from transformers import Trainer

trainer = Trainer(model_init=model_init, args=training_args, 
                  data_collator=data_collator, compute_metrics=compute_metrics,
                 train_dataset=panx_de_encoded["train"],
                 eval_dataset=panx_de_encoded["validation"],
                 tokenizer=xlmr_tokenizer)

In [None]:
trainer.train()
trainer.push_to_hub(commit_message="Training completed!")

In [None]:
# F1 scores are quite good for NER model; to confirm it works as expected, test on German Translation of a simple example
text_de = "Jeff Dean ist ein Informatiker bei Google in Kalifornien"
tag_text(text_de, tags, trainer.model, xlmr_tokenizer)


Great! Though we shouldn't get too excited over a single example; time for a more proper and thorough analysis of the model's errors.

## Error Analysis

There are several failure modes to bear in mind where it looks like the model is performing well, but in practice has some serious flaws. Some examples:

- Accidentally mask toomany tokens and also masking some of our labels to get a really promising loss drop
- `compute_metrics()` function might have a bug that overestimates the true performance
- Might include zero class or 0 entity in NER as a normal class; which will heavily skew accuracy and F1-score since it is the majority class by a large margin

When the model performs much worse than expected, looking at errors can yield useful insights and reveal bugs that would be hard to spot by just reviewing code. Even if the model performs well and there are no bugs, error analysis is a useful tool to understand the model's strengths and weaknesses and something we want to bear in mind when we deploy a model to a production environment.

One of the most powerful tools at our disposal is to look at validation examples with the highest loss, so now we can look at the loss per token in the same sequence.

In [None]:
# a method to apply to the validation set

from torch.nn.functional import cross_entropy

def forward_pass_with_label(batch):
    
    # convert dict of lists to list of dicts suitable for data collator
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    
    # pad inputs and labels and put all tensors on device
    batch = data_collator(features)
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["labels"].to(device)
    
    with torch.no_grad():
        # pass data through model
        output = trainer.model(input_ids, attention_mask) # logit size: [batch_size, sequence_length, classes]
        # predict class with largest logit value on classes axis
        predicted_label = torch.argmax(output.logits, axis=-1).cpu().numpy()
        
    # calculate loss per token after flattening batch dimension with view
    loss = cross_entropy(output.logits.view(-1, 7), labels.view(-1), reduction="none")
    # unflatten batch dimension and convert to numpy array
    loss = loss.view(len(input_ids), -1).cpu().numpy()
    
    return {"loss": loss, "predicted_label": predicted_label}

In [None]:
# apply to whole validation set using map() and load into dataframe for further analysis
valid_set = panx_de_encoded["validation"]
valid_set = valid_set.map(forward_pass_with_label, batched=True, batch_size=32)
df = valid_set.to_pandas()

In [None]:
# map back to strings from IDs to better read results; assign IGN to -100 labels. Also get rid of padding by truncating to length of inputs
index2tag[-100] = "IGN"
df["input_tokens"] = df["input_ids"].apply(lambda x: xlmr_tokenizer.convert_ids_to_tokens(x))
df["predicted_label"] = df["predicted_label"].apply(lambda x: [index2tag[i] for i in x])
df["labels"] = df["labels"].apply(lambda x: [index2tag[i] for i in x])
df["loss"] = df.apply(lambda x: x["loss"][:len(x["input_ids"])], axis=1)
df["predicted_label"] = df.apply(lambda x: x["predicted_label"][:len(x["input_ids"])], axis=1)

df.head()

In [None]:
# we can unpack by using pandas.Series.explode() fn; creating a row for each element in original rows list
# can do in parallel for all columns as all lists in one row have the same length
# also drop padding tokens as their loss is 0 and cast losses to standard floats
df_tokens = df.apply(pd.Series.explode)
df_tokens = df_tokens.query("labels != 'IGN'")
df_tokens["loss"] = df_tokens["loss"].astype(float).round(2)
df_tokens.head(7)

In [None]:
# can groupby input tokens and aggregate the losses for each token with count, mean and sum
# then sort by sum of losses and see which tokens have the most loss in val set
(df_tokens
    .groupby("input_tokens")[["loss"]]
    .agg(["count", "mean", "sum"])
    .droplevel(level=0, axis=1) # rid of multi-level columns
    .sort_values(by="sum", ascending=False)
    .reset_index()
    .round(2)
    .head(10)
.T)

Observations:
- Whitespace token has highest total loss; unsurprising as it is the most common token in the list. However it's mean loss is much lower than other tokens in the list, so the model doesn't struggle to classify it
- Words like "in", "von", "und" appear relatively frequently and appear together with named entities and sometimes part of them, which explains why model might get mixed up
- Parentheses, slashes and capitals are rarer but havea relatively high average loss. Investigate these further

In [None]:
# can also group label ids and look at losses for each class
(
    df_tokens.groupby("labels")[["loss"]]
    .agg(["count", "mean", "sum"])
    .droplevel(level=0, axis=1)
    .sort_values(by="mean", ascending=False)
    .reset_index()
    .round(2)
    .T
)

I-LOC has highest average loss; determining the location subwords is a challenge to our model, as well as B-ORD, the beginning of an organisation is a challenge to our model.

In [None]:
# go further by plotting confusion matrix where we see beginning of an organisation is often confused with subsequent I-ORG

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6,6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
    
plot_confusion_matrix(df_tokens["labels"], df_tokens["predicted_label"], tags.names)

See confusiion between B-ORD and I-ORG; otherwise is quite good at classifying remaining entities which is clear by the near diagonal of confusion matrix.

Move on from token level. Now look at sequences with high losses. Revisit "unexploded" DataFrame and calculate losses by summing over loss per token

In [None]:
# first write a function to help us display the token sequences with labels and losses
def get_samples(df):
    for _, row in df.iterrows():
        labels, preds, tokens, losses = [], [], [], []
        for i, mask in enumerate(row["attention_mask"]):
            if i not in {0, len(row["attention_mask"])}:
                labels.append(row["labels"][i])
                preds.append(row["predicted_label"][i])
                tokens.append(row["input_tokens"][i])
                losses.append(f"{row['loss'][i]:2f}")
        df_tmp = pd.DataFrame({"tokens":tokens, "labels":labels, "preds":preds, "losses":losses}).T
        
        yield df_tmp
    
df["total_loss"] = df["loss"].apply(sum)
df_tmp = df.sort_values(by="total_loss", ascending=False).head(3)

for sample in get_samples(df_tmp):
    display(sample)

Some of the labels are incorrect! Such as United Nations ... is labelled as a person! The annotations for PAN-X were generated through an automated processes, and are referred to as "silver standard" (vs "gold standard" of human-generated annotations) thus it is no surprise we see such fail cases of non-sensible labels. Even when humans annotate, mistakes can occur when the annotator misunderstands or loses concentration.

Lets also look at parentheses and slashes which had a relatively high loss.

In [None]:
df_tmp = df.loc[df["input_tokens"].apply(lambda x: u"\u2581(" in x)].head(2)
for sample in get_samples(df_tmp):
    display(sample)

Seems to be parentheses in the way automatic extraction annotated the documents.. Also these contain geographic specifications. In Wikipedia articles, the titles often contain some explanation in parentheses which are important details to know when we roll out the model and have implications on downstream performance of whole pipeline the model is part of.

So we have identified some weaknesses in both model and dataset. In real use-case we would iterate on each, clean the dataset and retrain the model; then re-analyze new errors until we are satisfied with the performance.

Though here we will move on, and look at performances across languages.

## Cross-Lingual Transfer

In [None]:
# evaluate ability to transfer to other languages via predict() method of trainer
def get_f1_score(trainer, dataset):
    return trainer.predict(dataset).metrics["test_f1"]

In [None]:
# evaluate performance on test set and keep track of scores in a dict
f1_scores = defaultdict(dict)
f1_scores["de"]["de"] = get_f1_score(trainer, panx_de_encoded["test"])
print(f"F1-score of [de] model on [de] dataset: {f1_scores['de']['de']:.3f}")

In [None]:
# see how German performes on French
text_fr = "Jeff Dean est informaticien chez Google en Californie"
tag_text(text_fr, tags, trainer.model, xlmr_tokenizer)


In [None]:
def evaluate_lang_performance(lang, trainer):
    panx_ds = encode_panx_dataset(panx_ch[lang])
    return get_f1_score(trainer, panx_ds["test"])

In [None]:
f1_scores["de"]["fr"] = evaluate_lang_performance("fr", trainer)
print(f"F1-score of [de] model on [fr] dataset: {f1_scores['de']['fr']:.3f}")

Drop of about 15 points. Remember, our model has not seen a single labeled French example. Generally, the size of performance drop is related to how "far away" the languages are from each other; Germanic and Romance languages are different families after all..

In [None]:
# look at Italian; also Romance language so similar to French result
f1_scores["de"]["it"] = evaluate_lang_performance("it", trainer)
print(f"F1-score of [de] model on [it] dataset: {f1_scores['de']['it']:.3f}")

In [None]:
# finally english, which belongs to Germanic language family
f1_scores["de"]["en"] = evaluate_lang_performance("en", trainer)
print(f"F1-score of [de] model on [en] dataset: {f1_scores['de']['en']:.3f}")

Model fares *worst* on English though intuitively we expect German to be more similar to English than French. So let's next examine wen it makes sense to fine-tune directly on target language..

### When Does Zero-Shot Transfer Make Sense?

Fine-tune on training sets of increasing size; track performance to determine at which point zero-shot cross-lingual transfer is superior; which in practice can be useful to guide decisions on whether to collect more labeled data. 

Keep hyperparameters from fine-tuning on German corpus, and tweak logging_steps argument of TrainingArguments to account for changing training set sizes. Wrap all in a simple function that takes a DatasetDict object corresponding to a monolingual corpus, downsample it by num_samples and fine-tunes XLM-R on that to return metrics from the best epoch:

In [None]:
def train_on_subset(dataset, num_samples):
    train_ds = dataset["train"].shuffle(seed=42).select(range(num_samples))
    valid_ds = dataset["validation"]
    test_ds = dataset["test"]
    
    training_args.logging_steps = len(train_ds) // batch_size
    
    trainer = Trainer(model_init=model_init, args=training_args,
                     data_collator=data_collator, compute_metrics=compute_metrics,
                     train_dataset=train_ds, eval_dataset=valid_ds, tokenizer=xlmr_tokenizer)
    trainer.train()
    
    if training_args.push_to_hub:
        trainer.push_to_hub(commit_message="Training completed!")
        
    f1_score = get_f1_score(trainer, test_ds)
    return pd.DataFrame.from_dict(
        {"num_samples": [len(train_ds)], "f1_score": [f1_score]}
    )

In [None]:
# encode French corpus to input ids, attention masks and label ids
panx_fr_encoded = encode_panx_dataset(panx_ch["fr"])

In [None]:
# test with small training set of 250 examples
training_args.push_to_hub = False
metrics_df = train_on_subset(panx_fr_encoded, 250)
metrics_df

Small dataset underperforms zero-shot from German by a large margin; see how results vary with increasing training set sizes:

In [None]:
for num_samples in [500, 1000, 2000, 4000]:
    metrics_df = metrics_df.append(
        train_on_subset(panx_fr_encoded, num_samples), ignore_index=True)

In [None]:
# plot f1 score on test set as function of increasing training set size:

fig, ax = plt.subplots()
ax.axhline(f1_scores["de"]["fr"], ls="--", color="r")
metrics_df.set_index("num_samples").plot(ax=ax)
plt.legend(["Zero-shot from de", "Fine-tuned on fr"], loc="lower right")
plt.ylim((0, 1))
plt.xlabel("Number of Training Samples")
plt.ylabel("F1 Score")
plt.show()

Zero-shot transfer remains competitive until about 750 training examples, after which fine-tuning on French reaches a siilar level of performance as we had when fine-tuning on German. But this result is not to be laughed at! Getting labels can be pricey and zero-shot transfer learning can have a large business impact.

One final technique: fine-tuning on multiple languages at once!

## Fine-Tuning on Multiple Languages at Once

Fine-tune on multiple languages at the same time to prevent drop in performance. First concatenate_datasets() from HuggingFace Datasets to concat the German and French corpora together.

In [None]:
from datasets import concatenate_datasets

def concatenate_splits(corpora):
    multi_corpus = DatasetDict()
    for split in corpora[0].keys():
        multi_corpus[split] = concatenate_datasets(
            [corpus[split] for corpus in corpora]).shuffle(seed=42)
    return multi_corpus

panx_de_fr_encoded = concatenate_splits([panx_de_encoded, panx_fr_encoded])

In [None]:
# use same hyperparameters frm previous sections, so can just update logging steps, model and datasets in trainer
training_args.logging_steps = len(panx_de_fr_encoded["train"]) // batch_size
training_args.push_to_hub = True
training_args.output_dir = "xlm-roberta-base-finetuned-panx-de-fr"

trainer = Trainer(model_init=model_init, args=training_args,
                 data_collator=data_collator, compute_metrics=compute_metrics, 
                 tokenizer=xlmr_tokenizer, train_dataset=panx_de_fr_encoded["train"],
                 eval_dataset=panx_de_fr_encoded["validation"])

trainer.train()
trainer.push_to_hub(commit_message="Training completed!")

In [None]:
# see how model performs on each test set of language
for lang in langs:
    f1 = evaluate_lang_performance(lang, trainer)
    print(f"F1-score of [de-fr] model on [{lang}] dataset: {f1:.3f})

Much better performance on French, matching German. Also increases performance on Italian and English by roughly 10 points. So adding data in another language improves the model performance on unseen languages!

Finalise by comparing performance of fine-tuning on each language separately against multilingual learning on all corpora. Can fine-tune on remaining languages with `train_on_subset()` with `num_samples` equal to number of examples in training set.

In [None]:
corpora = [panx_de_encoded]

# exclude German from iteration
for lang in langs[1:]:
    training_args.output_dir = f"xlm-roberta-base-finetuned-panx-{lang}"
    # finetune on monolingual corpus
    ds_encoded = encode_panx_dataset(panx_ch[lang])
    metrics = train_on_subset(ds_encoded, ds_encoded["train"].num_rows)
    # collect F1 scores in common dict
    f1_scores[lang][lang] = metrics["f1_score"][0]
    # add monolingual corpus to corpora to concatenate
    corpora.append(ds_encoded)

In [None]:
# Now concatenate all the splits to create a multilingual corpus of all four languages; use concatenate_splits() as previous.
corpora_encoded = concatenate_splits(corpora)

In [None]:
# run familiar steps with the trainer
training_args.logging_steps = len(corpora_encoded["train"]) // batch_size
training_args.output_dir = "xlm-roberta-base-finetuned-panx-all"

trainer = Trainer(model_init=model_init, args=training_args,
                 data_collator=data_collator, compute_metrics=compute_metrics,
                 tokenizer=xlmr_tokenizer, train_dataset=corpora_encoded["train"],
                 eval_dataset=corpora_encoded["validation"])

trainer.train()
trainer.push_to_hub("Training completed!")

In [None]:
# finally generate predictions on each language's test set
for idx, lang in enumerate(langs):
    f1_scores["all"][lang] = get_f1_score(trainer, corpora[idx]["test"])
    
scores_data = {"de": f1_scores["de"],
              "each": {lang: f1_scores[lang][lang] for lang in langs},
              "all": f1_scores["all"]}
f1_scores_df = pd.DataFrame(scores_data).T.round(4)
f1_scores_df.rename_axis(index="Fine-tune on", columns="Evaluated on", inplace=True)

f1_scores_df

A few conclusions:
- Multilingual learning can provide significant gains on performance, especially if the low-resource language for cross-lingual transfer belong to similar language families - German, French and Italian achieve similar performance in all; suggesting that these languages are more similar to each other than English
- As a general strategy, it is a good idea to focus attention on cross-lingual transfer within language families; especially when dealing with different scripts like Japanese

## Conclusion

We saw NLP task on multilingual corpus using a single transforer pretrained on 100 languages: XLM-R. We were able to show cross-lingual transfer from German to French is competitive when a small nmber of labeled examples is available for fine-tuning, this good performance does not occur if target language is significantly different from one the base model was fine-tuned on or not one of the 100 languages used during pretraining. Recent proposals like MAD-X are designed for such low-resource scenarios, and since MAD-X is built on top of HuggingFace Transformers you can easily adapt the code to work with it.

So far we have looked at sequence classification and token classification which fall into the domain of natural language understanding, where text is synthesized into predictions; next we can look at text generation where the input and output is text.