# Multilingual Named Entity Recognition

In the series of notebooks till now, we've applied transformers to solve NLP problems on engligh corpora. What if the dataset has multiple languages, mainitainig multiple monolingual models in production will not be any fun.

Fortunatley, we've a class of multilingual transformer to the rescue. Like BERT they are pretrained for masked language modeling as a pretraining objective.

By pretraining on a huge multilingual corpora, we can achieve zero-shot cross lingual transfer. By fine tuning the pretrained label on one language, we can evaluate it on another language without fine tuning for that language explictly.

In this notebook we'll use XLM-RoBERTa pretrained on [2.5Terabyte of text based on Common Crawl Corpus](https://commoncrawl.org/).
The dataset contains only data without parallel texts(translations) and trained an encoder with MLM on this dataset.

Some applications of NER:
* insights from documents
* augmenting quality of search engines
* building a strucutred database from corpus

> Note: *Zero-short transfer or zero-shot learning* usually refers to the task of training a model on one set of labels and then evaluating it on a different set of labels. In the context of transformers, zero-shot learning may also refer to situations where a lnaguage model like GPT-3 is evaluated ona downstream task that it wasn't even fine tuned on.

Problem(assumption):

We want to perfoerm NER for a customer based in Switzerlang, where there are four national languages(With English serving as bridge between them.)

## The Dataset.

We'll use subset of Cross-lingual TRansfer Evaluation of Multilingual Encoder(XTREME) benchmark called WikiANN or PAN-X. This dataset consisits of Wikipedia articles in many languages, including four languages from our Problem. Each article is annotated with `LOC`(location), `PER`(person) and `ORG`(organisation) tags in [IOB2 fornat](https://oreil.ly/yXMUn)

In this format, a `B0` prefix indicates beginning of an entity and consecutive tokens belonging to the same entity are given `I-` prefix. `O` tag indicate that the token does not belong to any entity.

In [1]:
# Sample of the annotation scheme
import pandas as pd
toks = "Jeff Dean is a computer scientist at Google in California".split()
lbls = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]
df = pd.DataFrame(data=[toks, lbls], index=['Tokens', 'Tags'])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,Jeff,Dean,is,a,computer,scientist,at,Google,in,California
Tags,B-PER,I-PER,O,O,O,O,O,B-ORG,O,B-LOC


In [2]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"Number of subsets in XTREME {len(xtreme_subsets)}")
print(f"Subset names in XTREME {xtreme_subsets}")

Number of subsets in XTREME 183
Subset names in XTREME ['XNLI', 'tydiqa', 'SQuAD', 'PAN-X.af', 'PAN-X.ar', 'PAN-X.bg', 'PAN-X.bn', 'PAN-X.de', 'PAN-X.el', 'PAN-X.en', 'PAN-X.es', 'PAN-X.et', 'PAN-X.eu', 'PAN-X.fa', 'PAN-X.fi', 'PAN-X.fr', 'PAN-X.he', 'PAN-X.hi', 'PAN-X.hu', 'PAN-X.id', 'PAN-X.it', 'PAN-X.ja', 'PAN-X.jv', 'PAN-X.ka', 'PAN-X.kk', 'PAN-X.ko', 'PAN-X.ml', 'PAN-X.mr', 'PAN-X.ms', 'PAN-X.my', 'PAN-X.nl', 'PAN-X.pt', 'PAN-X.ru', 'PAN-X.sw', 'PAN-X.ta', 'PAN-X.te', 'PAN-X.th', 'PAN-X.tl', 'PAN-X.tr', 'PAN-X.ur', 'PAN-X.vi', 'PAN-X.yo', 'PAN-X.zh', 'MLQA.ar.ar', 'MLQA.ar.de', 'MLQA.ar.vi', 'MLQA.ar.zh', 'MLQA.ar.en', 'MLQA.ar.es', 'MLQA.ar.hi', 'MLQA.de.ar', 'MLQA.de.de', 'MLQA.de.vi', 'MLQA.de.zh', 'MLQA.de.en', 'MLQA.de.es', 'MLQA.de.hi', 'MLQA.vi.ar', 'MLQA.vi.de', 'MLQA.vi.vi', 'MLQA.vi.zh', 'MLQA.vi.en', 'MLQA.vi.es', 'MLQA.vi.hi', 'MLQA.zh.ar', 'MLQA.zh.de', 'MLQA.zh.vi', 'MLQA.zh.zh', 'MLQA.zh.en', 'MLQA.zh.es', 'MLQA.zh.hi', 'MLQA.en.ar', 'MLQA.en.de', 'MLQA.en.vi', 'ML

In [3]:
# We're looking for subsets with PAN
panx_subsets = [subset for subset in xtreme_subsets if subset.startswith("PAN")]
panx_subsets[:3], len(panx_subsets)

(['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg'], 40)

There are 40 PAN-X subsets available.

Eacch of the PAN-X subset is suffixed by language based on [ISO 639-1 language code](https://oreil.ly/R8XNu). This means that to load the German corpus, we pass the `de` code to the `name` argument of `load_dataset()`.

In [4]:
from datasets import load_dataset
# Loading german PAN-X
panx_de_dataset = load_dataset(path="xtreme", name="PAN-X.de")

Found cached dataset xtreme (/Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
panx_de_dataset["train"][0]

{'tokens': ['als', 'Teil', 'der', 'Savoyer', 'Voralpen', 'im', 'Osten', '.'],
 'ner_tags': [0, 0, 0, 5, 6, 0, 0, 0],
 'langs': ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']}

To make a realistic Swiss corpus, we'll same the German(`de`), French(`fr`), Italian(`it`) and English(`en`) corpora from PAN-X according to their spoken proportions. This will create a language imbalance that is very common in real-world datasets, where acquiring labeled examples in aminority language can be expensive due to lack of doman experts who are fluent in the language. This imbalanced dataset will simulate a common situation when working on multilingual applications, we'll cover how we can build a model that works on all languages.

To keep track of each language, let's create a Python `defaultDict` that stores the language code as the key and a PAN-X corpus of type `DatasetDict` as the value:

### Creating multiple languages dataset dict

In [6]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]

# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to fracs
    # Looping through each split
    for split in ds:
        # Assigning sampled split to datasetDict
        panx_ch[lang][split] = ds[split].shuffle(seed=42).select(
            range
            (int(len(ds[split]) * frac))
            )

Found cached dataset xtreme (/Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-50c6130fc2dbe6ad.arrow
Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-3d878c38ca830baa.arrow
Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-c8fed2b7e6d59cbc.arrow
Found cached dataset xtreme (/Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-1d0cd9eb0adc8933.arrow
Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-e0dc2d749c429e9e.arrow
Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-ca0a6f6c69069001.arrow
Found cached dataset xtreme (/Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-02142ffc6dacef09.arrow
Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-79992a3f46d42fd5.arrow
Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-0e5a91e037304e5e.arrow
Found cached dataset xtreme (/Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-e14b50505509ca06.arrow
Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-529925d4984531e4.arrow
Loading cached shuffled indices for dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-ef60137063549caf.arrow


In [7]:
panx_ch

defaultdict(datasets.dataset_dict.DatasetDict,
            {'de': DatasetDict({
                 train: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 12580
                 })
                 validation: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 6290
                 })
                 test: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 6290
                 })
             }),
             'fr': DatasetDict({
                 train: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 4580
                 })
                 validation: Dataset({
                     features: ['tokens', 'ner_tags', 'langs'],
                     num_rows: 2290
                 })
                 test: Dataset({
                     features: ['tokens', 'ner_tags', 'la

We've used `shuffle()` method to make sure we don't bias our dataset splits accidentally. With `select()` we downsample each corpus according to the values in `fracs`. 

Let's check it out.

In [8]:
import pandas as pd
pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs},
             index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


By design, we have more examples in German than all other languages combined, so we'll use this as a starting point from which we'll perform zero-shot linguagl transfer to french, Italian and English.

In [9]:
# Inspecting a sample from de
element = panx_ch["de"]["train"][0]
for k, v in element.items():
    print(f"{k}: {v}")

tokens: ['Olympique', 'Nîmes', ',', 'Auxerres', 'seinerzeitiger', 'drittklassiger', 'Endspielgegner', ',', 'hatte', 'sich', 'erst', 'gar', 'nicht', 'für', 'die', 'Hauptrunde', 'qualifizieren', 'können', '.']
ner_tags: [3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


> **Note**: In each `Dataset` object, the keys correspond to column names of an Arrow table and values denote the entries in each column.

In [10]:
pd.DataFrame(panx_ch["de"]["train"][0])

Unnamed: 0,tokens,ner_tags,langs
0,Olympique,3,de
1,Nîmes,4,de
2,",",0,de
3,Auxerres,0,de
4,seinerzeitiger,0,de
5,drittklassiger,0,de
6,Endspielgegner,0,de
7,",",0,de
8,hatte,0,de
9,sich,0,de


In [11]:
# Let's add the ner tag labels to the data
tags = panx_ch["de"]["train"].features["ner_tags"].feature
tags

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)

In [12]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

In [13]:
panx_de = panx_ch["de"].map(create_tag_names)
pd.DataFrame(panx_de["train"][0])

Loading cached processed dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-ea7bb32e9e190d7a.arrow
Loading cached processed dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-f1199d3c03140ace.arrow
Loading cached processed dataset at /Users/jayaprakashsivagami/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/29f5d57a48779f37ccb75cb8708d1095448aad0713b425bdc1ff9a4a128a56e4/cache-100cbdb51baf3b56.arrow


Unnamed: 0,tokens,ner_tags,langs,ner_tags_str
0,Olympique,3,de,B-ORG
1,Nîmes,4,de,I-ORG
2,",",0,de,O
3,Auxerres,0,de,O
4,seinerzeitiger,0,de,O
5,drittklassiger,0,de,O
6,Endspielgegner,0,de,O
7,",",0,de,O
8,hatte,0,de,O
9,sich,0,de,O


In [14]:
pd.DataFrame(panx_de["train"][0])[["tokens", "ner_tags_str"]]

Unnamed: 0,tokens,ner_tags_str
0,Olympique,B-ORG
1,Nîmes,I-ORG
2,",",O
3,Auxerres,O
4,seinerzeitiger,O
5,drittklassiger,O
6,Endspielgegner,O
7,",",O
8,hatte,O
9,sich,O


In [15]:
# Let's check the balance of tags
from collections import Counter

split2freqs = defaultdict(Counter)

for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1

pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,ORG,PER,LOC
train,5397,5881,6169
validation,2639,2870,3172
test,2657,2971,3100


## Multilingual Transformers

Multilingual training and architectures are similar to monolingual counterparts except the corpus consists of documents in multiple languages.

These models might outperform monlingual models and generalize well for a variety of downstream tasks.

To measure the performance of cross-lingual transfer for NER, the [CoNLL-2002](https://huggingface.co/datasets/conll2002) and [CoNLL-2003](https://huggingface.co/datasets/conll2003) datasets are often used as benchmark for English, dutch, spanish and German. This benchmark has all three entity types as PAN-X dataset plus `MISC` for othe entities.

### Multilingual Transformers Evaluation

These models can be evaluated in three different ways:

`en`

Fine-tune on English training data and then evaluate on each langauge set.

`each`

Fine-tune and test on each language individually.

`all`

Fine-tune on all training data to evaluate on all on each langauage's test set.

### Models

mBERT is one of the first multinlingual transformers that differs from BERT only in corpus(multilingual wikipedia articles).

XLM-RoBERTa(or XLM-R for short) has supersed mBERT since then.

XLM-R uses only MLM as a pretraining obejctive for 100 languages. Distinguishing feaure of XLM-R is it uses wikipedia dumps for each languages and 2.5TB of common web crawl data. This is several orders of magnitude larger than corpus used by previous models. This provides good boost for low resource languages where only small wikipedia articles exist.

RoBERTa part of the model name is due to the same pretraining approach used by monlingual RoBERTa models with below significant changes,
* Removes next sentence prediction
* Moves from langauge embeddings to Sentence-Piece tokenizer
* Vocabulary of 25,000 versus 55,000.

XLM-R is a great choice for Multilingual NLU tasks.

### Closer look at Tokenization

Let's compare WordPiece(BERT) tokenizer with SentencePiece(XLM-R) tokenizer.

In [16]:
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"

bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

In [17]:
# Let's encode a text to find the special tokens used by each
text = "Human's heart is mental!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()

###

In [18]:
pd.DataFrame([bert_tokens, xlmr_tokens])

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,[CLS],Human,',s,heart,is,mental,!,[SEP]
1,<s>,▁Human,',s,▁heart,▁is,▁mental,!,</s>


BERT uses `[CLS]` and `[SEP]` to indicate start and end of text,
XLM-R uses `<s>` and `</s>` to do the same.

How are these tokens added? Let's see...

### Tokenizer Pipeline

![alt tokenizer pipeline](../notes/images/4-multilingual-named-entity-recognition/tokeinzier-pipeline.png)

Assume the text `Jack Sparrow loves New York!` is passsed through the pipeline.

#### Normalization

This step corresponds to the set of operatins applied to make it `cleaner`. Some of them might be as follows,

1. Stripping whitespace
2. Removing accented characters
3. [Unicode normalization](https://unicode.org/reports/tr15/)
    - Is a normalization applied by many tokenizers to deal with the fact that there are many ways to write the same character.
    - This can make two versions of same string appear different.
    - This uses schemes like NFC, NFD, NKFC and NFKD replace the ways to write the same character with standard forms.
4. Lowecasing
    - If the model accpets and uses only lowecase, this will reduce the size of the vocabulary

After normalization the text would look like,
`jack sparrow loves new york!`

#### Pretokenization

* Pretokenizatoin gives the upper bound to what the tokens will be at the end of training.
* One way to think about this is splitting a string into words based on whitespace which works well for Indo-European lanuages. Then these words can be split to simpler sub-words with Byte-Pair Encoding or Unigram algorithms in the next step.
* Splitting based on whitespace and grouping them into semantic units is non-deterministic for languages like Chinese, Japanese, Korean.
* Best approch is to pretokenize using a language-specific library.

After pretokenization,
`["jack", "sparrow", "loves", "new", "york","!"]`

#### Tokenizer model

* The model splits the words from pretokenizer into sub words.
* Reduce the size of the vocabulary and number of out-of-vocabulary tokens.
* Serveral subword tokenization algorithms exist
    - BPE
    - Unigram
    - WordPiece
* Now we've a list of integers(input IDs)

Now text becomes like below,
`["jack", "spa", "rrow"", loves", "new", "york","!"]`

#### Postprocessing

* This is the last piece of tokenization pipeline where special tokens are added.
* Like CLS, SEP by BERT tokenizer and <s>, </s> by SentencePiece tokenizer.


### The SentencePiece Tokenizer

* The SentencePiee tokenizer is based on a type of subword segmentation called unigram and encodes each input text as a sequence of Unicode characters.
* This specific feature is especially useful for multilingual corpora as it allows SentencePiece to be agnostic about accents, punctuations and the fact that languages, like Japaneese, do not have whitespace characters.
* Another special feature is SentencePiece assigns the unicode symbol U+2581 or the __ character. This enables detokenization without whitespace and without relying on language-specific pretokenizers.
* For example, WordPiece has lost the information that there is not whitespace between `York` and `!`.

In [19]:
xlmr_tokens

['<s>', '▁Human', "'", 's', '▁heart', '▁is', '▁mental', '!', '</s>']

In [20]:
"".join(xlmr_tokens).replace(u"\u2581", " ")

"<s> Human's heart is mental!</s>"

Let's see how we can encode our simple example in a form suitable for NER.

We can load a token classification head with a pretrained model. But instead of loading the head from transformers, we're gonna build it ourselves by diving into Transformers API.

### Transformers for Named Entity Recognition

In our [2-text-classification.ipynb](../notebooks/2-text-classification.ipynb) BERT uses the special token `[CLS]` to represent the entire sequence of text. Then this is fed to dense layers and classification head to classify the label.

*Sequence classification*
![alt](../notes/images/4-multilingual-named-entity-recognition/sequence-classification.png)

For NER we do the same but with all tokens and assign a label for each token on which entity it is, hence NER is also called as *token classification task*.

*token classification*
![alt](../notes/images/4-multilingual-named-entity-recognition/token-classification.png)

How should the subwords handled in a token classification task?

`##ista` is assigned `IGN`(ignored) label with `Chr` assigned the `B-PER` label. Ignore labels can be propogated later with post processing. This is the convention followed in BERT paper.

All architecure aspect in BERT is carried over to XLM-R since its architecture is based on RoBERTa, which id identical to BERT!.

Next we'll see how Transformers supports many other tasks with minor modifications.

## The Anatomy of the Transformers Model Class

Transformers is organized around dedicated classes for each architecture and task. The model associated with different tasks are named according to a `<ModelName>For<Task>` convention ot `AutoModelFor<Task>` when using `AutoModel` classess.

Let's assume we want try out a poc using a model, what if the task is not available with that model. No need to worry that is where Body and Head concept of transformers pitches in...

### Bodies and Heads

Like we saw in [2.text-classification.ipynb](../notebooks/2-text-classification.ipynb) we used DistillBERT's body and trained a classification head.

Here Body is task agnostic as it's set of pretrained weights on a corpus and Head can be attached to the body to leverage the features it has learned and use it to perform our downstream task.

> This is the main concept that makes transformers so versatile. The split of architecture into a `body` and `head`.

This strucuture is reflected in the Transformers code as well: The body of a model is implemented in a class such as `BERTModel` or `GPT2Model` that returns the hidden states of the last layer. Task-specific models such as `BertForMaskedLM` or `BertForSequenceClassification` use the base model and add the necessary head on top of the hidden states.

*Body-Head architecture*
![alt](../notes/images/4-multilingual-named-entity-recognition/body-head.png)

This seperation of bodies and heads allow us to build a custom head for any task and just mount it on top of a pretrained model.

### Creating a Custom Model for Token Classification

Let's build a custom toke classification head for XLM-R. Since it's based on RoBERTa we'll use that as the base model. Also we'll augment specific setting to XLM-R.

To get started, we need a data strucutre that will represent our XLM-R NER tagger. As a first guess, we'll need a configuration object to initalize the model and a `forward()` function to generate the outputs.

In [39]:
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel, RobertaPreTrainedModel

class XLMRobertaTokenClassification(RobertaPreTrainedModel):
    config_class = XLMRobertaConfig
    
    def __init__(self, config):
        super().__init__(config)
        # Number of NER tags
        self.num_labels = config.num_labels
        # Load model body
        self.roberta = RobertaModel(config, add_pooling_layer=False)
        # Set up token classification head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
        # Load and initialize weights
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None,**kwargs):
        # Use model body to get encoder representations
        outputs = self.roberta(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, **kwargs)

        # Apply classifier to outputs
        sequence_output = outputs[0]
        logits = self.classifier(self.dropout(sequence_output))

        # Calculate losses
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return TokenClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

***Model explaination:***

* `config_class` ensures that the standard XLM-R settings are used when we initialize a new model. To change the default parameters , we can overwrite the default setting in the configuration.
* With `super()` method we call the initialization function of `RobertaPreTrainedModel` class. This abstract class handles the loading of pretrained weight.
* Then we load our model body `RobertaModel`, and extend it with our own classification head with dropout and a standard feed-forward layer.
* `add_poolin_layer=False` to ensure all the hidden states are returned and not ony the one associated with the `[CLS]` token.
* Finally we initialize all the weights by calling the `init_weights()` method inherit from `RobertaPretrainedModek`, which will load weights for the model body and randomnly intialize weights of our token classification head.
* `forward()` -> We pass the inputs to model body to get the hidden states. The hidden states are then passed to classification  head. If we have labels we can calculate the loss. If we use attention head, we need to have some logic to avoid loss calculation on masked tokens and calculate only for unmasked tokens. Return all the outputs wrapped in a `TokenClassifierOutput` object in a tuple.
* By implementing `__init__()` and `forward()`, we can build our own custom transformer model. Since we inherit `PreTrainedModel`, we instantly get access to all the useful transformer utilities. such as `from_pretrained()`.


### Loading a custom model

With tagtoid and idtotag functionalities we can load our custom model. `tags` variable holds this information.

In [40]:
tags

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)

In [41]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2idx = {tag: idx for idx, tag in enumerate(tags.names)}

In [42]:
index2tag, tag2idx

({0: 'O',
  1: 'B-PER',
  2: 'I-PER',
  3: 'B-ORG',
  4: 'I-ORG',
  5: 'B-LOC',
  6: 'I-LOC'},
 {'O': 0,
  'B-PER': 1,
  'I-PER': 2,
  'B-ORG': 3,
  'I-ORG': 4,
  'B-LOC': 5,
  'I-LOC': 6})

Let's store these two config plus `num_labels` in `AutoConfig` class.

In [43]:
from transformers import AutoConfig
xlmr_config = AutoConfig.from_pretrained(
    pretrained_model_name_or_path=xlmr_model_name,
    num_labels=tags.num_classes,
    id2label=index2tag,
    label2id=tag2idx,
)

loading configuration file https://huggingface.co/xlm-roberta-base/resolve/main/config.json from cache at /Users/jayaprakashsivagami/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.dfaaaedc7c1c475302398f09706cbb21e23951b73c6e2b3162c1c8a99bb3b62a
Model config XLMRobertaConfig {
  "_name_or_path": "xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-LOC": 5,
    "B-ORG": 3,
    "B-PER": 1,
    "I-LOC": 6,
    "I-ORG": 4,
    "I-PER": 2,
    "O": 0
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "m

The `AutoConfig` class contatins the blueprint of a model's architecture. When we load a model with the configuraion file sis downloaded automatically. To overwrite some values we can load the configuration with those paramter like we did above with num_labels, id2label, label2id.

Nowe we can load the model weights as usual with the `from_pretrained()` function with additional config. Not we did not implement loading pretrained weights in out custom model, we inherited it from PreTrainedModek `init_weights()`.

In [44]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = XLMRobertaTokenClassification.from_pretrained(
    pretrained_model_name_or_path=xlmr_model_name,
    config=xlmr_config,
).to(device)

https://huggingface.co/xlm-roberta-base/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /Users/jayaprakashsivagami/.cache/huggingface/transformers/tmpy_2rfjk1


Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

storing https://huggingface.co/xlm-roberta-base/resolve/main/pytorch_model.bin in cache at /Users/jayaprakashsivagami/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
creating metadata file for /Users/jayaprakashsivagami/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
loading weights file https://huggingface.co/xlm-roberta-base/resolve/main/pytorch_model.bin from cache at /Users/jayaprakashsivagami/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaTokenClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.lay

To validate the loaded custom model. Let's test it on a sequence.

In [46]:
RobertaPreTrainedModel.main_input_name

'input_ids'

In [51]:
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["tokens", "ids"])

Unnamed: 0,0,1,2,3,4,5,6,7,8
tokens,<s>,▁Human,',s,▁heart,▁is,▁mental,!,</s>
ids,0,28076,25,7,26498,83,13893,38,2


In [52]:
input_ids

tensor([[    0, 28076,    25,     7, 26498,    83, 13893,    38,     2]])