**Named Entity Recognition**
- NER is a common NLP task that identifies entities like people, organizations or locations in text. These entities can be used for various applications such as gaining insights from documents, augmenting the quality of search engines, or building a structured database from a corpus.

**Initialization**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading Libraries and Dependencies**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [21]:
#@ IMPORTING MODULES: UNCOMMENT BELOW:
# !pip install transformers[sentencepiece]
# !pip install datasets
import torch
import torch.nn as nn
from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel
from datasets import load_dataset
from datasets import get_dataset_config_names
from datasets import DatasetDict
from collections import defaultdict
from collections import Counter
import pandas as pd

#@ IGNORING WARNINGS: 
import warnings
warnings.filterwarnings("ignore")

**The Dataset**
- The dataset consists of Wikipedia articles in many languages. Each article is annotated with LOC(location), PER(person) and ORG(organization) tags in IOB2 format. 

In [5]:
#@ LOADING PAN-X DATASET:
xtreme_subsets = get_dataset_config_names("xtreme")                     # Xtreme benchmarked. 
print(f"XTREME has {len(xtreme_subsets)} configurations")               # Inspecting number of configurations.
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]       # Initializing PAN-X dataset.
panx_subsets[:5]                                                        # Inspection.

XTREME has 183 configurations


['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg', 'PAN-X.bn', 'PAN-X.de']

In [7]:
#@ LOADING PAN-X DATASET:
langs = ["de", "fr", "it", "en"]                                                # Initialization.
fracs = [0.629, 0.229, 0.084, 0.059]                                            # Initialization.
panx_ch = defaultdict(DatasetDict)                                              # Initializing dictionary. 
for lang, frac in zip(langs, fracs):
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")                           # Loading monolingual corpus.
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=2022)                                                 # Shuffling. 
            .select(range(int(frac * ds[split].num_rows))))                     # Downsampling dataset. 

Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-dc382e99bd2b7098.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-c1148974c7445208.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-a6c757bb698e99d8.arrow
Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-48607c59ace84d0b.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-ad924b582fa43377.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-a9387d1f61db954b.arrow
Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-1dd3d856388549da.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-d881eaa00d634cb2.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-2ebd40825d65723e.arrow
Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-bf644b9b954e614a.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-bdb2a5e0dfa211d1.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-b89fabe1a51f963b.arrow


In [8]:
#@ LOADING PAN-X DATASET:
pd.DataFrame({lang:[panx_ch[lang]["train"].num_rows] for lang in langs},
             index=["Number of training examples"])                             # Creating dataframe.

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


In [9]:
#@ INSPECTING GERMAN CORPUS:
element = panx_ch["de"]["train"][0]                         # Initializing german corpus.
for key, value in element.items():
    print(f"{key}: {value}")                                # Inspection.

tokens: ['Brighton', '&', 'Hove', 'Albion', '-', 'Scunthorpe', 'United', '3:2']
ner_tags: [3, 4, 4, 4, 0, 3, 4, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


In [10]:
#@ INSPECTING GERMAN CORPUS:
element = panx_ch["de"]["train"].features                   # Initializing german corpus.
for key, value in element.items():
    print(f"{key}: {value}")                                # Inspection.
tags = panx_ch["de"]["train"].features["ner_tags"].feature  # Initializing ner tags.
print(tags)                                                 # Inspection.

tokens: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ner_tags: Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None)
langs: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)


In [12]:
#@ INITIALIZING TAG NAMES: 
def create_tag_names(batch):                                                    # Defining class. 
    return {"ner_tag_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}    # Converting integer to strings.
panx_de = panx_ch["de"].map(create_tag_names)                                   # Implementation of function.

Loading cached processed dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-10c39e47e2cb8378.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-4c27d77cfabd4001.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-6010854ae6d92763.arrow


In [13]:
#@ INSPECTING TAG NAMES:
de_example = panx_de["train"][0]                                                # Initialization.
pd.DataFrame([de_example["tokens"], de_example["ner_tag_str"]], 
             ["Tokens", "Tags"])                                                # Creating a dataframe.

Unnamed: 0,0,1,2,3,4,5,6,7
Tokens,Brighton,&,Hove,Albion,-,Scunthorpe,United,3:2
Tags,B-ORG,I-ORG,I-ORG,I-ORG,O,B-ORG,I-ORG,O


In [14]:
#@ CALCULATING FREQUENCIES OF ENTITIES: GENERAL:
split2freqs = defaultdict(Counter)                              # Initialization.
for split, dataset in panx_de.items():
    for row in dataset["ner_tag_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")             # Creating dataframe. 

Unnamed: 0,ORG,LOC,PER
train,5465,6130,5812
validation,2742,3070,2801
test,2624,3070,3017


**Multilingual Transformers**
- Multilingual transformers involve similar architectures and training procedures as their multilingual counterparts, except that the corpus used for pretraining consists of documents in many languages. 

**Tokenization**
- XLM-R uses a tokenizer called SentencePiece that is trained on the raw text of all one hundred languages.

In [17]:
#@ INITIALIZING TOKENIZATION:
bert_model_name = "bert-base-cased"                                 # Initializing bert checkpoint.
xlmr_model_name = "xlm-roberta-base"                                # Initializing xlmr checkpoint.
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)     # Initializing pretrained tokenizer.
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)     # Initializing pretrained tokenizer. 
text = "Jack Sparrow loves New York!"                               # Initializing text example.
bert_tokens = bert_tokenizer(text).tokens()                         # Initializing bert tokens.
xlmr_tokens = xlmr_tokenizer(text).tokens()                         # Initializing xlmr tokens.

**The Tokenizer Pipeline**
  - Normalization.
  - Pretokenization. 
  - Tokenizer Model. 
  - Postprocessing.

**The SentencePiece Tokenizer**
- The SentencePiece tokenizer is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters.

In [16]:
#@ IMPLEMENTATION OF SENTENCEPIECE TOKENIZER:
"".join(xlmr_tokens).replace("\u2581", " ")

'<s> Jack Sparrow loves New York!</s>'

**Transformers for Named Entity Recognition**

In [19]:
#@ CUSTOM MODEL FOR TOKEN CLASSIFICATION:
class XLMRobertaForTokenClassification(RobertaPreTrainedModel):                         # Defining class. 
    config_class = XLMRobertaConfig
    def __init__(self, config):                                                         # Initializing constructor function.
        super().__init__(config)
        self.num_labels = config.num_labels                                             # Initialization.
        self.roberta = RobertaModel(config, add_pooling_layer=False)                    # Initializing roberta model.
        self.dropout = nn.Dropout(config.hidden_dropout_prob)                           # Initializing dropout layer.
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)              # Initializing linear layer.
        self.init_weights()                                                             # Loading and initializing weights.
    
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, 
                labels=None, **kwargs):
        outputs = self.roberta(input_ids, attention_mask=attention_mask, 
                               token_type_ids=token_type_ids, **kwargs)                 # Getting encoder representations. 
        sequence_output = self.dropout(outputs[0])                                      # Applying dropout.
        logits = self.classifier(sequence_output)                                       # Applying classifier.
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()                                            # Initializing cross entropy loss function.
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))          # Calculating loss.
        return TokenClassifierOutput(loss=loss, logits=logits, 
                                     hidden_states=outputs.hidden_states,
                                     attentions=outputs.attentions)                     # Getting model output.

In [23]:
#@ LOADING CUSTOM MODEL:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}                            # Index to tags.
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}                            # Tags to index.
xlmr_config = AutoConfig.from_pretrained(xlmr_model_name, 
                                         num_labels=tags.num_classes, 
                                         id2label=index2tag, label2id=tag2index)        # Initializing configurations. 

loading configuration file https://huggingface.co/xlm-roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/87683eb92ea383b0475fecf99970e950a03c9ff5e51648d6eee56fb754612465.dfaaaedc7c1c475302398f09706cbb21e23951b73c6e2b3162c1c8a99bb3b62a
Model config XLMRobertaConfig {
  "_name_or_path": "xlm-roberta-base",
  "architectures": [
    "XLMRobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-LOC": 5,
    "B-ORG": 3,
    "B-PER": 1,
    "I-LOC": 6,
    "I-ORG": 4,
    "I-PER": 2,
    "O": 0
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-robe

In [25]:
#@ LOADING MODEL WEIGHTS: 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")       # Initializing gpu. 
xlmr_model = (XLMRobertaForTokenClassification.from_pretrained(
    xlmr_model_name, config=xlmr_config).to(device))                        # Initializing pretrained model.

loading weights file https://huggingface.co/xlm-roberta-base/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/97d0ea09f8074264957d062ec20ccb79af7b917d091add8261b26874daf51b5d.f42212747c1c27fcebaa0a89e2a83c38c6d3d4340f21922f892b88d882146ac2
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identi

In [26]:
#@ IMPLEMENTATION OF MODEL:
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")                # Initializing ids of tokens.
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], 
             index=["Tokens", "Inputs"])                                    # Creating a dataframe.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Inputs,0,21763,37456,15555,5161,7,2356,5753,38,2


In [27]:
#@ INITIALIZING PREDICTIONS:
outputs = xlmr_model(input_ids.to(device)).logits                           # Implementation of mode.
predictions = torch.argmax(outputs, dim=-1)                                 # Generating predictions.
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")                  # Inspection.
print(f"Shape of outputs: {outputs.shape}")                                 # Inspection.
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]               # Initializing predictions.
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"])                # Inspection.

Number of tokens in sequence: 10
Shape of outputs: torch.Size([1, 10, 7])


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Tags,B-ORG,B-ORG,B-ORG,B-ORG,B-ORG,B-ORG,B-ORG,B-ORG,B-ORG,B-ORG


In [32]:
#@ INITIALIZING TEXT TAGGER: IMPORTANT: 
def tag_text(text, tags, model, tokenizer):                                 # Defining function. 
    tokens = tokenizer(text).tokens()                                       # Initializing tokenization.
    input_ids = xlmr_tokenizer(text, 
                               return_tensors="pt").input_ids.to(device)    # Encoding sequence into ids.
    outputs = model(input_ids)[0]                                           # Generating predictions.
    predictions = torch.argmax(outputs, dim=2)
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()]           # Initializing predictions.
    pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])                 # Inspecting dataframe.

**Tokenizing Texts for NER**

In [34]:
#@ TOKENIZING TEXTS FOR NER: EXAMPLE:
words, labels = de_example["tokens"], de_example["ner_tags"]                # Initialization.
tokenized_input = xlmr_tokenizer(de_example["tokens"], 
                                 is_split_into_words=True)                  # Tokenized inputs.
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"]) # Initializing tokens.
pd.DataFrame([tokens], index=["Tokens"])                                    # Inspection.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
Tokens,<s>,▁Bright,on,▁&,▁Ho,ve,▁Albi,on,▁-,▁Scu,nt,hor,pe,▁United,▁3:,2,</s>


In [35]:
#@ TOKENIZING TEXTS FOR NER:
word_ids = tokenized_input.word_ids()                                       # Initializing word ids.
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])              # Inspection. 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
Tokens,<s>,▁Bright,on,▁&,▁Ho,ve,▁Albi,on,▁-,▁Scu,nt,hor,pe,▁United,▁3:,2,</s>
Word IDs,,0,0,1,2,2,3,3,4,5,5,5,5,6,7,7,


In [36]:
#@ TOKENIZING TEXTS FOR NER:
previous_word_idx = None                                            # Initialization.
label_ids = []                                                      # Initialization.
for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100)
    elif word_idx != previous_word_idx:
        label_ids.append(labels[word_idx])
    previous_word_dx = word_idx
labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]
pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)    # Initializing dataframe.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
Tokens,<s>,▁Bright,on,▁&,▁Ho,ve,▁Albi,on,▁-,▁Scu,nt,hor,pe,▁United,▁3:,2,</s>
Word IDs,,0,0,1,2,2,3,3,4,5,5,5,5,6,7,7,
Label IDs,-100,3,3,4,4,4,4,4,0,3,3,3,3,4,0,0,-100
Labels,IGN,B-ORG,B-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,O,B-ORG,B-ORG,B-ORG,B-ORG,I-ORG,O,O,IGN


In [38]:
#@ TOKENIZING TEXT FOR NER: IMPORTANT:
def tokenize_and_align_labels(examples):                                            # Defining class. 
    tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True, 
                                      is_split_into_words=True)                     # Initializing tokenized inputs. 
    labels = []                                                                     # Initialization.
    for idx, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)                       # Initializing word ids.
        previous_word_idx = None                                                    # Initialization.
        label_ids = []                                                              # Initialization.
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)                                              # -100 is ignored. 
            else:
                label_ids.append(label[word_idx])                                   # Adding labels.
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels                                             # Adding labels column.
    return tokenized_inputs

#@ ENCODING PAN-X DATASET:
def encode_panx_dataset(corpus):                                                    # Defining function. 
    return corpus.map(tokenize_and_align_labels, batched=True,
                      remove_columns=["langs", "ner_tags", "tokens"])               # Encoding dataset.
panx_de_encoded = encode_panx_dataset(panx_ch["de"])                                # Implementation.

  0%|          | 0/13 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-5e1da0ee67f278e1.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-c3a89d9efa0b632d.arrow


**Performance Measures**