**Named Entity Recognition**
- NER is a common NLP task that identifies entities like people, organizations or locations in text. These entities can be used for various applications such as gaining insights from documents, augmenting the quality of search engines, or building a structured database from a corpus.

**Initialization**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading Libraries and Dependencies**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [19]:
#@ IMPORTING MODULES: UNCOMMENT BELOW:
# !pip install transformers[sentencepiece]
# !pip install datasets
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from datasets import get_dataset_config_names
from datasets import DatasetDict
from collections import defaultdict
from collections import Counter
import pandas as pd

#@ IGNORING WARNINGS: 
import warnings
warnings.filterwarnings("ignore")

**The Dataset**
- The dataset consists of Wikipedia articles in many languages. Each article is annotated with LOC(location), PER(person) and ORG(organization) tags in IOB2 format. 

In [5]:
#@ LOADING PAN-X DATASET:
xtreme_subsets = get_dataset_config_names("xtreme")                     # Xtreme benchmarked. 
print(f"XTREME has {len(xtreme_subsets)} configurations")               # Inspecting number of configurations.
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]       # Initializing PAN-X dataset.
panx_subsets[:5]                                                        # Inspection.

XTREME has 183 configurations


['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg', 'PAN-X.bn', 'PAN-X.de']

In [20]:
#@ LOADING PAN-X DATASET:
langs = ["de", "fr", "it", "en"]                                                # Initialization.
fracs = [0.629, 0.229, 0.084, 0.059]                                            # Initialization.
panx_ch = defaultdict(DatasetDict)                                              # Initializing dictionary. 
for lang, frac in zip(langs, fracs):
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")                           # Loading monolingual corpus.
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=2022)                                                 # Shuffling. 
            .select(range(int(frac * ds[split].num_rows))))                     # Downsampling dataset. 

Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-dc382e99bd2b7098.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-c1148974c7445208.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-a6c757bb698e99d8.arrow
Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-48607c59ace84d0b.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-ad924b582fa43377.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.fr/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-a9387d1f61db954b.arrow
Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-1dd3d856388549da.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-d881eaa00d634cb2.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.it/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-2ebd40825d65723e.arrow
Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-bf644b9b954e614a.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-bdb2a5e0dfa211d1.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.en/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-b89fabe1a51f963b.arrow


In [21]:
#@ LOADING PAN-X DATASET:
pd.DataFrame({lang:[panx_ch[lang]["train"].num_rows] for lang in langs},
             index=["Number of training examples"])                             # Creating dataframe.

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


In [11]:
#@ INSPECTING GERMAN CORPUS:
element = panx_ch["de"]["train"][0]                         # Initializing german corpus.
for key, value in element.items():
    print(f"{key}: {value}")                                # Inspection.

tokens: ['Brighton', '&', 'Hove', 'Albion', '-', 'Scunthorpe', 'United', '3:2']
ner_tags: [3, 4, 4, 4, 0, 3, 4, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


In [16]:
#@ INSPECTING GERMAN CORPUS:
element = panx_ch["de"]["train"].features                   # Initializing german corpus.
for key, value in element.items():
    print(f"{key}: {value}")                                # Inspection.
tags = panx_ch["de"]["train"].features["ner_tags"].feature  # Initializing ner tags.
print(tags)                                                 # Inspection.

tokens: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ner_tags: Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None)
langs: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)


In [22]:
#@ INITIALIZING TAG NAMES: 
def create_tag_names(batch):                                                    # Defining class. 
    return {"ner_tag_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}    # Converting integer to strings.
panx_de = panx_ch["de"].map(create_tag_names)                                   # Implementation of function.

  0%|          | 0/12580 [00:00<?, ?ex/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-4c27d77cfabd4001.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/xtreme/PAN-X.de/1.0.0/349258adc25bb45e47de193222f95e68a44f7a7ab53c4283b3f007208a11bf7e/cache-6010854ae6d92763.arrow


In [23]:
#@ INSPECTING TAG NAMES:
de_example = panx_de["train"][0]                                                # Initialization.
pd.DataFrame([de_example["tokens"], de_example["ner_tag_str"]], 
             ["Tokens", "Tags"])                                                # Creating a dataframe.

Unnamed: 0,0,1,2,3,4,5,6,7
Tokens,Brighton,&,Hove,Albion,-,Scunthorpe,United,3:2
Tags,B-ORG,I-ORG,I-ORG,I-ORG,O,B-ORG,I-ORG,O


In [26]:
#@ CALCULATING FREQUENCIES OF ENTITIES: GENERAL:
split2freqs = defaultdict(Counter)                              # Initialization.
for split, dataset in panx_de.items():
    for row in dataset["ner_tag_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")             # Creating dataframe. 

Unnamed: 0,ORG,LOC,PER
train,5465,6130,5812
validation,2742,3070,2801
test,2624,3070,3017
