## Deep Learning Ulaanbaatar (DLUB) 2022 - Summer School 🇲🇳
**Seminar: Mongolian Named Entity Recognition (NER) using HuggingFace Transformers **

Бид нар юу хийх гэж байгаа вэ?
1. Named entity recognition Монгол хэл дээр Transformers ашиглан сургаж үзэх гэж байна.
2. Өмнө нь MLM task дээр сурсан Encoder model ашиглана.
3. Fine-tune хийхийг сурна.

Энэхүү notebook нь NER task-д тусгайлан бэлдсэн датагүйгээр боломжгүй юм. Датаг бэлдэж ил болгосон [Төгөлдөр](https://github.com/tugstugi) болон [Энод](https://github.com/enod) нарт баярлалаа! 🙇

## Setup

In [1]:
# for huggingface hub integration
!apt install git-lfs
!git lfs install
!pip install seqeval
!pip install transformers[torch]
!pip install datasets

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Git LFS initialized.
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m834.5 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=c538484f7187277d56605da77fc4d47d0674548a3f3ceb2098ad5c42461d7658
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Collecting accelerate>=0.21.0 (from transforme

In [None]:
!pip install peft

Collecting peft
  Downloading peft-0.10.0-py3-none-any.whl (199 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/199.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m194.6/199.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: peft
Successfully installed peft-0.10.0


In [None]:
import os
import sys
import json

import datasets
import numpy as np
import pandas as pd
from datasets import ClassLabel, load_dataset, load_metric

import transformers
from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    HfArgumentParser,
    PretrainedConfig,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
    set_seed,
)

## Өгөгдлөө боловсруулах

Энэхүү датаг бэлдсэн

```json
{
    "text": "Харин \"Тавантолгой\" ХК-ийн уурхайчид 2017 онд 141 тэрбум төгрөгийн ашигтай ажиллаад удахгүй хувьцаа эзэмшигчдийнхээ хурлыг хийж ногдол ашгаа хуваарилах юм байна."
    "labels":[
        [
            7
            18
            string"ORG"
        ],..
    ]
}
```
Тайлбар: Энэ тохиолдолд `text` field-ийн 7-18 хоорондох-р substring нь ORG буюу Organization.
```bash
0123456[7..........]18...
Харин "[Тавантолгой]" ХК-ийн уурхайчид...
```

Датаг ашиглахад амар байлгах зорилгоор дараах байдлаар kaggle dataset болгон байрлуулав.
https://www.kaggle.com/datasets/bayartsogtya/mongolian-ner-v1

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# path = 'drive/MyDrive/Colab Notebooks/Machine Learning Course/lab04/'
path = 'drive/MyDrive/Colab Notebooks/ML/lab4/'

Mounted at /content/drive


In [None]:
with open(path + 'NER_v1.0.json', 'r') as reader:
    lines = reader.readlines()
lines = [json.loads(x) for x in lines]

In [None]:
# find all unique tags
labels = set()
for line in lines:
    for s, e, label in line['labels']:
        labels.add(label)

labels = sorted(list(labels))
labels = ['O'] + labels
label2idx = {x:i for i,x in enumerate(labels)}
idx2label = {i:x for i,x in enumerate(labels)}
num_labels = len(labels)
print(num_labels)
print(labels)
print(label2idx)
print(idx2label)

5
['O', 'LOC', 'MISC', 'ORG', 'PER']
{'O': 0, 'LOC': 1, 'MISC': 2, 'ORG': 3, 'PER': 4}
{0: 'O', 1: 'LOC', 2: 'MISC', 3: 'ORG', 4: 'PER'}


In [None]:
token_list = []
ner_tag_list = []

for line in lines:
    labels = line['labels']
    text = line['text']

    labels.sort()
    labels = [[-1, -1, '']] + labels + [[len(text), len(text), '']]

    tokens = []
    tags = []
    for pre, cur in zip(labels, labels[1:]):
        ps, pe, pl = pre
        cs, ce, cl = cur

        ll = text[pe+1: cs].strip().split(' ')
        tokens += ll
        tags += [0] * len(ll)

        if cl:
            ll = text[cs: ce].strip().split(' ')
            tokens += ll
            tags += [label2idx[cl]] * len(ll)
    token_list.append(tokens)
    ner_tag_list.append(tags)

In [None]:
print(token_list[10])
print(ner_tag_list[10])

['Өнгө', 'дагасан', 'хувцаслалт', 'Монголчууд', 'хүнийг', 'хувцсаар', 'нь', 'угтаж', 'ухаанаар', 'нь', 'үддэг', 'гэдэг.']
[0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0]


## pandas.DataFrame-ээс HF Dataset үүсгэх

In [None]:
df = pd.DataFrame()
df['tokens'] = token_list
df['ner_tags'] = ner_tag_list
df.head()

Unnamed: 0,tokens,ner_tags
0,"[Харин, "", Тавантолгой, ХК-ийн, уурхайчид, 201...","[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,"[Харин, зөвшөөрөл, олгох, эрх, бүхий, албан, т...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0]"
2,"[Ингээд, Баянгол, дүүргийн, Засаг, даргаас, ""О...","[0, 1, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,"[Үүнээс, хойш, манайхан, ""Их, борооноос, урьта...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
4,"[Надад, одоогоор, УИХ, ын, Тамгын, газраас, ээ...","[0, 0, 3, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0]"


In [None]:
from datasets import Dataset
raw_dataset = Dataset.from_pandas(df)

In [None]:
raw_dataset

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 10162
})

## Бэлтгэл

- config, tokenizer, model from huggingface model hub
- model card: https://huggingface.co/bayartsogt/mongolian-roberta-base (trained data: OSCAR deduplicated_mn)

In [None]:
model_name = 'bayartsogt/mongolian-roberta-base'

In [None]:
config = AutoConfig.from_pretrained(
    model_name,
    num_labels=num_labels,
    finetuning_task='ner',
)



config.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=True,
    add_prefix_space=True,
)

tokenizer_config.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.33M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    config=config,
)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at bayartsogt/mongolian-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model

RobertaForTokenClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (L

In [None]:
model.config.label2id = label2idx
model.config.id2label = idx2label

In [None]:
text_column_name = 'tokens'
label_column_name = 'ner_tags'
print(raw_dataset[2][text_column_name])
print(raw_dataset[2][label_column_name])

['Ингээд', 'Баянгол', 'дүүргийн', 'Засаг', 'даргаас', '"Оршин', 'суугчид', 'жагсаал', 'цуглааны', 'хэлбэрт', 'орчихсон', 'байгаа', 'бол', 'иргэдийн', 'аюулгүй', 'байдлыг', 'хангах', 'үүднээс', 'эмх', 'замбараагүй', 'байдлыг', 'хуулийн', 'хүрээнд', 'таслан', 'зогсоо"', 'гэсэн', 'даалгавар', 'өгчээ.']
[0, 1, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
# Tokenize all texts and align the labels with them.
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples[text_column_name],
        padding="max_length",
        truncation=True,
        max_length=128,
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples[label_column_name]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx

        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [None]:
# tokenized_dataset = raw_dataset.select(range(10)).map(
tokenized_dataset = raw_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    num_proc=2,
    desc="Running tokenizer on train dataset",
)

  self.pid = os.fork()


Running tokenizer on train dataset (num_proc=2):   0%|          | 0/10162 [00:00<?, ? examples/s]

In [None]:
def count_parameters(model):
    total  = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Trainable parameters {trainable}/{total}")

In [None]:
print(len(tokenized_dataset[0]['input_ids']))
print(len(tokenized_dataset[0]['labels']))
print(tokenized_dataset[0]['input_ids'])
print(tokenized_dataset[0]['labels'])

128
128
[1062, 713, 5690, 4671, 17, 292, 24952, 1632, 786, 36644, 1594, 2389, 3372, 9456, 4782, 2918, 11766, 1181, 7360, 895, 13582, 13586, 13850, 394, 385, 18, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100

In [None]:
# all_dataset = tokenized_dataset.select(range(300)).train_test_split()
all_dataset = tokenized_dataset.train_test_split()
all_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 7621
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2541
    })
})

In [None]:
count_parameters(model)

Trainable parameters 124058885/124058885


## PEFT

In [None]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(task_type=TaskType.TOKEN_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)


In [None]:
from peft import get_peft_model

model_peft = get_peft_model(model, peft_config)
model_peft.print_trainable_parameters()

trainable params: 298,757 || all params: 124,357,642 || trainable%: 0.24024016151737582


In [None]:
count_parameters(model_peft)

Trainable parameters 298757/124357642


## HuggingFace Trainer хэрэглээ

In [None]:
# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=None)

# Metrics
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [labels[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [labels[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

  metric = load_metric("seqeval")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

In [None]:
OUTPUT_MODEL = 'roberta-base-ner-demo'

training_args = TrainingArguments(
    OUTPUT_MODEL,
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16*2,
    dataloader_num_workers=2,

    evaluation_strategy = "epoch",
    logging_strategy="epoch",
    save_strategy="epoch",

    learning_rate=2e-5,
    weight_decay=0.01,
    report_to='tensorboard',
    log_level="warning",

    # automatic version handling with huggingface
    push_to_hub=False
)

In [None]:
# Set seed before initializing model.
set_seed(training_args.seed)

In [None]:
# Initialize our Trainer
trainer = Trainer(
    model=model_peft,
    args=training_args,
    train_dataset=all_dataset['train'],
    eval_dataset=all_dataset['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

  self.pid = os.fork()


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.6864,0.344946,0.403775,0.416501,0.410039,0.895384
2,0.2857,0.207946,0.532625,0.597802,0.563335,0.930648
3,0.2106,0.173949,0.58344,0.666269,0.62211,0.939728
4,0.1835,0.157965,0.609221,0.694742,0.649177,0.944586
5,0.1671,0.146136,0.631005,0.72017,0.672645,0.949504
6,0.1576,0.141017,0.642882,0.727851,0.682733,0.950858
7,0.1499,0.136721,0.653371,0.735399,0.691963,0.95265
8,0.1452,0.133175,0.660886,0.746921,0.701274,0.954144
9,0.1425,0.131906,0.666784,0.749967,0.705934,0.954781
10,0.1418,0.131093,0.667845,0.750894,0.706938,0.95504


  self.pid = os.fork()
    0    0 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100] seems not to be NE tag.
    0    0 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
 -100 -100 -100 -1

TrainOutput(global_step=4770, training_loss=0.2270422929487888, metrics={'train_runtime': 1620.9878, 'train_samples_per_second': 47.015, 'train_steps_per_second': 2.943, 'total_flos': 4995977516866560.0, 'train_loss': 0.2270422929487888, 'epoch': 10.0})

In [None]:
model.save_pretrained(path)

In [None]:
# kwargs = {
#     "finetuned_from": model_name,
#     "tasks": "token-classification",
#     "language": 'mn'
# }
# trainer.push_to_hub(**kwargs)

In [None]:
# raw_dataset.push_to_hub('mongolian-ner', token=hf_token)

## Төгсгөл
