## Deep Learning Ulaanbaatar (DLUB) 2022 - Summer School 🇲🇳
**Seminar: Mongolian Named Entity Recognition (NER) using HuggingFace Transformers **

Бид нар юу хийх гэж байгаа вэ?
1. Named entity recognition Монгол хэл дээр Transformers ашиглан сургаж үзэх гэж байна.
2. Өмнө нь MLM task дээр сурсан Encoder model ашиглана.
3. Fine-tune хийхийг сурна.

Энэхүү notebook нь NER task-д тусгайлан бэлдсэн датагүйгээр боломжгүй юм. Датаг бэлдэж ил болгосон [Төгөлдөр](https://github.com/tugstugi) болон [Энод](https://github.com/enod) нарт баярлалаа! 🙇

## Setup

In [1]:
# for huggingface hub integration
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("hf_token")

!apt install git-lfs
!git lfs install
!pip install seqeval

In [2]:
import os
import sys
import json

import datasets
import numpy as np
import pandas as pd
from datasets import ClassLabel, load_dataset, load_metric

import transformers
from transformers import (
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    HfArgumentParser,
    PretrainedConfig,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
    set_seed,
)

## Өгөгдлөө боловсруулах

Энэхүү датаг бэлдсэн 

```json
{
    "text": "Харин \"Тавантолгой\" ХК-ийн уурхайчид 2017 онд 141 тэрбум төгрөгийн ашигтай ажиллаад удахгүй хувьцаа эзэмшигчдийнхээ хурлыг хийж ногдол ашгаа хуваарилах юм байна."
    "labels":[
        [
            7
            18
            string"ORG"
        ],..
    ]
}
```
Тайлбар: Энэ тохиолдолд `text` field-ийн 7-18 хоорондох-р substring нь ORG буюу Organization.
```bash
0123456[7..........]18...
Харин "[Тавантолгой]" ХК-ийн уурхайчид...
```

Датаг ашиглахад амар байлгах зорилгоор дараах байдлаар kaggle dataset болгон байрлуулав.
https://www.kaggle.com/datasets/bayartsogtya/mongolian-ner-v1 

In [3]:
with open('/kaggle/input/mongolian-ner-v1/NER_v1.0.json', 'r') as reader:
    lines = reader.readlines()
lines = [json.loads(x) for x in lines]

In [4]:
# find all unique tags
labels = set()
for line in lines:
    for s, e, label in line['labels']:
        labels.add(label)

labels = sorted(list(labels))
labels = ['O'] + labels
label2idx = {x:i for i,x in enumerate(labels)}
idx2label = {i:x for i,x in enumerate(labels)}
num_labels = len(labels)
print(num_labels)
print(labels)
print(label2idx)
print(idx2label)

In [5]:
token_list = []
ner_tag_list = []

for line in lines:
    labels = line['labels']
    text = line['text']
    
    labels.sort()
    labels = [[-1, -1, '']] + labels + [[len(text), len(text), '']]

    tokens = []
    tags = []
    for pre, cur in zip(labels, labels[1:]):
        ps, pe, pl = pre
        cs, ce, cl = cur
        
        ll = text[pe+1: cs].strip().split(' ')
        tokens += ll
        tags += [0] * len(ll)
        
        if cl:
            ll = text[cs: ce].strip().split(' ')
            tokens += ll
            tags += [label2idx[cl]] * len(ll)
    token_list.append(tokens)
    ner_tag_list.append(tags)

In [7]:
print(token_list[10])
print(ner_tag_list[10])

## pandas.DataFrame-ээс HF Dataset үүсгэх

In [8]:
df = pd.DataFrame()
df['tokens'] = token_list
df['ner_tags'] = ner_tag_list
df.head()

In [9]:
from datasets import Dataset
raw_dataset = Dataset.from_pandas(df)

In [10]:
raw_dataset

## Бэлтгэл

- config, tokenizer, model from huggingface model hub
- model card: https://huggingface.co/bayartsogt/mongolian-roberta-base (trained data: OSCAR deduplicated_mn)

In [11]:
model_name = 'bayartsogt/mongolian-roberta-base'

In [12]:
config = AutoConfig.from_pretrained(
    model_name,
    num_labels=num_labels,
    finetuning_task='ner',
)

In [13]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=True,
    add_prefix_space=True,
)

In [14]:
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    config=config,
)

In [15]:
model.config.label2id = label2idx
model.config.id2label = idx2label

In [36]:
text_column_name = 'tokens'
label_column_name = 'ner_tags'
print(raw_dataset[2][text_column_name])
print(raw_dataset[2][label_column_name])

In [17]:
# Tokenize all texts and align the labels with them.
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples[text_column_name],
        padding="max_length",
        truncation=True,
        max_length=128,
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples[label_column_name]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx

        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [18]:
# tokenized_dataset = raw_dataset.select(range(10)).map(
tokenized_dataset = raw_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    num_proc=2,
    desc="Running tokenizer on train dataset",
)

In [19]:
print(len(tokenized_dataset[0]['input_ids']))
print(len(tokenized_dataset[0]['labels']))
print(tokenized_dataset[0]['input_ids'])
print(tokenized_dataset[0]['labels'])

In [20]:
# all_dataset = tokenized_dataset.select(range(300)).train_test_split()
all_dataset = tokenized_dataset.train_test_split()
all_dataset

## HuggingFace Trainer хэрэглээ

In [21]:
# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=None)

# Metrics
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [labels[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [labels[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [22]:
OUTPUT_MODEL = 'roberta-base-ner-demo'

training_args = TrainingArguments(
    OUTPUT_MODEL,
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16*2,
    dataloader_num_workers=2,

    evaluation_strategy = "epoch",
    logging_strategy="epoch",
    save_strategy="epoch",

    learning_rate=2e-5,
    weight_decay=0.01,
    report_to='tensorboard',
    log_level="warning",

    # automatic version handling with huggingface
    push_to_hub=True,
    hub_token=hf_token,
)

In [23]:
# Set seed before initializing model.
set_seed(training_args.seed)

In [24]:
# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=all_dataset['train'],
    eval_dataset=all_dataset['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [25]:
trainer.train()

In [31]:
!cd roberta-base-ner && git pull

In [32]:
kwargs = {
    "finetuned_from": model_name, 
    "tasks": "token-classification",
    "language": 'mn'
}
trainer.push_to_hub(**kwargs)

In [None]:
# raw_dataset.push_to_hub('mongolian-ner', token=hf_token)

## Төгсгөл
