## Deep Learning Ulaanbaatar (DLUB) 2022 - Summer School 🇲🇳

**Seminar: Mongolian Masked Language Modeling using HuggingFace Transformers**

Бид нар юу хийх гэж байгаа вэ?
```python
dataset = load_dataset("oscar", "unshuffled_deduplicated_mn", split="train")
tokenizer = BertTokenizerFast.from_pretrained('./dlub')
model = AutoModelForMaskedLM.from_config(config)

```


Today we cover:
- [ ] HuggingFace `transformers`, `tokenizers` and `datasets` libraries
- [ ] Ашиглах өгөгдөл - дата
- [ ] Transformers - `Config`
- [ ] Tokenization and `BertTokenizer`
- [ ] Дата бэлтгэл
- [ ] Сургалт
- [ ] Push it to HuggingFace model hub


In [None]:
# for huggingface hub integration
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("hf_token")

!apt install git-lfs
!git lfs install

In [None]:
import transformers, datasets, tokenizers

In [None]:
%env TOKENIZERS_PARALLELISM=false

In [None]:
transformers.__version__, datasets.__version__, tokenizers.__version__

## Ашиглах өгөгдөл - дата

Бидний ашиглах датасет бол Common Crawl аас цэвэрлэж авсан OSCAR (Open Super-large Crawled ALMAnaCH coRpus).

hugginface-ын `datasets` library-г ашиглан датагаа татах болон түүн дээр процесс хийх илүү амархан болсон байна.


- OSCAR dataset view: https://huggingface.co/datasets/oscar/viewer/unshuffled_deduplicated_mn/train
- datasets library: https://github.com/huggingface/datasets
- examples for splits: https://huggingface.co/docs/datasets/v1.11.0/splits.html#examples

In [None]:
from datasets import load_dataset
# dataset = load_dataset("oscar", "unshuffled_deduplicated_mn", split="train") # -> бүтнээр нь авах
# dataset = load_dataset("oscar", "unshuffled_deduplicated_mn", split="train[:5%]") # -> эхний 5% ийг авах
dataset = load_dataset("oscar", "unshuffled_deduplicated_mn", split="train[:200]") # -> эхний 200 өгөгдлийг авах

In [None]:
dataset

## Transformers - `Config`

In [None]:
MODEL_NAME = 'dlub-2022-mlm-full'
model_dir = 'dlub'
%mkdir $model_dir

In [None]:
from transformers import BertConfig
config = BertConfig.from_pretrained("bert-base-uncased")
config.save_pretrained(model_dir)

In [None]:
!cat $model_dir/config.json

In [None]:
# token-оо model hub рүү оруулах
config.push_to_hub(
    MODEL_NAME,
    use_auth_token=hf_token
)

## Tokenization and `BertTokenizer`

WordPiece tokenizer- BERT-ийн анхны хувилбар дээр ашиглагдсан ба хэлнээс ил хамаарсан tokenizer үүсгэх арга юм. 

Анх танилцуулагдсан paper: [Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/pdf/1609.08144.pdf)

маш олон tokenizer үүд ашиглагдах боломжтой ба үүнийг нэгтгэн ашиглахад амар болгосон library нь [tokenizers](https://github.com/huggingface/tokenizers) юм.

In [None]:
%%time
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer()

def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i: i + batch_size]["text"]

# Customized training
tokenizer.train_from_iterator(batch_iterator(), vocab_size=30522, min_frequency=2, special_tokens=[
    "[UNK]",
    "[SEP]",
    "[PAD]",
    "[CLS]",
    "[MASK]",
])

# Save files to disk
tokenizer.save(f"{model_dir}/tokenizer.json")

In [None]:
from transformers import BertTokenizerFast
bert_tokenizer = BertTokenizerFast.from_pretrained('./dlub')

In [None]:
bert_tokenizer.encode_plus('[CLS] би монгол улсын иргэн')

In [None]:
bert_tokenizer.save_pretrained(MODEL_NAME, push_to_hub=True)

## Дата бэлтгэл

In [None]:
def tokenize_function(examples):
    return bert_tokenizer(examples["text"])

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=2, remove_columns=["id", "text"])

In [None]:
print(tokenized_datasets[0])

In [None]:
# block_size = tokenizer.model_max_length
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=2,
)

## Сургалт

In [None]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_config(config)

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    MODEL_NAME,
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32*2,
    dataloader_num_workers=2,

    evaluation_strategy = "epoch",
    logging_strategy="epoch",
    save_strategy="epoch",

    learning_rate=2e-5,
    weight_decay=0.01,
    save_total_limit=10,
    report_to='tensorboard',

    # automatic version handling with huggingface
    push_to_hub=True,
    hub_token=hf_token,
)

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer, mlm_probability=0.15)

In [None]:
data_collator

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets,
    eval_dataset=lm_datasets,
    data_collator=data_collator,
)

In [None]:
trainer.train()

In [None]:
trainer.push_to_hub()