__How to train a language model__	Notebook to Highlight all the steps to effectively train Transformer model on custom data
https://github.com/huggingface/transformers/tree/master/notebooks

https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

__Language Modeling__
https://github.com/huggingface/transformers/tree/master/examples/language-modeling
https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py

In [5]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="7"

from pathlib import Path
import torch

from functools import reduce

# from torch.utils.data import Dataset, DataLoader
# from tokenizers import CharBPETokenizer
# from tokenizers.processors import BertProcessing
# from tokenizers.normalizers import BertNormalizer

# from transformers import RobertaTokenizerFast, RobertaTokenizer

# From Simple Transformers
import logging

from simpletransformers.language_modeling import (
    LanguageModelingModel,
    LanguageModelingArgs,
)

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [2]:
# device = torch.device("cuda:1")
# device

In [3]:
# Check that PyTorch sees it
print(torch.cuda.is_available())
print(torch.cuda.device_count())

True
1


In [6]:
DATA_PATH = Path("..")

# DATA_RAW_PATH = DATA_PATH/"raw"
DATA_RAW_EXTRACTED_PATH = DATA_PATH/"raw_data_extraction"

# Output is in bytes - helper from Pathlib Path https://stackoverflow.com/questions/2104080/how-can-i-check-file-size-in-python
def getStat(prev_value, cur_value):
    if isinstance(prev_value, int):
        return prev_value + cur_value.stat().st_size
    return prev_value.stat().st_size + cur_value.stat().st_size

# 1. The data from thwiki
THWIKI_FOLDER = Path("thwiki-20200601-extracted")
WIKI_FILES = list((DATA_RAW_EXTRACTED_PATH/THWIKI_FOLDER).glob("*.txt"))
list(map(print , WIKI_FILES[:5]))
print(f"thwiki-20200601-extracted Amounts to a total of {reduce(getStat, WIKI_FILES)/1e6:.2f} MB")

# 2. The classification data from jung and ninja
CLASSIFICATION_JUNG_NINJA_FOLDER = Path("classification_dataset")
CLASSIFICATION_FILES = list((DATA_RAW_EXTRACTED_PATH/CLASSIFICATION_JUNG_NINJA_FOLDER).glob("*.txt"))
list(map(print , CLASSIFICATION_FILES[:5]))
print(f"classification_dataset Amounts to a total of {reduce(getStat, CLASSIFICATION_FILES)/1e6:.2f} MB")

# 3. The Data from p'Moo Crawlers
ANOTHER_WEBSITE_MOO_FOLDER = Path("another_website")
ANOTHER_WEBSITE_FILES = list((DATA_RAW_EXTRACTED_PATH/ANOTHER_WEBSITE_MOO_FOLDER).glob("*.txt"))
list(map(print , ANOTHER_WEBSITE_FILES[:5]))
print(f"another_website Amounts to a total of {reduce(getStat, ANOTHER_WEBSITE_FILES)/1e6:.2f} MB")

# 4. Senior Project Files
SENIOR_PROJ_FOLDER = Path("data_lm")
SENIOR_PROJ_FILES = list((DATA_RAW_EXTRACTED_PATH/SENIOR_PROJ_FOLDER).glob("*.txt"))
list(map(print , SENIOR_PROJ_FILES[:5]))
print(f"Senior Project Amounts to a total of {reduce(getStat, SENIOR_PROJ_FILES)/1e6:.2f} MB")

# 5. Guru Crawler Files
GURU_CRAWLER_FOLDER = Path("social_listening")
GURU_CRAWLER_FILES = list((DATA_RAW_EXTRACTED_PATH/GURU_CRAWLER_FOLDER).glob("*.txt"))
list(map(print , GURU_CRAWLER_FILES[:5]))
print(f"GuruCrawler Amounts to a total of {reduce(getStat, GURU_CRAWLER_FILES)/1e6:.2f} MB")

ALL_FILES = WIKI_FILES + CLASSIFICATION_FILES + ANOTHER_WEBSITE_FILES + SENIOR_PROJ_FILES + GURU_CRAWLER_FILES
print(f"\nI have a total of {len(ALL_FILES)} files!")





print(f"Amounts to a total of {reduce(getStat, ALL_FILES)/1e6:.2f} MB")

../raw_data_extraction/thwiki-20200601-extracted/WikiAD_3.txt
../raw_data_extraction/thwiki-20200601-extracted/WikiAE_3.txt
../raw_data_extraction/thwiki-20200601-extracted/WikiAF_0.txt
../raw_data_extraction/thwiki-20200601-extracted/WikiAF_2.txt
../raw_data_extraction/thwiki-20200601-extracted/WikiAD_1.txt
thwiki-20200601-extracted Amounts to a total of 566.79 MB
../raw_data_extraction/classification_dataset/dailynews_0.txt
../raw_data_extraction/classification_dataset/pptv36_0.txt
../raw_data_extraction/classification_dataset/prbangkok_0.txt
../raw_data_extraction/classification_dataset/siamrath_0.txt
../raw_data_extraction/classification_dataset/springnews_0.txt
classification_dataset Amounts to a total of 50.79 MB
../raw_data_extraction/another_website/khaosod_16.txt
../raw_data_extraction/another_website/pantip_470.txt
../raw_data_extraction/another_website/pantip_415.txt
../raw_data_extraction/another_website/naewna_2.txt
../raw_data_extraction/another_website/brighttv_5.txt
ano

# Making Electra Model
The ELECTRA model consists of a generator model and a discriminator model.

From simpletransformers, [Configuring an ELECTRA model](https://simpletransformers.ai/docs/lm-specifics/#configuring-an-electra-model)  
[Reference from Docs: Training an ELECTRA model from scratch (should be wrong)](https://simpletransformers.ai/docs/lm-minimal-start/#training-an-electra-model-from-scratch)  
[Another Reference from Github - Minimal Example For Language Model Training With ELECTRA](https://github.com/ThilinaRajapakse/simpletransformers#minimal-example-for-language-model-training-with-electra)  

_each document isn't the same at all!!_

- model_type must be set to electra.
- To load a saved ELECTRA model, you can provide the path to the save files as model_name

When training an ELECTRA language model from scratch, you can define the architecture by using the `generator_config` and `discriminator_config` in the args dict. The [default values](https://huggingface.co/transformers/model_doc/electra.html#electraconfig) will be used for any config parameters that aren’t specified.

```python
model_args = {
      "vocab_size": 52000,
      "generator_config": {
          "embedding_size": 128,
          "hidden_size": 256,
          "num_hidden_layers": 3,
      },
      "discriminator_config": {
          "embedding_size": 128,
          "hidden_size": 256,
      },
  }
```

Args will be updated from LanguageModelingArgs [link](https://github.com/ThilinaRajapakse/simpletransformers/blob/master/simpletransformers/config/model_args.py#L137)

Note: Tokenizers from huggingface/tokenizers and supported and documented in [issue #511](https://github.com/ThilinaRajapakse/simpletransformers/issues/511)
>If you have a pre-trained Chinese tokenizer saved in the format that HugginFace uses, you can load it by providing the path to the tokenizer files as tokenizer_name.

It uses `tokenizer_class.from_pretrained(self.args.tokenizer_name, cache_dir=self.args.cache_dir)` [Github Code](https://github.com/ThilinaRajapakse/simpletransformers/blob/master/simpletransformers/language_modeling/language_modeling_model.py#L188)

ElectraTokenizer Needs vocab file of txt with no indexes see [load_vocab() of BertTokenizer](https://huggingface.co/transformers/_modules/transformers/tokenization_bert.html#BertTokenizer) Because huggingface said that [ElectraTokenizer uses BertTokenizer](https://huggingface.co/transformers/model_doc/electra.html#electratokenizer)

In [45]:
import json
with open('thwiki-wordpiece-30522.tokenizer.json', 'r') as f:
    tokenizer_dict = json.load(f)
with open('temp_vocab.txt', 'w') as f:
    f.write('\n'.join([token for token,index in tokenizer_dict['model']['vocab'].items()]))

In [46]:
model_args = {
    "vocab_size": 30522,
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "tokenizer_name": "./temp_vocab.txt",
#     "overwrite_output_dir": True,
#       "generator_config": {
#           "embedding_size": 128,
#           "hidden_size": 256,
#           "num_hidden_layers": 3,
#       },
#       "discriminator_config": {
#           "embedding_size": 128,
#           "hidden_size": 256,
#       },
}

In [47]:
model = LanguageModelingModel(
    "electra",
    None,
    args=model_args,
    train_files=str(WIKI_FILES[0])
)


INFO:simpletransformers.language_modeling.language_modeling_model: Training language model from scratch


It strips tokens because of there is no arguments in [here](https://github.com/ThilinaRajapakse/simpletransformers/blob/master/simpletransformers/language_modeling/language_modeling_model.py#L188)

In [51]:
" ".join(model.tokenizer.tokenize("ฉันเคยเกือบพลาดสิ่งที่ดีที่สุดในชีวิต หากในวันที่ฉันล้มอยู่ ไม่มีหนึ่งใจของเธอ ฝันคงจบ"))

'ฉ ##น ##เคย ##เกอ ##บ ##พลาด ##สง ##ทด ##ท ##สด ##ใน ##ช ##วต หาก ##ใน ##วน ##ท ##ฉ ##น ##ลม ##อย ไม ##มห ##นง ##ใจของ ##เธอ ฝน ##คง ##จบ'

In [49]:
model.train_model(str(WIKI_FILES[0]))


INFO:simpletransformers.language_modeling.language_modeling_utils: Creating features from dataset file at cache_dir/


HBox(children=(FloatProgress(value=0.0, max=15244.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=9988.0), HTML(value='')))

INFO:simpletransformers.language_modeling.language_modeling_utils: Saving features into cached file cache_dir/electra_cached_lm_126_WikiAD_3.txt
INFO:simpletransformers.language_modeling.language_modeling_model: Training started



Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0', max=1249.0, style=ProgressStyle(descrip…

Running loss: 44.114944Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0




Running loss: 44.132175Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Running loss: 44.137611Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Running loss: 44.145008Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Running loss: 44.116955Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Running loss: 32.887733



Running loss: 24.497189

KeyboardInterrupt: 

In [None]:

model.eval_model("wikitext-2/wiki.test.tokens")

In [None]:
from transformers import ElectraModel, ElectraConfig

# Initializing a ELECTRA electra-base-uncased style configuration
configuration = ElectraConfig()
configuration.vocab_size = 20000

# Initializing a model from the electra-base-uncased style configuration
# model = ElectraModel(configuration)

# # Accessing the model configuration
# configuration = model.config
configuration

In [None]:
configuration.to_dict()

In [None]:
# model

In [None]:
# model.num_parameters()
# # => 12 million parameters

In [None]:
# model.get_input_embeddings()

# Trying out Roberta per Notebook 

From __HuggingFace Notebooks__ https://huggingface.co/transformers/notebooks.html: 

How to train a language model	Highlight all the steps to effectively train Transformer model on custom data
- Colab (ipynb) version : https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
- MD version: https://github.com/huggingface/blog/blob/master/how-to-train.md

Pretrain Longformer	How to build a "long" version of existing pretrained models	Iz Beltagy  
https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb

In [None]:
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM

configuration = RobertaConfig(
    vocab_size=30522,
    max_position_embeddings=514, # 512 + 2 more special tokens
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)
# configuration.vocab_size = 20000

model = RobertaForMaskedLM(config=configuration)

# Accessing the model configuration
model.config

In [None]:
model.num_parameters()
# => 102 million parameters

In [None]:
model

# Initializing Tokenizer

In [None]:
# from tokenizers import Tokenizer
# tokenizer = Tokenizer.from_file("./thwiki-sentencepiecebpe.tokenizer.json")
# encoded =  tokenizer.encode(u"สวัสดีครับ ผมชื่อไนท์ ตอนนี้ก็เป็นเวลาที่ผมต้องไปโรงเรียนแล้ว  นี่คือการเว้นวรรคสองทีครับ  จะได้ออกเป็นสอง Spaces")
# print(encoded.ids)
# print(encoded.tokens)

In [None]:
# tokenizer.enable_truncation(max_length=128)

In [None]:
# encoded =  tokenizer.encode(u"สวัสดีครับ ผมชื่อไนท์ ตอนนี้ก็เป็นเวลาที่ผมต้องไปโรงเรียนแล้ว  นี่คือการเว้นวรรคสองทีครับ  จะได้ออกเป็นสอง SpacesWhat is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – ĉ, ĝ, ĥ, ĵ, ŝ, and ŭ – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.")
# print("This will not be over 128: ", len(encoded.ids), encoded.tokens)
# print(encoded.overflowing[0].tokens)

wrap tokenizers inside a PreTrainedTokenizerFast from transformers 

https://github.com/huggingface/tokenizers/issues/259

In [None]:
# from tokenizers import SentencePieceBPETokenizer
# from transformers import PreTrainedTokenizerFast


# class SentencePieceBPETokenizerFast(PreTrainedTokenizerFast):
#     def __init__(
#         self,
#         vocab_file,
#         merges_file,
#         bos_token="<s>",
#         eos_token="</s>",
#         sep_token="</s>",
#         cls_token="<s>",
#         unk_token="<unk>",
#         pad_token="<pad>",
#         mask_token="<mask>",
#         **kwargs
#     ):
#         super().__init__(
#             SentencePieceBPETokenizer(
#                 vocab_file=vocab_file,
#                 merges_file=merges_file,
#             ),
#              bos_token=bos_token,
#             eos_token=eos_token,
#             unk_token=unk_token,
#             sep_token=sep_token,
#             cls_token=cls_token,
#             pad_token=pad_token,
#             mask_token=mask_token,
#             **kwargs,
#         )

In [None]:
# import json

# # with open("./thwiki-sentencepiecebpe.tokenizer.json", 'r' ) as json_data:
# with open("./thwiki-charbpe-30522.tokenizer.json", 'r' ) as json_data:
#      data = json.load(json_data)
# vocab = data['model']['vocab']
# merges = data['model']['merges']


# with open('vocab.json', 'w', encoding='utf-8') as json_file:
#     json.dump(vocab, json_file, ensure_ascii=False)
# with open('merges.txt', 'w', encoding='utf-8') as f:
#     for merge_string in merges:
#         f.write(f'{merge_string}\n')

In [None]:
# pretrain_tokenizer = SentencePieceBPETokenizerFast(vocab_file='vocab.json',merges_file ='merges.txt' )

In [None]:
# from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast
# from tokenizers import Tokenizer
# from tokenizers.implementations import BaseTokenizer

# tokenizer = Tokenizer.from_file("./thwiki-sentencepiecebpe.tokenizer.json")
# base_tokenizer = BaseTokenizer(tokenizer) # Wrapper!! to PretrainTokenizerFast Tokenizer should be an instance of a Tokenizer provided by HuggingFace tokenizers library.
# base_tokenizer = SentencePieceBPETokenizer()
# pretrain_tokenizer = PreTrainedTokenizerFast(tokenizer=base_tokenizer)
# pretrain_tokenizer

In [None]:
# from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast
# from tokenizers import Tokenizer, CharBPETokenizer
# from tokenizers.implementations import BaseTokenizer

# # tokenizer = Tokenizer.from_file("./thwiki-charbpe-30522.tokenizer.json")
# # base_tokenizer = BaseTokenizer(tokenizer) # Wrapper!! to PretrainTokenizerFast Tokenizer should be an instance of a Tokenizer provided by HuggingFace tokenizers library.
# base_tokenizer = CharBPETokenizer(vocab_file='vocab.json',merges_file ='merges.txt')
# pretrain_tokenizer = PreTrainedTokenizerFast(tokenizer=base_tokenizer)
# pretrain_tokenizer

In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./thwiki-seniorproj-bytebpe-30522", max_len=512)

In [None]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("./thwiki-seniorproj-bytebpe-30522", max_len=512)

# Building our dataset

Build it with `from torch.utils.data.dataset import Dataset` just like [TextDataset](https://github.com/huggingface/transformers/blob/448c467256332e4be8c122a159b482c1ef039b98/src/transformers/data/datasets/language_modeling.py) and [LineByLineTextDataset](https://github.com/huggingface/transformers/blob/448c467256332e4be8c122a159b482c1ef039b98/src/transformers/data/datasets/language_modeling.py#L78)

Note: Training with multiple files is currently not supported [issue/3445](https://github.com/huggingface/transformers/issues/3445)

padding documentation [link](https://github.com/huggingface/tokenizers/blob/master/bindings/python/tokenizers/implementations/base_tokenizer.py#L52)

Potential Improvements
- การทำให้ Dataset นั้น dynamically tokenize + dynamically open file : ตอนนี้เวลาทำ Dataset จาก torch.utils.data.dataset จะทำการ tokenize เลยตอนอยู่ใน constructor  , กำลังคิดว่าถ้าเกิดว่า Data ใหญ่มากๆ อาจจะไม่เหมาะสมกับการทำแบบนี้  เพราะว่า Ram จะต้องมีขนาดเท่าๆกับ data ที่เราใส่เข้าไป  ซึ่งเป็นไปได้ยากหาก Data มีขนาดใหญ่มากๆ   ผมได้ทำการ Search ดูแล้วก็พบว่าจาก Discussion Forum ของ Pytorch: https://discuss.pytorch.org/t/how-to-use-a-huge-line-corpus-text-with-dataset-dataloader/30872 
Option1: ใช้ pd.Dataframe ในการเปิด File แบบ small chunks of data https://discuss.pytorch.org/t/data-processing-as-a-batch-way/14154/4?u=ptrblck
Option2: ใช้ byte Offsets จากไฟล์ใหญ่ๆเพื่อที่จะ lookup .seek(): https://github.com/pytorch/text/issues/130#issuecomment-510412877
More Examples: https://github.com/pytorch/text/blob/master/torchtext/datasets/unsupervised_learning.py , https://github.com/pytorch/text/blob/a5880a3da7928dd7dd529507eec943a307204de7/examples/text_classification/iterable_train.py#L169-L214

In [None]:
%%time
from transformers import LineByLineTextDataset

train_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=str(WIKI_FILES[0]),
    block_size=128,
)
val_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=str(WIKI_FILES[2]),
    block_size=128,
)

In [None]:
class THWikiDataset(Dataset):
    def __init__(self, tokenizer, evaluate = False, block_size=32):
        self.examples = []
        
        # Use AA and AB as train, AC as test
        src_files = Path("../data/text/AC/").glob("wiki*") if evaluate else Path("../data/text/").glob("A[AB]/wiki*")
        for src_file in src_files:
            print("🔥", src_file)
            lines = src_file.read_text(encoding="utf-8").splitlines()
#             self.examples += [x.ids for x in tokenizer.encode_batch(lines)]
            batch_encoding = tokenizer.batch_encode_plus(lines, add_special_tokens=True, max_length=block_size)
            self.examples = batch_encoding["input_ids"]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return torch.tensor(self.examples[i], dtype=torch.long)

In [None]:
dataset = THWikiDataset(tokenizer)
dataset.__getitem__(0)

In [None]:

# dataloader = DataLoader(dataset, batch_size=1, collate_fn=data_collator,
#                         shuffle=True, num_workers=4) # Still cant make more batch size!! Need collate function!

In [None]:
# for i_batch, sample_batched in enumerate(dataloader):
#     print(i_batch, sample_batched)
#     oumodel()

In [None]:
# tokenizer = CharBPETokenizer(vocab_file='vocab.json',merges_file ='merges.txt' )
# no_accent_strip = BertNormalizer(strip_accents=False)
# tokenizer._tokenizer.normalizer = no_accent_strip
# tokenizer._tokenizer.post_processor = BertProcessing(
#     ("</s>", tokenizer.token_to_id("</s>")),
#     ("<s>", tokenizer.token_to_id("<s>")),
# )

# input_ids = torch.tensor(tokenizer.encode(u"สวัสดีครับ ผมชื่อไนท์ ตอนนี้ก็เป็นเวลาที่ผมต้องไปโรงเรียนแล้ว  นี่คือการเว้นวรรคสองทีครับ  จะได้ออกเป็นสอง Spaces").ids).unsqueeze(0)
# print(input_ids)
# outputs = model(input_ids, labels=input_ids)
# print(outputs)
# loss, prediction_scores = outputs[:2]
# print(loss, prediction_scores.shape)

In [None]:
# dataset.__getitem__(1).unsqueeze(0)

In [None]:
# input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1

In [None]:
# %%time
# from transformers import TextDataset, LineByLineTextDataset

# # dataset = LineByLineTextDataset(
# #     tokenizer=pretrain_tokenizer,
# #     file_path="../data/text/AA/wiki_01",
# #     block_size=128,
# # )

# dataset = TextDataset(
#     tokenizer=pretrain_tokenizer,
#     file_path="../data/text/AA/wiki_01",
#     block_size=128,
# )


In [None]:
# one_doc = list(Path("../data/text/AA/").glob("wiki*"))[0].read_text(encoding="utf-8").splitlines()
# tokenizer = Tokenizer.from_file("./thwiki-sentencepiecebpe.tokenizer.json")
# tokenizer.encode_batch(one_doc[:8])

In [None]:
# one_doc = list(Path("../data/text/AA/").glob("wiki*"))[0].read_text(encoding="utf-8").splitlines()
# tokenizer = RobertaTokenizerFast(vocab_file='vocab.json',merges_file ='merges.txt', max_len=512)
# tokenizer.batch_encode_plus(one_doc[:8])

In [None]:
# print(tokenizer.encode_batch(one_doc[:8])[5].tokens)

In [None]:
# one_doc[:8]

In [None]:
from transformers import DataCollatorForLanguageModeling
# tokenizer = RobertaTokenizer(vocab_file='vocab.json',merges_file ='merges.txt', max_len=512)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

# Transfomers Trainer [link](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L133)

```python
class Trainer:
    """
    Trainer is a simple but feature-complete training and eval loop for PyTorch,
    optimized for Transformers.
    Args:
        prediction_loss_only:
            (Optional) in evaluation and prediction, only return the loss
    """
    def __init__(
        self,
        model: PreTrainedModel,
        args: TrainingArguments,
        data_collator: Optional[DataCollator] = None,
        train_dataset: Optional[Dataset] = None,
        eval_dataset: Optional[Dataset] = None,
        compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None,
        prediction_loss_only=False,
        tb_writer: Optional["SummaryWriter"] = None,
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR] = None,
```

[TrainingArguments](https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py#L33) is referenced here. 

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./test_RoBERTa2",
    overwrite_output_dir=True,  #"Use this to continue training if output_dir points to a checkpoint directory."
    
    
    do_train=True, #Whether to run training.
    do_eval=True, #Whether to run eval on the dev set.
#     do_predict=True, # Whether to run predictions on the test set.
    
    num_train_epochs=20, # Total number of training epochs to perform.
    
    
    per_device_train_batch_size=8, # Batch size per GPU/TPU core/CPU for training.
    per_device_eval_batch_size=8, # Batch size per GPU/TPU core/CPU for evaluation.
    
    learning_rate=5e-5,  #The initial learning rate for Adam.
    adam_epsilon=1e-8, #Epsilon for Adam optimizer.
    
    save_steps=10_000,  #Save checkpoint every X updates steps.
    save_total_limit=2, #"Limit the total amount of checkpoints. Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

In [None]:
%%time
trainer.train()

In [None]:
trainer.save_model("./EsperBERTo")

In [None]:
encoded = pretrain_tokenizer.encode(u"สวัสดีครับ ผมชื่อไนท์ ตอนนี้ก็เป็นเวลาที่ผมต้องไปโรงเรียนแล้ว  นี่คือการเว้นวรรคสองทีครับ  จะได้ออกเป็นสอง Spaces")
encoded