How to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on Esperanto. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.

## 1. Find a dataset
The Esperanto portion of the dataset is only 299M, so we’ll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 

In [1]:
!wget -c https://s3.amazonaws.com/datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2020-05-02 13:13:26--  https://s3.amazonaws.com/datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.113.85
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.113.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 312733741 (298M) [text/plain]
Saving to: ‘oscar.eo.txt’


2020-05-02 13:13:46 (15.3 MB/s) - ‘oscar.eo.txt’ saved [312733741/312733741]



## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [14]:
!git clone https://github.com/huggingface/transformers.git \
    ;cd transformers \
    ;pip install . 

Cloning into 'transformers'...
remote: Enumerating objects: 361, done.[K
remote: Counting objects: 100% (361/361), done.[K
remote: Compressing objects: 100% (189/189), done.[K
remote: Total 26096 (delta 178), reused 322 (delta 163), pack-reused 25735[K
Receiving objects: 100% (26096/26096), 15.38 MiB | 9.11 MiB/s, done.
Resolving deltas: 100% (18216/18216), done.
Processing /content/transformers
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 2.5MB/s 
Building wheels for collected packages: transformers
  Building wheel for transformers (setup.py) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-2.8.0-cp36-none-any.whl size=595726 sha256=41605a9471418b226c4a7a70139ea638a7fedafc224ab7caafc8aa1a3198a585
  Stored in directory: /tmp/pip-ephem-whee

In [3]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 4min 57s, sys: 2.69 s, total: 5min
Wall time: 4min 58s


Now let's save files to disk

In [4]:
!mkdir EsperBERTo
tokenizer.save("EsperBERTo")

['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it direcly from `transformers`.


In [0]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt",
)

In [0]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [7]:
tokenizer.encode("Mi estas Julien.")

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])

In [8]:
tokenizer.encode("Mi estas Julien.").tokens

['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']

## 3. Train a language model from scratch

We will now train our language model using the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) script from `transformers`. Just remember to leave `--model_name_or_path` to `None` to train from scratch vs. from an existing model or checkpoint.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [9]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

False

Here, as we only have one text file, we don't even need to customize our `LineByLineDataset`. We'll just run the `run_language_modeling.py` script out-of-the-box.

In [10]:
# Get the example scripts.
!wget -c https://raw.githubusercontent.com/huggingface/transformers/master/examples/run_language_modeling.py

--2020-05-02 13:31:04--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/run_language_modeling.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10500 (10K) [text/plain]
Saving to: ‘run_language_modeling.py’


2020-05-02 13:31:05 (97.7 MB/s) - ‘run_language_modeling.py’ saved [10500/10500]



### We'll define the following config for the model

In [0]:
import json
config = {
	"architectures": [
		"RobertaForMaskedLM"
	],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "roberta",
	"num_attention_heads": 12,
	"num_hidden_layers": 6,
	"type_vocab_size": 1,
	"vocab_size": 52000
}
with open("./EsperBERTo/config.json", 'w') as fp:
    json.dump(config, fp)

tokenizer_config = {
	"max_len": 512
}
with open("./EsperBERTo/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

Let's run our script with the following options:

In [0]:
cmd =	"""
  python run_language_modeling.py
  	--train_data_file ./oscar.eo.txt
  	--output_dir ./EsperBERTo-small-v1
	--model_type roberta
	--mlm
	--config_name ./EsperBERTo
	--tokenizer_name ./EsperBERTo
	--do_train
	--line_by_line
	--learning_rate 1e-4
	--num_train_epochs 1
	--save_total_limit 2
	--save_steps 2000
	--per_gpu_train_batch_size 16
	--seed 42
""".replace("\n", " ")

In [15]:
%%time
!{cmd}

2020-05-02 13:34:40.262434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
05/02/2020 13:34:42 - INFO - transformers.training_args -   PyTorch: setting up devices
05/02/2020 13:34:42 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./EsperBERTo-small-v1', overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, per_gpu_train_batch_size=16, per_gpu_eval_batch_size=8, gradient_accumulation_steps=1, learning_rate=0.0001, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir=None, logging_first_step=False, logging_steps=500, save_steps=2000, save_total_limit=2, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1)
05/02/2020 13:34:42 - INFO - transformers.configuration_utils -   loading configuration file ./EsperBERTo/config.json
05/02/2020 13:34:42 - I