<a href="https://colab.research.google.com/github/JayThibs/pretrain-nlp-models/blob/main/pretrain_bert_with_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-Training a BERT model with Huggingface

We will be following the tutorial found here: https://huggingface.co/blog/how-to-train

In [1]:
!pip install transformers --quiet

[K     |████████████████████████████████| 2.6 MB 15.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 76.2 MB/s 
[K     |████████████████████████████████| 636 kB 59.6 MB/s 
[K     |████████████████████████████████| 895 kB 66.4 MB/s 
[?25h

In [2]:
# download dataset
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2021-08-20 20:52:24--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 65.9.73.8, 65.9.73.71, 65.9.73.86, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|65.9.73.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 312733741 (298M) [text/plain]
Saving to: ‘oscar.eo.txt’


2021-08-20 20:52:31 (44.3 MB/s) - ‘oscar.eo.txt’ saved [312733741/312733741]



In [3]:
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path('.').glob('**/*.txt')]

# initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# customize training
tokenizer.train(files=paths, vocab_size=52000, min_frequency=2, special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

# save files to disk
tokenizers.save_model('.', 'esperberto')

NameError: ignored

In [5]:
!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [6]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt",
)

In [7]:
tokenizer.encode("Mi estas Jacques.")

Encoding(num_tokens=4, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [8]:
tokenizer.encode("Mi estas Jacques.").tokens

['Mi', 'Ġestas', 'ĠJacques', '.']

# 3. Train a lanaguage model from scratch

