https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

In [1]:
import os
os.environ['TRANSFORMERS_CACHE'] = '/home/vips/share/huggingface/transformers'
os.environ['HF_DATASETS_CACHE'] = '/home/vips/share/huggingface/datasets'

Here we’ll use the Esperanto portion of the OSCAR corpus from INRIA. OSCAR is a huge multilingual corpus obtained by language classification and filtering of Common Crawl dumps of the Web. The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on.

In [2]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2022-02-11 07:31:32--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 13.225.129.73, 13.225.129.30, 13.225.129.101, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|13.225.129.73|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more <unk> tokens!).

In [3]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])




CPU times: user 5min 5s, sys: 31.3 s, total: 5min 36s
Wall time: 37.8 s


In [4]:
#We now have both a vocab.json, which is a list of the most frequent tokens ranked by frequency, and a merges.txt list of merges.

!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
mkdir: cannot create directory ‘EsperBERTo’: File exists


['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

In [5]:
# Here’s how you can use it in tokenizers, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from transformers.

from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt",
)

tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)
tokenizer.encode("Mi estas Julien.").tokens

['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']

## 3. Train a language model from scratch

In [6]:
from transformers import RobertaConfig
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)


config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

# As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

file ./EsperBERTo/config.json not found
file ./EsperBERTo/config.json not found


In [7]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

from transformers import DataCollatorForLanguageModeling
# This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)



CPU times: user 6min 11s, sys: 6.49 s, total: 6min 17s
Wall time: 49.8 s


In [8]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)


trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 974545
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 30455


Step,Training Loss
500,7.9859


KeyboardInterrupt: 

In [9]:
trainer.save_model("./EsperBERTo")

Saving model checkpoint to ./EsperBERTo
Configuration saved in ./EsperBERTo/config.json
Model weights saved in ./EsperBERTo/pytorch_model.bin


## 4. Check that the LM actually trained

In [10]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./EsperBERTo",
    tokenizer="./EsperBERTo"
)

loading configuration file ./EsperBERTo/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.10.3",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./EsperBERTo/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gra

In [11]:
# The sun <mask>.
# =>

fill_mask("La suno <mask>.")

[{'sequence': 'La suno,.',
  'score': 0.021514812484383583,
  'token': 16,
  'token_str': ','},
 {'sequence': 'La suno..',
  'score': 0.01748129539191723,
  'token': 18,
  'token_str': '.'},
 {'sequence': 'La suno la.',
  'score': 0.01326251681894064,
  'token': 264,
  'token_str': ' la'},
 {'sequence': 'La suno:.',
  'score': 0.012589017860591412,
  'token': 30,
  'token_str': ':'},
 {'sequence': 'La suno estas.',
  'score': 0.012565826065838337,
  'token': 317,
  'token_str': ' estas'}]

In [12]:
fill_mask("Jen la komenco de bela <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

[{'sequence': 'Jen la komenco de bela la.',
  'score': 0.08968336135149002,
  'token': 264,
  'token_str': ' la'},
 {'sequence': 'Jen la komenco de bela de.',
  'score': 0.041614122688770294,
  'token': 274,
  'token_str': ' de'},
 {'sequence': 'Jen la komenco de bela,.',
  'score': 0.02514728158712387,
  'token': 16,
  'token_str': ','},
 {'sequence': 'Jen la komenco de bela..',
  'score': 0.023026520386338234,
  'token': 18,
  'token_str': '.'},
 {'sequence': 'Jen la komenco de bela en.',
  'score': 0.021679924800992012,
  'token': 295,
  'token_str': ' en'}]