Notebook is based on https://huggingface.co/blog/how-to-train

## 1. Find a dataset

## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [None]:
# # We won't need TensorFlow here
# !pip uninstall -y tensorflow
# # Install `transformers` from master
# !pip install git+https://github.com/huggingface/transformers
# !pip list | grep -E 'transformers|tokenizers'
# # transformers version at notebook update --- 2.11.0
# # tokenizers version at notebook update --- 0.8.0rc1

In [None]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Now let's save files to disk

In [None]:
!mkdir EsperBergman
tokenizer.save_model("EsperBergman")

🔥🔥 Wow, that was fast! ⚡️🔥

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsperBergman/vocab.json",
    "./EsperBergman/merges.txt",
)

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
tokenizer.encode("Mi estas Julien.")

In [None]:
tokenizer.encode("Mi estas Julien.").tokens

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [None]:
# Check that we have a GPU
!nvidia-smi

In [1]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

In [2]:
import time

def is_gpu_free(th=1E+9):
    free, total = torch.cuda.mem_get_info()
    return total - free < th
    
while not(is_gpu_free()):
    time.sleep(60)

### We'll define the following config for the model

In [1]:
from bergman import BergmanConfig

# # Feb24_04-34-34_raven
# # Mar04_04-08-58_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_matrix_heads=24,
#     num_hidden_layers=4,
#     type_vocab_size=1,
# )

# # Mar08_00-33-28_raven
# # 1035de06e
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=4,
#     type_vocab_size=1,
#     hidden_size=768,
# #     position_embedding_type="none",
#     matrix_norm_alg="-1",
#     matrix_dim=16,
#     num_matrix_heads=24,
#     vector_init_direction="one",
#     use_for_context=["lr", "rl"],
#     networks_for_heads=None,
#     matrix_norm_loss_type=None,
# )

# # Mar08_18-18-03_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=4,
#     type_vocab_size=1,
#     hidden_size=768,
# #     position_embedding_type="none",
#     matrix_norm_alg="-1",
#     matrix_dim=16,
#     num_matrix_heads=24,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl"],
#     networks_for_heads=None,
#     matrix_norm_loss_type=None,
# )

# # Mar09_14-45-05_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=2,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg="-1",
#     matrix_dim=16,
#     num_matrix_heads=24,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl"],
#     networks_for_heads=None,
#     matrix_norm_loss_type=None,
# )

# # Mar10_00-11-30_raven
# # 1035de06e
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=2,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg="-1",
#     matrix_dim=16,
#     num_matrix_heads=16,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads=None,
#     matrix_norm_loss_type=None,
# )

# # Mar15_22-56-53_raven
# # fb6bf8afb
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=2,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg="-1",
#     matrix_dim=16,
#     num_matrix_heads=16,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads=None,
#     matrix_encoder_two_layers=False,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=False,
#     detach_norm_vectors=False,
# )

# # Mar17_14-38-04_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=2,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=-1,
#     matrix_dim=8,
#     num_matrix_heads=16,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads=None,
#     matrix_encoder_two_layers=False,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=False,
#     detach_norm_vectors=False,
#     complex_matrix=True,
#     complex_matrix_abs=False,
# )

# # Mar20_15-00-34_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=2,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=16,
#     num_matrix_heads=16,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads=None,
#     matrix_encoder_two_layers=False,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=True,
#     complex_matrix_abs=True,
# )

# # Mar22_01-26-24_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=2,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=16,
#     num_matrix_heads=16,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads=None,
#     matrix_encoder_two_layers=False,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=False,
#     complex_matrix_abs=True,
# )

# # Mar22_16-53-29_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=2,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=32,
#     num_matrix_heads=8,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads=None,
#     matrix_encoder_two_layers=False,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=False,
#     complex_matrix_abs=True,
# )

# # Mar23_23-41-20_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=2,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=16,
#     num_matrix_heads=12,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r", "local_l"],
#     networks_for_heads=None,
#     matrix_encoder_two_layers=False,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=True,
#     complex_matrix_abs=True,
# )

# # Mar25_16-06-46_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=4,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=4,
#     num_matrix_heads=8,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads="common",
#     matrix_encoder_two_layers=False,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=True,
#     complex_matrix_abs=True,
# )

# # Mar26_04-07-52_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=4,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=4,
#     num_matrix_heads=8,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads="common",
#     matrix_encoder_two_layers=True,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=True,
#     complex_matrix_abs=True,
# )

# # Mar26_15-38-56_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=4,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=4,
#     num_matrix_heads=16,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads="common",
#     matrix_encoder_two_layers=True,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=True,
#     complex_matrix_abs=True,
# )

# # Mar26_15-38-56_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=4,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=8,
#     num_matrix_heads=16,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl", "local_r"],
#     networks_for_heads="common",
#     matrix_encoder_two_layers=True,
#     #
#     matrix_norm_loss_type="MSE",
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=True,
#     complex_matrix_abs=True,
# )

# # Mar27_21-14-28_raven
# config = BergmanConfig(
#     vocab_size=52_000,
#     max_position_embeddings=514,
#     num_hidden_layers=4,
#     type_vocab_size=1,
#     hidden_size=768,
#     position_embedding_type="none",
#     matrix_norm_alg=None,
#     matrix_dim=4,
#     num_matrix_heads=16,
#     vector_init_direction="one",
#     use_for_context=["lr_excl", "rl_excl"],
#     networks_for_heads="common",
#     matrix_encoder_two_layers=True,
#     #
#     matrix_norm_loss_type=None,
#     matrix_norm_loss_k=0.0,
#     matrix_unitary_loss=None,
#     matrix_unitary_loss_k = 0.0,
#     norm_vectors=True,
#     complex_matrix=True,
#     complex_matrix_abs=True,
# )

#
config = BergmanConfig(
    vocab_size=52_000,
    max_position_embeddings=512,
    num_hidden_layers=4,
    type_vocab_size=1,
    hidden_size=768,
    position_embedding_type="none",
    matrix_norm_alg=None,
    matrix_dim=4,
    num_matrix_heads=16,
    vector_init_direction="one",
    use_for_context=["lr_excl", "rl_excl"],
    networks_for_heads="common",
    matrix_encoder_two_layers=True,
    #
    matrix_norm_loss_type=None,
    matrix_norm_loss_k=0.0,
    matrix_unitary_loss=None,
    matrix_unitary_loss_k = 0.0,
    norm_vectors=True,
    complex_matrix=True,
    complex_matrix_abs=True,
)

Now let's re-create our tokenizer in transformers

In [2]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBergman", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [3]:
from bergman import BergmanForMaskedLM

model = BergmanForMaskedLM(config=config)

In [4]:
model.num_parameters()
# => 84 million parameters

66579744

In [5]:
# for name, param in model.named_parameters():
#     if param.requires_grad:
#         print(name, param.size(), param.numel())

In [6]:
# model = model.from_pretrained("EsperBergman_Mar29_00-28-04_raven")

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [7]:
from datasets import load_dataset

In [8]:
import multiprocessing
num_proc = multiprocessing.cpu_count()

In [9]:
dataset = load_dataset(
    "text",
    data_files="./oscar.eo.txt",
    split="train"
)

Found cached dataset text (/home/eugene/.cache/huggingface/datasets/text/default-7c2209e5d9f08436/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


In [10]:
dataset = dataset.map(
    lambda examples: tokenizer(examples["text"], return_special_tokens_mask=True),
    batched=True,
    remove_columns=dataset.column_names,
    num_proc=num_proc,
)

Loading cached processed dataset at /home/eugene/.cache/huggingface/datasets/text/default-7c2209e5d9f08436/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-f857d8432e88c817_*_of_00024.arrow


In [11]:
max_seq_length = tokenizer.model_max_length
max_seq_length = 128

In [12]:
# from itertools import chain

# # Main data processing function that will concatenate all texts from our dataset and generate chunks of
# # max_seq_length.
# def group_texts(examples):
#     # Concatenate all texts.
#     concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
#     total_length = len(concatenated_examples[list(examples.keys())[0]])
#     # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
#     # customize this part to your needs.
#     if total_length >= max_seq_length:
#         total_length = (total_length // max_seq_length) * max_seq_length
#     # Split by chunks of max_len.
#     result = {
#         k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
#         for k, t in concatenated_examples.items()
#     }
#     return result

In [13]:
merge_texts = True

In [14]:
def group_texts(examples):
    # Concatenate all texts.
    result = {}
    for k, v in examples.items():
        acc = [[]]
        for text in v:
            if len(acc[-1]) + len(text) <= max_seq_length and merge_texts:
                acc[-1].extend(text)
            else:
                acc.append(text[: max_seq_length])
                if len(text) > max_seq_length:
                    acc[-1][-1] = text[-1]  # sep_token or corresponding mask
        result[k] = acc

    return result

In [15]:
dataset = dataset.remove_columns(['attention_mask'])

In [16]:
dataset = dataset.map(
    group_texts,
    batched=True,
    num_proc=num_proc,
)

Loading cached processed dataset at /home/eugene/.cache/huggingface/datasets/text/default-7c2209e5d9f08436/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-1ac19afc5093aa8e_*_of_00024.arrow


In [17]:
dataset.set_format(type="torch", columns=["input_ids", 'special_tokens_mask'])

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [18]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [19]:
from transformers import Trainer, TrainingArguments

In [25]:
training_args = TrainingArguments(
    output_dir="./EsperBergman",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=45,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    logging_steps=10,
    learning_rate=5E-5,
    st
)

In [26]:
from transformers.trainer import (
    MODEL_FOR_CAUSAL_LM_MAPPING_NAMES,
    is_torch_tpu_available,
)
import torch


class BergmanTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        """
        How the loss is computed by Trainer. By default, all models return the loss in the first element.

        Subclass and override for custom behavior.
        """
        if self.label_smoother is not None and "labels" in inputs:
            labels = inputs.pop("labels")
        else:
            labels = None
        outputs = model(**inputs)
        # Save past state if it exists
        # TODO: this needs to be fixed and made cleaner later.
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        metrics = outputs["metrics"] if isinstance(outputs, dict) else outputs[-1]
        self.metrics = {
            m: v if isinstance(v, float) else v.detach() for m, v in metrics.items()
        }

        if labels is not None:
            if (
                unwrap_model(model)._get_name()
                in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values()
            ):
                loss = self.label_smoother(outputs, labels, shift_labels=True)
            else:
                loss = self.label_smoother(outputs, labels)
        else:
            if isinstance(outputs, dict) and "loss" not in outputs:
                raise ValueError(
                    "The model did not return a loss from the inputs, only the following keys: "
                    f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
                )
            # We don't use .loss here since the model may return tuples instead of ModelOutput.
            loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]

        return (loss, outputs) if return_outputs else loss

    def _maybe_log_save_evaluate(
        self, tr_loss, model, trial, epoch, ignore_keys_for_eval
    ):
        if not hasattr(self, "metrics_acc"):
            self.metrics_acc: Dict[str, torch.Tensor] = {}

        for m, v in self.metrics.items():
            if v is None:
                continue
            if m not in self.metrics_acc:
                self.metrics_acc[m] = torch.tensor(0.0).to(model.device)
            self.metrics_acc[m] += v

        if self.control.should_log:
            if is_torch_tpu_available():
                xm.mark_step()

            metrics = {
                m: self._nested_gather(v).mean().item()
                for m, v in self.metrics_acc.items()
            }
            # reset counters
            self.metrics_acc = {}

            logs = {
                m: round(
                    v / (self.state.global_step - self._globalstep_last_logged),
                    4,
                )
                for m, v in metrics.items()
            }

            # all_gather + mean() to get average loss over all processes
            tr_loss_scalar = self._nested_gather(tr_loss).mean().item()

            # reset tr_loss to zero
            tr_loss -= tr_loss

            logs["loss"] = round(
                tr_loss_scalar
                / (self.state.global_step - self._globalstep_last_logged),
                4,
            )
            logs["learning_rate"] = self._get_learning_rate()

            self._total_loss_scalar += tr_loss_scalar
            self._globalstep_last_logged = self.state.global_step
            self.store_flos()

            self.log(logs)

        metrics = None
        if self.control.should_evaluate:
            if isinstance(self.eval_dataset, dict):
                for eval_dataset_name, eval_dataset in self.eval_dataset.items():
                    metrics = self.evaluate(
                        eval_dataset=eval_dataset,
                        ignore_keys=ignore_keys_for_eval,
                        metric_key_prefix=f"eval_{eval_dataset_name}",
                    )
            else:
                metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
            self._report_to_hp_search(trial, self.state.global_step, metrics)

        if self.control.should_save:
            self._save_checkpoint(model, trial, metrics=metrics)
            self.control = self.callback_handler.on_save(
                self.args, self.state, self.control
            )

In [27]:
trainer = BergmanTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

### Start training

In [None]:
%%time
# with torch.autograd.detect_anomaly(True):
trainer.train()

#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
trainer.save_model("./EsperBergman")

## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [51]:
model = model.to("cpu")

In [52]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)

In [53]:
# The sun <mask>.
# =>

fill_mask("La suno <mask>.")

[{'score': 0.06196916475892067,
  'token': 316,
  'token_str': ' estas',
  'sequence': 'La suno estas.'},
 {'score': 0.02071121521294117,
  'token': 394,
  'token_str': ' estis',
  'sequence': 'La suno estis.'},
 {'score': 0.01644286885857582,
  'token': 968,
  'token_str': ' diris',
  'sequence': 'La suno diris.'},
 {'score': 0.015109127387404442,
  'token': 1170,
  'token_str': ' okazis',
  'sequence': 'La suno okazis.'},
 {'score': 0.011404418386518955,
  'token': 570,
  'token_str': ' mem',
  'sequence': 'La suno mem.'}]

In [54]:
fill_mask("Jen la komenco de bela <mask>.")

[{'score': 0.0218302384018898,
  'token': 709,
  'token_str': ' lingvo',
  'sequence': 'Jen la komenco de bela lingvo.'},
 {'score': 0.009496492333710194,
  'token': 1209,
  'token_str': ' lingvoj',
  'sequence': 'Jen la komenco de bela lingvoj.'},
 {'score': 0.00902404636144638,
  'token': 956,
  'token_str': ' mondo',
  'sequence': 'Jen la komenco de bela mondo.'},
 {'score': 0.00879774522036314,
  'token': 1239,
  'token_str': ' landoj',
  'sequence': 'Jen la komenco de bela landoj.'},
 {'score': 0.007448229473084211,
  'token': 1087,
  'token_str': ' lando',
  'sequence': 'Jen la komenco de bela lando.'}]

# Save graph

In [None]:
torch.onnx.export(model, torch.LongTensor([[0,0,0,0,0]]), 'Bergman.onnx')

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



## 5. Share your model 🎉

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

### **TADA!**

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("username/model_name")`.

[![tb](https://huggingface.co/blog/assets/01_how-to-train/model_page.png)](https://huggingface.co/julien-c/EsperBERTo-small)


If you want to take a look at models in different languages, check https://huggingface.co/models

[![all models](https://huggingface.co/front/thumbnails/models.png)](https://huggingface.co/models)
