In [3]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-vowy9435
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-vowy9435
  Resolved https://github.com/huggingface/transformers to commit 197e7ce911d91d85eb2f91858720957c2d979cd2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
tokenizers                    0.13.2
transformers                  4.27.0.dev0


In [4]:
from pathlib import Path
paths = [str(x) for x in Path(".").glob("**/*.txt")]
paths

['greek.txt']

In [5]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 11 s, sys: 427 ms, total: 11.4 s
Wall time: 10.2 s


Now let's save files to disk

In [6]:
!mkdir GreekBERTo
tokenizer.save_model("GreekBERTo")

['GreekBERTo/vocab.json', 'GreekBERTo/merges.txt']



We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [7]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./GreekBERTo/vocab.json",
    "./GreekBERTo/merges.txt",
)

In [8]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [9]:
tokenizer.encode("ὅσα δὴ δέδηγμαι τὴν ἐμαυτοῦ καρδίαν.")

Encoding(num_tokens=11, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [10]:
tokenizer.encode("ὅσα δὴ δέδηγμαι τὴν ἐμαυτοῦ καρδίαν.").tokens

['<s>',
 'á½ħÏĥÎ±',
 'ĠÎ´á½´',
 'ĠÎ´ÎŃ',
 'Î´Î·',
 'Î³Î¼Î±Î¹',
 'ĠÏĦá½´Î½',
 'Ġá¼ĲÎ¼Î±ÏħÏĦÎ¿á¿¦',
 'ĠÎºÎ±ÏģÎ´Î¯Î±Î½',
 '.',
 '</s>']

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [11]:
# Check that we have a GPU
!nvidia-smi

Fri Feb  3 14:18:30 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0    24W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [12]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [13]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [14]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./GreekBERTo", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [15]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [16]:
model.num_parameters()
# => 84 million parameters

83504416

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [17]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./greek.txt",
    block_size=128, # I changed it from 128
)



CPU times: user 3.15 s, sys: 184 ms, total: 3.33 s
Wall time: 3.39 s


In [18]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [24]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./GreekBERTo",
    overwrite_output_dir=True,
    num_train_epochs=30,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


### Start training

In [25]:
%%time
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 26808
  Num Epochs = 30
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 12570
  Number of trainable parameters = 83504416
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,6.4285
1000,6.2505
1500,6.1155
2000,5.9634
2500,5.8704
3000,5.7897
3500,5.7761
4000,5.7553
4500,5.6395
5000,5.5373


Saving model checkpoint to ./GreekBERTo/checkpoint-10000
Configuration saved in ./GreekBERTo/checkpoint-10000/config.json
Model weights saved in ./GreekBERTo/checkpoint-10000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 42min 27s, sys: 10.5 s, total: 42min 37s
Wall time: 42min 39s


TrainOutput(global_step=12570, training_loss=5.330826895750982, metrics={'train_runtime': 2559.8108, 'train_samples_per_second': 314.179, 'train_steps_per_second': 4.911, 'total_flos': 3676931937369600.0, 'train_loss': 5.330826895750982, 'epoch': 30.0})

#### Save final model (+ tokenizer + config) to disk

In [26]:
trainer.save_model("./GreekBERTo")

Saving model checkpoint to ./GreekBERTo
Configuration saved in ./GreekBERTo/config.json
Model weights saved in ./GreekBERTo/pytorch_model.bin


## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [27]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./GreekBERTo",
    tokenizer="./GreekBERTo"
)

loading configuration file ./GreekBERTo/config.json
Model config RobertaConfig {
  "_name_or_path": "./GreekBERTo",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.27.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file ./GreekBERTo/config.json
Model config RobertaConfig {
  "_name_or_path": "./GreekBERTo",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dro

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [28]:
fill_mask("γαμεῖ δὲ Κελεὸς Φαιναρέτην τήθην <mask>, 50ἐξ ἧς Λυκῖνος ἐγένε᾽: ἐκ τούτου δ᾽ ἐγὼ ἀθάνατός εἰμ᾽: ἐμοὶ δ᾽ ἐπέτρεψαν οἱ θεοὶ σπονδὰς ποιεῖσθαι πρὸς Λακεδαιμονίους μόνῳ.")

# Missing --> ἐμήν
# =>

[{'score': 0.014425074681639671,
  'token': 352,
  'token_str': 'όν',
  'sequence': 'γαμεῖ δὲ Κελεὸς Φαιναρέτην τήθηνόν, 50ἐξ ἧς Λυκῖνος ἐγένε᾽: ἐκ τούτου δ᾽ ἐγὼ ἀθάνατός εἰμ᾽: ἐμοὶ δ᾽ ἐπέτρεψαν οἱ θεοὶ σπονδὰς ποιεῖσθαι πρὸς Λακεδαιμονίους μόνῳ.'},
 {'score': 0.009696695022284985,
  'token': 324,
  'token_str': 'οι',
  'sequence': 'γαμεῖ δὲ Κελεὸς Φαιναρέτην τήθηνοι, 50ἐξ ἧς Λυκῖνος ἐγένε᾽: ἐκ τούτου δ᾽ ἐγὼ ἀθάνατός εἰμ᾽: ἐμοὶ δ᾽ ἐπέτρεψαν οἱ θεοὶ σπονδὰς ποιεῖσθαι πρὸς Λακεδαιμονίους μόνῳ.'},
 {'score': 0.008785399608314037,
  'token': 328,
  'token_str': 'ος',
  'sequence': 'γαμεῖ δὲ Κελεὸς Φαιναρέτην τήθηνος, 50ἐξ ἧς Λυκῖνος ἐγένε᾽: ἐκ τούτου δ᾽ ἐγὼ ἀθάνατός εἰμ᾽: ἐμοὶ δ᾽ ἐπέτρεψαν οἱ θεοὶ σπονδὰς ποιεῖσθαι πρὸς Λακεδαιμονίους μόνῳ.'},
 {'score': 0.008395455777645111,
  'token': 521,
  'token_str': 'ήν',
  'sequence': 'γαμεῖ δὲ Κελεὸς Φαιναρέτην τήθηνήν, 50ἐξ ἧς Λυκῖνος ἐγένε᾽: ἐκ τούτου δ᾽ ἐγὼ ἀθάνατός εἰμ᾽: ἐμοὶ δ᾽ ἐπέτρεψαν οἱ θεοὶ σπονδὰς ποιεῖσθαι πρὸς Λακεδαιμονίους μόνῳ.'},
