<a href="https://colab.research.google.com/github/QazQazaq/transformers/blob/master/Copy_of_01_how_to_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title
%%html
<div style="background-color: pink;">
  Notebook written in collaboration with <a href="https://github.com/aditya-malte">Aditya Malte</a>.
  <br>
  The Notebook is on GitHub, so contributions are more than welcome.
</div>
<br>
<div style="background-color: yellow;">
  Aditya wrote another notebook with a slightly different use case and methodology, please check it out.
  <br>
  <a target="_blank" href="https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b">
    https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b
  </a>
</div>


# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train)). Last update May 15, 2020


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on **Esperanto**. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we’ll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we’ll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 



In [None]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/oscar/

--2021-03-11 14:32:44--  https://cdn-datasets.huggingface.co/oscar/
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 13.226.50.79, 13.226.50.50, 13.226.50.129, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|13.226.50.79|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-03-11 14:32:44 ERROR 404: Not Found.



In [None]:
!wget https://object.pouta.csc.fi/Tatoeba-Challenge/abk-eng.tar
import tarfile
my_tar = tarfile.open('abk-eng.tar')
my_tar.extractall()
import gzip

input = gzip.GzipFile("/content/data/abk-eng/train.src.gz", 'rb')
s = input.read()
input.close()
output = open("/content/data/abk-eng/train.src", 'wb')
output.write(s)
output.close()

print("done")
import gzip

input = gzip.GzipFile("/content/data/abk-eng/train.trg.gz", 'rb')
s = input.read()
input.close()
output = open("/content/data/abk-eng/train.trg", 'wb')
output.write(s)
output.close()

print("done")

--2021-03-11 16:10:14--  https://object.pouta.csc.fi/Tatoeba-Challenge/abk-eng.tar
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1792000 (1.7M) [application/x-tar]
Saving to: ‘abk-eng.tar’


2021-03-11 16:10:18 (1.72 MB/s) - ‘abk-eng.tar’ saved [1792000/1792000]

done
done


## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [37]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!piamp list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

transformers-cli login



SyntaxError: ignored

In [None]:
%%time 
#from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

#paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files='/content/data/abk-eng/train.src', vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 6.08 s, sys: 1.02 s, total: 7.1 s
Wall time: 2.31 s


Now let's save files to disk

In [None]:
!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

🔥🔥 Wow, that was fast! ⚡️🔥

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt",
)

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
tokenizer.encode("Mi estas Julien.")

Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [None]:
tokenizer.encode("Mi estas Julien.").tokens

['<s>', 'M', 'i', 'Ġ', 'est', 'as', 'Ġ', 'J', 'ul', 'i', 'en', '.', '</s>']

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [None]:
# Check that we have a GPU
!nvidia-smi

Thu Mar 11 16:11:05 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [None]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [None]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [None]:
model.num_parameters()
# => 84 million parameters

83504416

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/content/data/abk-eng/train.src",
    block_size=128,
)



CPU times: user 2.99 s, sys: 532 ms, total: 3.52 s
Wall time: 1.79 s


Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    #prediction_loss_only=True
)

### Start training

In [None]:
%%time
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss


CPU times: user 1min 6s, sys: 1min 43s, total: 2min 50s
Wall time: 2min 50s


TrainOutput(global_step=393, training_loss=8.391918386211833, metrics={'train_runtime': 170.191, 'train_samples_per_second': 2.309, 'total_flos': 602296977530496.0, 'epoch': 1.0, 'init_mem_cpu_alloc_delta': 2573358, 'init_mem_gpu_alloc_delta': 334180352, 'init_mem_cpu_peaked_delta': 18306, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 586754, 'train_mem_gpu_alloc_delta': 1010782208, 'train_mem_cpu_peaked_delta': 1281085, 'train_mem_gpu_peaked_delta': 10426161152})

#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
trainer.save_model("./EsperBERTo")

In [None]:
from torch.utils.data import Dataset

class EsperantoDataset(Dataset):
    def __init__(self, evaluate: bool = False):
        tokenizer = ByteLevelBPETokenizer(
            "./models/EsperBERTo-small/vocab.json",
            "./models/EsperBERTo-small/merges.txt",
        )
        tokenizer._tokenizer.post_processor = BertProcessing(
            ("</s>", tokenizer.token_to_id("</s>")),
            ("<s>", tokenizer.token_to_id("<s>")),
        )
        tokenizer.enable_truncation(max_length=512)
        # or use the RobertaTokenizer from `transformers` directly.

        self.examples = []

        src_files = Path("./data/").glob("*-eval.txt") if evaluate else Path("./data/").glob("*-train.txt")
        for src_file in src_files:
            print("🔥", src_file)
            lines = src_file.read_text(encoding="utf-8").splitlines()
            self.examples += [x.ids for x in tokenizer.encode_batch(lines)]

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return torch.tensor(self.examples[i])

## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./EsperBERTo",
    tokenizer="./EsperBERTo"
)

Some weights of RobertaModel were not initialized from the model checkpoint at ./EsperBERTo and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
# The sun <mask>.
# =>

fill_mask("Иҟалап Давид игәҭакы ахьизынамыгӡаз <mask>.")

[{'score': 0.04219769313931465,
  'sequence': 'Иҟалап Давид игәҭакы ахьизынамыгӡаз,.',
  'token': 16,
  'token_str': ','},
 {'score': 0.01041566114872694,
  'sequence': 'Иҟалап Давид игәҭакы ахьизынамыгӡаз уи.',
  'token': 375,
  'token_str': ' уи'},
 {'score': 0.009070862084627151,
  'sequence': 'Иҟалап Давид игәҭакы ахьизынамыгӡаз Иегова.',
  'token': 370,
  'token_str': ' Иегова'},
 {'score': 0.008338535204529762,
  'sequence': 'Иҟалап Давид игәҭакы ахьизынамыгӡаз насгьы.',
  'token': 391,
  'token_str': ' насгьы'},
 {'score': 0.005623785313218832,
  'sequence': 'Иҟалап Давид игәҭакы ахьизынамыгӡаз ауаа.',
  'token': 459,
  'token_str': ' ауаа'}]

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [31]:
fill_mask("Иҟалап Давид  <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

[{'score': 0.036667462438344955,
  'sequence': 'Иҟалап Давид,.',
  'token': 16,
  'token_str': ','},
 {'score': 0.020548192784190178,
  'sequence': 'Иҟалап Давид  Иегова.',
  'token': 370,
  'token_str': ' Иегова'},
 {'score': 0.013288303278386593,
  'sequence': 'Иҟалап Давид  уи.',
  'token': 375,
  'token_str': ' уи'},
 {'score': 0.009317377582192421,
  'sequence': 'Иҟалап Давид  Анцәа.',
  'token': 397,
  'token_str': ' Анцәа'},
 {'score': 0.008055318146944046,
  'sequence': 'Иҟалап Давид :.',
  'token': 30,
  'token_str': ':'}]

In [40]:
pip install git-lfs

Collecting git-lfs
  Downloading https://files.pythonhosted.org/packages/eb/40/cd243be7ba9bd9d83fd8515b1aed7e0b822c76cab4bf60398a1b7e024f00/git_lfs-1.6-py2.py3-none-any.whl
Installing collected packages: git-lfs
Successfully installed git-lfs-1.6


In [43]:



!git clone https://huggingface.co/QA/Ab



git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
	log
fatal: destination path 'Ab' already exists and is not an empty directory.


In [47]:
trainer.save_model("https://huggingface.co/QA/Ab/blob/main")
tokenizer.save_pretrained("https://huggingface.co/QA/Ab/blob/main")

('https://huggingface.co/QA/Ab/blob/main/tokenizer_config.json',
 'https://huggingface.co/QA/Ab/blob/main/special_tokens_map.json',
 'https://huggingface.co/QA/Ab/blob/main/vocab.json',
 'https://huggingface.co/QA/Ab/blob/main/merges.txt',
 'https://huggingface.co/QA/Ab/blob/main/added_tokens.json')

In [48]:
sudo apt-get install git-lfs

SyntaxError: ignored

In [53]:
!git clone https://QA:Bug718is@huggingface.co/QA/Ab

fatal: destination path 'Ab' already exists and is not an empty directory.


In [54]:
cd Ab

/content/Ab


In [55]:
!sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 29 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 1s (1,572 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package git-lfs.
(Reading database ... 160975 files and directories c

In [None]:
!transformers-cli login
!transformers-cli repo create Abkho

In [None]:
!git clone https://QA:Bug718is@huggingface.co/QA/Abkho
!cd Abkho
!git config --global user.email "QA:Bug718is@huggingface.co"
# Tip: using the same email than for your huggingface.co account will link your commits to your profile
!git config --global user.name "QA"
!git add .
!git commit -m "Initial commit"
!git push

In [87]:
!git add .
!git commit -m "Initial commit"
!git push


On branch main
Your branch is ahead of 'origin/main' by 4 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
fatal: could not read Username for 'https://huggingface.co': No such device or address


In [88]:
model.save_pretrained("https://huggingface.co/QA/Abkho")
tokenizer.save_pretrained("https://huggingface.co/QA/Abkho")

('https://huggingface.co/QA/Abkha/tokenizer_config.json',
 'https://huggingface.co/QA/Abkha/special_tokens_map.json',
 'https://huggingface.co/QA/Abkha/vocab.json',
 'https://huggingface.co/QA/Abkha/merges.txt',
 'https://huggingface.co/QA/Abkha/added_tokens.json')

In [89]:
!git add --all
!git status


On branch main
Your branch is ahead of 'origin/main' by 4 commits.
  (use "git push" to publish your local commits)

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mnew file:   https:/huggingface.co/QA/Abkha/config.json[m
	[32mnew file:   https:/huggingface.co/QA/Abkha/merges.txt[m
	[32mnew file:   https:/huggingface.co/QA/Abkha/pytorch_model.bin[m
	[32mnew file:   https:/huggingface.co/QA/Abkha/special_tokens_map.json[m
	[32mnew file:   https:/huggingface.co/QA/Abkha/tokenizer_config.json[m
	[32mnew file:   https:/huggingface.co/QA/Abkha/vocab.json[m



In [80]:
!transformers-cli upload

[1m[31mDeprecated: used to be the way to upload a model to S3. We now use a git-based system for storing models and other artifacts. Use the `repo create` command instead.[0m


## 5. Share your model 🎉

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

### **TADA!**

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("username/model_name")`.

[![tb](https://huggingface.co/blog/assets/01_how-to-train/model_page.png)](https://huggingface.co/julien-c/EsperBERTo-small)


If you want to take a look at models in different languages, check https://huggingface.co/models

[![all models](https://huggingface.co/front/thumbnails/models.png)](https://huggingface.co/models)
