# Training a tokenizer
Train a sentencepiece BPE tokenizer from scratch using the Huggingface's tokenizers package.

## Load and concatenate the datasets
Load the dataset needed to train the tokenizer, in our case we will be using English, French, German and Spanish sentences from the cc100 dataset.

In [None]:
from datasets import load_dataset, concatenate_datasets

In [None]:
dataset_en = load_dataset("cc100", lang="en", split="train",
                          cache_dir="/disk1/a.ristori/cc100", verification_mode="no_checks", streaming=True)
dataset_de = load_dataset("cc100", lang="de", split="train",
                          cache_dir="/disk1/a.ristori/cc100", verification_mode="no_checks", streaming=True)
dataset_fr = load_dataset("cc100", lang="fr", split="train",
                          cache_dir="/disk1/a.ristori/cc100", verification_mode="no_checks", streaming=True)
dataset_es = load_dataset("cc100", lang="es", split="train",
                          cache_dir="/disk1/a.ristori/cc100", verification_mode="no_checks", streaming=True)

There should be an equal amount of sentences for each language.

In [None]:
num_samples = 1000000
dataset_en = dataset_en.take(num_samples)
dataset_de = dataset_de.take(num_samples)
dataset_fr = dataset_fr.take(num_samples)
dataset_es = dataset_es.take(num_samples)

We can finally concatenate the datasets.

In [None]:
dataset = concatenate_datasets([dataset_en, dataset_de, dataset_fr, dataset_es])

## Train the tokenizer
We will build a tokenizer with a shared vocab of size 32000.

In [None]:
from tokenizers.implementations import SentencePieceBPETokenizer
from tokenizers.processors import ByteLevel

def batch_iterator(batch_size):
    batch = []
    for example in dataset:
        batch.append(example["text"])
        if len(batch) == batch_size:
            yield batch
            batch = []

    if batch:  # yield last batch
        yield batch

Define the special tokens and the vocab size.

In [None]:
special_tokens = ["<s>", "<pad>", "</s>", "<unk>", "<length>", "<mask>"]
vocab_size = 32000
sentencepiece_tokenizer = SentencePieceBPETokenizer()
sentencepiece_tokenizer.post_processor = ByteLevel()

Train the tokenizer (this will take a while depending on the number of sentences in your dataset) and save its configuration.

In [None]:
sentencepiece_tokenizer.train_from_iterator(batch_iterator(1000), vocab_size, special_tokens=special_tokens)
sentencepiece_tokenizer.save(f"sentencepiece_config_{vocab_size / 1000}k.json")

## Use your tokenizer
Load your tokenizer and work with it alongside huggingface's transformers library.

In [None]:
from transformers import MBartTokenizerFast
# The previously trained tokenizer can only work with the fast version of the hugginface tokenizers
tokenizer = MBartTokenizerFast(tokenizer_file="tokenizers/sp_32k.json", cls_token="<length>",
                                   src_lang="en_XX", tgt_lang="de_DE")