# Build/Train a New Tokenizer

* [HuggingFace video: build](https://www.youtube.com/watch?v=MR8tZm5ViWU)
* [HuggingFace video: train](https://www.youtube.com/watch?v=DJimQynXZsQ)
* [HuggingFace course](https://huggingface.co/learn/nlp-course/chapter6/8)

Tokenisation Steps: 

1. Normalisation
2. Pre-tokenisation
3. Model
4. Post-processing
5. Decoding

Build-your-own Steps:

1. Gather a corpus
2. Create a `backend_tokenizer` with HF (steps 1-4 in tokenisation)
3. Load `backend_tokenizer` in a HF transformers tokenizer

Why to Train your own: 

* Tokenizer won't be suitable if trained on a non-similar corpus to your purpose
* Different language
* New characters (e.g. accents!)
* New domain
* New style (e.g. from another century)

# Imports

In [None]:
from datasets import load_dataset
from datasets.dataset_dict import DatasetDict
from tokenizers import (decoders,
                        models, 
                        normalizers, 
                        pre_tokenizers, 
                        processors,
                        Regex,
                        trainers, 
                        Tokenizer, 
)
from transformers import AutoTokenizer, BertTokenizerFast, PreTrainedTokenizerFast

# Tokenizer Failure Modes, Examples 

From: [Hugging Face, Train Your Own](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/train_new_tokenizer.ipynb)

In [None]:
tokenizer = BertTokenizerFast.from_pretrained(
  'huggingface-course/bert-base-uncased-tokenizer-without-normalizer'
)
text = "here is a sentence adapted to our tokenizer"
print(text)
print(tokenizer.tokenize(text))
print("The base BERT tokenizer does well, on 'simple' English", end="\n\n")


text = "এই বাক্যটি আমাদের টোকেনাইজারের উপযুক্ত নয়"
print(text)
print(tokenizer.tokenize(text))
print("It struggles on Bengali: splitting one word into _many_ tokens, or it does not recognize words at all; [UNK]", end="\n\n")


text = "this tokenizer does not know àccënts and CAPITAL LETTERS"
print(text)
print(tokenizer.tokenize(text))
print("It loses accents, capitals. problematic for other languages and proper nouns", end="\n\n")

text = "the medical vocabulary is divided into many sub-token: paracetamol, phrayngitis"
print(text)
print(tokenizer.tokenize(text))
print("It is missing critical vocabulary for domain-usage")

* Excessive splitting is problematic: models are sequence-length-limited; common words may carry additional semantic value
* `[UNK]` strips all information from the token

# Dataset

In [None]:
shakespeare = load_dataset("tiny_shakespeare")
train_split = shakespeare["train"]
test_split = shakespeare["test"]

In [None]:
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

tokenizer.normalizer = normalizers.Sequence([
    normalizers.Replace(Regex(r"[\p{Other}&&[^\n\t\r]]"), ""),    # cleanup control 
    normalizers.Replace(Regex(r"[\s]"), " "),     # characters not visibile in text
    normalizers.Lowercase(),
    normalizers.NFD(),     # NFD Unicode Normalizer
    normalizers.StripAccents() # remove accents
])

# what's the normalizer doing? 
print("Normalizer: Héllò hôw are ü? --> ", tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

# chain two pre-tokenizers 
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([pre_tokenizers.WhitespaceSplit(),   #  pre-tokenizer splits on whitespace and all characters that are not letters, digits, or the underscore character, so it technically splits on whitespace and punctuation
                                                   pre_tokenizers.Punctuation()])     # isolate punctiation 


# what's the pre-tokenizer doing?
print("Pre-tokenizer: Let's test my pre-tokenizer. -->", tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer."))


special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

trainer = trainers.WordPieceTrainer(vocab_size=10000, 
                                    special_tokens=special_tokens,
                                    continuing_subword_prefix="##",    # can be any characters... but we don't want something that would occur 
)

In [None]:
def get_training_corpus(dataset: DatasetDict):
    for i in range(0, len(dataset), 1000):
        yield dataset[i: i+1000]["text"]

tokenizer.train_from_iterator(get_training_corpus(train_split), trainer=trainer)

we have to specify how to treat a single sentence and a pair of sentences. For both, we write the special tokens we want to use; the first (or single) sentence is represented by $A, while the second sentence (if encoding a pair) is represented by $B. For each of these (special tokens and sentences), we also specify the corresponding token type ID after a colon.

In [None]:
# retrieve ids of special tokens, needed for post-processing of sequences
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

# set beginning of each sequence to have CLS; SEP to end of each sequence and before a new sequence
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

Finally, define the decoder, allowing removal of hashtags. Notice how true re-normalisation never occurs: we cannot go back to accent characters. 

In [None]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

In [None]:
tokenizer.save("../data/06_models/bert-remake-tokenizer.json")

## 2. Load it into a FastTokenizer from Transformers library

Two options: 

1. PretrainedTokenizerFast
2. BertTokenizerFast

In [None]:
wrapped_tokenizer = BertTokenizerFast(
    #tokenizer_object=tokenizer,
    tokenizer_file="../data/06_models/bert-remake-tokenizer.json",
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
    )

In [None]:
s = "To die, - To sleep, - To sleep!"\
    "Perchance to dream: - ay, there's the rub;"\
    "For in that sleep of death what dreams may come,"\
    "When we have shuffled off this mortal coil,"\
    "Must give us pause: there's the respect"\
    "That makes calamity of so long life;"

encoded = wrapped_tokenizer.encode(s, padding=True, add_special_tokens=True)
print(encoded)
encoded_plus = wrapped_tokenizer.encode_plus(s, padding=False, add_special_tokens=True)

# print attention mask // no attention mask, single sequence
# for a in encoded_plus['attention_mask']:
#     print(a)

for a in encoded_plus['input_ids']:
    print(wrapped_tokenizer.convert_ids_to_tokens(a), end=" ")

# Train AutoTokenizer

All HuggingFace Autotokenizers can `train_new_from_iterator()`. 

This uses a known architecture (e.g. BERT-WordPiece), and creates from a new corprus a vocabulary

In [None]:
auto_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [None]:
auto_tokenizer.train_new_from_iterator(
    get_training_corpus(train_split),
    vocab_size=25000,
    new_special_tokens=None,
    special_tokens_map=None,
)

In [None]:
auto_tokenizer.save_pretrained("../data/06_models/auto_tokenizer_retrained")