<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Tokenization
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Pretrained Tokenizers
  </div> 


  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE
  </div> 

  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="plan"></a>

***
<div style="font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Table of Content
  </div> 


#### References

- Huggingface [tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/tokenizer_training.ipynb) on tokenizers

<div style="font-weight: normal; 
      font-size: 30px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Packages
  </div> 

In [1]:
import zipfile

In [1]:
import os
import unicodedata
import inspect
import multiprocessing

# data
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling

# tokenizers
from tokenizers.pre_tokenizers import Whitespace
# base tokenizers
from tokenizers import Tokenizer
from tokenizers.models import (
    BPE,
    Unigram,
    WordLevel,
    WordPiece,
) 
# base tokenizer trainers
from tokenizers.trainers import (
    BpeTrainer,
    UnigramTrainer,
    WordLevelTrainer,
    WordPieceTrainer,
)
# other special tokenizers, see the list at 
# https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations
from tokenizers.implementations import (
    CharBPETokenizer,          # The original BPE
    ByteLevelBPETokenizer,     # The byte level version of the BPE
    SentencePieceBPETokenizer, # A BPE implementation compatible with the one used by SentencePiece
    BertWordPieceTokenizer,    # The famous Bert tokenizer, using WordPiece
) 
# model-specific tokenizers
from transformers import GPT2TokenizerFast, RobertaTokenizerFast, DebertaV2Tokenizer

In [2]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_save = os.path.join(path_to_repo, 'saves', '20_news_group')

path_to_roberta = os.path.join(path_to_save, 'roberta')

<a id="corpus"></a>

# 1 Corpus

[Back to top](#plan)

### $\bullet$ 20 news group

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
newsgroups_trn = fetch_20newsgroups(subset = 'train', remove = ('headers', 'footers', 'quotes'))
newsgroups_tst = fetch_20newsgroups(subset = 'test',  remove = ('headers', 'footers', 'quotes'))

In [None]:
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip()) # 
    return s

def cleanNum(s) :
    s = re.sub('[\.!?]+ ', ' . ', s)
    s = re.sub(',+ ', ' , ', s)
    s = re.sub(' [0-9]*\.[0-9] ', ' FLOAT ', ' ' + s + ' ').strip()
    s = re.sub(' [0-9,]*[0-9] ', ' INT ', ' ' + s + ' ').strip()
    return s

def trueWord(w) :
    return len(w)>0 and re.sub('[^a-zA-Z0-9.,]', '', w) != ''

def prepareCorpus(corpus) :
    corpus = [normalizeString(s) for s in corpus]
    corpus = [cleanNum(s) for s in corpus]
    corpus = [nltk.tokenize.word_tokenize(s) for s in corpus]
    corpus = [[w for w in s if trueWord(w)] for s in corpus]
    corpus = [s for s in corpus if len(s) < 1000] # a lot of crapy words are accumulated among few texts
    return corpus

In [None]:
corpus = prepareCorpus(newsgroups_trn.data)
len(corpus)

<a id="tokenizers"></a>

# 2 Tokenizers

[Back to top](#plan)

In [22]:
# # use pre-fitted tokenizer for DeBERTa (BERT-like adaptation of GPT2 tokenizer, a particular BPE tokenizer)
# tokenizer = DebertaV2Tokenizer.from_pretrained("microsoft/deberta-v2-xlarge")

### $\bullet$ GPT-2 tokenizer

Tokenizer based on a byte-level BPE tokenizer. The list of special tokens is determined by the [GPT2TokenizerFast](https://huggingface.co/transformers/_modules/transformers/models/gpt2/tokenization_gpt2_fast.html#GPT2TokenizerFast) class.

In [None]:
path_to_vocab_gpt2 = os.path.join(path_to_save, 'gpt2', 'vocabulary')

In [96]:
# instantiate, fit and export GPT2 tokenizer
tokenizer_gpt2 = ByteLevelBPETokenizer()
tokenizer_gpt2.train(
    files = os.path.join(path_to_data, "news_auto_texts_trn.txt"), 
    vocab_size = 5000,
    special_tokens = ['<|endoftext|>'],
)
tokenizer_gpt2.save_model(path_to_vocab_gpt2)

['C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\news_auto\\gpt2\\vocabulary\\vocab.json',
 'C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\news_auto\\gpt2\\vocabulary\\merges.txt']

In [97]:
#re-import tokenizer into a more suitable class for model training
tokenizer_gpt2 = GPT2TokenizerFast.from_pretrained(path_to_vocab_gpt2, max_len = 512)

In [98]:
# "begin of word" is encoded by the special 'ƒ†' character, see
# https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475
txt = "Hello, y'all! How are you ?"
output = tokenizer_gpt2(txt)
tokens = output.tokens

print(txt)
print(tokens)

Hello, y'all! How are you ?
<bound method BatchEncoding.tokens of {'input_ids': [40, 457, 79, 12, 569, 7, 499, 1, 2604, 419, 817, 221, 31], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}>


### $\bullet$ Roberta tokenizer

Roberta tokenizer is a subclass of GPT2 tokenizer. The list of special tokens is determined by the [RobertaTokenizerFast](https://huggingface.co/transformers/_modules/transformers/models/roberta/tokenization_roberta_fast.html#RobertaTokenizerFast) class.

In [12]:
path_to_vocab_roberta = os.path.join(path_to_roberta, 'vocabulary')

In [20]:
# instantiate, fit and export Roberta tokenizer
vocab_size = 5000
special_tokens = ["<s>", "<pad>", "</s>", "<unk>", "<mask>"]

tokenizer_roberta = ByteLevelBPETokenizer()
tokenizer_roberta.train(
    files = os.path.join(path_to_data, "news_auto_texts_trn.txt"), 
    vocab_size = vocab_size,
    special_tokens = special_tokens,
)
tokenizer_roberta.save_model(path_to_vocab_roberta)

['C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\data\\vocab\\news_auto_roberta\\vocab.json',
 'C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\data\\vocab\\news_auto_roberta\\merges.txt']

In [13]:
#re-import tokenizer into a more suitable class for model training
tokenizer_roberta = RobertaTokenizerFast.from_pretrained(path_to_vocab_roberta, max_len = 512)

In [89]:
tokenizer_roberta.save_pretrained(os.path.join(path_to_roberta, "tokenizer"))

('C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\Tokenizer\\news_auto_roberta\\tokenizer_config.json',
 'C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\Tokenizer\\news_auto_roberta\\special_tokens_map.json',
 'C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\Tokenizer\\news_auto_roberta\\vocab.json',
 'C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\Tokenizer\\news_auto_roberta\\merges.txt',
 'C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\Tokenizer\\news_auto_roberta\\added_tokens.json')

In [14]:
txt = "Hello, y'all! How are you ?"
output = tokenizer_roberta(txt)
tokens = output.tokens

print(txt)
print(tokens)

Hello, y'all! How are you ?
<bound method BatchEncoding.tokens of {'input_ids': [0, 44, 461, 83, 16, 573, 11, 503, 5, 2608, 423, 821, 225, 35, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}>


### $\bullet$ Deberta-v2 tokenizer

Tokenizer based on _SentencePiece_. The list of special tokens is determined by the [DebertaV2Tokenizer](https://huggingface.co/transformers/_modules/transformers/models/deberta_v2/tokenization_deberta_v2.html#DebertaV2Tokenizer) class.

In [99]:
path_to_vocab_deberta_v2 = os.path.join(path_to_save, 'deberta_v2', 'vocabulary')

# instantiate, fit and export Deberta-v2 tokenizer
vocab_size = 5000
special_tokens = ["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]

tokenizer_debertav2 = SentencePieceBPETokenizer()
tokenizer_debertav2.train(
    files = os.path.join(path_to_data, "news_auto_texts_trn.txt"), 
    vocab_size = vocab_size,
    special_tokens = special_tokens,
)
tokenizer_debertav2.save_model(path_to_vocab_deberta_v2)

['C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\news_auto\\deberta_v2\\vocabulary\\vocab.json',
 'C:\\Users\\Jb\\Desktop\\NLP\\BERTology\\saves\\news_auto\\deberta_v2\\vocabulary\\merges.txt']

In [14]:
# #re-import tokenizer into a more suitable class for model training
# not working_
# tokenizer_debertav2 = DebertaV2Tokenizer.from_pretrained(path_to_vocab_deberta_v2, max_len = 512)
# not working either
# tokenizer_debertav2 = DebertaV2Tokenizer(vocab_file = os.path.join(path_to_vocab_deberta_v2, 'vocab.json'))

### $\bullet$ Custom tokenizers

In [288]:
def fit_tokenizer(Model, Trainer, vocab_size, special_tokens):
    model = (
        Model(unk_token = "[UNK]")
        if 'unk_token' in inspect.signature(Model).parameters # for Unigram model
        else Model()
    )
    # instantiate tokenizer, and fit using external trainer
    tokenizer = Tokenizer(model)
    tokenizer.pre_tokenizer = Whitespace()
    trainer = Trainer(vocab_size = vocab_size, special_tokens = special_tokens)
    tokenizer.train(
        files = [os.path.join(path_to_data, "news_auto_texts_trn.txt")], 
        trainer = trainer,
    )
    return tokenizer

In [294]:
vocab_size = 5000
special_tokens = ["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]

bpe_tokenizer = fit_tokenizer(BPE, BpeTrainer, vocab_size, special_tokens)
wdl_tokenizer = fit_tokenizer(WordLevel, WordLevelTrainer, vocab_size, special_tokens)
wdp_tokenizer = fit_tokenizer(WordPiece, WordPieceTrainer, vocab_size, special_tokens)
uni_tokenizer = fit_tokenizer(Unigram, UnigramTrainer, vocab_size, special_tokens)

In [295]:
path_to_bpe = os.path.join(path_to_data, 'vocab', 'news_auto_bpe.json')

# save
bpe_tokenizer.save(path_to_bpe, pretty = True)

# load as base Tokenizer
bpe_tokenizer = Tokenizer.from_file(path_to_bpe)

# load as tokenizer suitable for model training
tokenizer_bpe = PreTrainedTokenizerFast(tokenizer_file = path_to_bpe)

In [299]:
txt = "Hello, y'all! How are you üòÅ? wanna work at Toyota?"
output = tokenizer_bpe(txt)
tokens = output.tokens

print(txt)
print(tokens)

Hello, y'all! How are you üòÅ? wanna work at Toyota?
<bound method BatchEncoding.tokens of {'input_ids': [44, 1034, 1067, 16, 93, 11, 1121, 5, 2512, 1092, 1458, 0, 35, 91, 1287, 69, 1223, 1013, 4194, 35], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}>
