### 3 subword algorithms help to improve your NLP model performance
- Byte Pair Encoding (BPE)
- WordPiece
- Unigram Language Model
- SentencePiece  

Subword balances vocabulary size and footprint. Extreme case is we can only use 26 token (i.e. character) to present all English word. 16k or 32k subwords are recommended vocabulary size to have a good result.

Many Asian language word cannot be separated by space. Therefore, the initial vocabulary is larger than English a lot. You may need to prepare over 10k initial word to kick start the word segmentation. From Schuster and Nakajima research, they propose to use 22k word and 11k word for Japanese and Korean respectively.  
https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46

To perform subword tokenization, BPE is slightly modified in its implementation such that the frequently occurring subword pairs are merged together instead of being replaced by another byte to enable compression. This would basically lead the rare word athazagoraphobia to be split up into more frequent subwords such as ['▁ath', 'az', 'agor', 'aphobia'].
https://towardsdatascience.com/byte-pair-encoding-the-dark-horse-of-modern-nlp-eb36c7df4f10

### Tokenizers: How machines read - 28 JANUARY 2020
Recommended read on tokenization  

- **BPE**: Just uses the frequency of occurrences to identify the best match at every iteration until it reaches the predefined vocabulary size.
- **WordPiece**: Similar to BPE and uses frequency occurrences to identify potential merges but makes the final decision based on the likelihood of the merged token
- **Unigram**: A fully probabilistic model which does not use frequency occurrences. Instead, it trains a LM using a probabilistic model, removing the token which improves the overall likelihood the least and then starting over until it reaches the final token limit.
- **SentencePiece** basically tries to bring all the subword tokenization tools and techniques under one banner. _" SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. "_ (BPE and Unigram are reimplemented with improvements).
    - __All other models assume input is already tokenized__: BPE and Unigram are great models but they share one big disadvantage- they both need to have their input already tokenized. BPE needs to have the input tokenized so that every character (including word-boundary characters) are tokenized. Only then can BPE count frequencies and start to merge tokens. Usually this is done by simply doing word level tokenization but, as we discussed earlier, this is a problem with tokenization since not all languages are space segmented. Similarly, the unigram model needs to have its input tokenized before it can start discarding tokens based on their probability distribution. SentencePiece deals with this by simply taking in an input in raw text and then doing everything (which we will discuss below) needed on that input to perform subword tokenization.
    - __Encode everything as unicode ...__: SentencePiece first converts all the input into unicode characters. This means it doesn’t have to worry about different languages or characters or symbols. If it uses unicode it can just treat all input in the same way, which allows it to be language agnostic
    - __… including  the spaces__: To get around the word segmenting issues, SentencePiece simply encodes spaces as a unicode symbol. Specifically it encodes it as unicode value U+2581 (underscore ‘_’ to those of us who don’t speak unicode). This helps with the language agnostic issues and the decoding issue. Since spaces are unicode encoded then they can be easily reversed or decoded and treated (i.e learned) like a normal language character. It sounds like a simple approach and I guess it is, but the best ideas tend to seem that way in the end


https://blog.floydhub.com/tokenization-nlp/

### __Huggingface `tokenizers`__ : 
Provided Tokenizers
- CharBPETokenizer: The original BPE
- ByteLevelBPETokenizer: The byte level version of the BPE
- SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
- BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece  
 
We designed the library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide
these various components: 

- **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to
lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer. 
- **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to
pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.
- **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant
 of your input data.
- **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA
models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.
- **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according
to the `PreTokenizer` we used previously.
- **Trainer**: Provides training capabilities to each model. 

Notebook for Tokenizers: https://github.com/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb  
Github Link for Python Binding: https://github.com/huggingface/tokenizers/tree/master/bindings/python

Implementation: https://github.com/huggingface/tokenizers/tree/master/bindings/python/tokenizers/implementations


# Experiment Training on CharBPETokenizer with raw Wiki Text

In [1]:
from tokenizers import CharBPETokenizer

In [2]:
# Initialize a tokenizer
tokenizer = CharBPETokenizer(suffix='', lowercase=True, bert_normalizer=False)

In [3]:
# Then train it!
tokenizer.train(files=[ "../data/text/AA/wiki_01"], vocab_size=20000, min_frequency=2)

In [6]:
encoded = tokenizer.encode(u"เป็นตัวการ์ตูนใน ลูนีย์ทูนส์ และเดอะ ลูนี่ตูนส์ โชว์")
print(encoded.ids)
print(encoded.tokens)

[13996, 5928, 360, 9239, 2153, 3108, 109, 4434, 13625]
['เป็นตัวการ์ตูนใน</w>', 'ลูนีย์ทูนส์</w>', 'และ', 'เดอะ</w>', 'ลู', 'นี่', 'ต', 'ูนส์</w>', 'โชว์</w>']


# Using `SentencePieceBPETokenizer` with raw Wiki Text
`SentencePieceBPETokenizer` is chosen because of the end-to-end solution it provides and the unicode preprocessing with all the enhancement that comes with the BPE method. See benefits inside the blogpost [here](https://blog.floydhub.com/tokenization-nlp/) , and also the official Github [here](https://github.com/google/sentencepiece) and [python Readme](https://github.com/google/sentencepiece/blob/master/python/README.md).

### Extracting Thai Wiki Dump
From WannaPhong: https://python3.wannaphong.com/2018/06/wikipedia-dump-text.html  
Link to Wikimedia for `thwiki`: https://dumps.wikimedia.org/thwiki/  
WikiExtractor: https://github.com/attardi/wikiextractor

1. I downloaded `thwiki-20200601-pages-articles.xml`
2. ```$ python WikiExtractor.py thwiki-20200601-pages-articles.xml``` (Actually .bz2 can also extract)
3. _**//TODO**_ : Need to extract regex of XML tags inside WikiExtractor out:   
```<doc id="" revid="" url="" title="">   
...
</doc>```   
Option #1: Maybe make a new txt file that regex replaces all the `<doc>` tags  
Option #2: Maybe do it in `pre_tokenizer`? See Issue [#269](https://github.com/huggingface/tokenizers/issues/269) and Custom Pre Tokenizer https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/custom_pre_tokenizer.py . But now I'm just going to token the whole raw text with first Level `<doc>` Tags first
4. _**//TODO**_ : What to do about english language? do we filter it all out? if not, do we need to increase the vocab size? and also add the lowercase to pre_tokenizer?

In [9]:
import os
from pathlib import Path

DATA_PATH = Path('../data/text')

WIKI_FILES = []
for path, subdirs, files in os.walk(DATA_PATH):
    for name in files:
#         print("THIS IS PATH", os.path.join(path, name))
        WIKI_FILES.append(os.path.join(path, name))
WIKI_FILES[:10]

['../data/text/AF/wiki_04',
 '../data/text/AF/wiki_12',
 '../data/text/AF/wiki_26',
 '../data/text/AF/wiki_61',
 '../data/text/AF/wiki_34',
 '../data/text/AF/wiki_08',
 '../data/text/AF/wiki_14',
 '../data/text/AF/wiki_13',
 '../data/text/AF/wiki_02',
 '../data/text/AF/wiki_37']

### Train `SentencePieceBPETokenizer`

BpeTrainer:  
```
- vocab_size: unsigned int:
                The size of the final vocabulary, including all tokens and alphabet.
- min_frequency: unsigned int:
                The minimum frequency a pair should have in order to be merged.
- show_progress: boolean:
                Whether to show progress bars while training.
- special_tokens: List[Union[str, AddedToken]]:
                A list of special tokens the model should know of.
- limit_alphabet: unsigned int:
                The maximum different characters to keep in the alphabet.
- initial_alphabet: List[str]:
                A list of characters to include in the initial alphabet, even
                if not seen in the training dataset.
                If the strings contains more than one character, only the first one
                is kept.
- continuing_subword_prefix: Optional[str]:
                A prefix to be used for every subword that is not a beginning-of-word.
- end_of_word_suffix: Optional[str]:
                A suffix to be used for every subword that is a end-of-word.
```
https://github.com/huggingface/tokenizers/blob/master/bindings/python/tokenizers/trainers/__init__.pyi

In [10]:
from tokenizers import SentencePieceBPETokenizer
# Initialize a tokenizer
tokenizer = SentencePieceBPETokenizer()
tokenizer.train(files=WIKI_FILES, vocab_size=20000, min_frequency=2)


In [11]:
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))

Trained vocab size: 20000


In [14]:
# And finally save it somewhere
tokenizer.save("./thwiki-sentencepiecebpe.tokenizer.json", pretty=True)

In [19]:
encoded = tokenizer.encode(u"สวัสดีครับ ผมชื่อไนท์ ตอนนี้ก็เป็นเวลาที่ผมต้องไปโรงเรียนแล้ว  นี่คือการเว้นวรรคสองทีครับ  จะได้ออกเป็นสอง Spaces")
print(encoded.ids)
print(encoded.tokens)

[15906, 1204, 18091, 13184, 1257, 12938, 3426, 9773, 5391, 1005, 4462, 1397, 1096, 1751, 1364, 527, 7777, 5202, 4057, 10411, 1564, 1317, 18091, 527, 8768, 3127, 1564, 6927, 2113, 1739]
['▁สวัส', 'ดี', 'ครับ', '▁ผม', 'ชื่อ', 'ไนท์', '▁ตอน', 'นี้ก็', 'เป็นเวลา', 'ที่', 'ผม', 'ต้อง', 'ไป', 'โรงเรียน', 'แล้ว', '▁', '▁นี่', 'คือการ', 'เว้น', 'วรรค', 'สอง', 'ที', 'ครับ', '▁', '▁จะได้', 'ออกเป็น', 'สอง', '▁Sp', 'ac', 'es']


### If we want to use it again

The Encoding structure exposes multiple properties which are useful when working with transformers models

- normalized_str: The input string after normalization (lower-casing, unicode, stripping, etc.)
- original_str: The input string as it was provided
- tokens: The generated tokens with their string representation
- input_ids: The generated tokens with their integer representation
- attention_mask: If your input has been padded by the tokenizer, then this would be a vector of 1 for any non padded token and 0 for padded ones.
- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.
- type_ids: If your input was made of multiple "parts" such as (question, context), then this would be a vector with for each token the segment it belongs to.
- overflowing: If your input has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts.

In [24]:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("./thwiki-sentencepiecebpe.tokenizer.json")
encoded =  tokenizer.encode(u"สวัสดีครับ ผมชื่อไนท์ ตอนนี้ก็เป็นเวลาที่ผมต้องไปโรงเรียนแล้ว  นี่คือการเว้นวรรคสองทีครับ  จะได้ออกเป็นสอง Spaces")
print(encoded.ids)
print(encoded.tokens)

[15906, 1204, 18091, 13184, 1257, 12938, 3426, 9773, 5391, 1005, 4462, 1397, 1096, 1751, 1364, 527, 7777, 5202, 4057, 10411, 1564, 1317, 18091, 527, 8768, 3127, 1564, 6927, 2113, 1739]
['▁สวัส', 'ดี', 'ครับ', '▁ผม', 'ชื่อ', 'ไนท์', '▁ตอน', 'นี้ก็', 'เป็นเวลา', 'ที่', 'ผม', 'ต้อง', 'ไป', 'โรงเรียน', 'แล้ว', '▁', '▁นี่', 'คือการ', 'เว้น', 'วรรค', 'สอง', 'ที', 'ครับ', '▁', '▁จะได้', 'ออกเป็น', 'สอง', '▁Sp', 'ac', 'es']


In [25]:
encoded = tokenizer.encode(u"""<doc id="774" url="https://th.wikipedia.org/wiki?curid=774" title="บักส์ บันนี">""")
print(encoded.ids)
print(encoded.tokens)
encoded = tokenizer.encode(u"""</doc>""")
print(encoded.ids)
print(encoded.tokens)

[1242, 17366, 1896, 17367, 1896, 1245, 12145, 1297, 19440, 2143, 1219]
['▁<doc', '▁id="77', '4"', '▁url="https://th.wikipedia.org/wiki?curid=77', '4"', '▁title="', 'บัก', 'ส์', '▁บัน', 'นี', '">']
[1244]
['▁</doc>']


In [25]:
encoded

Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [19]:
tokenizer

Tokenizer(vocabulary_size=15197, model=SentencePieceBPE, unk_token=<unk>, replacement=▁, add_prefix_space=True, dropout=None)

In [18]:
tokenizer.get_vocab()

{'ที่ให้': 7590,
 'ในขณะนั้น': 3204,
 'สิบปี': 8807,
 '▁ในระยะ': 10844,
 'อร์ค': 2884,
 'ิมนุษย': 11894,
 'ο': 104,
 'บิดา': 2692,
 'พรม': 9814,
 'จากดุลยภาพ': 10667,
 '▁เนื่องมาจาก': 14610,
 'การใช้งาน': 7582,
 'ฎี': 1452,
 'ของตัวเอง': 3953,
 'ในสภาพ': 7636,
 'ยังไม่': 5582,
 '▁คณะ': 2399,
 'สังหาร': 11999,
 'สถาบันสถาปนา': 3092,
 'ตะวันตกเฉียง': 4088,
 'นั่น': 3170,
 'ที่สาธารณะ': 10191,
 'อยู่นั้น': 14624,
 'เกาหลี': 6378,
 '▁ช็อม': 8429,
 'ในคัมภีร์ไบเบิล': 10238,
 'ได้สนับสนุน': 10509,
 'มานุษยวิทยา': 13612,
 'let': 11626,
 'ดัดแปลงเป็นละคร': 13950,
 'และเจ้าหน้าที่': 14809,
 'ชไอ': 9690,
 '.4%': 5835,
 '▁ฮอโลแกรมแบ่งได้เป็นประเภทใหญ่': 15133,
 '้': 178,
 'ของกษัตริย์': 10421,
 'ึก': 422,
 'นิสัย': 8591,
 'อนุมานราช': 14065,
 'ตะวันออกเฉียงใต้': 1749,
 'อัตราการ': 3110,
 'สงค์': 2162,
 '▁ในนครนิวยอร์ก': 10847,
 'ลินิวส์': 12636,
 '▁ในเดือนมิถุนายน': 13742,
 'เบธเลเฮม': 15175,
 'ด้นํา': 9721,
 'ัตต': 6531,
 'อาชีพ': 3696,
 'ในกลุ่ม': 7623,
 '44"': 9494,
 '▁2558': 3082,
 'ในอัฟกานิ

In [None]:
# Then train it!
tokenizer.train([ "./data/text/AA/wiki_00"])

In [1]:
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase, NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel

# First we create an empty Byte-Pair Encoding model (i.e. not trained model)
tokenizer = Tokenizer(BPE())

# Then we enable lower-casing and unicode-normalization
# The Sequence normalizer allows us to combine multiple Normalizer that will be
# executed in order.
tokenizer.normalizer = Sequence([
    NFKC(),
    Lowercase()
])

# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.
tokenizer.pre_tokenizer = ByteLevel()

# And finally, let's plug a decoder so we can recover from a tokenized input to the original one
tokenizer.decoder = ByteLevelDecoder()

In [None]:
from tokenizers.trainers import BpeTrainer

# We initialize our trainer, giving him the details about the vocabulary we want to generate
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, ["../data/text/AA/wiki_01"])

In [2]:
ByteLevel.alphabet()

['½',
 'Ó',
 'ŀ',
 '9',
 '°',
 'æ',
 'İ',
 'Î',
 '^',
 '<',
 'ħ',
 'ī',
 'B',
 'n',
 'ĝ',
 'ĥ',
 '4',
 'Ð',
 '2',
 '©',
 'Ē',
 '}',
 'ē',
 '¦',
 'Ļ',
 'h',
 '6',
 '¿',
 '!',
 'P',
 '`',
 'Į',
 'ü',
 'ķ',
 'H',
 'ĩ',
 ';',
 'Ä',
 'Ý',
 'ď',
 'ê',
 '÷',
 'Å',
 '¾',
 'ę',
 'V',
 '²',
 'ø',
 'ë',
 'đ',
 'č',
 '+',
 'É',
 'ĺ',
 'ğ',
 'l',
 '¹',
 'Ċ',
 ',',
 '«',
 '?',
 'é',
 'S',
 'ý',
 'ó',
 'Č',
 'Ė',
 'Ě',
 'o',
 'À',
 'ı',
 'Ę',
 'â',
 '³',
 '=',
 'Ğ',
 'ì',
 '|',
 '¯',
 '1',
 'Ķ',
 '¤',
 'd',
 'ĵ',
 'ÿ',
 '>',
 'ĸ',
 '.',
 'Û',
 'ä',
 'Ã',
 'Ĵ',
 'ļ',
 'Ò',
 'Ŀ',
 'f',
 'ė',
 'ª',
 'X',
 'Ă',
 '{',
 'v',
 '¢',
 'ć',
 'ą',
 'ă',
 '£',
 '5',
 'ù',
 'Ī',
 'Ë',
 'Ü',
 ']',
 'T',
 ')',
 "'",
 'å',
 'ò',
 '¼',
 '\\',
 '×',
 'Ĭ',
 '¶',
 '-',
 'Ĉ',
 'b',
 'Ĺ',
 '~',
 'Â',
 'Ĥ',
 '%',
 '¡',
 'O',
 'õ',
 'û',
 '8',
 '3',
 'Z',
 'Ć',
 'J',
 'Õ',
 'Ù',
 'ľ',
 'Y',
 'a',
 'Ĕ',
 'y',
 'Q',
 '[',
 '±',
 '®',
 'µ',
 'Ġ',
 '@',
 'M',
 'i',
 's',
 'Ľ',
 '"',
 'W',
 '/',
 'Ď',
 'Ĝ',
 'ā',
 'Ö',
 'à',
 'U