# Installation

```
pip install tokenizers
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
rm wikitext-103-raw-v1.zip
```

# Tutorial

## Training Tokenizer

 - Byte-Pair Encoding (BPE) Tokenizer Î•º ÌïôÏäµ
 - speical_tokens: ÏàúÏÑúÍ∞Ä Ï§ëÏöî. `[UNK]` Îäî 0, `[CLS]`Îäî 1 ÏùÑ Í∞ÅÍ∞Å IDÎ°ú Î∂ÄÏó¨Îê®

In [26]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

files = [f"wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer = Tokenizer(BPE(unk_token='[UNK]', vocab_size=30000))
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(special_tokens=['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]'])
tokenizer.train(files, trainer)

# Save & Load
# tokenizer.save("tokenizer-wiki.json")
# tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")

In [34]:
output = tokenizer.encode("Hello, y'all! How are you üòÅ ?")
print('tokens:', output.tokens)
print('ids   :', output.ids)
print(output.offsets[])

tokens: ['Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?']
ids   : [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]
[(0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27), (28, 29)]


## Post Processing 

ÏïÑÎûò ÏΩîÎìúÎäî traditional BERT InputÏúºÎ°ú Î≥ÄÌôò ÌïòÎäî ÏΩîÎìú. <br>

 - `$A` in single : sentence Î•º ÏùòÎØ∏
 - `$A` and `$B` in pair : Ï≤´Î≤àÏß∏ Í∑∏Î¶¨Í≥† ÎëêÎ≤àÏß∏ sentences Î•º ÏùòÎØ∏
 - `:1` in pair : type IDÎ•º ÏùòÎØ∏ÌïòÎ©∞, Í∏∞Î≥∏Í∞íÏúºÎ°ú `:0` Ïù¥ ÏûàÏùå (Îî∞ÎùºÏÑú `$A:0` Í∞Ä Î™ÖÏãúÎêòÏñ¥ ÏûàÏßÄ ÏïäÏùå)
 
 
 encode Ìï¥ÏÑú ÏúÑÏùò ÏòàÏ†úÏôÄ ÎπÑÍµêÌïòÎ©¥ Ïñ¥Îñ§ ÏùòÎØ∏Ïù∏ÏßÄ ÏâΩÍ≤å ÏïåÍ≤å Îê®
 

In [47]:
from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

In [53]:
# Single
output = tokenizer.encode("Hello, y'all! How are you üòÅ ?")
print('Single: ', output.tokens)

# Pair
output = tokenizer.encode("Hello, y'all!", "How are you üòÅ ?")
print('Pair  : ', output.tokens)
print('TypeID: ', output.type_ids)

Single:  ['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']
Pair  :  ['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', '[SEP]', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']
TypeID:  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


## Encoding Multiple Sentences in a batch

In [60]:
# Single
output = tokenizer.encode_batch(["Hello, y'all!", "How are you üòÅ ?"])

for i, o in enumerate(output):
    print(f'{i}:', o.tokens)


# Pair
output = tokenizer.encode_batch(
    [["Hello, y'all!", "How are you üòÅ ?"], 
     ["Hello to you too!", "I'm fine, thank you!"]]
)

0: ['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', '[SEP]']
1: ['[CLS]', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']


## Padding

 - `length`: ÏÑ§Ï†ï ÏïàÌïòÎ©¥ sentenceÏ§ëÏóê Í∞ÄÏû• Í∏¥ Í∏∏Ïù¥Î°ú ÏÑ§Ï†ïÎê®


In [72]:
tokenizer.enable_padding(pad_id=3, pad_token='[PAD]', length=12)
output = tokenizer.encode("Hello, y'all!")
print('Padding Tokens:', output.tokens)
print('Attention Mask:', output.attention_mask)

Padding Tokens: ['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]


## Pretrained Tokenizer

In [75]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

--2021-06-11 00:18:46--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.93.142
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.93.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‚Äòbert-base-uncased-vocab.txt‚Äô


2021-06-11 00:18:48 (388 KB/s) - ‚Äòbert-base-uncased-vocab.txt‚Äô saved [231508/231508]



In [79]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
print(tokenizer.encode("Hello, y'all! How are you üòÅ ?").tokens)

['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
