# Installation

```
pip install tokenizers
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
rm wikitext-103-raw-v1.zip
```

# Tutorial

## Training Tokenizer

 - Byte-Pair Encoding (BPE) Tokenizer 를 학습
 - speical_tokens: 순서가 중요. `[UNK]` 는 0, `[CLS]`는 1 을 각각 ID로 부여됨

In [26]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

files = [f"wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer = Tokenizer(BPE(unk_token='[UNK]', vocab_size=30000))
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(special_tokens=['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]'])
tokenizer.train(files, trainer)

# Save & Load
# tokenizer.save("tokenizer-wiki.json")
# tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")

In [34]:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print('tokens:', output.tokens)
print('ids   :', output.ids)
print(output.offsets[])

tokens: ['Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?']
ids   : [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]
[(0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27), (28, 29)]


## Post Processing 

아래 코드는 traditional BERT Input으로 변환 하는 코드. <br>

 - `$A` in single : sentence 를 의미
 - `$A` and `$B` in pair : 첫번째 그리고 두번째 sentences 를 의미
 - `:1` in pair : type ID를 의미하며, 기본값으로 `:0` 이 있음 (따라서 `$A:0` 가 명시되어 있지 않음)
 
 
 encode 해서 위의 예제와 비교하면 어떤 의미인지 쉽게 알게 됨
 

In [47]:
from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

In [53]:
# Single
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print('Single: ', output.tokens)

# Pair
output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
print('Pair  : ', output.tokens)
print('TypeID: ', output.type_ids)

Single:  ['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']
Pair  :  ['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', '[SEP]', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']
TypeID:  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


## Encoding Multiple Sentences in a batch

In [60]:
# Single
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])

for i, o in enumerate(output):
    print(f'{i}:', o.tokens)


# Pair
output = tokenizer.encode_batch(
    [["Hello, y'all!", "How are you 😁 ?"], 
     ["Hello to you too!", "I'm fine, thank you!"]]
)

0: ['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', '[SEP]']
1: ['[CLS]', 'How', 'are', 'you', '[UNK]', '?', '[SEP]']


## Padding

 - `length`: 설정 안하면 sentence중에 가장 긴 길이로 설정됨


In [72]:
tokenizer.enable_padding(pad_id=3, pad_token='[PAD]', length=12)
output = tokenizer.encode("Hello, y'all!")
print('Padding Tokens:', output.tokens)
print('Attention Mask:', output.attention_mask)

Padding Tokens: ['[CLS]', 'Hello', ',', 'y', "'", 'all', '!', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]


## Pretrained Tokenizer

In [75]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt

--2021-06-11 00:18:46--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.93.142
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.93.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘bert-base-uncased-vocab.txt’


2021-06-11 00:18:48 (388 KB/s) - ‘bert-base-uncased-vocab.txt’ saved [231508/231508]



In [79]:
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
print(tokenizer.encode("Hello, y'all! How are you 😁 ?").tokens)

['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
