Huggingface tokenizer example 노트북을 참고함.

https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb

---

# 01-training-tokenizers

## Tokenization doesn't have to be slow !

### Introduction

Before going deep into any Machine Learning or Deep Learning Natural Language Processing models, every practitioner
should find a way to map raw input strings to a representation understandable by a trainable model.

One very simple approach would be to split inputs over every space and assign an identifier to each word. This approach
would look similar to the code below in python

```python
s = "very long corpus..."
words = s.split(" ")  # Split over space
vocabulary = dict(enumerate(set(words)))  # Map storing the word to it's corresponding id
```

This approach might work well if your vocabulary remains small as it would store every word (or **token**) present in your original
input. Moreover, word variations like "cat" and "cats" would not share the same identifiers even if their meaning is 
quite close.

![tokenization_simple](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/11/tokenization.png)

### Subtoken Tokenization

To overcome the issues described above, recent works have been done on tokenization, leveraging "subtoken" tokenization.
**Subtokens** extends the previous splitting strategy to furthermore explode a word into grammatically logicial sub-components learned
from the data.

Taking our previous example of the words __cat__ and __cats__, a sub-tokenization of the word __cats__ would be [cat, ##s]. Where the prefix _"##"_ indicates a subtoken of the initial input. 
Such training algorithms might extract sub-tokens such as _"##ing"_, _"##ed"_ over English corpus.

As you might think of, this kind of sub-tokens construction leveraging compositions of _"pieces"_ overall reduces the size
of the vocabulary you have to carry to train a Machine Learning model. On the other side, as one token might be exploded
into multiple subtokens, the input of your model might increase and become an issue on model with non-linear complexity over the input sequence's length. 
 
![subtokenization](https://nlp.fast.ai/images/multifit_vocabularies.png)
 
Among all the tokenization algorithms, we can highlight a few subtokens algorithms used in Transformers-based SoTA models : 

- [Byte Pair Encoding (BPE) - Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)
- [Word Piece - Japanese and Korean voice search (Schuster, M., and Nakajima, K., 2015)](https://research.google/pubs/pub37842/)
- [Unigram Language Model - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, T., 2018)](https://arxiv.org/abs/1804.10959)
- [Sentence Piece - A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Taku Kudo and John Richardson, 2018)](https://arxiv.org/abs/1808.06226)

Going through all of them is out of the scope of this notebook, so we will just highlight how you can use them.

### @huggingface/tokenizers library 
Along with the transformers library, we @huggingface provide a blazing fast tokenization library
able to train, tokenize and decode dozens of Gb/s of text on a common multi-core machine.

The library is written in Rust allowing us to take full advantage of multi-core parallel computations in a native and memory-aware way, on-top of which 
we provide bindings for Python and NodeJS (more bindings may be added in the future). 

We designed the library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide
these various components: 

- **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to
lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer. 
- **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to
pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.
- **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant
 of your input data.
- **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA
models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.
- **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according
to the `PreTokenizer` we used previously.
- **Trainer**: Provides training capabilities to each model.

For each of the components above we provide multiple implementations:

- **Normalizer**: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...
- **PreTokenizer**: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...
- **Model**: WordLevel, BPE, WordPiece
- **Post-Processor**: BertProcessor, ...
- **Decoder**: WordLevel, BPE, WordPiece, ...

All of these building blocks can be combined to create working tokenization pipelines. 
In the next section we will go over our first pipeline.

---

# First time using huggingface's `tokenizers`
### Introduction
- tokenizer
    - Byte-Pair Encoding(BPE) tokenizer
- dataset
    - The Adventures of Sherlock Holmes (130,000 lines)
- installation
    - `!pip install tokenizers`

### 데이터 다운 받기

In [1]:
BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'

# Let's download the file and save it somewhere
from requests import get
with open('big.txt', 'wb') as big_f:
    response = get(BIG_FILE_URL, )
    
    if response.status_code == 200:
        big_f.write(response.content)
    else:
        print("Unable to get the file: {}".format(response.reason))

### tokenizer 정의하기
- 아래의 코드는 `ByteLevelBPETokenizer class`로 바꿀 수 있음.

In [1]:
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase, NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel

- 우선, empty bpe encoding model을 만들자.

In [2]:
tokenizer = Tokenizer(BPE())

- text normalizer들을 정의해주자.

In [3]:
tokenizer.normalizer = Sequence([
    NFKC(),
    Lowercase()
])

- str $\longrightarrow$ byte, encoder or pre-tokenizer

In [4]:
tokenizer.pre_tokenizer = ByteLevel()

- byte $\longrightarrow$ str, decoder.

In [5]:
tokenizer.decoder = ByteLevelDecoder()

### tokenizer를 학습 및 저장
- tokenizer 학습하기.

In [7]:
from tokenizers.trainers import BpeTrainer

In [8]:
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(trainer, ["big.txt"])

In [9]:
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))

Trained vocab size: 25000


- tokenizer 저장

In [10]:
tokenizer.model.save('.')

['./vocab.json', './merges.txt']

### Use pretrained tokenizer
- load the tokenizer

In [12]:
tokenizer.model = BPE('vocab.json','merges.txt')

- encode
    - tokenzier.encode의 attribute로 중요 attributes들이 정리됨(token된 단어, encoding된 ids 등등).

In [25]:
encoding = tokenizer.encode("This is a simple input to be tokenized")
encoding

Encoding(num_tokens=10, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [28]:
tokenizer.get_vocab()

{'Ġperformances': 22483,
 'Ġfavoured': 13383,
 'Ġrotated': 18788,
 'Ġflog': 13282,
 'Ġtrunk': 5208,
 'Ġsexual': 18834,
 'Ġfavorably': 20156,
 'Ġfors': 14608,
 'Ġsire': 9182,
 'Ġdeclare': 8425,
 'iper': 21166,
 'Ġappalach': 12650,
 'Ġconsented': 16796,
 'Ġfibrositis': 12377,
 'Ġapr': 5195,
 'Ġsp': 435,
 'Ġsentiments': 11708,
 'Ġstorage': 24220,
 'Ġsty': 13843,
 'Ġproclaimed': 10610,
 'Ġonerous': 24201,
 'Ġdisg': 9529,
 'Ġinterview': 6603,
 'Ġfillmore': 20265,
 'nikov': 23804,
 'Ġinhum': 20214,
 'Ġsobbed': 12289,
 'Ġdeput': 18183,
 'Ġding': 16575,
 'Ġascii': 13842,
 'Ġawaited': 6929,
 'oughs': 19693,
 'Ġdrum': 8609,
 'Ġdecree': 9311,
 'ulyu': 13484,
 'Ġcow': 6129,
 'Ġirration': 18205,
 'Ġlimbs': 4741,
 'Ġmarines': 21897,
 'Ġdecl': 1668,
 'lished': 2702,
 'what': 1178,
 'Ġpersons': 3165,
 'Ġburen': 12969,
 'Ġcoral': 24917,
 'Ġlawrence': 18174,
 'Ġtranscription': 18257,
 'Ġtug': 11123,
 'Ġwhirl': 15200,
 'ilib': 21547,
 'Ġ612': 20031,
 'Ġjourney': 4587,
 'Ġleash': 12613,
 'Ġmildly': 20341,

In [26]:
encoding.ids

[431, 338, 258, 2865, 284, 8519, 286, 304, 13143, 1259]

In [29]:
encoding.type_ids

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [30]:
encoding.tokens

['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']

In [31]:
encoding.offsets

[(0, 4),
 (4, 7),
 (7, 9),
 (9, 16),
 (16, 19),
 (19, 22),
 (22, 25),
 (25, 28),
 (28, 34),
 (34, 38)]

In [32]:
encoding.attention_mask

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [33]:
encoding.special_tokens_mask

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [34]:
encoding.overflowing

[]