# Tokenizers
* In NLP most data is raw text, but ML can't read or understand in raw form
    * Only Work with numbers
* Tokenizer translates text to numbers
* Several approaches - Objective is to find most meaningful representation:
## Word-Based
* ![Image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/word_based_tokenization.svg)
* There're different ways to split text. For example, using whitespace to tokenize text into words by applying python's split function:
```python
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)
```
Would output:
```['Jim', 'Henson', 'was', 'a', 'puppeteer']```
* Variations of WTs w/ extra punctuation rules.
    * Can have large "vocabularies", w/ vocab defined by total number of independent tokens we have in our corpus.
    * Each word assigned ID, from 0-(vocabsize). Model uses IDs to indentify each word
    * The problem:
        * There're >500k words in English language, lots of tokens
        * Dog & Dogs are similar, but model wouldn't be able to relate them
        * We need custom token to represent words not in vocab, known as "unknown" token (often represented as "UNK" or "<unk>")
        * Goal is to have **AS FEW WORDS AS POSSIBLE UNK**
        * Which is why we use a character based tokenizer
## Character-Based
* CB splits chars rather than words, w/ two main benifits:
    * Smaller vocab
    * Fewer out-of-vocab tokens since you can build all words
* Also not perfect: Problems w/ spaces & punctuation
    * Less meaningful than words, but language dependent
        * In Chinese, each character carries more info than latin language
    * We'll end up w/ very large amt of tokenz to be processed by model
        * WB = 1 Token, CB = 10 Tokens
        * Thus, Subword
## Subword-Based
* Freq. used words not split, but rare words should be
* E.g. "Annoyingly" is rare, thus, ```["Annoying", "ly"]```, as both will show as frequent standalone subwords, while the meaning is kept by the composite meaning of annoyingly.
* E.g. "Tokenization"
    * BERT writes ```['token','##ization'] where "##" shows that it's PART of the word

## Loading & Saving
* ```from_pretrained()``` & ```save_pretrained()``` will load or save alrgorithm used by tokenizer (like the architecture of the model) and it's vocab (weights of the model)
* Loading BERT tokenizer trained w/ same checkpoint as BERT is done same way as loading model but w/ ```BertTokenizer``` class:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

* Like ```AutoModel```,```AutoTokenizer``` class grabs proper tokenizer class in lib based on checkpoint name, & can be used w/ any checkpoint

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

* We can now use tokenizer as shown in previous section

In [4]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

* Saving Tokenizer is same as saving model:
```python
tokenizer.save_pretrained("directory_on_my_computer")
```
* How input_ids are generated looking @ intermediate methods of tokenizer:
## Encoding & Decoding
### Encoding
* Translating text -> Numbers = encoding
* Two steps:
    * Tokenization
        * Split text in to (words, sub-word, character)
        * Must match pretraining of model
    * Conversion to Input IDs
        * Convert tokens to numbers, so we can build a tensor out of them & feed the model
        * Tokenizer has vocabulary which is downloaded when we instantiate ```from_pretrained()```
            * Must be same vocab as when model was pretrained
#### Tokenization
***This demo does the two steps seperately to show outputs, but in practice you should call tokenizer directly on your inputs***

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

Returns ```['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']```
* This is sub-word tokenizer, note ```'Trans','##former'``` splitting the uncommon word
* Raw Text -> Tokens :: "Let's try to tokenize!" -> ```[let,',s,try,to,token,##ize,!]```
    * May Lowercase all words, then split everything into small text chunks
    * Most use SW tok algorithm
* Tokens + Special Tokens :: ```[[CLS],let,',s,try,to,token,##ize,!,[SEP]]```
    * ```##``` is indication used by BERT to denote that token isn't the beginning of a word
        * Others may be different, like ALBERT using ```_``` to all tokens w/ space before them
* T+ST -> Input IDs :: ```[101, 2292, 1005, 3046, 2000, 19204, 4697, 999, 102]```
    * Maps token IDs to vocab of tokenizer
    * Must match to make sure we use same as models pretraining

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

Returns ```[7993, 170, 13809, 23763, 2443, 1110, 3014]```
* the conversion to IDs is handled by ```covert_tokens_to_ids()``` method
* Once converted to appropriate framework tensor, these inputs sent to model
## Decoding
* Turns vocab indicies -> string w/ the ```decode()``` method

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Returns: ```Using a transformer network is simple```
* Decode not only converts, but groups SW tokens to produce readable sentence
* Useful w/ models predicting new text (either generated or translated/summarized)