# Sub-word-level Tokenizing

Recall flaws of 
1. Word-based approaches

* Very large vocabularies
* Large quantity of out-of-vocabulary tokens
* Loss of meaning across very similar words (dog, dogs, sun, sunny, hap, happy)

2. Character-based approaches

* Very long sequences
* Less meaningful individual tokens

So,

A middle-ground between the two is found.

1. Frequently used words should not be split (into subwords)
2. Rare words should be split into "meaningful" subwords

`dog` -> `dog`
`dogs` -> `dog` `s`

see [hf video on it](https://www.youtube.com/watch?v=zHvTiHr506c)

Good example given is `tokenization`: 

`token` -- `ization`  

`token`: Token, tokens, tokenizing, tokenization, tokenized, tokenizes, tokenizable, tokenizability  
`ization`: tokenization, modernization, globalization, industrialization, organization, realization,   utilization, ...  

* Sub-word-based tokenizers typically identify when a word is split, by special character padding: 

`token` `##ization` (BERT)

The number of `#` may or may not refer to the count of preceding letters it was split from (in this case it doesn't)

# Sub-Word Algorithms

1. WordPiece (BERT, DistilBERT,  ...)
2. Unigram (XLNET, ALBERT)
3. Byte-Pair Encoding (RoBERTa, GPT-2+, T5, ...)
3. [SentencePiece](https://huggingface.co/docs/transformers/tokenizer_summary#sentencepiece+) (sorta T5, ...) ((all that use it also use Unigram))

## WordPiece

* [Hugging Face Video](https://www.youtube.com/watch?v=qpv6ms_t_1A)
* [Google Paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf)

* Algorithm is not open source (Google) 

Pseudo-algorithm : 

1. (similar to BPE) start with a corpus and divide each word into sequence of splits that make it up: `huggingface` -> `h` `##u` `##g` ... `##e`. Note the `##` for letters that do not start a word. 
2. Keep only one occurence per elementary unit. i.e. vocab starts `a` ...`z` + `##e` ,... (and only non-start letters that actually occurred)
3. List all existing pairs in corpus (e.g. `h`+`##u`, `##u`+`##g`, ...) and score each pairs via: 

$$score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)$$

4. Add to vocabulary pair with highest score. 
5. Add pair to splits
6. iterate till desired size

## Unigram Tokenization

Overall strategy is to start with a very large vocabulary, and iteratively shrink it. Each iteration a unigram loss is calculated and the bottom _p_ tokens are reduced.

* Unigram model is stats model: $$P(t1, t2, t3, ... tN)$$

* Unigram model assumes that the occurence of each word is independent of its previous word. Thus, can calulate overall probability of next word: 

$$P(t1, t2, t3, ... tN)= P(t1) x P(t2) x P(t3) x ... x P(TN)$$

i.e. the probability of a text is the probability of the tokens that compose it. This means that it can't perform meaningful text generation: it will always predict the single highest probability token. So, what's it good for? 

It is a useful model to estimate the relative likelihood of different phrases. (..?)  

**TODO Fill iterations**

* each iteration, calculate probabilities of each token 
* remove one that impacts loss the least && is not an elementary token (as this would prevent ever spelling a word that contains it)


## Byte-Pair Encoding (BPE) Tokenizaiton

Idea is originally text compression algorithm. 

Psuedo-algorithm: 

* Take a corpus, split into all words, then into all characters
* count each pair frequency, finding most common pair(s)
* take most common pair as a token and add to vocabulary
* iterate until desired vocab size is reached 

* [HuggingFace Video](https://www.youtube.com/watch?v=HEikzVL-lZU)

## SentencePiece

# Imports

In [None]:
%load_ext kedro.ipython
%reload_kedro

from typing import Any, Dict, List, Tuple

import re


from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from transformers import AutoTokenizer

## SentencePiece

In [None]:
model = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:
s = "Stars, hide your fires; Let not light see my black and deep desires."

In [None]:
encoded = tokenizer(s)
print(f"{encoded = }")
encoded = tokenizer.encode(s)
print(f"{encoded = }")
encoded_plus = tokenizer.encode_plus(s)
print(f"{encoded_plus = }")

and to grab the actual tokens

In [None]:
print(tokenizer.convert_ids_to_tokens(encoded))

What about other special characters and padding? 

In [None]:
s = ["In the caldron boil and bake;",
    "Eye of newt and toe of frog",
    "Wool of bat and tongue of dog",
    "Adder's fork and blind-worm's sting",
    "Lizard's leg and howlet's wing",
    "For a charm of powerful trouble",
    "Like a hell-broth boil and bubble."
]
encoded = tokenizer(s, padding=True, add_special_tokens=True)

# print ids
for a in encoded['input_ids']:
    print(a)

# print attention mask
for a in encoded['attention_mask']:
    print(a)

for a in encoded['input_ids']:
    print(tokenizer.convert_ids_to_tokens(a))

The typical special tokens are:

* [CLS] - At the beginning of the sequence
* [SEP] - Between two sequences
* [UNK] - When a token is unknown
* [PAD] - At the end of the sequence
* [MASK] - Mask (cover, hide) tokens for prediction task