# Word Based Tokenizer

- Split on some delimiter
- Most basically: split on whitespace.
- Split on punctuation and more tokens for more advanced behaviour. 

Depending on your delimiter, you will get different tokens with different meanings. 

In [14]:
python_zen = "Special cases aren't special enough to break the rules."
print(python_zen.split())

['Special', 'cases', "aren't", 'special', 'enough', 'to', 'break', 'the', 'rules.']


`rules.` will be different from `rules`, resulting in many tokens that represent the same. 

**We should probably split on punctuation as well**

In [21]:
from nltk.tokenize import wordpunct_tokenize, word_tokenize
print(wordpunct_tokenize(python_zen))

['Special', 'cases', 'aren', "'", 't', 'special', 'enough', 'to', 'break', 'the', 'rules', '.']


In [23]:
print(word_tokenize(python_zen))

['Special', 'cases', 'are', "n't", 'special', 'enough', 'to', 'break', 'the', 'rules', '.']


- `are` and `aren't` are still fairly simple, since `n't` is typically appended to negate something. 
- Having a rule to handle `n't` works well enough. 


However, this doesn't hold for other combined words. For example:

- `token` 
- `tokens`
- `tokenizer`
- `tokenization` 

all are unique tokens and get corresponding unique IDs. 

This result in **a lot** of tokens!

A common way of dealing with this was to limit the vocabulary size. Typically, the $n$ most frequent words will be allowed in the vocab. 

# Character Based Tokenizers
These tokenizers are conceptually very simple. Each character is a token.

- `Small vocabulary`. For ASCII, 256. With unicode, this is 1.1M characters.
- No out of vocabulary (OOV) words/characters.
- Lose a lot of meaninful information
- Models typically have `max input lengths`

In [8]:
list('Cat')

['C', 'a', 't']

## Mapping to a vocabulary
The vocabulary of a text will be really simple, with one index mapping to a letter. 

In [5]:
dict(enumerate(list(python_zen)[:5]))

{0: 'C', 1: 'o', 2: 'm', 3: 'p', 4: 'l'}

# Character based tokenization of Chinese characters
*Disclaimer: I do not speak a word of chinese*

- Characters have meaning by themselves
- But can change by combining characters

For example:

[馬來西亞 means Malaysia, but taken separately they mean "Horse come to Western Asia".](https://medium.com/@jjsham/nlp-tokenizing-chinese-phases-3302da4336bf).

Some good reads that use character based tokenization:
1. [Neural Machine Translation in Linear Time. Kalchbrenner et al. 2017](https://arxiv.org/pdf/1610.10099.pdf)
2. [Fully Character-Level Neural Machine Translation without Explicit Segmentation - Lee et al. 2017.](https://aclanthology.org/Q17-1026.pdf)
3. [Learning to Generate Reviews and Discovering Sentiment - Radford et al. 2017.](https://arxiv.org/pdf/1704.01444.pdf)

From these papers, you can already see that these tokenizers were mainly used a while back, around 2017. 