Texts need to be converted into number before passing it onto transformers. Some example of tokenization algorithms are

### Word-Based

There are different types of word tokenization methods. The simplest one is white-space tokenization

In [1]:
tokenized_text = "Jim Hernson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Hernson', 'was', 'a', 'puppeteer']


* There are variations of word tokenizers that have extra rules for punctuations
* Each word after tokenization gets assinged an unique ID from 0
* The set of these words is called vocabulary
* Words not in the vocabulary are represented by [UNK] tokens
* Goal when crafting vocabulary is to do it in such a way that tokenizer tokenizes with least amount of unknown tokens

One way to reduce amount of unknown tokens in to go one level deeper, using character-based tokenizer.

### Character-Based

* Splits documents into characters, rather than words
* Vocabulary is much smaller
* Less unknown tokens

**Problem**

* Characters hold less contextual meaning or just meaninng than words (differs from language)
* Large amount of tokens to be processed by our model




### Subword Tokenization

* Built on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords
* Example: `annoyingly` might be considered rare and could be decomposed into `annoying` and `ly`

### Loading and Saving

In [None]:
# Loading Tokenizer

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# AutoTokenizer similart to AutoModel graps the best fit Tokenizer architecture form checkpoint

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
# Using Tokenizer

tokenizer("Using a Transformer network is simple")

```py
{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
 ```

In [None]:
# Saving tokenizer
tokenizer.save_pretrained("dir")

## Encoding

* Translating text to numbers
* Done in two steps:
    * Tokenization
    * Conversion to input IDs


### Tokenization

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenizer(sequence)

print(tokens)

```py
['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']
```

### From tokens to input IDs

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

```py
[7993, 170, 11303, 1200, 2443, 1110, 3014]
```

## Decoding

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

```py
'Using a Transformer network is simple'
```