## Tokenizers

There objective is to translate text in Natural Language Processing to raw numbers that can be processed by the model.

This step is a key commponent in NLP pipelines.

There are different types of tokenizer algorithms and in this section, we will dive into a few:

1. **Word-based Tokenizers**
 This is the idea of splitting raw text into words. Each word has a specific id attached to it.

there are different ways to split a text i.e whitespaces can be used to tokenize text into words by applying Python's split fuction. eg:



In [2]:
tokenized_text = "hello lovely world".split()
print(tokenized_text)

['hello', 'lovely', 'world']


The downside of word based tokenizer is that it is difficult to assign id's to large vocabulary of words. For example: dog and dogs while they are related, the model will recognize them as unrelated given each set of words will be assigned different tokens.

In addition, for words not in the vocabulary, the model is likely to return unkown tokens with the abbreviation "<unk>" or unknown. which is generally a bad sign

one way to reduce the amount of unkown tokens is to go one level deaper using a *character-based tokenizer*

2. **Character-based tokenizers**

   
text is split into individual characters as oppossed to words. it has 2 main benefits:

* the vocanulary is much smaller
* there are much fewer out-of-vocabulary tokens since every word can be built from characters

Some concerns:

* intuitively one can argue that characters are less meaningful since they don't mean much on there own.
* we'lll likely end up with very large amount of tokens to be processed by the model


3. **subword tokenizers**

   
relies on the principle that frequently used words should not be split into smaller subwords but rare words should be decomposed into meaningful subwords.

### Loading and saving tokenizers

- based on the same 2 methods i.e: `from_pretrained()` and `save_pretrained()`. these two methods will load and save the algorithm used by the tokernizer

- let's chck the example with the BertTokenizer class:

In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")



tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

The same can be done for the TFAutoModel, the `AutoTokenizer` class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# we can now use the tokenizer as was shown earlier

tokenizer("using a transformer network is simple")


{'input_ids': [101, 1606, 170, 11303, 1200, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [4]:
# we can now save the tokenizer using the save_pretrained("directory on my computer")

tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

### Encoding

***The tokenizer pipeline***

Encoding is the process of translating texts to numbers. it is doen in a two step process. i.e:

- the tokenization
- conversion to input_ids

step 1: tokenization:- split text to words/characters etc. because there are multiple rules that can govern the process, it is essential to instantiate a tokenizer using the name of the model inorder to make sure we use the same rules that were used when the model was pretrained.

Step 2: generated tokens in step 1 are converted into numbers. this allows us to build a tensor out of them inorder to feed them to the model. to achieve this, the tokenizer has a vocabulary, which is the downloaded part when we instantiate it with the from_pretrained() method. 

let's test it out with some code:


In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "using a transformer network is simple"

tokens = tokenizer.tokenize(sequence)

print(tokens)

['using', 'a', 'transform', '##er', 'network', 'is', 'simple']


In [6]:
# to convert ti input id's we use the convert_tokens_to_ids() tokenizer method:

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[1606, 170, 11303, 1200, 2443, 1110, 3014]


#### Decoding

this is the reverse of encoding i.e converting the id's back to a string

- to achieve this we use the `decode()` method
  

In [8]:
decode_string = tokenizer.decode([1606, 170, 11303, 1200, 2443, 1110, 3014])
print(decode_string)

using a transformer network is simple
