In [None]:
!pip install tokenizers
!pip install transformers

# Tokenizers and Vocabulary

Tokenizer does the job of breaking the datapoint to smallest units available for the task. The smallest units are also known as ``Tokens``. The process of breaking into small units is called ``Tokenization``.  

``Vocabulary`` contains all the unique tokens created by the Tokenizer.

There are 3 broad categories of Tokenizers based on how they split a text input.


1.   Word Level
2.   Subword Level
3.   Character Level



Lets see how we perform this tokenization. For this demo we will use english data set from tatoeba.org.

Link : [here](https://tatoeba.org/)

## Word-Level Tokenization

In [None]:
lines = ["I trained the model using the train dataset and used the test data set to test it",
        "I ate a dumpling yesterday",
        "The duck and the duckling swam across the lake",
        ]

In [None]:
#lets clean it a bit
text = " ".join(lines)

In [None]:
#lets now create word level tokenization
tokens = text.split()
print(tokens[:10])

['I', 'trained', 'the', 'model', 'using', 'the', 'train', 'dataset', 'and', 'used']


In [None]:
vocabulary_wordlevel = set(tokens)
print(f"Length of the vocabulary: {len(vocabulary_wordlevel)}")

Length of the vocabulary: 24


## Subword Level Tokenization

In [None]:
from transformers import AutoTokenizer
def pretrained_subword_tokenize(lines, model_name="bert-base-uncased"):
    # Load the pretrained tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    unique_tokens = []
    for line in lines:
      # Tokenize the text into subwords
      subword_tokens = tokenizer.tokenize(line)
      unique_tokens = list(unique_tokens + subword_tokens)
    return unique_tokens

def tokenize(line, model_name="bert-base-uncased"):
    # Load the pretrained tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    subword_tokens = tokenizer.tokenize(line)
    return subword_tokens

In [None]:
vocab_subword_level = pretrained_subword_tokenize(lines)
vocab_subword_level = set([subword.replace("##","") for subword in vocab_subword_level])

print(vocab_subword_level)
print(len(vocab_subword_level))



{'used', 'data', 'ate', 'to', 'ling', 'lake', 'model', 'yesterday', 'swam', 'duck', 'the', 'it', 'test', 'and', 'using', 'dump', 'set', 'i', 'across', 'train', 'a', 'trained'}
22


## Character Level Tokenization

In [None]:
tokens = list(text.lower())
vocab_character_level = set(tokens)

print(vocab_character_level)
print(len(vocab_character_level))


{'c', 'r', 'd', 'u', 'n', 's', 'e', 'p', 'h', 'o', 't', 'g', 'y', 'w', ' ', 'i', 'k', 'a', 'm', 'l'}
20


**Observetion**

It is clear that Word level tokenization will have the largest size out of the three methods, while subword tokenization has an intermediate size and finally the character level tokenization will always have the lowest.

So why does the size of the vocabulary matter so much?

This because our RNN or any other language model is a probabilistic model, it assigns a probability for what could be the next token given the current information. This will be clearly understood in our decoding tutorial.