# Hugging Face Tokenizers



### HuggingFace tokenizers help us break down text into smaller, manageable pieces called tokens. These tokenizers are easy to use and also remarkably fast due to their use of the Rust programming language.



## Technical Terms Explained:


- **Tokenization**: It's like cutting a sentence into individual pieces, such as words or characters, to make it easier to analyze.

- **Tokens**: These are the pieces you get after cutting up text during tokenization, kind of like individual Lego blocks that can be words, parts of words, or even single letters. These tokens are converted to numerical values for models to understand.

- **Pre-trained Model**: This is a ready-made model that has been previously taught with a lot of data.

- **Uncased**: This means that the model treats uppercase and lowercase letters as the same.

# Load a Tokenizer

In [4]:
from transformers import BertTokenizer


# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

#See how many tokes are in the cocabulary
tokenizer.vocab_size

30522

In [5]:
# Tokenize a sentence
tokens  = tokenizer.tokenize("I heart Generative AI")

# print thte tokens
print(tokens)

# Show the token ids assigned to each token
print(tokenizer.convert_tokens_to_ids(tokens))

['i', 'heart', 'genera', '##tive', 'ai']
[1045, 2540, 11416, 6024, 9932]


# Code Example


In [1]:
from transformers import BertTokenizer

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# See how many tokens are in the vocabulary
tokenizer.vocab_size
# 30522

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

30522

In [2]:
# Tokenize the sentence
tokens = tokenizer.tokenize("I heart Generative AI")

# Print the tokens
print(tokens)
# ['i', 'heart', 'genera', '##tive', 'ai']

# Show the token ids assigned to each token
print(tokenizer.convert_tokens_to_ids(tokens))
# [1045, 2540, 11416, 6024, 9932]

['i', 'heart', 'genera', '##tive', 'ai']
[1045, 2540, 11416, 6024, 9932]


## Resources
- [Hugging Face Tokenizers documentation index(opens in a new tab)](https://huggingface.co/docs/tokenizers/main/en/index)

- [Hugging Face Tokenizer documentation(opens in a new tab)](https://huggingface.co/docs/tokenizers/main/en/api/tokenizer)

- [Hugging Face BertTokenizer documentation](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)