# Tokenization Tutorial


Tokenization is the process of breaking text into smaller units such as characters, words, or sentences.
In this tutorial, we will cover the following types of tokenization:

1. Character Tokenization
2. Word Tokenization
3. Sentence Tokenization
4. Byte Pair Encoding (BPE)

We will use Python libraries like `nltk` and `tokenizers` to achieve these tokenizations. Let's start!
        

## 1. Character Tokenization


Character tokenization is the process of splitting text into individual characters. This method treats each character as a token.
        

In [5]:
# Example of character tokenization

text = "T5 is the greatest data science boot-camp!"
char_tokens = list(text)
print(char_tokens)

['T', '5', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'g', 'r', 'e', 'a', 't', 'e', 's', 't', ' ', 'd', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'n', 'c', 'e', ' ', 'b', 'o', 'o', 't', '-', 'c', 'a', 'm', 'p', '!']


## 2. Word Tokenization


Word tokenization splits text into individual words. Commonly, spaces are used as delimiters to break text into words.
We will use the `nltk` library to achieve word tokenization.
        

In [6]:
import nltk
nltk.download('punkt')

# Example of word tokenization
from nltk.tokenize import word_tokenize

text = "T5 is the greatest data science boot-camp!"
word_tokens = word_tokenize(text)
print(word_tokens)

['T5', 'is', 'the', 'greatest', 'data', 'science', 'boot-camp', '!']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 3. Sentence Tokenization


Sentence tokenization breaks a text into individual sentences. The `nltk` library provides a sentence tokenizer that can handle different punctuation marks.
        

In [9]:
# Example of sentence tokenization
from nltk.tokenize import sent_tokenize

text = "T5 is the greatest boot-camp! it is for data science !"
sentence_tokens = sent_tokenize(text)
print(sentence_tokens)

['T5 is the greatest boot-camp!', 'it is for data science !']


## 4. Byte Pair Encoding (BPE)


Byte Pair Encoding (BPE) is a subword tokenization algorithm. It is commonly used in large models like GPT to break down text into smaller subword units.
We will use the `tokenizers` library to perform BPE tokenization.
        

In [14]:
# Install the tokenizers library first if you don't have it
# !pip install tokenizers

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Example of BPE tokenization
tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=1000)
tokenizer.pre_tokenizer = Whitespace()

texts = ["T5 is fun!", "Byte Pair Encoding is powerful."]
tokenizer.train_from_iterator(texts, trainer)

output = tokenizer.encode("T5 is fun!")
print(output.tokens)

['T5', 'is', 'fun', '!']
