# Week 2: NLP Tasks and Tokenization Techniques

Applied Learning Assignments 2:

  ● Evaluate and compare different tokenization techniques on a dataset.

  ● Apply word-level, character-level, and subword-level tokenization to a
given text.

  ● Explore Byte Pair Encoding (BPE), WordPiece, and SentencePiece using
Python libraries.

  ● Compare the outputs of each technique and discuss the advantages and
limitations of each.



Word-Level Tokenization

In [12]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [15]:
text = "Nunsi is a 4th Year student at the Federal University of Technology Akure!"
from nltk.tokenize import word_tokenize

word_tokens = word_tokenize(text)
print(word_tokens)

['Nunsi', 'is', 'a', '4th', 'Year', 'student', 'at', 'the', 'Federal', 'University', 'of', 'Technology', 'Akure', '!']


Character-Level Tokenization


In [16]:
char_tokens = list(text)
print(char_tokens)

['N', 'u', 'n', 's', 'i', ' ', 'i', 's', ' ', 'a', ' ', '4', 't', 'h', ' ', 'Y', 'e', 'a', 'r', ' ', 's', 't', 'u', 'd', 'e', 'n', 't', ' ', 'a', 't', ' ', 't', 'h', 'e', ' ', 'F', 'e', 'd', 'e', 'r', 'a', 'l', ' ', 'U', 'n', 'i', 'v', 'e', 'r', 's', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'T', 'e', 'c', 'h', 'n', 'o', 'l', 'o', 'g', 'y', ' ', 'A', 'k', 'u', 'r', 'e', '!']


Subword Tokenization

In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['token', '##ization', 'is', 'crucial', 'in', 'natural', 'language', 'processing', '!']


In [17]:
tokens = tokenizer.tokenize(text)
print(tokens)

['nuns', '##i', 'is', 'a', '4th', 'year', 'student', 'at', 'the', 'federal', 'university', 'of', 'technology', 'ak', '##ure', '!']


Word Piece Tokenization (BERT Style)

In [18]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print(tokenizer.tokenize(text))

['nuns', '##i', 'is', 'a', '4th', 'year', 'student', 'at', 'the', 'federal', 'university', 'of', 'technology', 'ak', '##ure', '!']


 SentencePiece

In [19]:
from transformers import AutoTokenizer

# Load a pre-trained SentencePiece tokenizer (e.g., T5 or ALBERT uses SentencePiece)
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Tokenize a sample text
tokens = tokenizer.tokenize(text)
print("SentencePiece Tokens:", tokens)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

SentencePiece Tokens: ['▁Nun', 's', 'i', '▁is', '▁', 'a', '▁4', 'th', '▁Year', '▁student', '▁at', '▁the', '▁Federal', '▁University', '▁of', '▁Technology', '▁Ak', 'ure', '!']


Byte Pair Encoding

In [20]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Build a tokenizer using BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=50, special_tokens=["[UNK]", "[PAD]"])

# Train on your sample text
tokenizer.train_from_iterator([text], trainer)

# Encode the text
output = tokenizer.encode(text)
print(output.tokens)

['Nu', 'nsi', 'is', 'a', '4th', 'Ye', 'ar', 's', 't', 'u', 'de', 'nt', 'at', 'th', 'e', 'Fe', 'de', 'r', 'al', 'Un', 'iv', 'er', 'si', 't', 'y', 'o', 'f', 'Te', 'ch', 'no', 'lo', 'gy', 'Ak', 'u', 'r', 'e', '!']


# Comparison of Tokenization Techniques

* Word-Level Tokenization produces output like ['Nunsi', 'is', ...]. It is easy
to understand and quick to implement, but it fails when encountering unknown words or words not seen during training.

* Character-Level Tokenization gives results such as ['T', 'o', 'k', 'e', 'n', ...]. This approach handles typos well and works for languages without clear word boundaries, but it loses the semantic structure of the text, making it harder for models to understand meaning.

* Byte Pair Encoding (BPE) breaks down text into subwords like ['Token', 'ization', ...]. It results in a compact vocabulary and can handle rare or unknown words, but it requires training or using a pretrained model.

* WordPiece Tokenization, as used in models like BERT, outputs tokens such as ['token', '##ization', ...]. It is widely used and offers a balanced approach between word- and subword-level methods, though it is somewhat model-specific.

* SentencePiece Tokenization yields output like ['▁Token', 'ization', ...]. It is language-agnostic, works on raw text without whitespace pre-tokenization, and is ideal for multilingual scenarios. However, it requires a training step and is more difficult to debug compared to simpler methods.