In [2]:
#word level tokenization
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')

nltk.download('punkt')

text = "Tokenization is crucial for NLP models!"

tokens = word_tokenize(text)
print("Word-level Tokens:", tokens)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...


Word-level Tokens: ['Tokenization', 'is', 'crucial', 'for', 'NLP', 'models', '!']


[nltk_data]   Package punkt is already up-to-date!


In [3]:
#character level tokenization

text = "Tokinization is crucial for NLP models!"

tokens = list(text)
print('Chraracter level Tokens:', tokens)

Chraracter level Tokens: ['T', 'o', 'k', 'i', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'c', 'r', 'u', 'c', 'i', 'a', 'l', ' ', 'f', 'o', 'r', ' ', 'N', 'L', 'P', ' ', 'm', 'o', 'd', 'e', 'l', 's', '!']


In [1]:
#Byte pair encoding
from tokenizers import Tokenizer

#load pre-trained BPE tokenizer

tokenizer = Tokenizer.from_pretrained("gpt2")

output = tokenizer.encode("Tokenization is crucial for NLP models!")
print("BPE Tokens:", output.tokens)




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

BPE Tokens: ['Token', 'ization', 'Ġis', 'Ġcrucial', 'Ġfor', 'ĠN', 'LP', 'Ġmodels', '!']


In [3]:
#Byte pair encoding
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

#Initialize tokenizer and trainer
tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

#Training corpus
corpus = ["Tokenization is crucial for NLP models!", "Machine Learning is amazing"]

#Train the tokenizer
tokenizer.train_from_iterator(corpus, trainer)

#Tokenize input
bpe_tokens = tokenizer.encode("Tokenization is crucial for NLP models!")
print(" BPE Token:", bpe_tokens)

 BPE Token: Encoding(num_tokens=1, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [4]:
#WordPiece Tokenizer

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

tokens = tokenizer.tokenize("Tokenization is crucial for NLP models!")
print("WordPiece Tokens:", tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

WordPiece Tokens: ['token', '##ization', 'is', 'crucial', 'for', 'nl', '##p', 'models', '!']


In [5]:
#SentencePiece Tokenization

from transformers import AutoTokenizer

tokens = tokenizer.tokenize("Tokenization is crucial for NLP models!")
print("SentencePiece Tokens:", tokens)

SentencePiece Tokens: ['token', '##ization', 'is', 'crucial', 'for', 'nl', '##p', 'models', '!']


**Comparison of outputs**
Given the text:

"Tokenization is crucial for NLP models!"

**For Word-level Tokenization** = Word level Tokens: ['Tokenization', 'is', 'crucial', 'for', 'NLP', 'models', '!']

**For character level tokenization** = Chraracter level Tokens: ['T', 'o', 'k', 'i', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'c', 'r', 'u', 'c', 'i', 'a', 'l', ' ', 'f', 'o', 'r', ' ', 'N', 'L', 'P', ' ', 'm', 'o', 'd', 'e', 'l', 's', '!']

**For Byte Pair Encoding** = BPE Tokens: ['Token', 'ization', 'Ġis', 'Ġcrucial', 'Ġfor', 'ĠN', 'LP', 'Ġmodels', '!']

**For Wordpiece Tokenizer** = WordPiece Tokens: ['token', '##ization', 'is', 'crucial', 'for', 'nl', '##p', 'models', '!']

**SentencePiece Tokenizer** = ['token', '##ization', 'is', 'crucial', 'for', 'nl', '##p', 'models', '!']

**Discussion: Advantages and Limitations**
**Word-level:**       

                        Advantages: - Easy to understand
                                    - Preserves full word meanings

                       Disadvantages: - Struggles with unknown words (OOV problem)
                                      - Vocabulary can become very large

**Character-level:**

                       Advantages: - No OOV issues (any text can be processed)
                                   - Small vocabulary size

                      Disadvantages: - Longer sequences → slower training
                                     - Hard to capture word-level meaning

**Byte Pair Encoding (BPE):**

                        Advantages: - Balances between full words and subwords
                                    - Reduces OOVs
                                    - Efficient on medium-sized data

                        Disadvantages: - Might break rare or new words awkwardly
                                       - Vocabulary still needs careful tuning

**Wordpiece Tokenizer:**

                          Advantages:  - Better handling of unseen words
                                       - Optimized for large datasets
                                       - Used in models like BERT

                          Disadvantages: - Requires large training corpora
                                          - More complex preprocessing

**SentencePiece:**          

                          Advantages:  - Language-independent (works without spaces)
                                       - Great for low-resource languages
                                       - Trains directly on raw text
                          Disadvantages: - Slightly more complicated setup
                                         - May not perfectly align with intuitive word boundaries


**Final Summary**

1. Word-level is simple but brittle (bad with unknown words).

2. Character-level is flexible but impractical for long sequences.

3. BPE/WordPiece/SentencePiece (Subword tokenizers) offer the best of both worlds:
smaller vocabularies, good handling of new words, and better generalization.


                                  


