<a href="https://colab.research.google.com/github/Lakshmi-Adhikari-AI/LLM-HuggingFace/blob/main/ch2/mod3_tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [29]:
!pip install datasets evaluate transformers[sentencepiece]



### **Tokenization: Converting text into subparts (chunks) like words, subwords, or characters.**



## **🔹 Types of Tokenization**
**1. Word-based Tokenization**


*   Splits text by space or punctuation.
*   Easy to understand.
*   Fails for new words (e.g., “dogs” ≠ “dog”) and requires a large vocabulary.






In [39]:
# Word-based Tokenization using space
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)


['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [40]:
# Word-based Tokenization using punctuation (via NLTK)
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokenized_text = tokenizer.tokenize("Let's do Tokenization!")
print(tokenized_text)


['Let', "'", 's', 'do', 'Tokenization', '!']


**2. Character-based Tokenization**

* Splits text into individual characters.

* Low vocabulary size, handles unknown words.

* Less meaningful as single characters carry little meaning.

In [41]:
# Character-based Tokenization
tokenized_text = list("Let's do Tokenization!")
print(tokenized_text)

['L', 'e', 't', "'", 's', ' ', 'd', 'o', ' ', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', '!']


**3. Subword-based Tokenization**

* Breaks rare/long words into smaller parts.

* Balances vocabulary size and flexibility.

* Most commonly used in models like BERT, GPT.

In [42]:
# Subword-based Tokenization using BertTokenizer
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenized_text = tokenizer.tokenize("Let's do Tokenization!")
print(tokenized_text)


['Let', "'", 's', 'do', 'To', '##ken', '##ization', '!']


In [43]:
# Subword-based Tokenization using AutoTokenizer (auto-selects correct tokenizer based on checkpoint)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenized_text = tokenizer.tokenize("Let's do Tokenization!")
print(tokenized_text)

['Let', "'", 's', 'do', 'To', '##ken', '##ization', '!']


## **🔹 Encoding: Converting Text into Numbers**

**Encoding has 2 steps:**

1. Tokenization → Convert text to tokens

2. Token ID Conversion → Convert tokens to vocabulary IDs

In [44]:
# Step 1: Tokenize the text
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
input_tokens = tokenizer.tokenize("Using a Transformer network is simple")
print(input_tokens)




['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [45]:
# Step 2: Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(input_tokens)
print(token_ids)


[7993, 170, 13809, 23763, 2443, 1110, 3014]


**🔎 Check token presence in vocabulary**

In [46]:
print("Does 'Transformer' exist in vocab?", tokenizer.vocab.get("Transformer"))
print("Does 'transform' exist?", tokenizer.vocab.get("transform"))
print("Does 'Trans' exist?", tokenizer.vocab.get("Trans"))
print("Does '##former' exist?", tokenizer.vocab.get("##former"))

Does 'Transformer' exist in vocab? None
Does 'transform' exist? 11303
Does 'Trans' exist? 13809
Does '##former' exist? 23763


**🔄 Decoding: Converting IDs back to human-readable text**

In [47]:
# Decode - get back complete words of the rawtext from the token_ids

decode_text=tokenizer.decode(token_ids)
print(decode_text)

Using a Transformer network is simple


**📌 Manually Add Special Tokens ([CLS], [SEP])**

In [48]:
# In direct tokenizer based on the checkpoint it add these specials tokens

tokens=[tokenizer.cls_token] + [tokenizer.tokenize("Using a Transformer network is simple")] + [tokenizer.sep_token]
print(tokens)

['[CLS]', ['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple'], '[SEP]']
