
# **Tokenization and Its Types**

---

## **1. Theory**

### **What is Tokenization?**

* **Definition**: Tokenization is the process of splitting raw text into smaller meaningful units called **tokens** (words, subwords, sentences, or characters).
* Tokens are the **basic building blocks** for any NLP pipeline.
* Example:
  Input: *“I love NLP!”*
  Tokens: `[“I”, “love”, “NLP”, “!”]`

---

### **Types of Tokenization**

1. **Sentence Tokenization**

   * Splitting text into sentences.
   * Example:
     Text: *“I love NLP. It is amazing.”*
     Tokens: `[“I love NLP.”, “It is amazing.”]`

2. **Word Tokenization**

   * Splitting sentences into words.
   * Example:
     Text: *“I love NLP.”*
     Tokens: `[“I”, “love”, “NLP”, “.”]`

3. **Character Tokenization**

   * Splitting text into individual characters.
   * Example:
     Text: *“NLP”*
     Tokens: `[“N”, “L”, “P”]`

4. **Subword Tokenization (Modern NLP)**

   * Splitting words into smaller units (useful for rare or unknown words).
   * Example:
     Word: *“unhappiness”* → `[“un”, “happy”, “ness”]`
   * Used in **BPE (Byte Pair Encoding)**, **WordPiece** (BERT), **SentencePiece** (T5).

5. **Whitespace Tokenization**

   * Splitting text by spaces.
   * Simple, but fails for punctuation and contractions.
   * Example: *“I’m happy.”* → `[“I’m”, “happy.”]`

---

## **2. Examples in NLTK & SpaCy**

### **NLTK Example**

```python
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download("punkt")

text = "Tokenization is the first step in NLP. It's essential for text processing."

# Sentence Tokenization
print("Sentence Tokens:", sent_tokenize(text))

# Word Tokenization
print("Word Tokens:", word_tokenize(text))
```

**Output:**

```
Sentence Tokens: ['Tokenization is the first step in NLP.', "It's essential for text processing."]
Word Tokens: ['Tokenization', 'is', 'the', 'first', 'step', 'in', 'NLP', '.', 'It', "'s", 'essential', 'for', 'text', 'processing', '.']
```

---

### **SpaCy Example**

```python
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Tokenization is the first step in NLP. It's essential for text processing."
doc = nlp(text)

# Sentence Tokenization
print("Sentence Tokens:")
for sent in doc.sents:
    print(sent.text)

# Word Tokenization with POS
print("\nWord Tokens:")
for token in doc:
    print(token.text, token.pos_)
```

**Output:**

```
Sentence Tokens:
Tokenization is the first step in NLP.
It’s essential for text processing.

Word Tokens:
Tokenization NOUN
is AUX
the DET
first ADJ
step NOUN
in ADP
NLP PROPN
. PUNCT
...
```

---

## **3. Interview-Style Q&A**

### **Basic Level**

**Q1. What is tokenization in NLP?**
*A: Tokenization is the process of splitting raw text into smaller meaningful units called tokens (words, sentences, subwords, or characters). It is the foundation for most NLP tasks.*

**Q2. What are the main types of tokenization?**
*A: Sentence tokenization, word tokenization, character tokenization, subword tokenization, and whitespace tokenization.*

---

### **Intermediate Level**

**Q3. Why is subword tokenization important in modern NLP models?**
*A: Subword tokenization helps handle rare or unknown words by breaking them into smaller units, reducing vocabulary size and improving generalization in models like BERT and GPT.*

**Q4. Difference between NLTK and SpaCy in tokenization?**
*A: NLTK provides rule-based tokenizers (like Punkt) and is flexible for teaching/research. SpaCy uses a production-ready tokenizer optimized for speed and accuracy, with built-in support for linguistic annotations.*

---

### **Advanced Level**

**Q5. How do tokenizers in Transformer models differ from traditional tokenizers?**
*A: Traditional tokenizers split text into words/sentences, while Transformer tokenizers (BPE, WordPiece, SentencePiece) split text into subwords, making them more robust to rare words and morphologically rich languages.*

**Q6. What challenges exist in tokenization?**
*A: Handling contractions (e.g., “don’t”), multi-word entities (e.g., “New York”), languages without spaces (e.g., Chinese), and domain-specific terms (e.g., medical jargon). Modern approaches like subword tokenization address many of these issues.*

---



# **Comparison of Tokenization Techniques**

| **Technique**                                            | **Description**                                      | **Use Cases**                                                                             | **Pros**                                                                | **Cons**                                                                     |
| -------------------------------------------------------- | ---------------------------------------------------- | ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
| **Whitespace Tokenization**                              | Splits text based on spaces only.                    | Quick text splitting, simple preprocessing.                                               | Simple and fast. No external libraries needed.                          | Fails on punctuation and contractions (*“I’m” → “I’m”*), language-dependent. |
| **Sentence Tokenization**                                | Splits text into sentences using rules or models.    | Document summarization, sentiment analysis at sentence level.                             | Maintains context; good for sentence-level NLP.                         | Rule-based models fail with abbreviations (*“Dr. Smith”*).                   |
| **Word Tokenization**                                    | Splits sentences into words/tokens.                  | Text classification, POS tagging, sentiment analysis.                                     | Foundation for most NLP tasks; widely supported in NLTK & SpaCy.        | Struggles with contractions (*“don’t” → [“do”, “n’t”]*).                     |
| **Character Tokenization**                               | Splits text into individual characters.              | Languages without clear word boundaries (Chinese, Japanese, Korean); spelling correction. | Useful for morphologically rich languages; captures fine details.       | Very sparse representation; loses semantic meaning.                          |
| **Subword Tokenization** (BPE, WordPiece, SentencePiece) | Splits words into smaller meaningful units.          | Transformers (BERT, GPT, T5), machine translation, speech recognition.                    | Handles rare/unknown words; reduces vocabulary size; language-agnostic. | More complex; requires training a tokenizer model.                           |
| **Rule-based Tokenization**                              | Uses regex/patterns to split text.                   | Domain-specific NLP (legal, medical, financial text).                                     | Customizable for industry-specific use cases.                           | Fragile; rules may fail for unseen cases.                                    |
| **Statistical / ML-based Tokenization**                  | Uses statistical models to predict token boundaries. | Complex languages (Chinese, Thai, Hindi).                                                 | More accurate in complex scripts; adapts better than rules.             | Requires large training data; slower than simple methods.                    |

---

✅ **Key Insights for Interviews**:

* Mention that **traditional NLP** often used *word-level tokenization*, but **modern deep learning models** rely on *subword tokenization* for efficiency and robustness.
* Show awareness of **language diversity** — e.g., whitespace tokenization won’t work in **Chinese** or **Japanese**.

