### Tokenization in NLP – Interview Theory Sheet

#### 1. What is tokenization?
##### Answer:
Tokenization is the process of splitting raw text into smaller units called tokens, which can be words, subwords, characters, or sentences. It is the foundational step in most NLP pipelines.

#### 2. Why is tokenization important in NLP?
##### Answer:
Tokenization transforms unstructured text into structured tokens, enabling subsequent tasks like POS tagging, parsing, sentiment analysis, and language modeling. Without tokenization, models cannot differentiate between linguistic units.

#### 3. What are different types of tokenization?
Type | Description | Example
--- | --- | ---
Word | Splits text into words | "Let's go" → ["Let", "'s", "go"]
Sentence | Splits text into sentences | "Hi. How are you?" → ["Hi.", "How are you?"]
Subword | Used in BPE, WordPiece | "playing" → ["play", "##ing"]
Character | Splits into individual characters | "cat" → ["c", "a", "t"]
Regex-based | Custom rules to tokenize specific patterns | Emails, hashtags, prices, etc.

#### 4. Explain the difference between word_tokenize and TreebankWordTokenizer.
Feature | word_tokenize() | TreebankWordTokenizer()
--- | --- | ---
Backend | Wrapper over Treebank + Punkt | Manual usage of Treebank rules
Output Format | Words + Punctuation | Words + Punctuation
Usage Simplicity | Easier, automatic | More customizable
Preferred For | Quick NLP tasks | Linguistic parsing

#### 5. What is the role of sent_tokenize()?
##### Answer:
sent_tokenize() breaks a paragraph or document into sentences using pretrained models (e.g., Punkt tokenizer). It handles abbreviations, punctuation, and sentence boundaries intelligently.

#### 6. What is RegexpTokenizer and where is it used?
##### Answer:
RegexpTokenizer allows custom tokenization using regular expressions. It is ideal for domain-specific extraction such as:

- Extracting monetary values
- Tokenizing hashtags, URLs, and dates
- Tokenizing programming code or log files

#### 7. What are the challenges in tokenization?
##### Answer:
- Ambiguity: "U.S." vs. sentence end
- Contractions: "Don't" → ["Do", "n't"]
- Multilingual complexities
- Special symbols or emojis
- Compound words in German, Chinese (no whitespace)

#### 8. Compare NLTK vs spaCy vs HuggingFace tokenizers
Feature | NLTK | spaCy | HuggingFace
--- | --- | --- | ---
Language Support | English, multilingual | Multilingual | Multilingual
Customization | Medium | High | Very High (subword-level)
Speed | Medium | Very Fast | Fast
Deep Learning | Not built-in | Limited | Deep Learning optimized
Use Case | Education, research | Production NLP | Transformers, BERT, GPT

#### 9. What is subword tokenization? Where is it used?
##### Answer:
Subword tokenization breaks unknown or rare words into smaller known units (e.g., "unhappiness" → ["un", "happi", "ness"]). It's essential for models like BERT, GPT, and T5 to reduce vocabulary size while covering more words.

#### 10. How do you handle tokenization for Indian languages (e.g., Hindi, Tamil)?
##### Answer:
Tokenization for Indian languages often involves:

- Unicode normalization
- Syllable or morpheme-based tokenizers
- Tools like Indic NLP Library, spaCy + custom rules, or HuggingFace tokenizers trained on native corpora

✅ Pro Interview Tips  
Always mention trade-offs (e.g., speed vs accuracy)  
Reference real-world cases (e.g., Twitter sentiment, resume parsing)  
Be prepared to write regex for pattern-based tokenization  
Understand how tokenization impacts vectorization and embedding
