
# **Phrase Matching and Vocabulary in NLP**

---

## **1. Theory**

### **Phrase Matching**

* **Definition**: Phrase Matching is the process of identifying **multi-word expressions** (MWEs) or predefined sequences of words in text.
* Unlike simple word search, it finds **meaningful groups of words**.
* Example:
  Text: *“He lives in New York City and works at United Nations.”*

  * Phrase matches: *“New York City”*, *“United Nations”*.

👉 Useful in:

* **Information extraction** (company names, drug names, legal terms).
* **Domain-specific NLP** (medical jargon, financial terminology).
* **Pattern-based rules** in chatbots.

---

### **Vocabulary**

* **Definition**: In NLP, vocabulary refers to the **set of unique tokens (words, subwords, or characters)** recognized by a model or system.
* Acts as a **dictionary of known tokens** used during training/inference.
* Example:

  * Text corpus: *“Cats play. Dogs play.”*
  * Vocabulary: `{cats, play, dogs, .}`

👉 Types of Vocabulary:

1. **Fixed Vocabulary** (traditional ML) → Derived from training data (BoW, TF-IDF).
2. **Subword Vocabulary** (modern NLP) → Derived from tokenizers (BPE, WordPiece, SentencePiece).
3. **Domain Vocabulary** → Customized lists (medical, legal, finance).

---

## **2. Examples**

### **Phrase Matching in SpaCy**

```python
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)

# Define phrases to match
phrases = ["New York City", "United Nations"]
patterns = [nlp(text) for text in phrases]
matcher.add("ORG_PHRASES", patterns)

doc = nlp("He lives in New York City and works at the United Nations.")

matches = matcher(doc)
for match_id, start, end in matches:
    print("Matched Phrase:", doc[start:end].text)
```

**Output:**

```
Matched Phrase: New York City
Matched Phrase: United Nations
```

---

### **Vocabulary in NLTK**

```python
import nltk
from nltk.tokenize import word_tokenize

text = "Cats play. Dogs play."
tokens = word_tokenize(text.lower())
vocab = set(tokens)

print("Vocabulary:", vocab)
```

**Output:**

```
Vocabulary: {'.', 'cats', 'dogs', 'play'}
```

---

## **3. Interview-Style Q&A**

### **Basic Level**

**Q1. What is phrase matching in NLP?**
*A: Phrase matching is the process of finding multi-word expressions or predefined word sequences in text, often used for entity or keyword extraction.*

**Q2. What is vocabulary in NLP?**
*A: Vocabulary is the set of unique tokens (words, subwords, or characters) recognized and used by an NLP system or model.*

---

### **Intermediate Level**

**Q3. How is phrase matching different from named entity recognition (NER)?**
*A: Phrase matching relies on exact string matching or pattern rules, while NER uses statistical/machine learning models to detect entities even if they weren’t predefined.*

**Q4. Why is vocabulary size important in NLP models?**
*A: Large vocabularies increase model size and complexity, while small vocabularies may lose expressiveness. Subword-based vocabularies balance efficiency and coverage.*

---

### **Advanced Level**

**Q5. How does SpaCy implement phrase matching?**
*A: SpaCy provides a `PhraseMatcher` that matches document spans against a list of pre-defined Doc patterns, using hash-based lookups for speed and efficiency.*

**Q6. How do modern NLP models like BERT handle vocabulary differently than Bag of Words?**
*A: Instead of a fixed word-level vocabulary, BERT uses subword vocabularies (WordPiece). This reduces out-of-vocabulary issues and allows handling of rare/unknown words by splitting them into smaller meaningful units.*

---

## **4. Quick Comparison Table**

| **Aspect**      | **Phrase Matching**                     | **Vocabulary**                                |
| --------------- | --------------------------------------- | --------------------------------------------- |
| **Definition**  | Finds predefined multi-word expressions | Set of unique tokens recognized by a model    |
| **Granularity** | Multi-word expressions                  | Words, subwords, or characters                |
| **Use Cases**   | Entity extraction, rule-based systems   | Token representation, feature extraction      |
| **Tools**       | SpaCy’s `PhraseMatcher`, regex          | NLTK corpus, SpaCy vocab, BPE, WordPiece      |
| **Limitations** | Only matches predefined terms           | OOV (Out-of-Vocabulary) problem in old models |

---
