# Semantic NLP 

## Lexical Semantics

Definition:
Study of word meanings and relations among words ‚Äî how words convey meaning individually and in combination.

Key Concepts:
- **Synonyms** share similar meanings but may differ in tone or style (big vs large, begin vs commence). 
- **Antonyms** can be gradable (hot vs cold), complementary (alive vs dead) or relational (teacher vs student). 
- **Polysemy** covers words with related senses, such as paper (material vs article).  
- **Homonyms** are words that share the same form but have completely unrelated meanings. They can appear in these two ways: 
    - **Homophones**, which sound the same but differ in spelling and meaning, such as flower and flour 
    - **Homographs**, which are spelt the same but carry different meanings, and, sometimes, even different pronunciations, such as lead (to guide) and lead (the metal). 
- **Hypernyms** and **hyponyms** describe category relationships. 
    - **hypernym** is a broad category word such as animal, 
    - **hyponym** is a more specific member of that category, such as dog. You can think of this as an 'is-a-kind-of' relationship: a dog is a kind of animal. 
- **Holonyms** and **meronyms** capture part‚Äìwhole relations. 
    - **holonym** is the whole, such as tree, 
    - **meronym** is a part of that whole, such as branch or leaf. 

Applications:
- Word sense disambiguation
- Information retrieval
- Semantic search

| Concept         | Description                             | Example                              |
| --------------- | --------------------------------------- | ------------------------------------ |
| **Synonymy**    | Words with similar meanings             | *big* ‚Üî *large*                      |
| **Antonymy**    | Words with opposite meanings            | *hot* ‚Üî *cold*                       |
| **Polysemy**    | One word with multiple related meanings | *bank* (river bank / financial bank) |
| **Homonymy**    | One word with unrelated meanings        | *bat* (animal / sports equipment)    |
| **Hyponymy**    | ‚ÄúIs-a‚Äù relationship                     | *rose* is a *flower*                 |
| **Meronymy**    | ‚ÄúPart-of‚Äù relationship                  | *wheel* is part of *car*             |
| **Hypernymy**   | More general term                       | *vehicle* is a hypernym of *car*     |
| **Collocation** | Words that co-occur naturally           | *make a decision*, *fast food*       |

**Remember:**

‚ÄúLexical semantics is about relationships between words and structure of meaning.‚Äù

üí° Mnemonic: SAPH HMHC
Synonymy ‚Äì Antonymy ‚Äì Polysemy ‚Äì Homonymy ‚Äì Hyponymy ‚Äì Meronymy ‚Äì Hypernymy  ‚Äì Collocation.

**Example:**

Sentence: ‚ÄúThe dog chased the cat.‚Äù
- dog ‚Üí agent (animate noun)
- chased ‚Üí action (verb)
- cat ‚Üí patient (object acted upon)

Lexical semantics helps us interpret meaning roles of each word type.

![image.png](attachment:image.png)

---
---

## Word Sense Disambiguation (WSD)

### What is WSD (short)

WSD = choose the correct meaning (sense) of an ambiguous word from multiple candidate senses, using the surrounding context.

### 1. Lesk algorithm (classic gloss-overlap method)

‚ú≥Ô∏è **Core idea**

Compare the dictionary gloss (definition + example words) of each candidate sense with the words appearing in the context window around the target word. The sense whose gloss shares the most overlapping words with the context wins.

üîÅ **Simple Lesk (steps)**

- Pick target word w in sentence.
- Collect context words: maybe the whole sentence or ¬±N words.
- For each candidate sense s of w:
    - Get gloss(s) (definition + example usage).
    - Optionally expand gloss with related glosses (see extended Lesk).
    - Compute overlap = count of common words between gloss(s) and context.
- Choose sense with highest overlap (tie-breaker: frequency or first sense).

üìå **Pseudocode (concise)**

```
context = set(words_around(w))
best_sense = None
best_score = 0
for s in senses(w):
    gloss_words = set(tokenize(gloss(s)))
    score = |context ‚à© gloss_words|
    if score > best_score:
        best_score = score; best_sense = s
return best_sense
```

‚úÖ **Worked example (your sentence)**

Sentence: ‚ÄúThe workers cleaned the plant after the shift.‚Äù
- Target: plant (senses: living organism vs factory)
- Context words: {workers, cleaned, after, shift}
- Gloss(living-plant): {plant, living, green, leaf, stem, grow, ...} ‚Üí overlaps: {plant? maybe none with context}
- Gloss(factory-plant): {factory, plant, industrial, factory-floor, workers, shift, machinery, ...} ‚Üí overlaps: {workers, shift, cleaned? (maybe machinery/cleaned)} ‚Üí higher overlap ‚Üí choose factory.

‚ûï **Advantages of Lesk**

- Simple, intuitive.
- No labeled training data needed.
- Interpretable (you can eyeball overlaps).

‚ûñ **Limitations**

- Depends on quality & wording of glosses ‚Äî sparse overlaps when gloss wording differs from context.
- Sensitive to stopwords/tokenization choices.
- Poor when context uses synonyms not present in gloss.
- Basic Lesk ignores deeper semantic links.

### Improved / Extended Lesk & practical variants

|                    Variant | What it changes                                                                        | Why helpful                                              |
| -------------------------: | -------------------------------------------------------------------------------------- | -------------------------------------------------------- |
|          **Extended Lesk** | Expand gloss with glosses of related synsets (hypernyms, hyponyms, meronyms, examples) | More overlap opportunities; captures broader relatedness |
|      **Context expansion** | Use larger context (sentence, paragraph) or incorporate collocations                   | Better signal when sentence is short                     |
|       **Weighted overlap** | Weight rare words / longer overlaps more (e.g., n-gram overlaps score higher)          | Penalizes common words; rewards strong matches           |
| **Lesk with POS matching** | Match POS-tagged words only (noun senses matched to nouns)                             | Reduces spurious matches                                 |


---

## 2. Knowledge-based WSD (WordNet, ConceptNet, graph approaches)

‚ú≥Ô∏è **Core idea**

Use lexical/semantic networks (WordNet, ConceptNet) to compute semantic relatedness between context words and candidate senses. Rather than simple gloss matching, measure distance or relatedness in a graph of concepts.

üîπ **Common knowledge-based techniques**

- **Path length / shortest path** between candidate synset and synsets of context words (shorter = more related).
- **Information content (IC) methods** (Resnik, Lin, Jiang-Conrath) ‚Äî combine corpus-based informativeness with taxonomy structure.
- **Lesk on expanded glosses** ‚Äî gloss overlap but include related synsets‚Äô glosses (a hybrid).
- **Graph-based centrality / PageRank**: build subgraph around context and rank senses by connectivity.

**Use of ConceptNet**: match commonsense relations (UsedFor, AtLocation) against context cues.

üîÅ **Typical pipeline (WordNet path approach)**
```
For each context word (cw), get candidate synsets S_cw.
For each sense candidate s of the target:
    - Compute similarity(s, S_cw) for all cw (e.g., shortest path length or IC-based score).
    - Aggregate similarity scores across cw (sum/avg).
- Pick sense s with highest aggregated relatedness.
```

‚ú≥Ô∏è **Example:** 

‚ÄúI sat on the bank and watched the river flow.‚Äù
- Target: bank
- Context synsets: {sit‚Üíaction, river‚Üíriver.n.01, flow‚Üímove.v.01}

- bank senses:
    - bank.n.01 = financial institution (hypernyms: institution)
    - bank.n.02 = river bank (hypernyms: shore)

- Path similarity between bank.n.02 and river synset will be short; bank.n.01 will be far ‚Üí choose river bank.

‚ûï **Advantages**
    - Leverages rich lexical relations (is-a, part-of, etc.).
    - Works better when glosses alone don‚Äôt overlap.
    - Does not require annotated corpora.

‚ûñ **Limitations**
    - Requires a good lexical graph (WordNet coverage varies by domain/language).
    - Some measures need corpus-based IC values (requires corpora). 
    - Computationally heavier (graph traversals, similarity computations).

---

### Comparison: Lesk vs Knowledge-based (table) WSD

| Aspect         |                    Lesk (gloss overlap) | Knowledge-based (graph/WordNet)                         |
| -------------- | --------------------------------------: | ------------------------------------------------------- |
| Data needed    |                 Dictionary/glosses only | Lexical network (WordNet/ConceptNet) ¬± corpora for IC   |
| Core operation |                          Count overlaps | Compute semantic relatedness / path distance            |
| Strengths      |      Simple, interpretable, no training | Uses rich relations; handles indirect relatedness       |
| Weaknesses     | Weak when gloss/context wording differs | Needs good lexical resource; heavier compute            |
| Best when      |    Gloss wording contains context words | Context words relate indirectly via hypernyms/relations |


### Practical tips & engineering signals

- Preprocess: normalize tokens, remove stopwords carefully (keep content words), POS-tag context so you compare nouns-to-nouns, verbs-to-verbs.

- Context window: use sentence ¬± neighboring sentence for short texts (tweets), whole paragraph for long texts.

- Combine signals: hybridize Lesk + path similarity + distributional similarity (embeddings) for best results.

- Use frequency as tie-breaker: if two senses score equally, prefer more frequent sense (first sense heuristic).

- Domain adaptation: WordNet may miss domain senses; consider domain-specific ontologies or embeddings.

### Hybrid & modern enhancements (brief)

- Gloss + embeddings: embed glosses and context with sentence embeddings; pick sense whose gloss embedding is closest to context embedding.

- Graph + PageRank: build graph of candidate senses + context words and run PageRank; top-ranked sense selected.

- Use pretrained contextual models (BERT): fine-tune classifiers or use sense inventories mapped to contextual embeddings (state-of-the-art WSD uses supervised neural models trained on SemCor, but you asked for knowledge-based).

### Quick reference

| Item                      | What to do                                             | Why / Note                          |
| ------------------------- | ------------------------------------------------------ | ----------------------------------- |
| Lesk basic                | Count overlaps between gloss and context               | Very simple; no corpus needed       |
| Lesk extended             | Add related glosses (hypernyms, examples)              | Reduce sparse-overlap problem       |
| Path similarity           | Shortest path in WordNet between s and context synsets | Good for taxonomy-based relatedness |
| IC-based (Resnik/Lin/JCN) | Use corpus-derived IC + taxonomy                       | More principled similarity measure  |
| Graph methods             | Build subgraph, rank by centrality                     | Captures multi-hop relations        |
| Hybrid                    | Combine Lesk + WordNet + embeddings                    | Most robust in practice             |


### Two worked examples side-by-side
- Example A ‚Äî Your sentence (industrial)
    - Sentence: ‚ÄúThe workers cleaned the plant after the shift.‚Äù
    - Lesk: overlaps with gloss(factory) ‚Üí choose factory.  
    - WordNet path: context words (workers, shift, cleaned) map to synsets related to industry/work ‚Üí shorter path to factory sense ‚Üí factory.

- Example B ‚Äî Ambiguous ‚Äúbank‚Äù
    - Sentence 1: ‚ÄúShe deposited money at the bank.‚Äù ‚Üí money ‚Üí strong relation to financial institution ‚Üí choose financial bank.
    - Sentence 2: ‚ÄúHe sat on the bank and fished.‚Äù ‚Üí fished, river ‚Üí relation to river ‚Üí choose river bank.

### Memory aids

- Lesk = Look for Overlap in the Gloss (L-O-G)
- Knowledge-based = WALK the graph (WordNet) and see which sense is closest
- Hybrid rule: If gloss overlap strong ‚Üí Lesk wins; if overlap weak ‚Üí use WordNet path/IC; else use embeddings.

---
---

## Distributional Semantics

Distributional semantics, learns meaning from raw usage patterns in large text corpora.

üß© **1. What is Distributional Semantics?**

| Concept                                   | Meaning                                                                                                             | Analogy / Memory cue                                                       |
| ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- |
| **Goal**                                  | Represent *meaning* of words based on the *contexts they appear in*                                                 | ‚ÄúYou shall know a word by the company it keeps.‚Äù                           |
| **Core idea (Distributional Hypothesis)** | Words that occur in similar contexts tend to have similar meanings.                                                 | *dog* and *cat* both appear near ‚Äúpet‚Äù, ‚Äúfur‚Äù, ‚Äúanimal‚Äù ‚Üí similar meanings |
| **Approach type**                         | **Data-driven**, learned directly from text (unlike WordNet-based).                                                 | Think: learn from usage, not dictionaries                                  |
| **Output**                                | A vector representation for each word showing its association strengths with other words (its *context signature*). | ‚Äúword-as-vector‚Äù                                                           |


üßÆ **2. Building a Co-occurrence Matrix**
| Step                        | What it does                                                              | Example                                        |
| --------------------------- | ------------------------------------------------------------------------- | ---------------------------------------------- |
| **1. Collect corpus**       | Get a text sample: e.g. ‚ÄúThe cat sat on the mat. The dog sat on the rug.‚Äù | small toy corpus                               |
| **2. Define window size**   | Number of context words to consider around target word (¬±2 typical).      | For ‚Äúsat‚Äù, window words = {the, cat, on, the}  |
| **3. Count co-occurrences** | Count how often each (target, context) pair appears.                      | (cat, mat)=1, (dog, rug)=1, (sat, cat)=1, etc. |
| **4. Form matrix**          | Rows = target words, Columns = context words, Cells = counts.             | see below                                      |

Example co-occurrence matrix

| Word ‚Üì \ Context ‚Üí | cat | dog | sat | mat | rug | the |
| ------------------ | --: | --: | --: | --: | --: | --: |
| cat                |   0 |   0 |   1 |   1 |   0 |   2 |
| dog                |   0 |   0 |   1 |   0 |   1 |   2 |
| sat                |   1 |   1 |   0 |   1 |   1 |   2 |

(Co-occurrence window = ¬±2 words)

üëâ These raw counts show patterns, but frequent function words (‚Äúthe‚Äù) dominate.




‚öôÔ∏è 3. **From Counts to Probabilities**

We convert raw counts into probabilities to reason about how expected a co-occurrence is.

P(w,c)= count(w, c) / total¬†pairscount

Where,
- P(w) = count(w) / total pairs
- P(c) = count(c) / total pairs

üí° **4. Pointwise Mutual Information (PMI)**

‚ÄãPMI(w,c) = log2 [‚ÄãP(w)P(c) / P(w,c)]

| Concept            | Formula                                                                 | Intuition                                                    | Example                                                                      |
| ------------------ | ----------------------------------------------------------------------- | ------------------------------------------------------------ | ---------------------------------------------------------------------------- |
| **PMI**            | PMI(w, c) = log2 = P(w, c) / P(w)P(c)                          | Compares *actual* co-occurrence vs *expected* if independent | If ‚Äúdog‚Äù & ‚Äúbark‚Äù appear together far more often than random chance, PMI > 0 |
| **Interpretation** | High PMI ‚Üí strong semantic link, Low/Negative ‚Üí weak/unrelated          | PMI(cat, meow) > PMI(cat, table)                             |                                                                              |
| **Why needed**     | Corrects bias of frequent but meaningless words like ‚Äúthe‚Äù, ‚Äúis‚Äù, ‚Äúand‚Äù | Filters out noise from function words                        |                                                                              |

üß≠ **5. Positive PMI (PPMI)**
| Concept    | Definition                                                                       | Effect                                                   |
| ---------- | -------------------------------------------------------------------------------- | -------------------------------------------------------- |
| **PPMI**   | ( PPMI(w, c) = max(PMI(w, c), 0) )                                              | Keeps only positive associations; sets negatives to zero |
| **Why**    | Negative PMI indicates ‚Äúless likely than random‚Äù, usually not useful for meaning |                                                          |
| **Result** | Sparse but more meaningful matrix, highlighting genuine associations             |                                                          |

üß© **6. Example PMI Computation (Toy numbers)**

Suppose: 
- P(cat) = 0.10, P(mat) = 0.08, P(cat, mat) = 0.05
    - PMI(cat,mat)=log2 [ 0.05 / (0.10 * 0.08) ] = 2.64
    - ‚Üí Strong association (since ‚Äúcat‚Äù and ‚Äúmat‚Äù co-occur much more often than random chance).

- If P(cat, the) = 0.07,
    - PMI(cat,the)=log2 [0.07 / (0.10 * 0.15)] = 2.22
    - ‚Üí Still positive, but usually downweighted by stopword removal.

    

üßÆ **7. Turning Co-occurrences into Vectors**

- Matrix as vectors: 
    - Each row (word) = vector of co-occurrence or PMI values with contexts: 
    - ‚Äúcat‚Äù = [0.5 with ‚Äòdog‚Äô, 2.6 with ‚Äòmat‚Äô, ‚Ä¶]
- Similarity measure:
    - Compute cosine similarity between vectors
    - cos(theta) = A.B / |A| |B|
- Meaning :
    - Small angle (high cosine) ‚Üí words used in similar contexts ‚Üí similar meaning
    - ‚Äúcat‚Äù ~ ‚Äúdog‚Äù; ‚Äúcar‚Äù ~ ‚Äúvehicle‚Äù

| Symbol      | Meaning                                                     |
| ----------- | ----------------------------------------------------------- |
| (A, B)      | The two word vectors (e.g., word embeddings or PMI vectors) |
| (A dot B) | Dot product of the two vectors = (sum_i [A_i * B_i])    |
| (magnitude A)       | Magnitude (length) of vector (A = sqrt{sum_i [A_i^2]})      |
| (magnitude B)       | Magnitude (length) of vector (B = sqrt{sum_i [B_i^2]})      |


üß† **8. Limitations and fixes**
| Issue                              | Description                                 | Solution                                                       |
| ---------------------------------- | ------------------------------------------- | -------------------------------------------------------------- |
| **High dimensionality & sparsity** | Huge matrices, most entries zero            | Dimensionality reduction (SVD, PCA) ‚Üí Latent Semantic Analysis |
| **Frequency bias**                 | Common words dominate                       | PMI / PPMI normalization                                       |
| **Context window too small/large** | Too small = noisy; too large = generic      | Tune based on corpus size/task                                 |
| **Static meaning**                 | Each word has one vector (ignores polysemy) | Contextual embeddings (Word2Vec, GloVe, BERT) solve this later |


---

üßæ **9. Comparison table ‚Äî Count-based vs. Predictive models**

| Type            | Example                                                           | Core idea                                       | Representation                |
| --------------- | ----------------------------------------------------------------- | ----------------------------------------------- | ----------------------------- |
| **Count-based** | Co-occurrence, PMI, LSA                                           | Compute from word-context counts                | Explicit matrix (PPMI values) |
| **Predictive**  | Word2Vec, GloVe                                                   | Train model to predict context (Skip-gram/CBOW) | Learned dense embeddings      |
| **Link**        | GloVe bridges the two ‚Äî it factorizes PMI-like co-occurrence info | Combines count-based & predictive strengths     |                               |


üåê **10. Quick memory map**

| Concept                      | Keyword                  | Mnemonic                                 |
| ---------------------------- | ------------------------ | ---------------------------------------- |
| **Distributional semantics** | Meaning = usage          | *‚ÄúA word by the company it keeps‚Äù*       |
| **Co-occurrence matrix**     | Count neighbors          | *Frequency fingerprint*                  |
| **PMI**                      | Surprise ratio           | *How much more together than by chance?* |
| **PPMI**                     | Filter negatives         | *Keep only meaningful pairs*             |
| **Cosine similarity**        | Compare word vectors     | *Angle between meanings*                 |
| **Limitations**              | Sparsity, frequency bias | *Solved by embeddings (Word2Vec/GloVe)*  |


üß† 11. **Example summary**

| Sentence                  | Co-occurrence | PMI focus          | Insight            |
| ------------------------- | ------------- | ------------------ | ------------------ |
| ‚ÄúThe cat sat on the mat.‚Äù | (cat, mat)=1  | High PMI(cat, mat) | Real association   |
| ‚ÄúThe dog sat on the rug.‚Äù | (dog, rug)=1  | High PMI(dog, rug) | Real association   |
| ‚ÄúThe cat sat the dog.‚Äù    | (cat, dog)=1  | Low PMI(cat, dog)  | Weak, coincidental |

Thus, PMI filters meaningful relationships (‚Äúcat‚Äìmat‚Äù, ‚Äúdog‚Äìrug‚Äù) from coincidental ones.

---
---

üß† **Comparison Table: Lesk vs Knowledge-Based vs Distributional Approaches**

| **Aspect**                 | **Lesk Algorithm**                                                                                     | **Knowledge-Based WSD**                                                                                | **Distributional / Embedding-Based WSD**                                                            |
| -------------------------- | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
| **Core Idea**              | Disambiguate by comparing **dictionary gloss overlaps** with context words. Lesk algorithm                             | Use **semantic networks** (like WordNet/ConceptNet) to measure **semantic relatedness** between senses. Path Similarity, WU Palmer | Learns meaning directly from how words are used together in real text. Use **statistical co-occurrence** or **contextual embeddings** to infer meaning from usage patterns. Co-occurrrance matrix, PMI, PPMI, cosine similarity |
| **Knowledge Source**       | Dictionary definitions (e.g., WordNet glosses)                                                         | Structured lexical databases (WordNet, ConceptNet, BabelNet)                                           | Large unlabeled text corpora (Wikipedia, Common Crawl, etc.)                                        |
| **Representation**         | Words represented by **gloss text**                                                                    | Words represented by **synsets** (concept nodes linked semantically)                                   | Words represented as **vectors or embeddings** in continuous space                                  |
| **Similarity Measure**     | **Gloss overlap count** (number of shared words between gloss and context)                             | **Semantic similarity** based on **path length**, **depth**, or **information content** in the network | **Cosine similarity** or **contextual distance** between embeddings                                 |
| **Example**                | ‚ÄúPlant‚Äù in *‚Äòcleaned the plant after the shift‚Äô* ‚Üí gloss of ‚Äúfactory‚Äù overlaps with ‚Äúworkers‚Äù, ‚Äúshift‚Äù | ‚ÄúPlant‚Äù ‚Üí sense connected to ‚Äúfactory‚Äù synset through *workplace, building* relations                  | ‚ÄúPlant‚Äù vector near ‚Äúfactory‚Äù embeddings in context; transformer model infers correct sense         |
| **Interpretability**       | Very interpretable (you can read the gloss and overlap)                                                | Interpretable (semantic links visible)                                                                 | Less interpretable (latent vectors encode meaning implicitly)                                       |
| **Data Requirement**       | None ‚Äî uses only dictionary text                                                                       | No training data ‚Äî relies on existing lexical networks                                                 | Requires large corpora to train embeddings (Word2Vec, BERT, etc.)                                   |
| **Context Handling**       | Local context only (sentence-level overlap)                                                            | Broader lexical context via semantic relations                                                         | Deep contextual understanding via surrounding words and sentence structure                          |
| **Strengths**              | Simple, unsupervised, transparent                                                                      | Rich semantic relations, captures hierarchy and commonsense                                            | Captures nuanced, dynamic meaning shifts from real usage; scalable                                  |
| **Weaknesses**             | Relies heavily on gloss wording; fails if glosses don‚Äôt overlap                                        | Limited to predefined senses; struggles with unseen words                                              | Requires massive data; less explainable; may lose fine-grained sense distinctions                   |
| **Typical Tools / Models** | WordNet (glosses), dictionary APIs                                                                     | WordNet, ConceptNet, BabelNet                                                                          | Word2Vec, GloVe, BERT, ELMo, RoBERTa                                                                |
| **Use Case Example**       | Rule-based NLP or educational applications                                                             | Semantic search, ontology-based reasoning                                                              | Modern NLP tasks (chatbots, QA, MT) with contextual embeddings                                      |


### visual, intuitive ‚Äúdecision flow‚Äù comparison of how each method (Lesk, Knowledge-Based, Distributional) figures out what a word means in context ‚Äî step by step.

Sentence: ‚ÄúThe workers cleaned the plant after the shift.‚Äù

#### üß≠ 1Ô∏è‚É£ Lesk Algorithm (Dictionary Overlap Method)

| **Step**                                              | **What Happens**                                                                               | **Example / Visualization**                  |
| ----------------------------------------------------- | ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
| **1.** Identify ambiguous word                        | `plant`                                                                                        | ‚Äì                                            |
| **2.** Fetch all its dictionary definitions (glosses) | *plant‚ÇÅ*: ‚Äúa living organism‚Äù  <br>*plant‚ÇÇ*: ‚Äúan industrial building where goods are made‚Äù     | üìò From dictionary                           |
| **3.** Look at *context words*                        | {workers, cleaned, shift}                                                                      | üßë‚Äçüè≠üßπ‚è∞                                     |
| **4.** Compare overlap between gloss and context      | Gloss‚ÇÇ (‚Äúindustrial building‚Äù) shares terms like *work*, *produce*, *factory* ‚Üí higher overlap | üîç ‚ÄúShift‚Äù and ‚Äúworkers‚Äù match factory sense |
| **5.** Choose sense with max overlap                  | ‚úÖ *plant‚ÇÇ = factory*                                                                           | ‚úÖ Simple, intuitive                          |


üß© How it thinks: ‚ÄúWhich dictionary definition uses similar words as the sentence?‚Äù

#### üåê 2Ô∏è‚É£ Knowledge-Based WSD (Using WordNet / ConceptNet)

| **Step**                                                | **What Happens**                                                                                  | **Example / Visualization**                |
| ------------------------------------------------------- | ------------------------------------------------------------------------------------------------- | ------------------------------------------ |
| **1.** Identify ambiguous word                          | `plant`                                                                                           | ‚Äì                                          |
| **2.** Fetch its *synsets* (sense nodes) from WordNet   | *plant‚ÇÅ* ‚Üí living organism<br>*plant‚ÇÇ* ‚Üí factory                                                  | üå± üè≠                                      |
| **3.** Build small *semantic network* for context words | ‚Äúworker‚Äù ‚Üí *person employed in industry*<br>‚Äúshift‚Äù ‚Üí *period of work in factory*                 | üîó ‚Äúworker‚Äù ‚Üî ‚Äúfactory‚Äù ‚Üî ‚Äúshift‚Äù          |
| **4.** Measure semantic distance between senses         | *plant‚ÇÇ (factory)* is more closely linked to *worker* and *shift* in network than *plant‚ÇÅ (tree)* | üï∏Ô∏è Shorter path in network                |
| **5.** Choose most semantically connected sense         | ‚úÖ *plant‚ÇÇ = factory*                                                                              | ‚úÖ Supported by real-world conceptual links |

üß© How it thinks: ‚ÄúWhich sense is better connected in the semantic network of meanings?‚Äù

#### üìä 3Ô∏è‚É£ Distributional Semantics (Context = Meaning)

| **Step**                                                                                      | **What Happens**                                                        | **Example / Visualization**           |
| --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | ------------------------------------- |
| **1.** Collect a huge text corpus                                                             | millions of sentences                                                   | üóÉÔ∏è                                   |
| **2.** Build a **co-occurrence matrix** (how often words appear near each other)              | ‚Äúplant‚Äù often near {factory, worker, shift}, rarely near {forest, soil} | üî¢ Table of counts                    |
| **3.** Convert counts ‚Üí **associations** using PMI / PPMI                                     | Filters out frequent but uninformative words like *the*, *and*          | üéØ Keeps only meaningful pairs        |
| **4.** Compare word-context vectors (e.g., ‚Äúplant‚Äù vs. ‚Äúfactory‚Äù) using **cosine similarity** | High similarity ‚Üí same sense (factory); low ‚Üí unrelated (tree)          | üìê Measures closeness in vector space |
| **5.** Choose meaning based on strongest statistical association                              | ‚úÖ *plant‚ÇÇ = factory*                                                    | ‚úÖ Learned from data patterns          |

üß© How it thinks: ‚ÄúWhich meaning of the word appears in similar contexts across real-world text?‚Äù

---
---

## üß© Semantic Role Labeling (SRL)

üí° Definition

SRL identifies the semantic roles (meanings or functions) that words or phrases play in a sentence ‚Äî like Agent, Patient, Instrument, Location, Time, etc.

In short:
- Syntax tells you how words are arranged.
- SRL tells you who does what to whom and when.

üß† 1Ô∏è‚É£ Example to Grasp the Core

Sentence: ‚ÄúThe chef cooked the meal with care in the kitchen.‚Äù

| **Word/Phrase** | **Role**            | **Explanation**            |
| --------------- | ------------------- | -------------------------- |
| The chef        | **Agent**           | The doer of the action     |
| cooked          | **Predicate**       | The main verb / action     |
| the meal        | **Patient / Theme** | The thing being acted upon |
| with care       | **Manner**          | How the action was done    |
| in the kitchen  | **Location**        | Where the action happened  |

‚úÖ SRL Output:

[Agent: The chef] [Predicate: cooked] [Patient: the meal] [Manner: with care] [Location: in the kitchen]


### üß© 2Ô∏è‚É£ How SRL Works (Step-by-Step Pipeline)

| **Step**                        | **Task**                           | **Output Example**                        |
| ------------------------------- | ---------------------------------- | ----------------------------------------- |
| 1. **Sentence Input**           | Raw text                           | ‚ÄúThe chef cooked the meal with care.‚Äù     |
| 2. **Syntactic Parsing**        | Find subject‚Äìverb‚Äìobject structure | `chef ‚Üí cooked ‚Üí meal`                    |
| 3. **Predicate Identification** | Find the main verbs                | `cooked`                                  |
| 4. **Argument Detection**       | Identify related noun phrases      | `chef`, `meal`, `care`                    |
| 5. **Role Classification**      | Assign semantic roles              | Agent = chef, Theme = meal, Manner = care |


### ‚öôÔ∏è 3Ô∏è‚É£ SRL vs Syntax ‚Äî The Key Difference

| **Aspect**        | **Syntactic Parsing**             | **Semantic Role Labeling (SRL)**                    |
| ----------------- | --------------------------------- | --------------------------------------------------- |
| Focus             | Grammatical structure             | Meaning-based structure                             |
| Question answered | ‚ÄúWhich noun modifies which verb?‚Äù | ‚ÄúWho did what to whom, how, where?‚Äù                 |
| Example output    | Subject ‚Üí Verb ‚Üí Object           | Agent ‚Üí Predicate ‚Üí Patient                         |
| Example sentence  | `chef ‚Üí cooked ‚Üí meal`            | `[Agent: chef] [Predicate: cooked] [Patient: meal]` |


### üß© 4Ô∏è‚É£ Common Semantic Roles

| **Role**            | **Meaning**          | **Example**                   |
| ------------------- | -------------------- | ----------------------------- |
| **Agent**           | Doer of action       | ‚Äú*The boy* kicked the ball.‚Äù  |
| **Patient / Theme** | Receiver of action   | ‚ÄúThe boy kicked *the ball*.‚Äù  |
| **Instrument**      | Tool used            | ‚ÄúHe cut it *with a knife*.‚Äù   |
| **Experiencer**     | Feeler of emotion    | ‚Äú*She* loved the story.‚Äù      |
| **Beneficiary**     | Person benefited     | ‚ÄúHe cooked *for his friend*.‚Äù |
| **Location**        | Place of action      | ‚ÄúHe slept *on the couch*.‚Äù    |
| **Manner**          | Way action performed | ‚ÄúHe spoke *softly*.‚Äù          |
| **Time**            | When action occurred | ‚ÄúHe arrived *at noon*.‚Äù       |


### üßÆ 5Ô∏è‚É£ Under the Hood (Model Types)

| **Approach**                          | **Description**                        | **Pros**                    | **Cons**                         |
| ------------------------------------- | -------------------------------------- | --------------------------- | -------------------------------- |
| **Rule-Based SRL**                    | Uses handcrafted grammar & parse trees | Interpretable               | Not scalable                     |
| **Feature-Based ML** (e.g., CRF, SVM) | Uses syntactic + lexical features      | Accurate with rich features | Needs manual feature engineering |
| **Deep Learning SRL** (BiLSTM, BERT)  | Learns context and role automatically  | High accuracy               | Black-box, data-hungry           |

üß© Modern SRL uses pre-trained language models like BERT to predict roles directly from context without explicit parse trees.

### üéì 7Ô∏è‚É£ SRL vs Related Concepts

| **Concept**                | **Goal**                             | **Example**                   |
| -------------------------- | ------------------------------------ | ----------------------------- |
| **NER**                    | Label *entities* (Person, Org, etc.) | ‚ÄúMary‚Äù ‚Üí PERSON               |
| **Dependency Parsing**     | Show syntactic relations             | `gave ‚Üí object ‚Üí book`        |
| **SRL**                    | Show *semantic roles* (who did what) | `[Agent: Mary] [Theme: book]` |
| **Coreference Resolution** | Link mentions of same entity         | ‚ÄúMary‚Äù = ‚ÄúShe‚Äù                |


### üß† Summary Table

| **Aspect**           | **SRL Essence**                                                     |
| -------------------- | ------------------------------------------------------------------- |
| **Purpose**          | Assign roles like Agent, Patient, Instrument to sentence components |
| **Input Needed**     | POS tags + syntactic parse                                          |
| **Output**           | Predicate‚Äìargument structure                                        |
| **Example Toolkits** | AllenNLP, SpaCy + PropBank model, HuggingFace SRL models            |
| **Practical Uses**   | Information extraction, question answering, summarization           |

‚úÖ Final SRL Summary:

‚Üí Mary (Agent) gave (Predicate) a book (Theme) to John (Recipient) yesterday (Time).

---
---

## üí´ Named Entity Recognition (NER) + IOB Labelling.

### üß© 1Ô∏è‚É£ What is NER?

| **Aspect**        | **Description**                                                                                                                                                                                   |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Definition**    | Named Entity Recognition (NER) is the process of identifying and classifying *named entities* in text into predefined categories such as **Person, Organisation, Location, Date, Quantity, etc.** |
| **Goal**          | To find spans of text that refer to real-world entities and label them correctly.                                                                                                                 |
| **Example Input** | ‚ÄúBarack Obama visited Berlin in July 2008.‚Äù                                                                                                                                                       |
| **Output**        | `[Barack Obama] ‚Üí PERSON`, `[Berlin] ‚Üí LOCATION`, `[July 2008] ‚Üí DATE`                                                                                                                            |


### üß≠ 2Ô∏è‚É£ Why NER Matters

| **Application**            | **Use Case**                                                                          |
| -------------------------- | ------------------------------------------------------------------------------------- |
| **Search engines**         | Identify entities in queries to improve results (‚ÄúApple headquarters‚Äù ‚Üí organisation) |
| **Information extraction** | Extract facts: *[Person]* works at *[Organisation]*                                   |
| **Question answering**     | Identify ‚ÄúWho‚Äù, ‚ÄúWhere‚Äù, ‚ÄúWhen‚Äù answers                                               |
| **Summarisation**          | Highlight key entities in text                                                        |
| **Financial / Legal text** | Recognise companies, dates, monetary values                                           |


### üîñ 3Ô∏è‚É£ Core Entity Types (Common Categories)

| **Entity Type**            | **Example**               |
| -------------------------- | ------------------------- |
| PERSON                     | Barack Obama, Marie Curie |
| LOCATION                   | Berlin, Mount Everest     |
| ORGANISATION               | Google, United Nations    |
| DATE                       | July 2008, 21st May       |
| TIME                       | 5 PM, two hours           |
| MONEY                      | $1000, 50 euros           |
| PERCENT                    | 10%, 3.5 percent          |
| GPE (Geo-Political Entity) | India, European Union     |
| PRODUCT                    | iPhone 15, Tesla Model 3  |


#### üß± 4Ô∏è‚É£ The IOB Tagging Scheme (Inside‚ÄìOutside‚ÄìBeginning)

| **Tag** | **Meaning**                        | **Example**                 |
| ------- | ---------------------------------- | --------------------------- |
| **B-**  | Beginning of an entity             | `B-PER` ‚Üí *Barack*          |
| **I-**  | Inside (continuation) of an entity | `I-PER` ‚Üí *Obama*           |
| **O**   | Outside any named entity           | Words not part of an entity |

üìò Example Sentence:

Barack Obama visited Berlin in July 2008.

| **Token** | **Tag**    |
| --------- | ---------- |
| Barack    | **B-PER**  |
| Obama     | **I-PER**  |
| visited   | **O**      |
| Berlin    | **B-LOC**  |
| in        | **O**      |
| July      | **B-DATE** |
| 2008      | **I-DATE** |


#### üî§ 5Ô∏è‚É£ Extended Labeling Schemes

| **Scheme**    | **Meaning**                                                | **Tags Example**              |
| ------------- | ---------------------------------------------------------- | ----------------------------- |
| **BIO (IOB)** | Basic form                                                 | B-PER, I-PER, O               |
| **BIOES**     | Adds ‚ÄúEnd‚Äù and ‚ÄúSingle‚Äù for clarity                        | B-PER, I-PER, E-PER, S-PER, O |
| **BILOU**     | Begin, Inside, Last, Outside, Unit (Single-token entities) | B-ORG, I-ORG, L-ORG, O, U-PER |


#### ‚öôÔ∏è 6Ô∏è‚É£ How NER Systems Work

| **Step**                              | **Process**                                   | **Example**                  |
| ------------------------------------- | --------------------------------------------- | ---------------------------- |
| 1. **Tokenisation**                   | Split text into words                         | ‚ÄúBarack‚Äù, ‚ÄúObama‚Äù, ‚Äúvisited‚Äù |
| 2. **Feature Extraction / Embedding** | Compute word features (POS, context, vectors) | Context = nearby words       |
| 3. **Sequence Labelling Model**       | Predict tags (BIO scheme)                     | CRF / LSTM / BERT model      |
| 4. **Entity Reconstruction**          | Merge tokens with B/I labels                  | ‚ÄúBarack Obama‚Äù ‚Üí PERSON      |


#### ü§ñ 7Ô∏è‚É£ Model Types for NER

| **Model Type**       | **Description**                                  | **Example Algorithms**    |
| -------------------- | ------------------------------------------------ | ------------------------- |
| **Rule-based**       | Uses dictionaries + patterns (regex, gazetteers) | SpaCy Matcher             |
| **Statistical (ML)** | Learns from features                             | HMM, CRF, SVM             |
| **Neural (DL)**      | Uses contextual embeddings                       | BiLSTM-CRF, BERT, RoBERTa |


#### üß© 8Ô∏è‚É£ Common Challenges
| **Challenge**            | **Example**                                      |
| ------------------------ | ------------------------------------------------ |
| **Ambiguity**            | ‚ÄúApple‚Äù (company vs fruit)                       |
| **Multiword entities**   | ‚ÄúNew York Times‚Äù (single org)                    |
| **Nested entities**      | ‚ÄúUniversity of California, Berkeley‚Äù (ORG + LOC) |
| **Contextual confusion** | ‚ÄúJordan‚Äù (person vs country)                     |


#### üí° 9Ô∏è‚É£ NER vs Related Concepts

| **Concept**                | **Focus**                    | **Example**                           |
| -------------------------- | ---------------------------- | ------------------------------------- |
| **POS Tagging**            | Grammatical role             | Noun, Verb                            |
| **NER**                    | Entity recognition           | PERSON, LOCATION                      |
| **SRL**                    | Semantic role (who did what) | Agent, Patient                        |
| **Coreference Resolution** | Linking mentions             | ‚ÄúObama‚Äù = ‚ÄúHe‚Äù                        |
| **Chunking**               | Grouping noun/verb phrases   | [NP Barack Obama] [VP visited Berlin] |


#### üß† 10Ô∏è‚É£ Summary Cheat Box

| **Aspect**         | **NER Key Point**                                 |
| ------------------ | ------------------------------------------------- |
| **Goal**           | Identify and classify named entities in text      |
| **Output**         | Structured labels like PERSON, ORG, LOC           |
| **Tag Scheme**     | IOB (Inside‚ÄìOutside‚ÄìBeginning)                    |
| **Typical Models** | CRF, BiLSTM-CRF, BERT                             |
| **Applications**   | QA, search, summarisation, information extraction |
| **Challenges**     | Ambiguity, boundaries, multiword names            |


---
---

## üß© Conditional Random Fields (CRFs) ‚Äî Structured Sequence Labelling


#### üîπ **What it is**

| Aspect        | Description                                                                                                            |
| ------------- | ---------------------------------------------------------------------------------------------------------------------- |
| **Full Form** | Conditional Random Field                                                                                               |
| **Type**      | Probabilistic graphical model (specifically, an *undirected* model)                                                    |
| **Purpose**   | To predict the most likely sequence of *labels* for a sequence of *observations* (e.g., NER, POS tagging)              |
| **Analogy**   | Instead of labeling words one by one, CRF looks at the *whole sentence* and predicts the most consistent tag sequence. |


#### üß† Key Idea

CRFs combine word-level features + contextual dependencies ‚Üí to assign consistent sequence labels.

üí° Unlike classifiers (like Logistic Regression) that predict one label per word independently, CRFs jointly model all labels in the sequence.

#### üßÆ How It Works (Simplified)

| Step                                | Explanation                                                         | Example                                      |
| ----------------------------------- | ------------------------------------------------------------------- | -------------------------------------------- |
| 1Ô∏è‚É£ **Input Sequence**              | Words in a sentence                                                 | `"Barack Obama visited Berlin"`              |
| 2Ô∏è‚É£ **Features Extracted per Word** | Capitalization, POS tag, surrounding words, etc.                    | Word="Barack" ‚Üí capitalized, NNP, prev=START |
| 3Ô∏è‚É£ **Possible Labels**             | PERSON, LOCATION, ORG, DATE, O                                      |                                              |
| 4Ô∏è‚É£ **Model Training**              | Learns weights for each feature and transition between tags         | e.g., ‚ÄúB-PER ‚Üí I-PER‚Äù is a strong transition |
| 5Ô∏è‚É£ **Prediction**                  | Finds sequence of tags with highest probability across the sentence | Output: `B-PER I-PER O B-LOC`                |


#### üß± Key Strengths of CRFs

| Feature                      | Explanation                                                                                          |
| ---------------------------- | ---------------------------------------------------------------------------------------------------- |
| üß© **Context-Aware**         | Considers neighbouring tags and words ‚Üí learns dependencies like ‚ÄúB-PER‚Äù likely followed by ‚ÄúI-PER.‚Äù |
| ‚öôÔ∏è **Feature-Rich**          | Uses many overlapping features (word shape, POS tags, affixes, etc.).                                |
| üîÅ **Structured Prediction** | Predicts sequence as a *whole*, not isolated tokens.                                                 |
| ‚úÖ **Consistency**            | Ensures tags make sense together (‚ÄúB-LOC‚Äù not followed by ‚ÄúI-PER‚Äù).                                  |
| üìà **Supervised Learning**   | Learns from labeled data with examples of correct sequences.                                         |


#### ‚öñÔ∏è Comparison ‚Äî CRF vs Simpler Models

| Model                         | Approach                                                   | Context Awareness          | Typical Use                      |                             |
| ----------------------------- | ---------------------------------------------------------- | -------------------------- | -------------------------------- | --------------------------- |
| **Logistic Regression**       | Predicts each token independently                          | ‚ùå None                     | Binary tagging or classification |                             |
| **Hidden Markov Model (HMM)** | Generative model; uses transition & emission probabilities | ‚úÖ Context (previous state) | POS tagging (classic method)     |                             |
| **CRF**                       | Conditional model; directly models P(tags words)                  | ‚úÖ‚úÖ Strong context & features                     | NER, POS tagging, SRL, etc.     |  |


#### üß© Example: NER Tagging using CRF

| Word    | Features             | Predicted Tag |
| ------- | -------------------- | ------------- |
| Barack  | Capitalized, POS=NNP | **B-PER**     |
| Obama   | Capitalized, POS=NNP | **I-PER**     |
| visited | POS=VBD              | **O**         |
| Berlin  | Capitalized, POS=NNP | **B-LOC**     |


#### ‚ö†Ô∏è Challenges / Limitations

| Challenge                         | Description                                                     |
| --------------------------------- | --------------------------------------------------------------- |
| üßÆ **Training complexity**        | Computation-heavy due to sequence-level probability computation |
| üíæ **Requires labelled data**     | Needs manually annotated sequences                              |
| üìâ **Not scalable for long text** | Works best for sentence-level tagging                           |


#### üß† Quick Recap Cheatsheet

| Concept          | Remember As                                    |                                                       |
| ---------------- | ---------------------------------------------- | ----------------------------------------------------- |
| **Goal**         | Predict *sequence* of tags jointly             |                                                       |
| **Model Type**   | *Discriminative* probabilistic graphical model |                                                       |
| **Uses**         | NER, POS, Chunking, SRL                        |                                                       |
| **Key Strength** | Context + Feature combination                  |                                                       |
| **Formula**      | ( P(Y                                          | X) \propto \exp(\sum w \cdot f(y_{i-1}, y_i, X, i)) ) |
| **Advantage**    | Produces globally consistent predictions       |                                                       |


---
---

## üß© Coreference Resolution (CR) ‚Äî Linking Mentions to the Same Entity

#### What It Is

| Aspect         | Description                                                                                              |
| -------------- | -------------------------------------------------------------------------------------------------------- |
| **Definition** | The task of identifying when *different words or phrases* refer to the *same real-world entity* in text. |
| **Purpose**    | Helps maintain *continuity of meaning* by connecting pronouns and noun phrases across sentences.         |
| **Example**    | ‚Äú**Sarah** loves music. **She** plays the guitar.‚Äù ‚Üí *Sarah* and *She* refer to the same person.         |


#### üß† Why It Matters

Without coreference resolution, a computer treats every noun or pronoun as separate.
With CR, it realises that ‚ÄúSarah‚Äù, ‚Äúshe‚Äù, and ‚Äúthe teacher‚Äù may all refer to the same entity ‚Äî crucial for:
- Text summarization
- Question answering
- Information extraction
- Machine translation

#### üîÅ Types of References Detected

| Type                              | Example                                                          | Notes                       |
| --------------------------------- | ---------------------------------------------------------------- | --------------------------- |
| **Pronominal Coreference**        | ‚ÄúJohn lost his keys. *He* was upset.‚Äù                            | Pronoun ‚Üí noun link         |
| **Nominal Coreference**           | ‚ÄúThe *president* gave a speech. The *leader* spoke for an hour.‚Äù | Two noun phrases            |
| **Demonstrative Coreference**     | ‚ÄúI love apples. *Those* are my favourite fruit.‚Äù                 | Uses demonstratives         |
| **Cataphora (forward reference)** | ‚ÄúWhen *he* arrived, *John* sat down.‚Äù                            | Pronoun appears before noun |


#### üßÆ How It Works

| Step                         | Description                                     | Example                                          |
| ---------------------------- | ----------------------------------------------- | ------------------------------------------------ |
| 1Ô∏è‚É£ **Mention Detection**    | Identify all noun phrases and pronouns          | ‚ÄúSarah‚Äù, ‚Äúshe‚Äù, ‚Äúthe teacher‚Äù                    |
| 2Ô∏è‚É£ **Feature Extraction**   | Collect grammatical and semantic features       | Gender, number, person, position, syntactic role |
| 3Ô∏è‚É£ **Pairwise Comparison**  | Compare every mention pair to see if they match | ‚ÄúSarah‚Äìshe‚Äù: ‚úÖ gender/number match               |
| 4Ô∏è‚É£ **Clustering / Linking** | Group mentions that refer to the same entity    | {Sarah, she, the teacher}                        |


#### ‚öôÔ∏è Rule-Based Approach (Classic Method)

| Rule Type                     | Description                                      | Example                                 |
| ----------------------------- | ------------------------------------------------ | --------------------------------------- |
| **Gender Agreement**          | Pronoun & noun must match gender                 | Sarah ‚Üî she ‚úÖ; Sarah ‚Üî he ‚ùå             |
| **Number Agreement**          | Singular ‚Üî singular; plural ‚Üî plural             | Teachers ‚Üî they ‚úÖ; Teacher ‚Üî they ‚ùå     |
| **Person Agreement**          | 1st, 2nd, 3rd person consistency                 | I ‚Üî me ‚úÖ; I ‚Üî you ‚ùå                     |
| **Grammatical Role / Syntax** | Subject pronoun often refers to previous subject | ‚ÄúJohn hit Bob. He cried.‚Äù ‚Üí *He = John* |
| **Semantic Compatibility**    | Logical consistency (animate ‚Üî animate)          | *The car* ‚Üî *it* ‚úÖ; *The car* ‚Üî *he* ‚ùå  |


#### ‚öñÔ∏è Comparison: CRF vs NER vs Coreference

| Feature        | **NER**                                 | **CRF**                              | **Coreference Resolution**                    |
| -------------- | --------------------------------------- | ------------------------------------ | --------------------------------------------- |
| **Goal**       | Identify *entities* (names, orgs, etc.) | Predict *sequence labels*            | Link *same entity mentions*                   |
| **Output**     | Tags like `B-PER`, `B-LOC`              | Tag sequence (structured prediction) | Clusters of mentions referring to same entity |
| **Example**    | ‚ÄúBarack Obama‚Äù ‚Üí PERSON                 | ‚ÄúB-PER I-PER O B-LOC‚Äù                | ‚ÄúObama‚Äù ‚Üî ‚Äúhe‚Äù ‚Üî ‚Äúthe President‚Äù              |
| **Model Type** | Classification                          | Probabilistic (sequence model)       | Linking / clustering (rule or ML based)       |


----
----
----

# Text Representation & Unsupervised Modelling

(Core building block between semantics and ML/NLP models)

## üß© 1. Bag-of-Words (BoW)

### üîπ Concept

- Represents text as a bag (set) of individual words ‚Äî ignores grammar and order.
- Each document becomes a vector of word counts.

| Document                       | Word | Count |
| ------------------------------ | ---- | ----- |
| Doc1: "The cat sat on the mat" | cat  | 1     |
|                                | mat  | 1     |
|                                | the  | 2     |
|                                | sat  | 1     |
|                                | on   | 1     |

#### ‚û° Creates a Document-Term Matrix (DTM):
```
         the | cat | sat | on | mat
Doc1 ‚Üí   2     1     1     1     1
Doc2 ‚Üí   1     0     0     1     0
```

#### ‚öôÔ∏è Formula

For each word t in document d:

Term¬†Frequency¬†(TF) = Number¬†of¬†times¬†ùë° appears¬†in¬†ùëë / Total¬†words¬†in¬†ùëë

#### üí° Strengths
- ‚úÖ Simple & interpretable
- ‚úÖ Works well for classification, clustering, or spam/sentiment detection
- ‚úÖ Easy to implement

#### ‚ö†Ô∏è Limitations
- ‚ùå Ignores context & word order
- ‚ùå Treats all words equally important
- ‚ùå Produces sparse, high-dimensional data

---

## üßÆ 2. TF‚ÄìIDF (Term Frequency ‚Äì Inverse Document Frequency)

### üîπ Concept
- Builds on BoW but weights words by how important they are to a document.
- Common words (like ‚Äúthe‚Äù, ‚Äúis‚Äù) get low weight, rare but distinctive words get high weight.

#### ‚öôÔ∏è Formula

TF-IDF(t,d) = TF(t,d) √ó IDF(t)

where:
- TF-IDF(t,d) = Numbe of times t appears in d / Total words in d
- IDF(t) = log(N / df‚Çú)
    - N = total number of documents
    - df‚Çú = number of documents containing term t

#### üí° Intuition

| Word    | TF (Doc A) | Appears in Docs | IDF  | TF‚ÄìIDF Weight |
| ------- | ---------- | --------------- | ---- | ------------- |
| the     | 10         | 100             | low  | small         |
| project | 3          | 2               | high | large         |

‚Üí ‚Äúproject‚Äù is more meaningful than ‚Äúthe‚Äù.




#### üß† Applications

- Search engines (ranking results by keyword relevance)
- Spam filtering (frequent terms in spam)
- Document clustering / topic detection

---

## ‚öñÔ∏è Comparison: BoW vs TF-IDF

| Feature              | Bag-of-Words        | TF‚ÄìIDF                 |
| -------------------- | ------------------- | ---------------------- |
| **Counts**           | Raw word frequency  | Weighted by uniqueness |
| **Common Words**     | Treated equally     | Downweighted           |
| **Captures Meaning** | No                  | Partial                |
| **Output**           | Sparse count vector | Sparse weighted vector |


## üí≠ 3. Why Move Beyond BoW/TF-IDF?

| Limitation                         | Why It Matters                             |
| ---------------------------------- | ------------------------------------------ |
| Context ignored                    | ‚Äúbank‚Äù (river vs finance) ‚Äî no distinction |
| High-dimensional vectors           | Sparse ‚Üí inefficient                       |
| Similar words treated as unrelated | ‚Äúcar‚Äù ‚â† ‚Äúautomobile‚Äù in BoW space          |

‚û° Leads to word embeddings, which fix these.

## üß© 4. Sneak Preview ‚Äî Word Embeddings

| Property      | Description                                                                  |
| ------------- | ---------------------------------------------------------------------------- |
| **Core Idea** | Represent words as *dense*, *low-dimensional vectors* that capture *meaning* |
| **Example**   | `king - man + woman ‚âà queen`                                                 |
| **Methods**   | Word2Vec, GloVe, FastText                                                    |
| **Advantage** | Captures *semantic similarity* through geometry in vector space              |


## üß† Cheatsheet Summary

| Concept          | Representation  | Core Idea              | Pros                           | Cons                            |
| ---------------- | --------------- | ---------------------- | ------------------------------ | ------------------------------- |
| **Bag-of-Words** | Count vector    | Frequency of words     | Simple, interpretable          | No context, large sparse matrix |
| **TF‚ÄìIDF**       | Weighted vector | Frequency √ó uniqueness | More informative, improves BoW | Still ignores meaning           |
| **Embeddings**   | Dense vector    | Semantic relationships | Captures meaning, efficient    | Needs large data                |


----


## Word Embeddings (meaning in vectors)

### 1 ‚Äî Intuition

Word embeddings map each word to a dense numeric vector (e.g., 50‚Äì300 dims) so that words used in similar contexts are close in the vector space. This lets algorithms compute semantic similarity (cosine) and even solve analogies (king - man + woman ‚âà queen).

### 2 ‚Äî Popular methods (high level)

#### Word2Vec (Mikolov et al.)
- Two architectures: CBOW (predict target word from context) and Skip-Gram (predict context words from target).
- Learns embeddings by maximizing likelihood of observing true context words.
- Fast and local (uses sliding windows).

Example (complete sentence):
- Document: ‚ÄúThe cat chased the little mouse across the kitchen.‚Äù
- Skip-gram trains so the vector for cat is good at predicting the, chased, little within its window.

#### GloVe (Global Vectors)

- Uses global co-occurrence counts. Build co-occurrence matrix ùëã
- X where ùëãùëñùëó = # times word i occurs near word j.
- so their dot product plus biases reconstruct log co-occurrence

Example (complete doc):
- Corpus docs: ‚ÄúInvestors watched the stock market rally today.‚Äù
- ‚ÄúMarket volatility caused many investors to sell shares.‚Äù
- GloVe‚Äôs co-occurrence counts will strongly connect market ‚Üî investors, stock ‚Üî shares.

#### Doc2Vec / Sentence Embeddings

- Extend word embeddings to represent whole documents or sentences as vectors (Paragraph Vector).
- Useful for clustering documents, semantic search.

Example documents (complete):
- Doc A: ‚ÄúThe stock market rose as investors regained confidence.‚Äù
- Doc B: ‚ÄúThe local team won the championship last night.‚Äù
- Doc2Vec will place Doc A and Doc B far apart in semantic space.

### 3 ‚Äî Worked miniature example (toy numeric illustration)

Corpus (3 sentences):
- S1: ‚ÄúThe cat chased the mouse in the kitchen.‚Äù
- S2: ‚ÄúA dog chased the cat near the park.‚Äù
- S3: ‚ÄúThe chef cooked dinner in the kitchen.‚Äù

We‚Äôll illustrate conceptually (not real trained vectors).

#### Co-occurrence intuition (toy counts):

| Target \ Context | cat | dog | chased | mouse | kitchen | chef | cooked |
| ---------------: | --: | --: | -----: | ----: | ------: | ---: | -----: |
|              cat |   ‚Äî |   1 |      2 |     1 |       1 |    0 |      0 |
|              dog |   1 |   ‚Äî |      1 |     0 |       0 |    0 |      0 |
|           chased |   2 |   1 |      ‚Äî |     1 |       0 |    0 |      0 |
|            mouse |   1 |   0 |      1 |     ‚Äî |       1 |    0 |      0 |
|          kitchen |   1 |   0 |      0 |     1 |       ‚Äî |    1 |      1 |
|             chef |   0 |   0 |      0 |     0 |       1 |    ‚Äî |      1 |
|           cooked |   0 |   0 |      0 |     0 |       1 |    1 |      ‚Äî |

From these counts, 
- GloVe would try to find vectors so dot products approximate log-counts; 
- Word2Vec skip-gram would push cat close to chased, mouse, kitchen, and dog somewhat.

#### Analogy demonstration (conceptual):

vector(chef) ‚àí vector(kitchen) + vector(park) ‚âà vector(coach)
- (meaning: chef relates to kitchen as coach relates to park ‚Äî demonstrates analogical reasoning possible in embeddings)

Similarity check (toy cosine):
- cos(cat, dog) high (both animals, appear near chased)
- cos(cat, chef) low (different contexts)

#### 4 ‚Äî Practical notes & tips

- Dimensions: 50‚Äì300 typically; larger=more capacity but needs more data.
- Preprocessing: lowercase? vs keep case for Named Entities ‚Äî decide per use case.
- OOV & rare words: FastText uses subword info to handle rare words (character n-grams).
- Contextual vs static: Word2Vec/GloVe are static (one vector per word form). Modern transformer models (BERT) produce contextual embeddings ‚Äî different vectors per token instance depending on sentence.
- Use cases: semantic search, clustering, input to downstream models, analogy detection.

---

### Topic Modelling with NMF

We‚Äôll use a complete-document example and show how TF-IDF ‚Üí document-term matrix 

V ‚Üí NMF ‚Üí W and H.

#### 1 ‚Äî Intuition

Topic modelling: discover latent themes (topics) in a corpus; each document is a mixture of topics; each topic is a distribution over words.

NMF approximates a nonnegative document-term matrix 
- ùëâ by product ùëäùêª with ùëä ‚â• 0, ùêª ‚â• 0
- ùëâ ‚âà ùëäùêª

- W: document √ó topic matrix (how much each document expresses each topic)
- H: topic √ó term matrix (what words define each topic)

- Nonnegativity leads to additive, interpretable parts (topics).




### 2 ‚Äî Small complete-document example (3 documents)

Let‚Äôs take 3 documents and 17 unique words (terms):

#### Step A : Corpus

| Document ID | Text                                                      |
| ----------- | --------------------------------------------------------- |
| D1          | The stock market rises steadily amidst economic recovery. |
| D2          | The local team wins the championship match convincingly.  |
| D3          | Shares fell sharply as the market volatility increased.   |

#### Step B - Vocabulary - unique terms

Choose vocabulary (unique meaningful words after cleaning):
- [stock, market, investors, confidence, team, won, championship, shares, fell, volatility, match, night, economic]

| Term ID | Term         |
| ------- | ------------ |
| T1      | stock        |
| T2      | market       |
| T3      | rises        |
| T4      | steadily     |
| T5      | amidst       |
| T6      | economic     |
| T7      | recovery     |
| T8      | local        |
| T9      | team         |
| T10     | wins         |
| T11     | championship |
| T12     | match        |
| T13     | convincingly |
| T14     | shares       |
| T15     | fell         |
| T16     | sharply      |
| T17     | volatility   |
| T18     | increased    |


#### Step C: Document‚ÄìTerm Matrix (Bag-of-Words)
This is our raw frequency matrix (Bag-of-Words).
- Each row ‚Üí one document (D1, D2, D3).
- Each column ‚Üí one term from the vocabulary.
- Each cell value ‚Üí number of times the term occurs in that document.

Example:
- ‚Äúmarket‚Äù appears in D1 and D3 ‚Üí frequency = 1 in both.
- ‚Äúteam‚Äù appears only in D2 ‚Üí frequency = 1.
- ‚Äúshares‚Äù appears only in D3 ‚Üí frequency = 1. 

| Document | stock | market | rises | steadily | amidst | economic | recovery | local | team | wins | championship | match | convincingly | shares | fell | sharply | volatility | increased |
| :------- | :---: | :----: | :---: | :------: | :----: | :------: | :------: | :---: | :--: | :--: | :----------: | :---: | :----------: | :----: | :--: | :-----: | :--------: | :-------: |
| **D1**   |   1   |    1   |   1   |     1    |    1   |     1    |     1    |   0   |   0  |   0  |       0      |   0   |       0      |    0   |   0  |    0    |      0     |     0     |
| **D2**   |   0   |    0   |   0   |     0    |    0   |     0    |     0    |   1   |   1  |   1  |       1      |   1   |       1      |    0   |   0  |    0    |      0     |     0     |
| **D3**   |   0   |    1   |   0   |     0    |    0   |     0    |     0    |   0   |   0  |   0  |       0      |   0   |       0      |    1   |   1  |    1    |      1     |     1     |

#### Step D ‚Äî TF‚ÄìIDF Weighting

TF-IDF= TF √ó log(N/DFn‚Äã)

Where:
- TF = term frequency in the document
- N = total number of documents (= 3)
- DFn = number of documents containing the term

#### Step D1 ‚Äî Compute DF and IDF

| Term         | DF (no. of docs with term) |  IDF = log(3 / DF)  |
| ------------ | :------------------------: | :-----------------: |
| stock        |              1             | log(3/1) = **1.10** |
| market       |              2             | log(3/2) = **0.40** |
| recovery     |              1             |       **1.10**      |
| team         |              1             |       **1.10**      |
| championship |              1             |       **1.10**      |
| match        |              1             |       **1.10**      |
| shares       |              1             |       **1.10**      |
| volatility   |              1             |       **1.10**      |

#### Step D2 ‚Äî Apply TF‚ÄìIDF

| Document | stock | market | recovery | team | championship | match | shares | volatility |
| -------- | :---: | :----: | :------: | :--: | :----------: | :---: | :----: | :--------: |
| **D1**   |  1.10 |  0.40  |   1.10   |   0  |       0      |   0   |    0   |      0     |
| **D2**   |   0   |    0   |     0    | 1.10 |     1.10     |  1.10 |    0   |      0     |
| **D3**   |   0   |  0.40  |     0    |   0  |       0      |   0   |  1.10  |    1.10    |

Interpretation:
- Common word ‚Äúmarket‚Äù has lower IDF ‚Üí weight = 0.40.
- Distinctive terms like ‚Äústock, shares, team‚Äù have higher IDF = 1.10.
- TF‚ÄìIDF downweights common words, upweights rare but meaningful ones.



#### üß© Step E ‚Äî Matrix Factorization (Topic Modelling)

Now we approximate: V ‚âà W √ó H

Where:
- V = TF‚ÄìIDF matrix (documents √ó terms)
- W = document‚Äìtopic matrix (documents √ó topics)
- H = topic‚Äìterm matrix (topics √ó terms)

Let‚Äôs assume we extract 2 topics:
- Topic 1 ‚Üí Finance
- Topic 2 ‚Üí Sports

#### Step E1 ‚Äî Factorization Illustration

TF‚ÄìIDF matrix (V):

| Doc | stock | market | recovery | team | championship | match | shares | volatility |
| --- | :---: | :----: | :------: | :--: | :----------: | :---: | :----: | :--------: |
| D1  |  1.10 |  0.40  |   1.10   |   0  |       0      |   0   |    0   |      0     |
| D2  |   0   |    0   |     0    | 1.10 |     1.10     |  1.10 |    0   |      0     |
| D3  |   0   |  0.40  |     0    |   0  |       0      |   0   |  1.10  |    1.10    |

Document‚ÄìTopic Matrix (W):

| Document | Topic 1 (Finance) | Topic 2 (Sports) |
| -------- | :---------------: | :--------------: |
| D1       |        0.9        |        0.1       |
| D2       |        0.1        |        0.9       |
| D3       |        0.8        |        0.2       |

Topic‚ÄìTerm Matrix (H):

| Topic                 | stock | market | recovery | team | championship | match | shares | volatility |
| --------------------- | :---: | :----: | :------: | :--: | :----------: | :---: | :----: | :--------: |
| **Topic 1 (Finance)** |  0.9  |   0.6  |    0.8   |  0.1 |      0.0     |  0.0  |   0.9  |     0.8    |
| **Topic 2 (Sports)**  |  0.1  |   0.1  |    0.0   |  0.8 |      0.9     |  0.9  |   0.0  |     0.0    |

Summary NMF Components:

| Matrix | Size  | Meaning                         | Example Entry                   |
| ------ | ----- | ------------------------------- | ------------------------------- |
| **V**  | D √ó T | Document‚ÄìTerm (original TF‚ÄìIDF) | TF‚ÄìIDF of ‚Äústock‚Äù in D1 = 0.7   |
| **W**  | D √ó K | Document‚ÄìTopic                  | D1 has 0.8 weight for Finance   |
| **H**  | K √ó T | Topic‚ÄìTerm                      | ‚Äúteam‚Äù = 0.9 under Sports topic |
| **WH** | D √ó T | Reconstructed Matrix            | Predicted TF‚ÄìIDF ‚âà real TF‚ÄìIDF  |


‚úÖ Interpretation
- V (Documents √ó Terms): your original text representation.
- W (Documents √ó Topics): tells you what each document talks about.
    - D1, D3 ‚Üí high Finance weights.
    - D2 ‚Üí high Sports weight.
- H (Topics √ó Terms): shows the keywords for each topic.
    - Topic 1 ‚Üí stock, market, shares, volatility ‚Üí Finance.
    - Topic 2 ‚Üí team, championship, match ‚Üí Sports.


#### üß≠ Summary Table

| Step                       | Technique          | Captures                           | Limitation                   |
| -------------------------- | ------------------ | ---------------------------------- | ---------------------------- |
| Bag-of-Words               | Word frequency     | Basic term occurrence              | Ignores importance & meaning |
| TF‚ÄìIDF                     | Weighted frequency | Importance of distinctive terms    | Ignores context & order      |
| Matrix Factorization (NMF) | Hidden topics      | Needs interpretation, unsupervised |                              |


---

### How is W and H calcualated?

TF‚ÄìIDF ‚Üí Random W,H ‚Üí Iterative updates ‚Üí Error minimization ‚Üí Final topics

#### ‚öôÔ∏è 1Ô∏è‚É£ Setup: Start with random guesses

We begin with: V ‚âà W √ó H
- V = TF‚ÄìIDF matrix (Docs √ó Terms)
- W = Document‚ÄìTopic weights
- H = Topic‚ÄìTerm weights

Initially, W and H are filled with small random non-negative values.

Example flow diagram (markup friendly)
```
Step 1: Initialization
   [V]  =  [W‚ÇÄ] √ó [H‚ÇÄ]
   ‚Üì         ‚Üì       ‚Üì
 TF‚ÄìIDF     Random   Random
 (Docs√óTerms)  Doc‚ÄìTopic  Topic‚ÄìTerm

Reconstruction error = || V - W‚ÇÄH‚ÇÄ ||¬≤  ‚Üê High!
```


#### ‚öôÔ∏è 2Ô∏è‚É£ Iterative Update (Gradient Descent-like process)

The model adjusts W and H iteratively to minimize the reconstruction error:

Error = ‚à£‚à£V ‚àí WH ‚à£‚à£^2

Each iteration updates both W and H as follows (multiplicative update rules):
![image.png](attachment:image.png)

These updates gradually reduce the difference between the original TF‚ÄìIDF matrix and its approximation.

```
Iteration 1:
   W‚ÇÅ = W‚ÇÄ √ó adjustment
   H‚ÇÅ = H‚ÇÄ √ó adjustment
   Error = || V - W‚ÇÅH‚ÇÅ ||¬≤ = 2.4  ‚Üì

Iteration 2:
   W‚ÇÇ = W‚ÇÅ √ó adjustment
   H‚ÇÇ = H‚ÇÅ √ó adjustment
   Error = || V - W‚ÇÇH‚ÇÇ ||¬≤ = 1.1  ‚Üì

Iteration 3:
   W‚ÇÉ = W‚ÇÇ √ó adjustment
   H‚ÇÉ = H‚ÇÇ √ó adjustment
   Error = || V - W‚ÇÉH‚ÇÉ ||¬≤ = 0.5  ‚Üì

Converged:
   Error change < threshold ‚Üí stop
```

---

### Application of Topic Modelling

| **#** | **Application**                  | **Description / Example**                                                                                                                       |
| :---: | :------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------- |
| **1** | **Document Clustering**          | Groups similar documents (e.g., news articles, research papers) by themes ‚Äî like ‚Äúfinance,‚Äù ‚Äúsports,‚Äù or ‚Äúhealth.‚Äù                              |
| **2** | **Thematic Summarisation**       | Extracts dominant ideas from large text collections such as reviews, customer feedback, or survey responses.                                    |
| **3** | **Personalised Recommendations** | Suggests content (articles, videos, products) by linking users to items sharing similar underlying topics rather than just matching keywords.   |
| **4** | **Trend Analysis**               | Tracks how topics evolve over time ‚Äî e.g., how interest in ‚Äúrenewable energy‚Äù or ‚ÄúAI ethics‚Äù rises or falls in news or social media.            |
| **5** | **Enhanced Search & Retrieval**  | Improves information discovery by finding conceptually related documents even when different wording is used (‚ÄúAI‚Äù vs. ‚Äúmachine intelligence‚Äù). |


---
---

## üß† Applications of NLP ‚Äî Summary Sheet



| **#** | **NLP Task**                                           | **Core Idea**                                                 | **How it Works (Classical Approach)**                                                                                                      | **Example / Outcome**                                                                                   |
| ----- | ------------------------------------------------------ | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| **1** | **Predictive Text Generation (Language Modelling)**    | Predict the next word in a sentence using prior words.        | N-gram models compute conditional probabilities such as **P(word‚Çô given word‚Çô‚Çã‚ÇÅ, word‚Çô‚Çã‚ÇÇ, ‚Ä¶)**. The Markov assumption limits dependency to the last *N‚Äì1* words. | Predicting ‚Äúcoffee‚Äù or ‚Äútea‚Äù after ‚ÄúI need a cup of ‚Ä¶‚Äù. |
| **2** | **Laplace (Add-One) Smoothing**                        | Prevent zero probability for unseen word combinations.        | Adds +1 to every possible N-gram count before computing probabilities. | Even unseen phrases like ‚Äúwent to the zoo‚Äù get a small non-zero probability. |
| **3** | **Sentence Probability & Perplexity**                  | Measure how natural or fluent a sentence is under a model.    | Multiply probabilities of all words to get **P(sentence)**, then compute **Perplexity = (1 / P(sentence))^(1/N)**. | ‚ÄúThe dog sat down‚Äù has lower perplexity (more natural) than ‚ÄúDog the down sat.‚Äù |
| **4** | **Template-based Sentence Generation**                 | Generate grammatically correct text using fixed patterns.     | Predefined sentence templates are filled with contextually appropriate words or phrases. | ‚ÄúDear [Name], your order [#ID] has been shipped.‚Äù |
| **5** | **Rule-based Sentence Generation**                     | Use grammar rules to ensure syntactic and semantic coherence. | Apply syntactic and morphological rules for agreement and structure. | Early chatbots and report generators (e.g., ELIZA). |
| **6** | **Classical Summarisation**                            | Extract key information from long text.                       | **Frequency-based:** Select sentences with most frequent key words.<br>**Position-based:** Prefer opening/closing sentences or section headers. | News or report summarisation based on term frequency and position. |
| **7** | **Machine Translation (Rule-based)**                   | Translate using dictionaries and linguistic rules.            | **Direct:** Word-for-word substitution.<br>**Transfer-based:** Apply grammar rules for correct order. | English ‚Üí French: ‚ÄúHi, my name is Meet‚Äù ‚Üí ‚ÄúBonjour, je m‚Äôappelle Meet.‚Äù |
| **8** | **Machine Translation (Statistical)**                  | Translate by learning from bilingual corpora (data-driven).   | Combines:<br>‚Ä¢ **Translation Model:** *P(Source given Target)*<br>‚Ä¢ **Language Model:** *P(Target)*<br>Chooses translation that maximises joint probability. | Learns that ‚ÄúMy name is‚Äù ‚Üî ‚ÄúJe m‚Äôappelle‚Äù via probability scores. |
