<a href="https://colab.research.google.com/github/HosseinEyvazi/NLP/blob/main/Text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# 🧠 Machine Translation

Machine Translation (MT) refers to automatically converting text from one language to another using **sequence-to-sequence models** (encoder-decoder architectures).

### 🔍 Main Challenge

* **Low-resource languages** suffer from lack of parallel training data (e.g., aligned sentence pairs).

### 🔧 Common Approaches

* **Pretrained multilingual models** (e.g., **mBART**, **MarianMT**) leverage shared language representations across many languages.
* **Fine-tuning**: Adapt these models to specific **language pairs** or **domains** for improved accuracy.

### 🌍 Advanced Models

* Models like **NLLB (No Language Left Behind)** support up to **200 languages** and achieve strong results even on low-resource pairs.

---

## 📊 Translation Evaluation Metrics

| Metric     | Description                                                                        | Example                                                                                                                       |
| ---------- | ---------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| **BLEU**   | Measures n-gram precision against reference translations.                          | Ref: *“The cat is on the mat”*<br>Hyp: *“The cat sits on the mat”* → BLEU captures overlap of 1–4-grams.                      |
| **ROUGE**  | Originally for summarization, used for overlap and fluency checks.                 | Ref: *“She enjoys reading books.”*<br>Hyp: *“She loves reading novels.”* → ROUGE-L captures “She … reading”.                  |
| **METEOR** | Considers **synonyms**, **stemming**, and **word order**, not just exact matches.  | Ref: *“He searched for the answer.”*<br>Hyp: *“He looked for the solution.”* → Matches "searched/looked" & "answer/solution". |
| **TER**    | Translation Edit Rate: measures the number of edits needed to match the reference. | Fewer edits → better translation.                                                                                             |

---

# 📚 Text Summarization

Text summarization reduces long documents into shorter versions while preserving key information.

### 🧠 Two Main Approaches

* **Abstractive**: Generates novel sentences using models like **BART**, **T5**, **mT5**.
* **Extractive**: Selects key sentences or phrases from the input (e.g., TextRank, BERTSum).

> Abstractive models can be more fluent and human-like, while extractive models are often more faithful to the original text.

---

## 📏 Summarization Evaluation Metrics

| Metric        | Description                                                                             | Example                                                                                                                                                                                   |
| ------------- | --------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **ROUGE**     | Measures overlap between generated and reference summaries (n-grams, LCS, skip-grams).  | Ref: *“AI drives innovation across fields.”*<br>Gen: *“Artificial intelligence powers change in many domains.”* → ROUGE finds matching n-grams like “AI/Artificial”, “innovation/powers”. |
| **BLEU**      | Less common for summarization, but used to measure content overlap.                     | Shared phrases between summary and reference are rewarded.                                                                                                                                |
| **BERTScore** | Uses contextual embeddings (e.g., BERT) to measure semantic similarity between outputs. | Recognizes when wording differs but meaning is preserved.                                                                                                                                 |

---

# 🧪 General NLP Evaluation Metrics

| Task                    | Common Metrics                                        |
| ----------------------- | ----------------------------------------------------- |
| **Text Classification** | Accuracy, Precision, Recall, F1 Score                 |
| **Text Generation**     | Perplexity, BLEU Score                                |
| **Summarization**       | ROUGE, BLEU, BERTScore                                |
| **Translation**         | BLEU, METEOR, TER                                     |
| **Question Answering**  | Exact Match (EM), F1 Score, ROUGE (for generative QA) |

---

# 🌐 Popular Multilingual Models

| Model        | Supported Languages    | Fine-tunable | Notes                                                           |
| ------------ | ---------------------- | ------------ | --------------------------------------------------------------- |
| **T5**       | \~100 (mostly English) | ✅ Yes        | General-purpose model, strong in abstractive summarization      |
| **mT5**      | 101                    | ✅ Yes        | Fully multilingual version of T5                                |
| **mBART50**  | 50                     | ✅ Yes        | Seq2seq model for multilingual tasks                            |
| **MarianMT** | \~90                   | ✅ Yes        | Lightweight and fast, supports many translation pairs           |
| **NLLB**     | 200                    | ✅ Yes        | Meta’s model for high-quality low-resource language translation |

> 🔧 All these models can be fine-tuned for specific tasks, domains, or language pairs.

---


### 🎯 **BLEU Score Example (Translation)**

#### **Reference Translation**

> *"The quick brown fox jumps over the lazy dog."*

#### **Machine Translation Output (Hypothesis)**

> *"A fast brown fox leaps over the lazy dog."*

---

### ✅ Step-by-step BLEU Evaluation

#### 1. **Unigram (1-gram) overlap**

Compare individual words in the hypothesis with the reference:

* Overlapping words:
  `"brown"`, `"fox"`, `"over"`, `"the"`, `"lazy"`, `"dog"`
* Total unigrams in hypothesis: 9
* Matching unigrams: 6
* → **Precision\@1 (BLEU-1)** = 6 / 9 ≈ **0.667**

#### 2. **Bigram (2-gram) overlap**

Check pairs of consecutive words:

* Reference bigrams:
  `"The quick"`, `"quick brown"`, `"brown fox"`, `"fox jumps"`, `"jumps over"`, `"over the"`, `"the lazy"`, `"lazy dog"`
* Hypothesis bigrams:
  `"A fast"`, `"fast brown"`, `"brown fox"`, `"fox leaps"`, `"leaps over"`, `"over the"`, `"the lazy"`, `"lazy dog"`
* Matching bigrams:
  `"brown fox"`, `"over the"`, `"the lazy"`, `"lazy dog"` → 4 matches
* Total bigrams in hypothesis: 8
* → **Precision\@2 (BLEU-2)** = 4 / 8 = **0.5**

#### 3. **Brevity Penalty (BP)**

Used if the hypothesis is shorter than the reference. Here, both have 9 words, so **BP = 1** (no penalty).

---

### 🧮 Final BLEU Score (simplified)

BLEU combines multiple n-gram precisions with a geometric mean and brevity penalty.
For simplicity, here’s a 2-gram version:

$$
\text{BLEU-2} = BP \times \exp\left( \frac{1}{2} (\log P_1 + \log P_2) \right)
$$

$$
= 1 \times \exp\left( \frac{1}{2} (\log 0.667 + \log 0.5) \right) \approx 0.577
$$

---

### 📌 Interpretation:

* **BLEU ≈ 0.577** here → indicates **moderate translation quality**.
* Synonyms like *“fast”* vs. *“quick”* and *“leaps”* vs. *“jumps”* **do not count** in BLEU — it rewards exact word overlap.
* That's why BLEU can **underestimate** semantically correct translations.




---

## 🌟 METEOR Score Example

### 🟦 Reference Sentence (Human Translation):

> "I am looking for a solution."

### 🟩 Hypothesis Sentence (Machine Translation):

> "I'm searching for an answer."

---

## 🔍 What Makes METEOR Special?

Unlike **BLEU** or **ROUGE**, METEOR:

✅ Handles **synonyms**
✅ Handles **stemming** (e.g. *run* vs. *running*)
✅ Considers **word order** penalties
✅ Can use **paraphrase dictionaries** (optional)

---

### 🔄 Matching Details:

| Word in Hypothesis | Match Type | Match in Reference |
| ------------------ | ---------- | ------------------ |
| "I'm"              | Stemming   | "I am"             |
| "searching"        | Synonym    | "looking"          |
| "for"              | Exact      | "for"              |
| "an"               | Exact      | "a"                |
| "answer"           | Synonym    | "solution"         |

✅ 5 out of 5 words aligned
✅ Synonyms & stemming captured
✅ Word order mostly preserved

---

### 📏 METEOR Score (Roughly)

* **Precision** ≈ 5/5
* **Recall** ≈ 5/5
* **F1** = 1.0
* **Fragmentation penalty** is small (due to mostly correct order)

✅ So **METEOR score would be high** (close to **1.0**),
whereas **BLEU score would be low** (due to no exact n-gram matches).

---

## 🔑 Summary

| Metric     | Would it reward this translation? | Why?                                      |
| ---------- | --------------------------------- | ----------------------------------------- |
| BLEU       | ❌ Low                             | Few exact n-gram overlaps                 |
| ROUGE      | ❌ Low                             | Weak word overlap                         |
| **METEOR** | ✅ High                            | Captures synonyms, paraphrases, and order |

---




## 🟥 ROUGE Score Example

**Reference Summary:**

> *"The quick brown fox jumps over the lazy dog."*

**Generated Summary (Hypothesis):**

> *"A fast brown fox leaps over a lazy dog."*

---

## 🔍 ROUGE-1 (Unigram Overlap)

Measures **word-level (1-gram)** overlap between reference and hypothesis.

* **Reference unigrams:**
  `["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]`

* **Hypothesis unigrams:**
  `["A", "fast", "brown", "fox", "leaps", "over", "a", "lazy", "dog"]`

* **Matching words:**
  `"brown"`, `"fox"`, `"over"`, `"lazy"`, `"dog"` → **5 matches**

### ROUGE-1 Scores:

* **Precision** = 5 / 9 = **0.555**
* **Recall** = 5 / 9 = **0.555**
* **F1** = 2 × (P × R) / (P + R) ≈ **0.555**

---

## 🔍 ROUGE-2 (Bigram Overlap)

Measures **two-word phrase (2-gram)** overlap.

* **Reference bigrams:**
  `"The quick"`, `"quick brown"`, `"brown fox"`, `"fox jumps"`, `"jumps over"`, `"over the"`, `"the lazy"`, `"lazy dog"`

* **Hypothesis bigrams:**
  `"A fast"`, `"fast brown"`, `"brown fox"`, `"fox leaps"`, `"leaps over"`, `"over a"`, `"a lazy"`, `"lazy dog"`

* **Matching bigrams:**
  `"brown fox"`, `"lazy dog"` → **2 matches**

### ROUGE-2 Scores:

* **Precision** = 2 / 8 = **0.25**
* **Recall** = 2 / 8 = **0.25**
* **F1** = 0.25

---

## 🔍 ROUGE-L (Longest Common Subsequence)

Looks for the **longest in-order matching word sequence**, ignoring gaps.

* **LCS between reference and hypothesis:**
  `"brown fox over lazy dog"` → length = **5**

### ROUGE-L Scores:

* **Precision** = 5 / 9 = **0.555**
* **Recall** = 5 / 9 = **0.555**
* **F1** = 0.555

---

## 📊 Summary of ROUGE Scores

| Metric      | Precision | Recall | F1 Score |
| ----------- | --------- | ------ | -------- |
| **ROUGE-1** | 0.555     | 0.555  | 0.555    |
| **ROUGE-2** | 0.25      | 0.25   | 0.25     |
| **ROUGE-L** | 0.555     | 0.555  | 0.555    |

---

### ✅ When to Use Each ROUGE Variant:

| Variant     | Best For                                     |
| ----------- | -------------------------------------------- |
| **ROUGE-1** | Word-level overlap (basic relevance)         |
| **ROUGE-2** | Phrase-level fluency                         |
| **ROUGE-L** | Capturing **sentence structure** and fluency |



Important Note : \
**High ROUGE-L doesn’t mean a good summary.**
It just shows that the output keeps the same word order as the reference — not that it’s concise, meaningful, or truly summarized: \

🔄 **Example**

🔷 **Input Document (Long):**
*"Artificial intelligence has revolutionized various industries by enabling machines to learn from data, adapt to new inputs, and perform human-like tasks."*

🔷 **Reference Summary (Good):**
*"AI enables machines to learn and mimic human behavior."*

🔷 **Generated Summary A (Copy-Paste):**
*"Artificial intelligence has revolutionized various industries by enabling machines to learn from data, adapt to new inputs, and perform human-like tasks."*

✅ **ROUGE-L: High (or Perfect)**
❌ **But:** This is just a copy of the input — **no actual summarization** was done.




---

## 🌟 What is BERTScore?

**BERTScore** compares the **semantic similarity** between the **generated sentence** and the **reference**, using contextual embeddings from models like **BERT** or **RoBERTa**.

✅ It understands **meaning**, not just surface form (words).
❌ It does **not rely on exact matches** or word order directly.

---

## 🧪 BERTScore Example

### 🟦 Reference Sentence (Gold Summary / Translation):

> *"A dog is chasing a ball in the field."*

### 🟩 Hypothesis Sentence (Generated):

> *"A puppy runs after a ball on the grass."*

---

### 🔍 Traditional Metrics Would Say:

* **BLEU**: Low (few exact n-gram matches like “a”, “ball”)
* **ROUGE**: Low (misses many unigrams like “puppy”, “runs”)
* **METEOR**: Medium (might catch "puppy/dog", "runs/chasing")
* ❗ **But all ignore full sentence-level semantics.**

---

### ✅ BERTScore’s View:

* "puppy" and "dog" → **semantically close**
* "runs after" and "chasing" → **contextually similar**
* "on the grass" and "in the field" → **nearly equivalent**

BERTScore uses embeddings from each token and **matches tokens semantically** between the hypothesis and reference. It evaluates **precision**, **recall**, and **F1** over these soft matches.

### 🔢 BERTScore Output (Typical):

| Metric    | Score (0–1 scale)                      |
| --------- | -------------------------------------- |
| Precision | 0.91                                   |
| Recall    | 0.89                                   |
| **F1**    | **0.90**         ← main score reported |

✅ Even though **no exact words match**, BERTScore gives a **high score** — because the **meaning is preserved**.


### 🧠 TL;DR

> **BERTScore** gives **high scores** when **meaning is preserved**, even if the wording is different — perfect for **abstractive summarization**, **paraphrase generation**, and **semantic translations**.




### 📊 Comparison Table: ROUGE vs BLEU vs METEOR vs BERTScore

| Feature / Metric          | **BLEU**                        | **ROUGE**                             | **METEOR**                               | **BERTScore**                                  |
| ------------------------- | ------------------------------- | ------------------------------------- | ---------------------------------------- | ---------------------------------------------- |
| **Main Use**              | Translation                     | Summarization (mostly), Translation   | Translation (can work for summarization) | Summarization, Paraphrase, Translation         |
| **Match Type**            | Exact n-gram                    | Exact n-gram (ROUGE-N), LCS (ROUGE-L) | Soft matches (synonyms, stems)           | Contextual similarity (embeddings)             |
| **Handles Synonyms**      | ❌ No                            | ❌ No                                  | ✅ Yes                                    | ✅ Yes                                          |
| **Handles Stemming**      | ❌ No                            | ❌ No                                  | ✅ Yes                                    | ✅ Yes (via context)                            |
| **Considers Word Order**  | ❌ No (bag-of-n-grams)           | ⚠️ Only in ROUGE-L                    | ✅ Yes (fragmentation penalty)            | ⚠️ Indirectly (context sensitivity)            |
| **Score Type**            | Precision (mainly)              | Recall (mainly), Precision, F1        | Precision, Recall, F1                    | Precision, Recall, F1 (based on embeddings)    |
| **Best For**              | Accurate phrasing & translation | Extractive summarization              | Translation with soft matching           | Abstractive summarization, paraphrase, meaning |
| **Semantic Awareness**    | ❌ No                            | ❌ No                                  | ⚠️ Limited (via WordNet)                 | ✅ Yes                                          |
| **Granularity**           | n-gram (typically up to 4)      | Unigram, bigram, LCS                  | Unigram + semantic features              | Token-level embedding match                    |
| **Language Independence** | ⚠️ Mostly English               | ✅ Yes                                 | ⚠️ English-focused (WordNet)             | ✅ Multilingual (depends on model used)         |
| **Output Range**          | 0 to 1 (or 0 to 100%)           | 0 to 1                                | 0 to 1                                   | 0 to 1                                         |
| **Interpretability**      | ✅ Easy to interpret             | ✅ Easy                                | ⚠️ Medium (more components)              | ❌ Less intuitive (embedding similarity)        |

---

### 🧠 TL;DR:

| Goal                            | Recommended Metric(s)    |
| ------------------------------- | ------------------------ |
| **Precise translation**         | BLEU, METEOR             |
| **Summarization (extractive)**  | ROUGE                    |
| **Summarization (abstractive)** | BERTScore, ROUGE, METEOR |
| **Semantic similarity**         | BERTScore                |



In [None]:
#####################  refer to section2 in vs code projects  #######################