<a href="https://colab.research.google.com/github/HosseinEyvazi/NLP/blob/main/Text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# üß† Machine Translation

Machine Translation (MT) refers to automatically converting text from one language to another using **sequence-to-sequence models** (encoder-decoder architectures).

### üîç Main Challenge

* **Low-resource languages** suffer from lack of parallel training data (e.g., aligned sentence pairs).

### üîß Common Approaches

* **Pretrained multilingual models** (e.g., **mBART**, **MarianMT**) leverage shared language representations across many languages.
* **Fine-tuning**: Adapt these models to specific **language pairs** or **domains** for improved accuracy.

### üåç Advanced Models

* Models like **NLLB (No Language Left Behind)** support up to **200 languages** and achieve strong results even on low-resource pairs.

---

## üìä Translation Evaluation Metrics

| Metric     | Description                                                                        | Example                                                                                                                       |
| ---------- | ---------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| **BLEU**   | Measures n-gram precision against reference translations.                          | Ref: *‚ÄúThe cat is on the mat‚Äù*<br>Hyp: *‚ÄúThe cat sits on the mat‚Äù* ‚Üí BLEU captures overlap of 1‚Äì4-grams.                      |
| **ROUGE**  | Originally for summarization, used for overlap and fluency checks.                 | Ref: *‚ÄúShe enjoys reading books.‚Äù*<br>Hyp: *‚ÄúShe loves reading novels.‚Äù* ‚Üí ROUGE-L captures ‚ÄúShe ‚Ä¶ reading‚Äù.                  |
| **METEOR** | Considers **synonyms**, **stemming**, and **word order**, not just exact matches.  | Ref: *‚ÄúHe searched for the answer.‚Äù*<br>Hyp: *‚ÄúHe looked for the solution.‚Äù* ‚Üí Matches "searched/looked" & "answer/solution". |
| **TER**    | Translation Edit Rate: measures the number of edits needed to match the reference. | Fewer edits ‚Üí better translation.                                                                                             |

---

# üìö Text Summarization

Text summarization reduces long documents into shorter versions while preserving key information.

### üß† Two Main Approaches

* **Abstractive**: Generates novel sentences using models like **BART**, **T5**, **mT5**.
* **Extractive**: Selects key sentences or phrases from the input (e.g., TextRank, BERTSum).

> Abstractive models can be more fluent and human-like, while extractive models are often more faithful to the original text.

---

## üìè Summarization Evaluation Metrics

| Metric        | Description                                                                             | Example                                                                                                                                                                                   |
| ------------- | --------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **ROUGE**     | Measures overlap between generated and reference summaries (n-grams, LCS, skip-grams).  | Ref: *‚ÄúAI drives innovation across fields.‚Äù*<br>Gen: *‚ÄúArtificial intelligence powers change in many domains.‚Äù* ‚Üí ROUGE finds matching n-grams like ‚ÄúAI/Artificial‚Äù, ‚Äúinnovation/powers‚Äù. |
| **BLEU**      | Less common for summarization, but used to measure content overlap.                     | Shared phrases between summary and reference are rewarded.                                                                                                                                |
| **BERTScore** | Uses contextual embeddings (e.g., BERT) to measure semantic similarity between outputs. | Recognizes when wording differs but meaning is preserved.                                                                                                                                 |

---

# üß™ General NLP Evaluation Metrics

| Task                    | Common Metrics                                        |
| ----------------------- | ----------------------------------------------------- |
| **Text Classification** | Accuracy, Precision, Recall, F1 Score                 |
| **Text Generation**     | Perplexity, BLEU Score                                |
| **Summarization**       | ROUGE, BLEU, BERTScore                                |
| **Translation**         | BLEU, METEOR, TER                                     |
| **Question Answering**  | Exact Match (EM), F1 Score, ROUGE (for generative QA) |

---

# üåê Popular Multilingual Models

| Model        | Supported Languages    | Fine-tunable | Notes                                                           |
| ------------ | ---------------------- | ------------ | --------------------------------------------------------------- |
| **T5**       | \~100 (mostly English) | ‚úÖ Yes        | General-purpose model, strong in abstractive summarization      |
| **mT5**      | 101                    | ‚úÖ Yes        | Fully multilingual version of T5                                |
| **mBART50**  | 50                     | ‚úÖ Yes        | Seq2seq model for multilingual tasks                            |
| **MarianMT** | \~90                   | ‚úÖ Yes        | Lightweight and fast, supports many translation pairs           |
| **NLLB**     | 200                    | ‚úÖ Yes        | Meta‚Äôs model for high-quality low-resource language translation |

> üîß All these models can be fine-tuned for specific tasks, domains, or language pairs.

---


### üéØ **BLEU Score Example (Translation)**

#### **Reference Translation**

> *"The quick brown fox jumps over the lazy dog."*

#### **Machine Translation Output (Hypothesis)**

> *"A fast brown fox leaps over the lazy dog."*

---

### ‚úÖ Step-by-step BLEU Evaluation

#### 1. **Unigram (1-gram) overlap**

Compare individual words in the hypothesis with the reference:

* Overlapping words:
  `"brown"`, `"fox"`, `"over"`, `"the"`, `"lazy"`, `"dog"`
* Total unigrams in hypothesis: 9
* Matching unigrams: 6
* ‚Üí **Precision\@1 (BLEU-1)** = 6 / 9 ‚âà **0.667**

#### 2. **Bigram (2-gram) overlap**

Check pairs of consecutive words:

* Reference bigrams:
  `"The quick"`, `"quick brown"`, `"brown fox"`, `"fox jumps"`, `"jumps over"`, `"over the"`, `"the lazy"`, `"lazy dog"`
* Hypothesis bigrams:
  `"A fast"`, `"fast brown"`, `"brown fox"`, `"fox leaps"`, `"leaps over"`, `"over the"`, `"the lazy"`, `"lazy dog"`
* Matching bigrams:
  `"brown fox"`, `"over the"`, `"the lazy"`, `"lazy dog"` ‚Üí 4 matches
* Total bigrams in hypothesis: 8
* ‚Üí **Precision\@2 (BLEU-2)** = 4 / 8 = **0.5**

#### 3. **Brevity Penalty (BP)**

Used if the hypothesis is shorter than the reference. Here, both have 9 words, so **BP = 1** (no penalty).

---

### üßÆ Final BLEU Score (simplified)

BLEU combines multiple n-gram precisions with a geometric mean and brevity penalty.
For simplicity, here‚Äôs a 2-gram version:

$$
\text{BLEU-2} = BP \times \exp\left( \frac{1}{2} (\log P_1 + \log P_2) \right)
$$

$$
= 1 \times \exp\left( \frac{1}{2} (\log 0.667 + \log 0.5) \right) \approx 0.577
$$

---

### üìå Interpretation:

* **BLEU ‚âà 0.577** here ‚Üí indicates **moderate translation quality**.
* Synonyms like *‚Äúfast‚Äù* vs. *‚Äúquick‚Äù* and *‚Äúleaps‚Äù* vs. *‚Äújumps‚Äù* **do not count** in BLEU ‚Äî it rewards exact word overlap.
* That's why BLEU can **underestimate** semantically correct translations.




---

## üåü METEOR Score Example

### üü¶ Reference Sentence (Human Translation):

> "I am looking for a solution."

### üü© Hypothesis Sentence (Machine Translation):

> "I'm searching for an answer."

---

## üîç What Makes METEOR Special?

Unlike **BLEU** or **ROUGE**, METEOR:

‚úÖ Handles **synonyms**
‚úÖ Handles **stemming** (e.g. *run* vs. *running*)
‚úÖ Considers **word order** penalties
‚úÖ Can use **paraphrase dictionaries** (optional)

---

### üîÑ Matching Details:

| Word in Hypothesis | Match Type | Match in Reference |
| ------------------ | ---------- | ------------------ |
| "I'm"              | Stemming   | "I am"             |
| "searching"        | Synonym    | "looking"          |
| "for"              | Exact      | "for"              |
| "an"               | Exact      | "a"                |
| "answer"           | Synonym    | "solution"         |

‚úÖ 5 out of 5 words aligned
‚úÖ Synonyms & stemming captured
‚úÖ Word order mostly preserved

---

### üìè METEOR Score (Roughly)

* **Precision** ‚âà 5/5
* **Recall** ‚âà 5/5
* **F1** = 1.0
* **Fragmentation penalty** is small (due to mostly correct order)

‚úÖ So **METEOR score would be high** (close to **1.0**),
whereas **BLEU score would be low** (due to no exact n-gram matches).

---

## üîë Summary

| Metric     | Would it reward this translation? | Why?                                      |
| ---------- | --------------------------------- | ----------------------------------------- |
| BLEU       | ‚ùå Low                             | Few exact n-gram overlaps                 |
| ROUGE      | ‚ùå Low                             | Weak word overlap                         |
| **METEOR** | ‚úÖ High                            | Captures synonyms, paraphrases, and order |

---




## üü• ROUGE Score Example

**Reference Summary:**

> *"The quick brown fox jumps over the lazy dog."*

**Generated Summary (Hypothesis):**

> *"A fast brown fox leaps over a lazy dog."*

---

## üîç ROUGE-1 (Unigram Overlap)

Measures **word-level (1-gram)** overlap between reference and hypothesis.

* **Reference unigrams:**
  `["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]`

* **Hypothesis unigrams:**
  `["A", "fast", "brown", "fox", "leaps", "over", "a", "lazy", "dog"]`

* **Matching words:**
  `"brown"`, `"fox"`, `"over"`, `"lazy"`, `"dog"` ‚Üí **5 matches**

### ROUGE-1 Scores:

* **Precision** = 5 / 9 = **0.555**
* **Recall** = 5 / 9 = **0.555**
* **F1** = 2 √ó (P √ó R) / (P + R) ‚âà **0.555**

---

## üîç ROUGE-2 (Bigram Overlap)

Measures **two-word phrase (2-gram)** overlap.

* **Reference bigrams:**
  `"The quick"`, `"quick brown"`, `"brown fox"`, `"fox jumps"`, `"jumps over"`, `"over the"`, `"the lazy"`, `"lazy dog"`

* **Hypothesis bigrams:**
  `"A fast"`, `"fast brown"`, `"brown fox"`, `"fox leaps"`, `"leaps over"`, `"over a"`, `"a lazy"`, `"lazy dog"`

* **Matching bigrams:**
  `"brown fox"`, `"lazy dog"` ‚Üí **2 matches**

### ROUGE-2 Scores:

* **Precision** = 2 / 8 = **0.25**
* **Recall** = 2 / 8 = **0.25**
* **F1** = 0.25

---

## üîç ROUGE-L (Longest Common Subsequence)

Looks for the **longest in-order matching word sequence**, ignoring gaps.

* **LCS between reference and hypothesis:**
  `"brown fox over lazy dog"` ‚Üí length = **5**

### ROUGE-L Scores:

* **Precision** = 5 / 9 = **0.555**
* **Recall** = 5 / 9 = **0.555**
* **F1** = 0.555

---

## üìä Summary of ROUGE Scores

| Metric      | Precision | Recall | F1 Score |
| ----------- | --------- | ------ | -------- |
| **ROUGE-1** | 0.555     | 0.555  | 0.555    |
| **ROUGE-2** | 0.25      | 0.25   | 0.25     |
| **ROUGE-L** | 0.555     | 0.555  | 0.555    |

---

### ‚úÖ When to Use Each ROUGE Variant:

| Variant     | Best For                                     |
| ----------- | -------------------------------------------- |
| **ROUGE-1** | Word-level overlap (basic relevance)         |
| **ROUGE-2** | Phrase-level fluency                         |
| **ROUGE-L** | Capturing **sentence structure** and fluency |



Important Note : \
**High ROUGE-L doesn‚Äôt mean a good summary.**
It just shows that the output keeps the same word order as the reference ‚Äî not that it‚Äôs concise, meaningful, or truly summarized: \

üîÑ **Example**

üî∑ **Input Document (Long):**
*"Artificial intelligence has revolutionized various industries by enabling machines to learn from data, adapt to new inputs, and perform human-like tasks."*

üî∑ **Reference Summary (Good):**
*"AI enables machines to learn and mimic human behavior."*

üî∑ **Generated Summary A (Copy-Paste):**
*"Artificial intelligence has revolutionized various industries by enabling machines to learn from data, adapt to new inputs, and perform human-like tasks."*

‚úÖ **ROUGE-L: High (or Perfect)**
‚ùå **But:** This is just a copy of the input ‚Äî **no actual summarization** was done.




---

## üåü What is BERTScore?

**BERTScore** compares the **semantic similarity** between the **generated sentence** and the **reference**, using contextual embeddings from models like **BERT** or **RoBERTa**.

‚úÖ It understands **meaning**, not just surface form (words).
‚ùå It does **not rely on exact matches** or word order directly.

---

## üß™ BERTScore Example

### üü¶ Reference Sentence (Gold Summary / Translation):

> *"A dog is chasing a ball in the field."*

### üü© Hypothesis Sentence (Generated):

> *"A puppy runs after a ball on the grass."*

---

### üîç Traditional Metrics Would Say:

* **BLEU**: Low (few exact n-gram matches like ‚Äúa‚Äù, ‚Äúball‚Äù)
* **ROUGE**: Low (misses many unigrams like ‚Äúpuppy‚Äù, ‚Äúruns‚Äù)
* **METEOR**: Medium (might catch "puppy/dog", "runs/chasing")
* ‚ùó **But all ignore full sentence-level semantics.**

---

### ‚úÖ BERTScore‚Äôs View:

* "puppy" and "dog" ‚Üí **semantically close**
* "runs after" and "chasing" ‚Üí **contextually similar**
* "on the grass" and "in the field" ‚Üí **nearly equivalent**

BERTScore uses embeddings from each token and **matches tokens semantically** between the hypothesis and reference. It evaluates **precision**, **recall**, and **F1** over these soft matches.

### üî¢ BERTScore Output (Typical):

| Metric    | Score (0‚Äì1 scale)                      |
| --------- | -------------------------------------- |
| Precision | 0.91                                   |
| Recall    | 0.89                                   |
| **F1**    | **0.90**         ‚Üê main score reported |

‚úÖ Even though **no exact words match**, BERTScore gives a **high score** ‚Äî because the **meaning is preserved**.


### üß† TL;DR

> **BERTScore** gives **high scores** when **meaning is preserved**, even if the wording is different ‚Äî perfect for **abstractive summarization**, **paraphrase generation**, and **semantic translations**.




### üìä Comparison Table: ROUGE vs BLEU vs METEOR vs BERTScore

| Feature / Metric          | **BLEU**                        | **ROUGE**                             | **METEOR**                               | **BERTScore**                                  |
| ------------------------- | ------------------------------- | ------------------------------------- | ---------------------------------------- | ---------------------------------------------- |
| **Main Use**              | Translation                     | Summarization (mostly), Translation   | Translation (can work for summarization) | Summarization, Paraphrase, Translation         |
| **Match Type**            | Exact n-gram                    | Exact n-gram (ROUGE-N), LCS (ROUGE-L) | Soft matches (synonyms, stems)           | Contextual similarity (embeddings)             |
| **Handles Synonyms**      | ‚ùå No                            | ‚ùå No                                  | ‚úÖ Yes                                    | ‚úÖ Yes                                          |
| **Handles Stemming**      | ‚ùå No                            | ‚ùå No                                  | ‚úÖ Yes                                    | ‚úÖ Yes (via context)                            |
| **Considers Word Order**  | ‚ùå No (bag-of-n-grams)           | ‚ö†Ô∏è Only in ROUGE-L                    | ‚úÖ Yes (fragmentation penalty)            | ‚ö†Ô∏è Indirectly (context sensitivity)            |
| **Score Type**            | Precision (mainly)              | Recall (mainly), Precision, F1        | Precision, Recall, F1                    | Precision, Recall, F1 (based on embeddings)    |
| **Best For**              | Accurate phrasing & translation | Extractive summarization              | Translation with soft matching           | Abstractive summarization, paraphrase, meaning |
| **Semantic Awareness**    | ‚ùå No                            | ‚ùå No                                  | ‚ö†Ô∏è Limited (via WordNet)                 | ‚úÖ Yes                                          |
| **Granularity**           | n-gram (typically up to 4)      | Unigram, bigram, LCS                  | Unigram + semantic features              | Token-level embedding match                    |
| **Language Independence** | ‚ö†Ô∏è Mostly English               | ‚úÖ Yes                                 | ‚ö†Ô∏è English-focused (WordNet)             | ‚úÖ Multilingual (depends on model used)         |
| **Output Range**          | 0 to 1 (or 0 to 100%)           | 0 to 1                                | 0 to 1                                   | 0 to 1                                         |
| **Interpretability**      | ‚úÖ Easy to interpret             | ‚úÖ Easy                                | ‚ö†Ô∏è Medium (more components)              | ‚ùå Less intuitive (embedding similarity)        |

---

### üß† TL;DR:

| Goal                            | Recommended Metric(s)    |
| ------------------------------- | ------------------------ |
| **Precise translation**         | BLEU, METEOR             |
| **Summarization (extractive)**  | ROUGE                    |
| **Summarization (abstractive)** | BERTScore, ROUGE, METEOR |
| **Semantic similarity**         | BERTScore                |



In [None]:
#####################  refer to section2 in vs code projects  #######################