<a href="https://colab.research.google.com/github/CrisMcode111/DI_Bootcamp/blob/main/w8_d1_evaluating_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What you will create
You will complete exercises to apply BLEU and ROUGE scores to sample text, analyze perplexity scores, conduct adversarial testing, and propose improvements for LLM evaluation methodologies.



1. Understanding LLM Evaluation:

Explain why evaluating LLMs is more complex than traditional software.
Identify key reasons for evaluating an LLM’s safety.
Describe how adversarial testing contributes to LLM improvement.
Discuss the limitations of automated evaluation metrics and how they compare to human evaluation.


1. Why LLM Evaluation Is More Complex Than Traditional Software

Evaluating an LLM is fundamentally different from evaluating classic deterministic software.

Key reasons:

Non-deterministic outputs:
Traditional software always returns the same result for the same input.
LLMs generate probabilistic outputs that can vary each run.

Dependence on training data:
Models inherit patterns, biases, and errors from massive datasets that developers cannot fully inspect.

High sensitivity to prompts:
Small changes in wording can dramatically change the output.

Emergent behaviors:
Large models can display abilities not explicitly programmed.

No single “correct” answer:
Many tasks (summarization, creative writing, etc.) do not have one definitive ground truth.

2. Why We Evaluate LLM Safety

LLM safety evaluation is essential because models can unintentionally cause harm.

Main motivations:

Hallucinations:
Models can fabricate facts, references, or instructions.

Harmful advice:
Guidance involving health, chemicals, self-harm, or dangerous activities can be unsafe.

Susceptibility to adversarial prompts:
Attackers can try to bypass safety rules (jailbreaks).

Bias and discrimination:
Models may reproduce or amplify social biases present in training data.

Privacy risks:
Potential leakage of personal or sensitive information.

Malicious use cases:
Phishing, misinformation, impersonation, or automated fraud.

3. How Adversarial Testing Improves LLMs

Adversarial testing challenges the model with intentionally difficult, misleading, or harmful prompts to expose weaknesses.

Contributions to model improvement:

Identifies vulnerabilities:
Reveals where the model breaks, hallucinates, or becomes unsafe.

Supports continuous fine-tuning:
Each discovered flaw becomes training data for future safety refinements.

Captures unexpected behavior:
Helps uncover edge cases that normal evaluation would miss.

Strengthens red-team processes:
Provides structured methods to stress-test safety systems before real-world misuse occurs.

4. Limitations of Automated Metrics vs Human Evaluation

Automated metrics (BLEU, ROUGE, METEOR, BERTScore, perplexity) provide quick, reproducible measurements — but they do not fully capture language quality.

Limitations of automated metrics:

BLEU / ROUGE:
Focus on n-gram overlap → penalize correct paraphrases or alternative valid expressions.

METEOR:
Better with synonyms, but still limited in understanding context and intent.

Perplexity:
Measures how predictable text is to the model — not how good, correct, or useful the text is.

BERTScore:
Measures semantic similarity, but can miss nuance, tone, humor, or factual accuracy.

Why human evaluation matters:

Humans can assess qualities that metrics cannot:

clarity and coherence

fluency and naturalness

factual correctness

tone and emotional appropriateness

usefulness and safety

ethical implications

user preference

Automated metrics are helpful but incomplete. Human judgment remains the gold standard for real-world LLM evaluation.

2. Applying BLEU and ROUGE Metrics:

Calculate the BLEU score for the following example:

Reference: “Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.”
Generated: “Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.”
Calculate the ROUGE score for the following example:

Reference: “In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact.”
Generated: “To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development.”
Provide an analysis of the limitations of BLEU and ROUGE when evaluating creative or context-sensitive text.

Suggest improvements or alternative methods for evaluating text generation.



In [1]:
# Install NLTK if needed
!pip install nltk

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

reference = ["Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.".split()]
generated = "Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.".split()

chencherry = SmoothingFunction()

bleu_score = sentence_bleu(reference, generated, smoothing_function=chencherry.method1)

print("BLEU score:", bleu_score)


BLEU score: 0.06296221772093169


In [2]:
!pip install rouge-score

from rouge_score import rouge_scorer

reference = "In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact."
generated = "To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development."

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

scores = scorer.score(reference, generated)

print("ROUGE-1:", scores['rouge1'])
print("ROUGE-2:", scores['rouge2'])
print("ROUGE-L:", scores['rougeL'])


Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=5fdfd76eec2ff80ec0197beed9c08e311fd2709cbfead21162ec2fabe0db17f6
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
ROUGE-1: Score(precision=0.47058823529411764, recall=0.3333333333333333, fmeasure=0.39024390243902435)
ROUGE-2: Score(precision=0.1875, recall=0.13043478260869565, fmeasure=0.15384615384615383)
ROUGE-L: Score(precision=0.35294117647058826, recall=0.25, fmeasure=0.2926829268292683)


# Limitations of BLEU and ROUGE for Creative or Context-Sensitive Text

BLEU and ROUGE are widely used automatic evaluation metrics, but they were originally designed for structured tasks such as machine translation and summarization, where close overlap with a reference text is desirable.
When applied to creative, open-ended, or context-sensitive text, these metrics have significant limitations.

1. Overreliance on n-gram overlap

Both BLEU and ROUGE compare generated text to a reference by counting shared n-grams.
This means:

* Valid paraphrases are penalized.

* Synonyms, rephrasings, or stylistically different text lead to low scores even if meaning is correct.

* Creative writing (narrative, argumentation, explanation) rarely matches word-for-word with a reference.

Example:
“human oversight is essential” vs. “human supervision is necessary”
→ Same meaning, very low n-gram overlap.

2. No understanding of semantics or context

BLEU/ROUGE measure surface similarity, not meaning.
They cannot detect whether:

* the answer is factually correct

* the reasoning is coherent

* the tone is appropriate

* the text respects real-world context

A sentence can score high even if it contains factual errors, as long as it matches the reference’s wording.

3. Penalize diversity and originality

For tasks requiring creativity (story generation, open-ended reasoning, dialogue), these metrics discourage innovation.

They favor:

* rigid similarity

* conservative phrasing

* repetition of the reference structure

But creative tasks often require:

* new metaphors

* new examples

* stylistic variation

* reordering of ideas

→ BLEU/ROUGE cannot reward this.

4. Sensitivity to sentence length

Both metrics produce misleading values when:

* the generated text is too short (artificially high precision)

* the generated text is too long (penalized for content not in reference)

This becomes problematic in tasks where elaboration or creativity is expected.

5. Lack of sensitivity to coherence and quality

BLEU and ROUGE ignore important aspects of language quality:

* argument flow

* narrative coherence

* emotional tone

* clarity

* persuasion

* grammar quality

A text can achieve a high BLEU/ROUGE score and still be confusing, poorly structured, or low-quality.

6. They assume one “gold standard” reference

Creative or context-sensitive tasks do not have a single correct answer.
Multiple valid outputs are possible, but BLEU/ROUGE compare only to one reference or a small set.

This structurally biases the evaluation against diversity.

Conclusion

BLEU and ROUGE are useful for structured tasks but insufficient for evaluating creative, context-sensitive, or open-ended LLM outputs.
They fail to measure meaning, quality, coherence, nuance, and originality — all essential for real-world language generation.

For more reliable assessment, they should be paired with:

* human evaluation,

* semantic metrics (e.g., BERTScore, Sentence Mover’s Similarity),

* task-specific qualitative criteria,

* and safety/robustness analysis.

Improvements and Alternative Methods for Evaluating Text Generation

Evaluating text generation requires more than surface-level metrics like BLEU and ROUGE. Modern LLMs operate in open-ended, creative, and context-dependent scenarios, so more robust evaluation approaches are needed.

Below are key improvements and alternative methodologies.

1. Semantic Similarity Metrics

Instead of relying on exact word overlap, semantic metrics measure meaning.
They compare embeddings rather than surface text.

Examples:

* BERTScore – uses contextual embeddings to measure semantic similarity

* Sentence Mover’s Similarity (SMS) – assesses how well meaning is preserved

* Word Mover’s Distance (WMD) – computes the minimal distance between word vectors

Advantage:
Captures synonymy, paraphrasing, and deeper semantic alignment.

2. Human Evaluation (Gold Standard)

Human judgment remains the most reliable method for open-ended tasks.

Often evaluated on:

* fluency and naturalness

*  and structure

* factual accuracy

* helpfulness and relevance

* tone and appropriateness

* creativity and originality

Advantage:
Humans can detect nuance, emotion, correctness, ethics, and intent — areas where automated metrics fail.

3. Task-Specific Rubrics

Different tasks require different evaluation frameworks.

Examples:

Summarization: faithfulness, completeness, conciseness

Dialogue: coherence, empathy, engagement

Creative writing: style, originality, voice consistency

Instruction following: accuracy, completeness, safety

Advantage:
Allows precise evaluation aligned with the task goals.

4. Model-Based Evaluation (LLM-as-a-Judge)

Large models can evaluate other models’ outputs using structured criteria.

Techniques:

* GPT-4 judging GPT-3 outputs

* LLM-raters with chain-of-thought reasoning

* Self-consistency scoring

Advantages:

* Scalable

* Correlates surprisingly well with human evaluation

Can apply task-specific scoring rubrics

Caution:
Must be monitored for bias and instability.

5. Adversarial Testing and Red-Team Evaluation

Tests robustness by challenging the model with difficult, ambiguous, or harmful prompts.

Checks for:

* safety violations

* hallucinations

* logical inconsistencies

* prompt sensitivity

* jailbreak vulnerabilities

Advantage:
Reveals failure modes that normal evaluation misses.

6. Factuality and Groundedness Metrics

Useful when models must remain accurate (e.g., news, science, medical).

Approaches:

* Fact-checking models (FEVER score)

* Knowledge-graph alignment

* Retrieval consistency tests

Advantage:
Distinguishes between fluent nonsense and genuinely correct output.

7. Diversity and Creativity Metrics

For story generation, brainstorming, or open-ended tasks:

Metrics:

* Distinct-n / Unique-n: counts unique n-grams

* Entropy / lexical richness: measures vocabulary diversity

* Novelty metrics: detect originality relative to training data

Advantage:
Rewards creativity instead of punishing it.

8. Holistic Evaluation Frameworks

Combining multiple evaluation types results in a more accurate overall assessment.

Examples:

HELM (Holistic Evaluation of Language Models)

G-Eval (LLM-based rubric evaluation)

MMLU, TruthfulQA, SafetyBench

Advantage:
Captures performance, safety, fairness, robustness, and real-world reliability.

Conclusion

Evaluating text generation requires moving beyond simple n-gram metrics.
A strong evaluation pipeline should combine:

* semantic metrics (BERTScore, SMS)

* human rating

* task-specific criteria

* safety & robustness tests

* LLM-as-a-judge frameworks

This multi-layered approach provides a far more accurate, fair, and meaningful assessment of modern text-generation models.

3. Perplexity Analysis:

Compare the perplexity of the two language models based on the probability assigned to a word:

Model A: Assigns 0.8 probability to “mitigation.”
Model B: Assigns 0.4 probability to “mitigation.”
Determine which model has lower perplexity and explain why.

Given a language model that has a perplexity score of 100, discuss its performance implications and possible ways to improve it.

# 3. Perplexity Analysis

## Model Comparison Based on Probability

Perplexity (PP) for a single word is defined as:

$$
PP = \frac{1}{P(w)}
$$

Where \( P(w) \) is the probability the model assigns to the word.

### Given:
- Model A: \( P = 0.8 \)
- Model B: \( P = 0.4 \)

### Perplexity Calculation:

$$
PP_A = \frac{1}{0.8} = 1.25
$$

$$
PP_B = \frac{1}{0.4} = 2.5
$$

### Conclusion
Model **A** has lower perplexity (1.25), meaning it is **more confident** and performs better on the word **"mitigation"**.

---

## Interpreting a Perplexity Score of 100

A perplexity of **100** indicates the model is highly uncertain about predictions — almost as if it must choose among **100 equally probable next words**.

### Implications:
- The model is not predicting well.
- It may be poorly trained or poorly matched to the domain.
- It shows weak confidence in its probability distribution.

---

## Ways to Improve a High Perplexity Score

- Increase the size of the training dataset.
- Improve data cleaning and consistency.
- Use a larger or more advanced model architecture.
- Fine-tune on domain-specific text.
- Optimize hyperparameters (learning rate, dropout, etc.).
- Use better tokenization adapted to the domain.



4. Human Evaluation Exercise:

Rate the fluency of this chatbot response using a Likert scale (1-5): “Apologies, but comprehend I do not. Could you rephrase your question?”
Justify your rating.
Propose an improved version of the response and explain why it is better.

# 4. Human Evaluation Exercise

## Fluency Rating (Likert Scale 1–5)

**Rating: 2 / 5**

### Justification
The response is grammatically unusual and unnatural in English.  
The phrase **“comprehend I do not”** resembles Yoda-style speech, which reduces clarity and fluency.  
While the intention is understandable, the structure is not typical for natural conversational English.

---

## Improved Version

**Improved response:**  
“I'm sorry, I didn't quite understand that. Could you please rephrase your question?”

### Why this version is better:
- **Natural and fluent** phrasing  
- **Clear and polite tone**  
- **Uses standard English grammar**  
- **Keeps the meaning identical**  
- **Improves user experience** by sounding more helpful and less robotic



5. Adversarial Testing Exercise:

Identify the potential mistake an LLM might make when answering the Prompt: “What is the capitol of France?”

Expected: “Paris.”
Suggest a method to improve robustness against such errors.

Create at least three tricky prompts that could challenge an LLM’s robustness, bias detection, or factual accuracy.

# 5. Adversarial Testing Exercise

## Potential Mistake for the Prompt:
**Prompt:** “What is the *capitol* of France?”

### Possible LLM Mistake
The model may interpret **“capitol”** literally as:
- a government **building** (like “the Capitol” in Washington, D.C.)
- or may attempt to describe *a type of structure*, rather than understanding the user meant **“capital”**.

This can lead to wrong or confused outputs such as:
- “France does not have a Capitol building like the U.S.”
- “The French capitol is located in Paris but functions differently…”

### Expected Correct Answer
**Paris.**

---

## Method to Improve Robustness
A robust approach includes:

### **Spell-checking + Semantic Intent Inference**
The model should:
1. Detect likely misspellings (“capitol” → “capital”).  
2. Infer user intent based on context.  
3. Confirm or cl


6. Comparative Analysis of Evaluation Methods:

Choose an NLP task (e.g., machine translation, text summarization, question answering).
Compare and contrast at least three different evaluation metrics (BLEU, ROUGE, BERTScore, Perplexity, Human Evaluation, etc.).
Discuss which metric is most appropriate for the chosen task and why.


# 6. Comparative Analysis of Evaluation Methods

## Chosen NLP Task: **Text Summarization**

Text summarization requires both **semantic accuracy** (capturing the core meaning) and **linguistic quality** (fluency, coherence).  
Below is a comparison of commonly used evaluation methods for this task.

---

## 1. **ROUGE**

### What it measures:
- n-gram overlap between the generated summary and a reference summary  
- ROUGE-1 (unigram overlap)  
- ROUGE-2 (bigram overlap)  
- ROUGE-L (longest common subsequence)

### Strengths:
- Standard metric widely used in summarization research  
- Easy to compute and compare  
- Captures whether key words and phrases match the reference

### Weaknesses:
- Rewards surface similarity, not meaning  
- Penalizes valid paraphrases  
- Does not detect factual errors  
- Does not measure coherence or fluency

---

## 2. **BERTScore**

### What it measures:
- Semantic similarity using contextual embeddings from transformer models  
- Evaluates meaning rather than exact wording

### Strengths:
- Captures synonyms and paraphrasing  
- Better correlation with human judgments than ROUGE  
- Sensitive to deeper semantic alignment

### Weaknesses:
- More computationally expensive  
- May still miss factual inconsistencies  
- Performance depends on the underlying language model

---

## 3. **Human Evaluation**

### What it measures:
Humans rate the summary on:
- Fluency  
- Coherence  
- Faithfulness (no hallucinations)  
- Coverage of key information  
- Usefulness and readability

### Strengths:
- Best method for open-ended and creative tasks  
- Can detect nuance, tone, and factuality  
- Not fooled by paraphrasing or reordering of information

### Weaknesses:
- Time-consuming  
- Expensive  
- Not easily scalable  
- Subjective unless guided by a scoring rubric

---

## 4. **Comparison for Text Summarization**

| Metric        | Captures Meaning | Sensitive to Paraphrasing | Detects Factual Errors | Measures Fluency | Scalable |
|---------------|------------------|---------------------------|-------------------------|------------------|----------|
| **ROUGE**     | ❌ Partial        | ❌ No                     | ❌ No                   | ❌ No            | ✔ Yes    |
| **BERTScore** | ✔ Good           | ✔ Yes                    | ❌ No                   | ❌ Limited       | ✔ Medium |
| **Human Eval**| ✔ Excellent      | ✔ Excellent              | ✔ Yes                  | ✔ Yes           | ❌ No     |

---

## 5. **Most Appropriate Metric for Text Summarization**

### **Best single metric: BERTScore**  
Because:
- it evaluates **semantic similarity**,  
- allows freedom in wording,  
- aligns more closely with how humans judge summary quality.

### **But the ideal evaluation combines:**
- **ROUGE** to compare key content overlap,  
- **BERTScore** to assess semantic preservation,  
- **Human evaluation** to evaluate fluency, coherence, and factual consistency.

### Final Conclusion:
**No single metric is sufficient.**  
For text summarization, a hybrid approach provides the most r
