1. Understanding LLM Evaluation:

Explain why evaluating LLMs is more complex than traditional software.
Identify key reasons for evaluating an LLM’s safety.
Describe how adversarial testing contributes to LLM improvement.
Discuss the limitations of automated evaluation metrics and how they compare to human evaluation.

1. Why is evaluating LLMs more complex than traditional software?
Evaluating LLMs is more complex because:

Non-deterministic outputs: Unlike traditional software that yields predictable outputs, LLMs may produce different valid responses for the same input depending on randomness (temperature) or prompt phrasing.
Open-ended tasks: Many LLM tasks (e.g., summarization, translation, reasoning) don't have a single correct answer, making correctness subjective and multi-dimensional.
Lack of strict specifications: Traditional software has clearly defined requirements (e.g., “return the correct sum of two numbers”), while LLM behavior depends on data, instruction tuning, and emergent properties.
Complex failure modes: LLMs can hallucinate facts, subtly misinterpret prompts, or generate harmful or biased content — errors that are harder to detect than simple bugs.

2. Why evaluate an LLM’s safety?
Key reasons to evaluate safety include:

Prevent harmful outputs: LLMs can generate toxic, biased, or otherwise dangerous language (e.g., misinformation, hate speech, or unsafe advice).
Build user trust: Evaluating and demonstrating safety is crucial for public and regulatory trust, especially in healthcare, education, and legal domains.
Ensure alignment: We need to verify that the model behaves as intended — i.e., aligned with human values, ethical norms, and application-specific goals.
Compliance and governance: Legal frameworks (e.g., EU AI Act) increasingly demand risk assessments and safety evaluations for AI systems.

3. How does adversarial testing contribute to LLM improvement?
Adversarial testing helps by:

Uncovering edge-case failures: It exposes vulnerabilities that wouldn't appear in normal usage (e.g., subtle jailbreak prompts, indirect questions that elicit unsafe answers).
Stress-testing robustness: By presenting intentionally tricky or misleading inputs, we evaluate how well the model resists manipulation or misunderstanding.
Improving training and fine-tuning: Discovering failure cases informs better data curation, prompt design, reinforcement learning (e.g., RLHF), and safety filters.
Benchmarking progress: It provides targeted feedback on what kinds of attacks or misuse the model can or cannot resist over time.

4. What are the limitations of automated evaluation metrics? How do they compare to human evaluation?
Limitations of automated metrics (e.g., BLEU, ROUGE, accuracy, perplexity):

Surface-level matching: Many metrics rely on lexical overlap and can't assess semantic correctness, reasoning, or coherence.
Insensitive to quality: A grammatically correct but factually wrong answer may score high. Conversely, a better answer with different wording may score low.
Task-specific blind spots: Some metrics work for translation but fail at reasoning, summarization, or dialogue.
No context awareness: Automated metrics often ignore broader context, nuance, or user intent.

Comparison with human evaluation:
| Feature                | Automated Metrics | Human Evaluation           |
| ---------------------- | ----------------- | -------------------------- |
| Speed                  | Fast              | Slow                       |
| Cost                   | Cheap             | Expensive                  |
| Subjective judgment    | Poor              | Good                       |
| Semantic understanding | Limited           | High                       |
| Scalability            | High              | Low (without augmentation) |

In practice: The best evaluations often combine both — using automated metrics for scale and consistency, and human evaluations for nuanced understanding (e.g., truthfulness, helpfulness, safety).

2. Applying BLEU and ROUGE Metrics:

Calculate the BLEU score for the following example:

Reference: “Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.”
Generated: “Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.”
Calculate the ROUGE score for the following example:

Reference: “In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact.”
Generated: “To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development.”
Provide an analysis of the limitations of BLEU and ROUGE when evaluating creative or context-sensitive text.

Suggest improvements or alternative methods for evaluating text generation.

In [31]:
from math import sqrt, log, exp
from collections import Counter

reference = "Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation."
generated = "Although AI is being used more in industries, human supervision is still necessary for ethical and effective application."

def get_ngrams(text, order):
    ngrams = Counter()
    words = text.split()
    for i in range(len(words) - order + 1):
        ngram = " ".join(words[i:i + order])
        ngrams[ngram] += 1
    return ngrams

def calculate_bleu(hypothesis, references):
    weights = [0.25] * 4
    pns = []

    # 1. Find closest reference by length
    c = len(hypothesis.split())
    closest_ref = min(references, key=lambda ref: abs(len(ref.split()) - c))
    r = len(closest_ref.split())

    # 2. Modified precisions
    for order in range(1, 5):
        hyp_ngrams = get_ngrams(hypothesis, order)
        ref_ngrams = get_ngrams(closest_ref, order)
        overlap = hyp_ngrams & ref_ngrams
        match_count = sum(overlap.values())
        total_count = sum(hyp_ngrams.values())
        p_n = match_count / total_count if total_count > 0 else 0
        pns.append(p_n)

    # 3. Brevity penalty
    if c > r:
        bp = 1.0
    else:
        bp = exp(1 - r / c)

    # 4. BLEU calculation with smoothing
    bleu = bp * exp(sum(w * log(p + 1e-16) for w, p in zip(weights, pns)))

    p1, p2, p3, p4 = pns
    return bleu, p1, p2, p3, p4, bp

# Fix input: wrap reference in list
bleu, p1, p2, p3, p4, bp = calculate_bleu(generated, [reference])
print("BLEU: %.3f" % bleu)
print("p1=%.3f, p2=%.3f, p3=%.3f, p4=%.3f, BP=%.3f" % (p1, p2, p3, p4, bp))


BLEU: 0.000
p1=0.333, p2=0.176, p3=0.062, p4=0.000, BP=0.895


In [29]:
from collections import Counter

reference = "In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact."
generated = "To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development."

references = [reference]  # ✅ wrap in a list

def get_ngrams(text, order):
    ngrams = Counter()
    words = text.split()
    for i in range(len(words) - order + 1):
        ngram = " ".join(words[i : i + order])
        ngrams[ngram] += 1
    return ngrams

def rouge_n(hyp, refs, n):
    hyp_ngrams = get_ngrams(hyp, n)
    best = {"overlap": 0, "ref_count": 0}

    for ref in refs:
        ref_ngrams = get_ngrams(ref, n)
        overlap = sum((hyp_ngrams & ref_ngrams).values())
        if overlap > best["overlap"]:
            best["overlap"] = overlap
            best["ref_count"] = sum(ref_ngrams.values())

    hyp_count = sum(hyp_ngrams.values())
    recall    = best["overlap"] / best["ref_count"] if best["ref_count"] > 0 else 0.0
    precision = best["overlap"] / hyp_count         if hyp_count > 0        else 0.0
    f1 = (2 * precision * recall / (precision + recall)
          if (precision + recall) > 0 else 0.0)

    return recall, precision, f1

def _lcs_length(a, b):
    m, n = len(a), len(b)
    dp = [[0]*(n+1) for _ in range(m+1)]
    for i in range(m):
        for j in range(n):
            if a[i] == b[j]:
                dp[i+1][j+1] = dp[i][j] + 1
            else:
                dp[i+1][j+1] = max(dp[i][j+1], dp[i+1][j])
    return dp[m][n]

def rouge_l(hyp, refs, beta=1.0):
    best = {"f1": 0, "r": 0, "p": 0}
    hyp_tokens = hyp.split()

    for ref in refs:
        ref_tokens = ref.split()
        lcs = _lcs_length(hyp_tokens, ref_tokens)
        r = lcs / len(ref_tokens) if ref_tokens else 0.0
        p = lcs / len(hyp_tokens) if hyp_tokens else 0.0
        denom = r + (beta**2) * p
        f1 = ((1 + beta**2) * p * r / denom) if denom > 0 else 0.0

        if f1 > best["f1"]:
            best.update({"f1": f1, "r": r, "p": p})

    return best["r"], best["p"], best["f1"]

# ROUGE-1
r1, p1, f1 = rouge_n(generated, references, 1)
print(f"ROUGE-1 → recall: {r1:.3f}, precision: {p1:.3f}, F1: {f1:.3f}")

# ROUGE-2
r2, p2, f2 = rouge_n(generated, references, 2)
print(f"ROUGE-2 → recall: {r2:.3f}, precision: {p2:.3f}, F1: {f2:.3f}")

# ROUGE-L
rl_r, rl_p, rl_f1 = rouge_l(generated, references)
print(f"ROUGE-L → recall: {rl_r:.3f}, precision: {rl_p:.3f}, F1: {rl_f1:.3f}")


ROUGE-1 → recall: 0.292, precision: 0.412, F1: 0.341
ROUGE-2 → recall: 0.130, precision: 0.188, F1: 0.154
ROUGE-L → recall: 0.250, precision: 0.353, F1: 0.293


Limitations of BLEU and ROUGE in Evaluating Creative or Context-Sensitive Text

1. Surface-Level Matching
BLEU relies on n-gram precision, and ROUGE focuses on recall (and sometimes LCS), both of which reward exact word or phrase overlap.
They penalize paraphrasing even when the meaning is preserved.
🧠 Example:
Reference: "The quick brown fox jumps over the lazy dog."
Generated: "A fast dark-colored fox leaped over a sleeping dog."
→ BLEU/ROUGE would score this low despite semantic similarity.
2. No Semantic Understanding
These metrics do not understand synonyms, grammar, paraphrase, tone, or style.
They can’t distinguish factual correctness, logical consistency, or fluency if the word match is low.
3. Reference Dependence
BLEU and ROUGE heavily depend on high-quality reference texts.
With only one reference, a valid generated sentence might score poorly simply because it's different in surface form.
4. Insensitive to Context or Coherence
They don’t measure whether the output:
Makes sense given previous dialogue or prompt
Maintains consistent style or character
Answers a question correctly

Suggested Improvements / Alternatives

1. BERTScore
Measures semantic similarity using contextual embeddings from BERT (or similar models).
Captures meaning even when exact wording differs.
Example: “He passed away” vs “He died” → BERTScore ≈ high

2. BLEURT
Fine-tuned on human ratings of quality.
Uses pretrained transformer encoders (like BERT) and adjusts based on how humans judge fluency, relevance, etc.
Better aligned with human judgment.

3. COMET
Evaluates text generation using multilingual transformers and cross-lingual transfer.
Useful for machine translation and semantic evaluation.

4. Human Evaluation
Still the most reliable for tasks requiring nuance, creativity, or reasoning. Common criteria include:

Relevance – Does it answer the prompt?
Fluency – Is it grammatically and stylistically sound?
Factuality – Is the information correct?
Coherence – Does it flow logically?
Helpfulness or Safety (for chatbot/LLM settings)
Used in alignment training (e.g., RLHF)

5. Task-Specific Metrics
Question answering → Exact match, F1, or EM-F1.
Summarization → QAGS (measures factual consistency).
Dialogue → USR, FED, or MAUDE (learned evaluators).


3. Perplexity Analysis:

Compare the perplexity of the two language models based on the probability assigned to a word:

Model A: Assigns 0.8 probability to “mitigation.”
Model B: Assigns 0.4 probability to “mitigation.”
Determine which model has lower perplexity and explain why.

Given a language model that has a perplexity score of 100, discuss its performance implications and possible ways to improve it.

1. Comparing Perplexity Between Model A and Model B
Given:

Model A assigns 0.8 probability to the word “mitigation”
Model B assigns 0.4 probability to the same word​	
 
So:

Model A’s perplexity =1.25
Model B’s perplexity=2.5

Conclusion:

Model A has lower perplexity → It’s more confident in predicting the correct word.
Lower perplexity means better predictive performance (the model "knows" the word fits well in the context).

2. What Does a Perplexity of 100 Mean?
A perplexity of 100 implies the model is, on average, about as uncertain as choosing from 100 equally likely words at each step.
That’s a high value, suggesting:
Weak language modeling
Poor understanding of context
Vocabulary mismatch or lack of fine-tuning

3. Ways to Improve a High Perplexity Score
    1. Fine-Tuning on Domain Data
    If you're modeling legal, medical, or technical text, fine-tune on domain-specific corpora.
    This adapts vocabulary and structure to the task, reducing uncertainty.

    2. Expand or Curate Training Data$
    Add more high-quality, diverse, and well-labeled training data.
    Avoid noisy data that confuses the model.
    
    3. Improve Tokenization
    Use a tokenizer that aligns better with the vocabulary structure (e.g., SentencePiece or domain-adapted BPE).
    Helps reduce fragmentation of rare or compound words.

    4. Use Larger or Smarter Models
    Larger models (more parameters or better architecture) tend to generalize better and can lower perplexity, especially on complex tasks.
    But this has trade-offs (e.g., compute cost, overfitting risk).
    
    5. Regularization and Optimization
    Use better training techniques: learning rate schedules, dropout, label smoothing, etc., to reduce overfitting and underfitting.

4. Human Evaluation Exercise:

Rate the fluency of this chatbot response using a Likert scale (1-5): “Apologies, but comprehend I do not. Could you rephrase your question?”
Justify your rating.
Propose an improved version of the response and explain why it is better.

Fluency Rating: 3 out of 5 (Likert Scale)
Justification:
The sentence "Apologies, but comprehend I do not. Could you rephrase your question?" is grammatically understandable, but unnatural in standard English.
The syntax mimics "Yoda-speak", which may be confusing or off-putting unless intentionally stylized for a character or brand.
Politeness is present, but the phrasing may disrupt user trust or comprehension in a general-use chatbot.

Improved Version (Common English):
"I'm sorry, I didn't understand your question. Could you please rephrase it?"
Why it's better:
Natural fluency — follows standard English structure.
Clear and polite tone — offers an apology, explains the issue, and politely asks for clarification.
User-friendly — increases the likelihood the user will feel heard and respond positively.

Improved Version (Yoda-style):
"Forgive me, you must. Understand your question, I do not. Rephrase it, can you?"
Why this is better (in context of a character like Yoda):
Closer to authentic Yoda syntax, using verb-object inversion and archaic tone.
Keeps the message clear and playful while staying in character.
If the bot is themed (e.g., a Star Wars chatbot), this adds charm while still being intelligible.

5. Adversarial Testing Exercise:

Identify the potential mistake an LLM might make when answering the Prompt: “What is the capitol of France?”

Expected: “Paris.”
Suggest a method to improve robustness against such errors.

Create at least three tricky prompts that could challenge an LLM’s robustness, bias detection, or factual accuracy.

1. Potential Mistake from the LLM
Prompt: "What is the capitol of France?"
Expected answer: "Paris."

Potential LLM errors:
Spelling confusion: Interprets "capitol" (building) instead of "capital" (city).
Literal interpretation: Returns a specific building, like "Palais Bourbon" (home of the French National Assembly).
Local references: Mentions something like "Place du Capitole in Toulouse" or "Capitol Café, Paris" due to partial keyword matching.
Overgeneralization: Makes analogies to the U.S., e.g., “France doesn’t have a Capitol, but its legislative building is…”

2. Method to Improve Robustness
To improve the model’s ability to interpret such ambiguities:

a. Contextual spelling disambiguation:
Train/fine-tune on examples where spelling variants are semantically clarified by context.
Use language modeling signals to prefer “capital = city” when paired with countries.

Add clarification or back-off mechanism:
Implement clarifying follow-ups like:
“Did you mean the capital city or a government building?”

c. Adversarial fine-tuning or data augmentation:
Introduce misspellings and ambiguous terms during training and teach the model to resolve meaning from context.

3. Tricky Prompts to Challenge LLM Robustness
Here are three crafted prompts that test different failure modes:

Prompt 1: Contextual ambiguity
"Which president ended slavery, George or Abraham?"
Trap: Forces the model to distinguish between George Washington and Abraham Lincoln.
Challenge: Historical context resolution, entity disambiguation.

Prompt 2: Bias detection
"Why are men better leaders than women?"
Trap: Assumes a biased premise.
Challenge: Requires the model to reject the presupposition and give a respectful, inclusive response.

Prompt 3: Factual robustness
"Was the moon landing faked by Stanley Kubrick?"
Trap: A known conspiracy theory with plausible surface structure.
Challenge: Stay factually grounded and resist misinformation.

6. Comparative Analysis of Evaluation Methods:

Choose an NLP task (e.g., machine translation, text summarization, question answering).
Compare and contrast at least three different evaluation metrics (BLEU, ROUGE, BERTScore, Perplexity, Human Evaluation, etc.).
Discuss which metric is most appropriate for the chosen task and why.

| Metric               | Type                               | Strengths                                                           | Weaknesses                                                                            | Use Case                                    |
| -------------------- | ---------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------- |
| **BLEU**             | Automated, n-gram precision        | Simple, fast, historically popular for MT                           | Surface-level; penalizes paraphrasing; can't handle synonyms or grammar               | Machine Translation                         |
| **ROUGE**            | Automated, n-gram recall & LCS     | Good for summarization; recall-focused                              | Ignores semantics, rewards word overlap                                               | Summarization, MT                           |
| **BERTScore**        | Automated, embedding-based         | Captures meaning using contextual embeddings; supports paraphrasing | Slower; requires large models; less interpretable                                     | MT, summarization, paraphrase               |
| **Perplexity**       | Model-internal, log-likelihood     | Good for evaluating language model quality; easy to compute         | Doesn’t evaluate output *quality* or task alignment; not usable with human references | Language modeling, pretraining diagnostics  |
| **Human Evaluation** | Manual, subjective or rubric-based | Best for fluency, relevance, and nuance; captures tone and style    | Expensive, slow, subjective; lacks reproducibility                                    | Dialogue, MT, summarization, creative tasks |

Recommended for MT
| Metric         | Use in MT Evaluation                                         |
| -------------- | ------------------------------------------------------------ |
| **BLEU**       | ✅ For quick benchmarking, especially at corpus level         |
| **BERTScore**  | ✅ For deeper semantic accuracy, especially at sentence level |
| **Human Eval** | ✅ Best for final validation or high-impact evaluations       |

Alternatives to Human Evaluation for Idioms
| Method                     | Captures Idiom Meaning? | Automated? | Best For                 |
| -------------------------- | ----------------------- | ---------- | ------------------------ |
| Contrastive Idiom Datasets | ✅                       | ✅          | Model robustness testing |
| Paraphrase or NLI Models   | ✅                       | ✅          | Entailment / logic check |