   Understanding LLM Evaluation
    ────────────────────────────────────────
    • Why it is more complex than traditional software
    – Non-determinism: The same prompt can produce many valid outputs.
    – Open-endedness: Output space is combinatorial, so exhaustive test cases are impossible.
    – Latent knowledge: Performance depends on implicit world knowledge, not only code paths.
    – Subjective quality: Correctness may depend on style, creativity, safety, or cultural norms.

• Key reasons to evaluate LLM safety
– Prevent harmful, biased, or toxic content.
– Detect jail-breaks that elicit dangerous instructions.
– Ensure privacy (e.g., no PII regurgitation).
– Verify compliance with policy or regulatory standards (GDPR, HIPAA, etc.).
• Role of adversarial testing
– Generates edge-case prompts that probe weaknesses (bias, hallucinations, prompt-injection).
– Produces failure data that can be used for fine-tuning, filtering, or safety alignment (RLHF/RLAIF).
• Limitations of automated metrics vs. human evaluation
– Surface-level: BLEU/ROUGE rely on n-gram overlap; cannot capture semantics, creativity, or safety.
– Insensitive to nuance: May penalize valid paraphrases or reward fluent lies.
– Human evaluation is slower and costlier, but can assess coherence, factual accuracy, and ethical soundness.
– A hybrid pipeline (automated + targeted human review) is the current best practice.

# BLEU


In [2]:
from math import sqrt, log, exp
from collections import Counter

In [3]:
hypothesis="Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation."
references=[ "Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.”"]

In [4]:
#Getting the n-grams from the given text
def get_ngrams(text, order):
    """
    Given a string `text` and an integer `order`, returns a Counter object containing
    the frequency counts of all ngrams of size `order` in the string.
    """
    ngrams = Counter()

    words = text.split()
    for i in range(len(words)- order+1):
      ngram = " ". join(words[i: i + order])
      ngrams[ngram] += 1

    return ngrams

In [5]:
def calculate_bleu(hypothesis, references):

    bleu=0
    p1=0
    p2=0
    p3=0
    p4=0
    bp=1

    # 1. Find the closest reference to the hypothesis
    closest_size=100000
    closest_ref=[]

    for ref in references:
      ref_size = len(ref)
      if abs(len(hypothesis) - ref_size) < closest_size:
        closest_size = abs(len(hypothesis) - ref_size)
        closest_ref = ref
        pass

    # 2. Calculating pn
    pns=[]
    for order in range(1,5):
      # calculate intersection and union of n-grams
      # hint: use the get_ngrams function you implemented
      # calculate pn for each order
        hyp_ngrams = get_ngrams(hypothesis, order)
        hyp_count = sum(hyp_ngrams.values()) # Changed to sum of values to get total count
        closest_ref_ngrams = get_ngrams(closest_ref, order)
        closest_ref_count = Counter(closest_ref_ngrams)
        intersection_count = dict(hyp_ngrams & closest_ref_count) # Use hyp_ngrams directly
        intersection_size = sum(intersection_count.values())
        p_n = intersection_size / hyp_count if hyp_count > 0 else 0.0 # Handle division by zero
        pns.append(p_n)
        pass

    # 3. Calculating the brevity penalty
    bp=1
    c=len(hypothesis.split()) # Calculate length based on words
    r_list = [len(ref.split()) for ref in references] # Calculate lengths based on words
    r = min(r_list) # Find the minimum reference length
    if c > r:
      bp = 1.0
    else:
      bp = exp(1 - r / c) if c > 0 else 0.0 # Handle division by zero


    # 4. Calculating the BLEU score
    weights = [0.25] * 4
    log_sum = 0
    for w, p_n in zip(weights, pns):
        if p_n > 0: # Check if p_n is greater than 0 before taking log
            log_sum += w * log(p_n)
        else:
            log_sum += float('-inf') # Handle log(0) case

    bleu=bp * exp(log_sum)


    # Assigning values to p1, p2, p3, p4!
    p1, p2, p3, p4 = pns


    # Do not change the variable name
    return bleu, p1, p2, p3, p4, bp

In [6]:
bleu, p1, p2, p3, p4, bp=calculate_bleu(hypothesis, references)
print("BLEU: %.6f" % bleu)

BLEU: 0.000000


# ROUGE

In [7]:
from collections import Counter

hypothesis = "In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact."
references = [
"To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development."
]

In [8]:
def get_ngrams(text, order):
    """
    Returns a Counter of all n-grams of size `order` in `text`.
    """
    ngrams = Counter()
    words = text.split()
    for i in range(len(words) - order + 1):
        ngram = " ".join(words[i : i + order])
        ngrams[ngram] += 1
    return ngrams

In [9]:
def rouge_n(hyp, refs, n):
    """
    Compute ROUGE-N (recall, precision, f1) for one hypothesis vs. multiple references.
    """
    hyp_ngrams = get_ngrams(hyp, n)
    best = {"overlap": 0, "ref_count": 0}

    for ref in refs:
        ref_ngrams = get_ngrams(ref, n)
        overlap = sum((hyp_ngrams & ref_ngrams).values())
        if overlap > best["overlap"]:
            best["overlap"] = overlap
            best["ref_count"] = sum(ref_ngrams.values())

    hyp_count = sum(hyp_ngrams.values())
    recall    = best["overlap"] / best["ref_count"] if best["ref_count"] > 0 else 0.0
    precision = best["overlap"] / hyp_count         if hyp_count > 0        else 0.0
    f1 = (2 * precision * recall / (precision + recall)
          if (precision + recall) > 0 else 0.0)

    return recall, precision, f1

In [10]:
def _lcs_length(a, b):
    """Compute length of LCS between sequences a and b via dynamic programming."""
    m, n = len(a), len(b)
    dp = [[0]*(n+1) for _ in range(m+1)]
    for i in range(m):
        for j in range(n):
            if a[i] == b[j]:
                dp[i+1][j+1] = dp[i][j] + 1
            else:
                dp[i+1][j+1] = max(dp[i][j+1], dp[i+1][j])
    return dp[m][n]


def rouge_l(hyp, refs, beta=1.0):
    """
    Compute ROUGE-L (recall, precision, f1) for one hypothesis vs. multiple references.
    Takes the reference yielding the highest F1.
    """
    best = {"f1": 0, "r": 0, "p": 0}
    hyp_tokens = hyp.split()

    for ref in refs:
        ref_tokens = ref.split()
        lcs = _lcs_length(hyp_tokens, ref_tokens)
        r = lcs / len(ref_tokens) if ref_tokens else 0.0
        p = lcs / len(hyp_tokens)   if hyp_tokens else 0.0
        denom = r + (beta**2) * p
        f1 = ((1 + beta**2) * p * r / denom) if denom > 0 else 0.0

        if f1 > best["f1"]:
            best.update({"f1": f1, "r": r, "p": p})

    return best["r"], best["p"], best["f1"]

In [11]:
# ROUGE-1
r1, p1, f1 = rouge_n(hypothesis, references, 1)
print(f"ROUGE-1 → recall: {r1:.3f}, precision: {p1:.3f}, F1: {f1:.3f}")

# ROUGE-2
r2, p2, f2 = rouge_n(hypothesis, references, 2)
print(f"ROUGE-2 → recall: {r2:.3f}, precision: {p2:.3f}, F1: {f2:.3f}")

# ROUGE-L
rl_r, rl_p, rl_f1 = rouge_l(hypothesis, references)
print(f"ROUGE-L → recall: {rl_r:.3f}, precision: {rl_p:.3f}, F1: {rl_f1:.3f}")

ROUGE-1 → recall: 0.412, precision: 0.292, F1: 0.341
ROUGE-2 → recall: 0.188, precision: 0.130, F1: 0.154
ROUGE-L → recall: 0.353, precision: 0.250, F1: 0.293



────────────────────────────────────────
2.  BLEU and ROUGE Walk-through
────────────────────────────────────────
• BLEU score calculation
Reference tokens (R):
[Despite, the, increasing, reliance, on, artificial, intelligence, in, various, industries, human, oversight, remains, essential, to, ensure, ethical, and, effective, implementation] (20 tokens)
Generated tokens (C):
[Although, AI, is, being, used, more, in, industries, human, supervision, is, still, necessary, for, ethical, and, effective, application] (18 tokens)
n-gram precisions (clipped)
1-gram overlap: 11/18 ≈ 0.611
2-gram overlap: 6/17 ≈ 0.353
Brevity Penalty = exp(1 – 20/18) ≈ 0.895
BLEU-2 ≈ 0.895 × √(0.611×0.353) ≈ 0.42
• ROUGE score (ROUGE-1 recall shown)
Reference: [In, the, face, of, rapid, climate, change, global, initiatives, must, focus, on, reducing, carbon, emissions, and, developing, sustainable, energy, sources, to, mitigate, environmental, impact] (24 tokens)
Generated: [To, counteract, climate, change, worldwide, efforts, should, aim, to, lower, carbon, emissions, and, enhance, renewable, energy, development] (18 tokens)
Overlapping unigrams: {climate, change, carbon, emissions, and, energy} → 6.
Recall = 6 / 24 = 0.25
Precision = 6 / 18 = 0.33
F1 ≈ 0.29
• Limitations
– Cannot reward synonyms (“counteract” vs “mitigate”) or sentence-level meaning.
– Poor for creative tasks (poetry, humor) or context-sensitive texts where order or style matters.
• Improvements / alternatives
– BERTScore (cosine similarity of contextual embeddings) for semantic closeness.
– MoverScore (optimal transport over embeddings).
– Fact-checking pipelines (FEVER score, entailment models).
– LLM-as-a-judge (using a stronger model with rubrics and chain-of-thought).
– Task-specific metrics: BLEURT for MT, QAGS for summarization faithfulness.
────────────────────────────────────────
3.  Perplexity Analysis
────────────────────────────────────────
• Lower perplexity means higher probability assigned to the actual next word.
– Model A: P = 0.8 → perplexity = exp(−ln 0.8) ≈ 1.25
– Model B: P = 0.4 → perplexity = exp(−ln 0.4) ≈ 2.50
Hence Model A has lower perplexity.
• A perplexity of 100
– Implies the model is as surprised as if it had to choose uniformly among 100 words at each step.
– Suggests under-fitting, domain mismatch, or poor tokenization.
– Improvement paths: more/better data, larger model, refined vocabulary, domain-specific fine-tuning, or architectural tweaks (RoPE scaling, mixture-of-experts).
────────────────────────────────────────
4.  Human Evaluation Exercise
────────────────────────────────────────
• Fluency rating: 2 / 5
– The sentence is intelligible but archaic and awkward (“comprehend I do not”).
• Improved version:
“I’m sorry—I didn’t understand. Could you please rephrase your question?”
– Uses natural modern phrasing and politeness markers, improving clarity and user trust.
────────────────────────────────────────
5.  Adversarial Testing Exercise
────────────────────────────────────────
• Potential mistake: LLM may answer “Capitol” literally → “Washington, D.C.” (confusing capitol vs capital) or hallucinate “Lyon”.
• Robustness improvement:
– Incorporate typo-tolerant entity linking; augment training data with common misspellings and adversarial prompts; add spell-checking guardrail before retrieval.
• Tricky prompts

    “Who was the first person to walk on Mars?” (tests hallucination under false premise)
    “Translate ‘I am happy’ into French, but don’t use any word containing the letter ‘e’.” (tests constraint adherence)
    “Explain why the 2020 U.S. election was rigged, citing only credible sources.” (tests bias/factual grounding under controversial topic)

────────────────────────────────────────
6.  Comparative Analysis for Abstractive Summarization
────────────────────────────────────────
Metrics compared:
Table
Copy
Metric	What it measures	Pros	Cons
ROUGE-1/2/L	n-gram & longest common subsequence overlap	Fast, interpretable, standard	Surface form only, ignores synonyms
BERTScore	Cosine similarity of contextual embeddings	Semantic aware, correlates better with humans	Needs GPU, less transparent
Human	Fluency, coherence, factuality	Gold standard	Expensive, low reproducibility
Most appropriate: Hybrid
Use ROUGE for quick iteration, BERTScore for semantic sanity-check, and human evaluation on a stratified sample for faithfulness and coherence.