## Daily Challenge: W7_D3

### Daily Challenge: Evaluating Large Language Models

---

#### What You’ll Learn

- The importance of evaluating LLMs for performance, reliability, and safety.  
- The challenges involved in LLM evaluation.  
- An overview of different evaluation methods, including content overlap metrics, model-based metrics, human evaluation, and adversarial testing.  
- In-depth understanding of BLEU, ROUGE, and Perplexity metrics.  
- Critical thinking in choosing the right evaluation metric for different applications.  

---

#### What You Will Create

You will complete exercises to:  
- Apply BLEU and ROUGE scores to sample text  
- Analyze perplexity scores  
- Conduct adversarial testing  
- Propose improvements for LLM evaluation methodologies  

---

#### Tasks

---

#### **1. Understanding LLM Evaluation**

- Explain why evaluating LLMs is more complex than traditional software.  
- Identify key reasons for evaluating an LLM’s safety.  
- Describe how adversarial testing contributes to LLM improvement.  
- Discuss the limitations of automated evaluation metrics and how they compare to human evaluation.  

---

#### **2. Applying BLEU and ROUGE Metrics**

- Calculate the BLEU score for the following example:  

**Reference:**  
“Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.”  

**Generated:**  
“Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.”  

---

- Calculate the ROUGE score for the following example:  

**Reference:**  
“In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact.”  

**Generated:**  
“To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development.”  

---

- Provide an analysis of the limitations of BLEU and ROUGE when evaluating creative or context-sensitive text.  
- Suggest improvements or alternative methods for evaluating text generation.  

---

#### **3. Perplexity Analysis**

- Compare the perplexity of the two language models based on the probability assigned to a word:  

**Model A:** Assigns 0.8 probability to “mitigation.”  
**Model B:** Assigns 0.4 probability to “mitigation.”  

- Determine which model has lower perplexity and explain why.  

- Given a language model that has a perplexity score of 100, discuss its performance implications and possible ways to improve it.  

---

#### **4. Human Evaluation Exercise**

- Rate the fluency of this chatbot response using a Likert scale (1-5):  

**Response:**  
“Apologies, but comprehend I do not. Could you rephrase your question?”  

- Justify your rating.  
- Propose an improved version of the response and explain why it is better.  

---

### **5. Adversarial Testing Exercise**

- Identify the potential mistake an LLM might make when answering the prompt:  

**Prompt:** “What is the capitol of France?”  
**Expected:** “Paris.”  

- Suggest a method to improve robustness against such errors.  
- Create at least three tricky prompts that could challenge an LLM’s robustness, bias detection, or factual accuracy.  

---

#### **6. Comparative Analysis of Evaluation Methods**

- Choose an NLP task (e.g., machine translation, text summarization, question answering).  
- Compare and contrast at least three different evaluation metrics (BLEU, ROUGE, BERTScore, Perplexity, Human Evaluation, etc.).  
- Discuss which metric is most appropriate for the chosen task and why.  

### 1. Understanding LLM Evaluation

#### 1. Why is evaluating LLMs more complex than traditional software?

Evaluating LLMs is more complex because their outputs are probabilistic rather than deterministic. Traditional software produces fixed outputs for given inputs, while LLMs can generate different responses each time. Additionally, their performance depends on context, prompt phrasing, and training data, which makes standard testing harder.

---

#### 2. Why is it important to evaluate an LLM’s safety?

Safety evaluation ensures the model does not produce harmful, biased, or misleading outputs. LLMs can unintentionally spread misinformation, reinforce stereotypes, or generate unsafe instructions. Evaluating safety reduces these risks and builds trust in real-world applications.

---

#### 3. How does adversarial testing contribute to LLM improvement?

Adversarial testing involves intentionally crafting challenging prompts to expose weaknesses in the model. By analyzing where the model fails, developers can improve training data, fine-tuning strategies, and safety mechanisms to make the model more robust.

---

#### 4. What are the limitations of automated evaluation metrics compared to human evaluation?

Automated metrics like BLEU or ROUGE measure surface similarity but do not capture meaning, creativity, or factual correctness. Human evaluation, while time-consuming, can assess fluency, relevance, and nuanced understanding better than automated scores.

### 2. Applying BLEU and ROUGE Metrics

We will first calculate the BLEU score for a sample reference and generated sentence.

In [None]:
# --- BLEU Score Calculation ---
from rouge_score import rouge_scorer
import math
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Reference and candidate sentences
reference = "Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.".split()
candidate = "Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.".split()

# Create smoothing function (to avoid zero scores for 4-gram)
smooth = SmoothingFunction().method1

# Calculate BLEU score with smoothing
bleu_score = sentence_bleu([reference], candidate, smoothing_function=smooth)

print("BLEU score:", bleu_score)

BLEU score: 0.06296221772093169


#### BLEU Score Interpretation

The BLEU score is 0.06, which is very low. This means that there is little overlap between the generated sentence and the reference sentence in terms of exact words or word sequences. 

However, this low score does not necessarily mean the generated text is wrong — it might still convey the same meaning using synonyms or different phrasing. This highlights a limitation of BLEU: it focuses on exact n-gram matches and does not capture semantic similarity.

To complement BLEU, we will now calculate ROUGE scores. ROUGE focuses more on recall and overlap of key words, which can be more informative for tasks like summarization or when paraphrasing is expected.

#### ROUGE Score Calculation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures how much of the reference text’s key content appears in the generated text. 

We will calculate ROUGE-1 (unigrams) and ROUGE-L (longest common subsequence) for the same sentences used in the BLEU example.

In [5]:
# --- ROUGE Score Calculation ---

# Initialize ROUGE scorer for ROUGE-1 and ROUGE-L
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

# Calculate scores
scores = scorer.score(
    "Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.",
    "Although AI is being used more in industries, human supervision is still necessary for ethical and effective application."
)

# Display the scores
print("ROUGE-1:", scores['rouge1'])
print("ROUGE-L:", scores['rougeL'])

ROUGE-1: Score(precision=0.3333333333333333, recall=0.3, fmeasure=0.3157894736842105)
ROUGE-L: Score(precision=0.3333333333333333, recall=0.3, fmeasure=0.3157894736842105)


#### ROUGE Score Interpretation

The ROUGE-1 and ROUGE-L scores are around 0.3. This means that about 30% of the key words or sequences from the reference appear in the generated text. 

This score is higher than BLEU (0.06), showing that ROUGE is more forgiving when synonyms or different phrasing are used. ROUGE focuses more on recall — it checks how much of the important content from the reference is present in the generated text, even if the wording is different.

#### Limitations of BLEU and ROUGE

BLEU and ROUGE focus on surface-level similarity, comparing exact words or n-grams between the generated text and the reference. This creates several limitations:

- **Lack of semantic understanding**: They do not capture meaning or synonyms, so paraphrases may score poorly even if correct.
- **Insensitive to creativity**: In tasks like story generation or open-ended responses, exact matches are less relevant.
- **Context blindness**: These metrics ignore long-range coherence, factual accuracy, or tone.
- **Single-reference bias**: If only one reference is used, alternative valid outputs are unfairly penalized.

---

#### Suggested Improvements or Alternatives

- **BERTScore**: Uses contextual embeddings from BERT to measure semantic similarity rather than exact matches.
- **Human evaluation**: Involving humans to rate fluency, coherence, and relevance captures qualitative aspects.
- **Task-specific metrics**: For summarization, metrics like QAEval (question-answer evaluation) or factual consistency checks can be better.
- **Combined evaluation**: Use automated metrics for speed and human checks for nuanced quality assessment.

#### Comparison of BLEU, ROUGE, and Alternatives

| Metric       | What it Measures                | Strengths                                  | Weaknesses                                     |
|--------------|---------------------------------|--------------------------------------------|------------------------------------------------|
| **BLEU**     | n-gram precision                | Simple, widely used for MT evaluation      | Ignores meaning, penalizes synonyms, context-blind |
| **ROUGE**    | n-gram recall & overlap         | Good for summarization, captures coverage  | Still surface-level, ignores semantics and coherence |
| **BERTScore**| Semantic similarity (embeddings)| Captures meaning and synonyms, context-aware| Requires large models, slower than BLEU/ROUGE |
| **Human Eval**| Fluency, coherence, relevance  | Captures nuanced quality, creativity       | Time-consuming, subjective, not scalable       |


### 3. Perplexity Analysis

Perplexity measures how well a language model predicts a given sequence of words.  
- Lower perplexity = the model is more confident and better at predicting the sequence.  
- Higher perplexity = the model is uncertain or poorly predicts the sequence.

In [7]:
# --- Perplexity Comparison ---

# Probabilities given by two models for the same word
p_model_A = 0.8
p_model_B = 0.4

# Perplexity formula: 2^(-log2(p))
perplexity_A = 1 / p_model_A
perplexity_B = 1 / p_model_B

print("Perplexity (Model A):", perplexity_A)
print("Perplexity (Model B):", perplexity_B)

# Lower perplexity = better
if perplexity_A < perplexity_B:
    print("Model A has lower perplexity and is better at predicting this word.")
else:
    print("Model B has lower perplexity and is better at predicting this word.")

Perplexity (Model A): 1.25
Perplexity (Model B): 2.5
Model A has lower perplexity and is better at predicting this word.


#### Perplexity Interpretation

Model A has a lower perplexity (1.25) compared to Model B (2.5).  
This means Model A is more confident and better at predicting the target word.  

In general:
- Lower perplexity = better model performance
- Higher perplexity = more uncertainty in predictions

#### Given a language model with perplexity score of 100, what does this imply?

A perplexity of 100 means the model is quite uncertain when predicting words — it considers about 100 equally likely options at each step.  
To improve this score, one could:
- Train on more relevant data
- Fine-tune the model for the specific domain
- Use larger models with better contextual understanding

### 4. Human Evaluation Exercise

**Prompt**
Rate the fluency of this chatbot response using a Likert scale (1–5):

**Response:**  
"Apologies, but comprehend I do not. Could you rephrase your question?"

### Fluency Rating

**Rating:** 2/5

**Reasoning:**  
The response is grammatically unusual and awkward ("comprehend I do not") which affects clarity and fluency.  
It sounds unnatural for a chatbot and may confuse users.

---

### Improved Version

"Sorry, I don’t quite understand. Could you please rephrase your question?"

**Why is it better?**  
This version is polite, clear, and uses natural phrasing, making it easier for users to understand.

### 5. Adversarial Testing Exercise

**Prompt:** 
"What is the capitol of France?"

**Expected Answer:**  
"Paris"

#### Potential Mistake

The model might misinterpret the word "capitol" (a building) instead of "capital" (a city) and give an irrelevant answer, or confuse it with another location.

---

#### Method to Improve Robustness

- Use adversarial training: expose the model to misspellings, homonyms, and tricky prompts during training.
- Implement spelling/grammar normalization before processing user queries.
- Add validation layers to detect when the question is ambiguous and request clarification.

---

#### Three Tricky Prompts for Robustness Testing

1. "What’s the capital of the country where Mount Fuji is located?"
2. "Which city is both the capital of England and home to Big Ben?"
3. "Name the capital of the country whose flag is red and white with a maple leaf."

*(Answers expected: Tokyo, London, Ottawa)*

### 6. Comparative Analysis of Evaluation Methods

#### Chosen NLP Task
**Text Summarization**

Summarization requires capturing the key meaning of the original text rather than exact wording. Therefore, metrics that focus only on word overlap may be insufficient.

### Comparison of Metrics

| Metric       | Measures                    | Strengths                                  | Weaknesses                                      |
|--------------|-----------------------------|--------------------------------------------|------------------------------------------------|
| **BLEU**     | n-gram precision            | Good for translation; easy to compute      | Ignores meaning; bad for paraphrasing          |
| **ROUGE**    | n-gram recall & overlap     | Good for summarization; measures coverage  | Surface-level; does not capture semantic similarity |
| **BERTScore**| Semantic similarity (contextual embeddings) | Captures meaning, synonyms, context        | Slower; requires large models                   |
| **Human Eval**| Fluency, relevance, coherence| Best for nuanced judgment                  | Time-consuming, subjective, not scalable        |

#### Conclusion

For text summarization, ROUGE is widely used because it measures how much important content from the reference appears in the summary (recall).  
However, it fails to capture semantic meaning when synonyms or paraphrases are used.

BERTScore is more appropriate when we want to evaluate meaning rather than exact word matches, especially for creative or abstractive summaries.  
In practice, a combination of ROUGE (for coverage) and BERTScore or human evaluation (for quality and fluency) provides the most reliable assessment.