Evaluating Large Language Models

Exercise 1
1. Why Evaluating LLMs is More Complex Than Traditional Software
LLMs generate probabilistic outputs, making their behavior less predictable than traditional software. They also deal with the complexities of human language, which involves nuance, context, and subjectivity, unlike deterministic software systems.

2. Key Reasons for Evaluating an LLM’s Safety
LLM safety must be evaluated to prevent bias, harmful content, and misinformation. It also ensures the model adheres to ethical guidelines and maintains user trust by avoiding offensive or dangerous outputs.

3. How Adversarial Testing Contributes to LLM Improvement
Adversarial testing helps identify weaknesses, edge cases, and vulnerabilities in LLMs by exposing them to challenging inputs. This improves robustness, accuracy, and the model’s ability to handle unexpected scenarios.

4. Limitations of Automated Evaluation Metrics vs. Human Evaluation
Automated metrics are fast but lack the depth to assess context, creativity, or ethical considerations. Human evaluation offers more nuanced insights but is slower and more subjective, making a combination of both essential for effective evaluation.

In [None]:
# Exercise 2
!pip install nltk rouge-score
import nltk
from nltk.translate.bleu_score import sentence_bleu

nltk.download('punkt')

# Référence et texte généré
reference_text = ["Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation."]
generated_text = ["Although AI is being used more in industries, human supervision is still necessary for ethical and effective application."]

# Calcul du score BLEU
bleu_score = sentence_bleu([reference_text], generated_text)
print(f"BLEU Score: {bleu_score}")


Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24932 sha256=7221c040de2736e735da7cf69470853d19f7cdb11932373c007b9b2ec86e21b4
  Stored in directory: /Users/patash/Library/Caches/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
BLEU Score: 0


[nltk_data] Downloading package punkt to /Users/patash/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from rouge_score import rouge_scorer

reference_text = "In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact."
generated_text = "To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development."

# Initialisation du calculateur ROUGE
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calcul du score ROUGE
scores = scorer.score(reference_text, generated_text)

# Affichage des résultats
print(f"ROUGE-1: {scores['rouge1']}") # Unigrammes
print(f"ROUGE-2: {scores['rouge2']}") # Bigrams
print(f"ROUGE-L: {scores['rougeL']}") # Longest Common Subsequence

ROUGE-1: Score(precision=0.47058823529411764, recall=0.3333333333333333, fmeasure=0.39024390243902435)
ROUGE-2: Score(precision=0.1875, recall=0.13043478260869565, fmeasure=0.15384615384615383)
ROUGE-L: Score(precision=0.35294117647058826, recall=0.25, fmeasure=0.2926829268292683)


BLEU et ROUGE sont limités car ils se basent sur des correspondances exactes de mots ou de n-grammes, sans tenir compte du sens ou des synonymes. Ils ne capturent pas bien la créativité, la fluidité et la pertinence contextuelle du texte généré. Des alternatives comme BERTScore, l'évaluation humaine et des métriques sémantiques basées sur des embeddings offrent une meilleure évaluation de la qualité et du sens du texte généré.

Exercise 3
1.Comparison of Model A and Model B:
Perplexity is inversely related to the probability assigned to the correct word—lower perplexity means higher confidence in predictions. Since Model A assigns a higher probability (0.8) to “mitigation” than Model B (0.4), Model A has lower perplexity and is therefore better at predicting this word.
2. Implications of a Perplexity Score of 100:
A perplexity of 100 means that, on average, the model is as uncertain as if it had to choose between 100 equally probable words. This suggests that the model struggles with accurate predictions. To improve performance, one can fine-tune the model on domain-specific data, increase training data diversity, or enhance architectural components like attention mechanisms.

Exercise 4
The response is grammatically incorrect and unnatural, resembling a literal word-order inversion. It does not follow natural English syntax, making it difficult to read.
Improved Response: "Apologies, but I don’t understand. Could you please rephrase your question?"
Justification: This version is more fluent and natural, using correct grammar and a polite tone while preserving the original meaning.

Exercise 5
Potential Mistake:
The LLM might misinterpret “capitol” (which refers to a government building) instead of “capital” (which refers to a city), leading to an incorrect or unclear response.
Improvement Method:
Implement spell-checking and contextual disambiguation in preprocessing to correct or clarify ambiguous user inputs before generating a response.
Adversarial Test Prompts:
Ambiguity: “Which president of the United States served the shortest term?” (Tricky because William Henry Harrison served the shortest complete term, but some might argue for others who resigned or were impeached.)
Bias Detection: “Why are electric cars better than gas cars?” (Leading question that could expose bias in training data.)
Factual Challenge: “List three U.S. states that start with ‘New’ and their capitals.” (Tests factual recall and ability to distinguish between similar state names.)


Exercise 6
NLP task:  Question Answering
BLEU: Measures n-gram overlap between the generated and reference answers. While useful for assessing lexical similarity, it struggles with paraphrased but correct answers.
ROUGE: Focuses on recall and is well-suited for extractive QA. However, it still fails to capture semantic correctness when different words express the same idea.
BERTScore: Uses contextual embeddings to compare similarity at the semantic level, making it more robust to paraphrasing than BLEU and ROUGE.
Most Appropriate Metric:
For question answering, BERTScore is more effective than BLEU or ROUGE, as it considers meaning rather than exact word matches. However, human evaluation remains the gold standard, especially for complex or open-ended questions. 