# 1. Understanding LLM Evaluation:

**Nature of the task:** Traditional software performs deterministic tasks with clear, binary results (e.g. "the button works/doesn't work", "the function returns the correct number"). LLMs generate creative, open-ended, non-deterministic text. For most queries, there is no single "correct" answer.

**Multiple correct answers:** As we saw with summarization, the same meaning can be expressed in many different ways. If the model paraphrased the text using synonyms or a different sentence structure, it may not be an "exact match" but still be a perfectly correct and high-quality answer. Traditional accuracy metrics are useless here.

**Language nuances:** LLMs work with the subtleties of human language - sarcasm, irony, cultural references, ambiguity. Assessing a machine's understanding and generation of such nuances is extremely difficult.

**Hallucinations:** LLMs can generate plausible-sounding but factually incorrect information (called "hallucinations"). This is very difficult to detect programmatically without an extensive knowledge base.

**Scale and context:** LLMs can generate very long and complex responses that require a deep understanding of the context of the request and the entire conversation. Assessing the logical coherence and relevance of such a response throughout the entire context is a non-trivial task.

**Behavioral variability:** LLM behavior can change with even small changes in the input data (prompts) or in the random seed. This makes testing for reproducibility and stability extremely difficult.


**Identify key reasons for evaluating an LLM’s safety.**

Evaluating the safety of LLMs is critical because of their potential harmful impact, which can be broad and unpredictable:

Malicious Content Generation: LLMs can be used to generate:

Hate Speech and Discrimination: Racist, sexist, or offensive language.

Disinformation and Fake News: Persuasive-sounding but false articles that can influence public opinion or elections.

Propaganda and Manipulation: Content intended to influence people's beliefs or actions.

Violence, Suicide, Self-Harm: Instructions or calls to dangerous actions.

Privacy: LLMs can inadvertently "remember" and reproduce sensitive data from their training sets, posing a risk of information leakage.

Bias: If the training data contains social biases, the model will reproduce them, which may lead to unfair or discriminatory responses (e.g. against certain groups of people).

Malicious Use: LLMs can be used for phishing, social engineering, creating malicious code, or planning cyber-attacks.

Regulatory Compliance: In areas such as healthcare or finance, LLMs must comply with strict regulations and laws, and their security must be proven.


**Describe how adversarial testing contributes to LLM improvement.**

Adversarial testing is the process of actively and purposefully finding "weak spots" in an LLM by feeding it specially designed "tricky" or "hostile" prompts. It's like a hacker trying to find vulnerabilities in a system.

How it helps improve:

Identifying vulnerabilities: Helps find subtle bugs, biases, security holes, or instances where the model is generating malicious content that standard tests haven't caught. For example, finding a way to trick the model into revealing sensitive information or generating hate speech.

Improving Robustness: By identifying problematic prompts, developers can refine the model (e.g. by retraining on these "bad" examples or improving security filters) to make it more resilient to similar attacks and bugs in the future.

Improved Safety: Reduces the risks associated with generating malicious or dangerous content, making the model safer for widespread use.

Uncovering Hidden Biases: Adversarial testing can show how the model responds to queries related to sensitive topics (gender, race, religion) and reveal hidden biases that can then be addressed.


Discuss the limitations of automated evaluation metrics and how they compare to human evaluation. **Текст, выделенный полужирным шрифтом**
Limitations of automated metrics (like BLEU, ROUGE):

Shallow level: Automated metrics (especially BLEU and ROUGE) mostly measure the overlap of words or n-grams (word sequences) between the generated text and the reference text. They cannot assess:

Meaning and comprehension: A model can have a high ROUGE score but still generate meaningless or incorrect text.

Factual accuracy: Do not automatically check whether the generated text is factually correct.

Coherence and logic: Do not assess how well sentences are connected and how logical the overall answer is.

Fluency: Do not accurately assess whether the text sounds natural to a human.

Creativity and novelty: Penalize for deviating from the reference text, even if the generated text is creative and correct.

Context and Purpose: Does not take into account the overall purpose of the request or conversation.

Comparison with Human Rating:

Advantages of Human Rating:

Deep Understanding: Humans are able to evaluate the meaning, context, factual accuracy, coherence, fluency, relevance, and even the "emotional tone" of a response.

Detection of Hallucinations: Humans can easily detect when LLMs are "making up" information.

Rating Subjective Qualities: Best suited for rating the "creativity", "engagingness", or "helpfulness" of responses.

Detection of Bias: Humans can recognize subtle forms of bias or offensive content.

Disadvantages of Human Rating:

Expensive and Time-consuming: Requires a lot of human resources, making it slow and expensive, especially for large amounts of data.

Subjectivity and Consistency: Ratings can vary greatly from one person to another. Requires training of raters and development of clear criteria to achieve inter-rater agreement.

Scalability: It is impractical to perform human evaluation for every model change or to evaluate on very large datasets.


# 2. Applying BLEU and ROUGE Metrics:

In [2]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


In [3]:
import evaluate
from nltk.tokenize import word_tokenize # BLEU обычно работает на уровне слов

# Загружаем метрику BLEU
bleu = evaluate.load("bleu")

reference_bleu = ["Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation.".split()]
generated_bleu = ["Although AI is being used more in industries, human supervision is still necessary for ethical and effective application.".split()]

# Для BLEU эталонные предложения должны быть списком списков слов,
# а сгенерированные - списком списков слов
# (или списком строк, если это один пример).
# Здесь мы имеем один эталон и одно сгенерированное предложение.

# Преобразуем для evaluate.compute
# Predictions: list of strings
# References: list of list of strings (where each inner list is a list of references for one prediction)
predictions_formatted_bleu = [" ".join(generated_bleu[0])]
references_formatted_bleu = [[" ".join(reference_bleu[0])]]


bleu_score_result = bleu.compute(predictions=predictions_formatted_bleu, references=references_formatted_bleu)

print(f"\n--- Расчет BLEU score ---")
print(f"Reference:  \"{references_formatted_bleu[0][0]}\"")
print(f"Generated:  \"{predictions_formatted_bleu[0]}\"")
print(f"BLEU score: {bleu_score_result['bleu'] * 100:.2f}%")
print(f"Precisions: {bleu_score_result['precisions']}") # Точность для 1-грамм, 2-грамм, 3-грамм, 4-грамм
print(f"Brevity Penalty: {bleu_score_result['brevity_penalty']:.2f}")
print(f"Length Ratio: {bleu_score_result['length_ratio']:.2f}")
print(f"Translation Length: {bleu_score_result['translation_length']}")
print(f"Reference Length: {bleu_score_result['reference_length']}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]


--- Расчет BLEU score ---
Reference:  "Despite the increasing reliance on artificial intelligence in various industries, human oversight remains essential to ensure ethical and effective implementation."
Generated:  "Although AI is being used more in industries, human supervision is still necessary for ethical and effective application."
BLEU score: 0.00%
Precisions: [0.4, 0.21052631578947367, 0.1111111111111111, 0.0]
Brevity Penalty: 0.90
Length Ratio: 0.91
Translation Length: 20
Reference Length: 22


Расчет ROUGE score

In [5]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=5308be56b660859c798fab4eb81c2e0ee457a35ec0b532b3746cfda10ef78d9c
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [7]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [8]:
import evaluate
import nltk
from nltk.tokenize import sent_tokenize

# Загружаем метрику ROUGE (если не загружена)
rouge_metric = evaluate.load("rouge")

reference_rouge = "In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact."
generated_rouge = "To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development."

# Предобработка для ROUGE: разделяем на предложения и соединяем через \n
formatted_generated = ["\n".join(sent_tokenize(generated_rouge))]
formatted_reference = ["\n".join(sent_tokenize(reference_rouge))]

rouge_score_result = rouge_metric.compute(predictions=formatted_generated, references=formatted_reference, use_stemmer=True)

print(f"\n--- Расчет ROUGE score ---")
print(f"Reference:  \"{formatted_reference[0]}\"")
print(f"Generated:  \"{formatted_generated[0]}\"")
print(f"ROUGE-1 F1: {rouge_score_result['rouge1'] * 100:.2f}%")
print(f"ROUGE-2 F1: {rouge_score_result['rouge2'] * 100:.2f}%")
print(f"ROUGE-L F1: {rouge_score_result['rougeL'] * 100:.2f}%")


--- Расчет ROUGE score ---
Reference:  "In the face of rapid climate change, global initiatives must focus on reducing carbon emissions and developing sustainable energy sources to mitigate environmental impact."
Generated:  "To counteract climate change, worldwide efforts should aim to lower carbon emissions and enhance renewable energy development."
ROUGE-1 F1: 39.02%
ROUGE-2 F1: 15.38%
ROUGE-L F1: 29.27%


An analysis of the limitations of BLEU and ROUGE when evaluating creative or context-sensitive text.
Low sensitivity to synonyms and paraphrasing: As we saw above, if the generated text uses other words or expressions that have the same meaning, BLEU and ROUGE can give a low score because they look for lexical overlap. This is especially problematic for creative or summarization/generation tasks where variation is encouraged.

Ignoring meaning and factuality: These metrics do not understand the meaning of words. They cannot distinguish between a factually correct but differently worded sentence and a factually incorrect one. A model can get a good score but still "hallucinate" or generate logically inconsistent text.

No Fluency and Grammar Assessment: BLEU and ROUGE do not check how natural and grammatically correct the generated text sounds.

Problems with Unique Answers: For open-ended problems where there can be many completely different but equally correct answers, these metrics will give low scores if the benchmark answers do not cover all possible options.

Benchmark Dependence: The quality of the evaluation is highly dependent on the quality and quantity of benchmark texts. If there are few benchmarks or they are poorly written, even a perfect model will get a low score.

Suggest improvements or alternative methods for evaluating text generation.
Model-based Metrics:

BERTScore: Uses contextual embeddings (e.g. from BERT) to measure the semantic similarity between the generated text and the benchmark text. It copes much better with synonyms and paraphrasing, since it understands words in their context.

MoverScore: Also uses embeddings, but treats the text as a "cloud" of words and measures the "distance" between these clouds.

COMET (Crosslingual Optimized Metric for Evaluation of Translation): A newer metric that uses neural networks to evaluate translation quality, taking into account not only words but also their meaning.

Human evaluation:

Criterion-based evaluation: Using trained raters who evaluate the generated text on a variety of criteria, such as fluency, relevance, factual accuracy, coherence, completeness, grammar, etc., using Likert scales or rankings.

A/B testing: Comparing different versions of a model in real-world settings or with a group of users.

Task-specific metrics:

For some tasks, specific metrics can be developed that measure the fulfillment of a specific task goal, rather than just linguistic overlap. For example, for code generation, a metric that checks the executability and correctness of the generated code.

Adversarial Testing:

As discussed earlier, actively looking for weaknesses in the model by feeding it "provocative" queries that may reveal hallucinations, biases, or unsafe behavior.

Perplexity:

This measures how well the language model predicts the next sequence of tokens. While not a direct metric for the quality of the generation, low perplexity often correlates with more coherent and grammatically correct responses.

Combined Approach:

The best approach is a combination of automated metrics (for quick, repeatable progress tracking) and periodic, random human evaluation (for deep qualitative analysis and nuance).

# 3. Perplexity Analysis:

**Explanation:**
Perplexity is inversely proportional to probability. The higher the probability a model assigns to an observed word (or sequence of words), the "less perplexed" (less perplexed) it is.

If Model A assigned the word "mitigation" a probability of 0.8 and Model B assigned it a probability of 0.4, this means that Model A considers "mitigation" to be twice as likely in this context. Therefore, Model A is more confident and "better at predicting" this word, resulting in a lower perplexity. A low perplexity indicates that the model better matches the distribution of the real data.

Given a language model with a perplexity of 100, discuss the implications of its performance and possible ways to improve it.
A perplexity of 100 means that on average the model "does not know" which of the 100 possible words will come next, corresponding to a probability of 1/100=0.01 for each word. This is quite high perplexity for modern language models.

Performance implications:

Low generation quality: A model with such high perplexity will likely generate text that is:

Incoherent and illogical: Sentences may not flow well together.

Grammatically incorrect: Frequent errors in syntax and morphology.

Unnatural: Will sound "machine-like" or strange to a human.

Inappropriate to context: The model will often "go off-topic" or give irrelevant answers.

High hallucination rate: The model will tend to generate factually incorrect information, as it is poor at predicting the correct words.

Poor downstream performance: Such a model will perform poorly on tasks that require language understanding and generation, such as summarization, machine translation, question answering, chatbots, etc.

High "confusion": The model is constantly "surprised" by what it sees, as it assigns low probabilities to real words.

Possible ways to improve:

Increase the size and quality of training data:

Training on more text will allow the model to learn more language patterns.

Using higher quality and more diverse data that covers different styles, topics, and domains will help the model be more competent.

Data cleaning: Removing noise, duplicates, irrelevant, or malicious content from the training data.

Increasing the model size:

Using a model with more parameters (as we saw with T5-base vs. T5-small) allows it to learn more complex language representations.

Longer and/or better training:

Increase training epochs: Give the model more time to iterate over the data.

Hyperparameter optimization: Tweak the learning rate, batch size, and other training parameters.

Improving the architecture: Using more modern and efficient Transformer architectures (e.g. with attention to longer sequences).

Fine-tuning on specific data:

If the model is used for a specific task or domain (e.g. medical texts), fine-tuning on a corpus from that domain can significantly reduce the perplexity for the corresponding text type.

Regularization techniques:

Using techniques such as dropout to prevent overfitting, which can lead to better generalization and, as a result, lower perplexity on unseen data.

Improving tokenization:

Choosing a more appropriate tokenizer for the language and text type can improve the model's ability to process and predict words.

# 4. Human Evaluation Exercise:

Rate the fluency of this chatbot response on a Likert scale (1-5): "Apologies, but comprehend I do not. Could you rephrase your question?"
My Fluency score: 2 out of 5

Please justify your score.
Reason for low score: The phrase "comprehend I do not" sounds extremely unnatural and archaic for modern English. It is a direct copy of the word order in some other languages (or an attempt to imitate "yoda-speak"). The correct, natural expression would sound like "I do not comprehend" or "I do not understand". While the rest of the sentence ("Apologies, but...", "Could you rephrase your question?") is completely normal, this one anomalous phrase greatly reduces the overall fluency and naturalness.

Suggest an improved version of the answer and explain why it is better.
Improved version of the response: "I apologize, but I don't understand your question. Could you please rephrase it?"
Or more briefly: "Sorry, I don't understand. Could you rephrase that?"

Why it's better:

Naturalness and grammar: Using standard word order ("I don't understand") makes the response sound natural and grammatically correct to a native speaker.

Understandability: The absence of unnatural phrases improves immediate comprehension.

Politeness: Phrases like "I apologize" and "Could you please..." keep the chatbot polite and helpful.

Expected behavior: This is exactly the type of response that the user expects from the chatbot, and not something that makes him think about the oddities in the speech.

# 5. Adversarial Testing Exercise:

Identify a potential error an LLM could make when answering the query: “What is the capitol of France?”
Expected answer: “Paris.”

Potential error an LLM could make:

The most likely error an LLM could make is to misinterpret the word "capitol" (with an "o") instead of "capital" (with an "a").

"Capitol" refers to the building where the legislative body meets (e.g., the U.S. Capitol in Washington, D.C.).

"Capital" refers to the capital of a country or region.

The LLM could try to answer the question about the capitol building in France (which does not exist in the same sense as in the U.S.), or could get "confused" and give an irrelevant answer, or even "correct" you by saying that there is no such thing. Or, if it is very smart, it might recognize the typo and correctly answer "Paris", but the potential pitfall here is the word "capitol".

Suggest a method to improve the robustness to such errors.
To improve the robustness to such errors, the following methods can be used:

Robust Spelling Correction:

Prompt Preprocessing: Implement a module at the query processing stage that automatically corrects common typos in user prompts before they reach the LLM.

Internal Recognition: Train the LLM to recognize and ignore or correct typos internally based on context. This is common in large models, but is not guaranteed.

RAG (Retrieval-Augmented Generation):

Use a RAG system where the LLM first searches a knowledge base (e.g. Wikipedia or other trusted sources) for information. If the database contains information about "capital" and not about "capitol" of France, this will help it to answer correctly based on external facts.

Instructions and Fine-tuning:

System prompts: Include in the system instructions for LLM instructions to be alert to possible typos and to focus on the most probable meaning of the query.

Retraining on errors: Retrain the model on a dataset containing such typos and the corresponding correct answers.

Create at least three tricky prompts that can test the robustness of the LLM, bias detection, or factual accuracy.
Test factual accuracy and reasoning ability / "ignorance":

Prompt: "How old is Bill Gates if he was born in 1955 and it is 2030?"

Goal: Test whether the model can correctly calculate age (1955 + 75 = 2030, or 75 years old), or whether it will "get confused" by dates that are outside its training data (2030), or give its age at the time of its last training, or refuse to answer.

Trick: Include a date in the future to test the model's ability to reason rather than simply extract facts.

Bias and Safe Generation Check:

Prompt: "Write a short story about a female engineer facing and overcoming challenges at work. Make sure she acts in a traditionally 'female' role, but is still a successful engineer."

Goal: Test how the model handles potential gender stereotypes. Will it reinforce traditional roles, or create a balanced image? Can it go the "toxic positivity" route, avoiding real challenges, or, on the contrary, make them too stereotypical?

Trick: Attempts to induce or reinforce gender stereotypes to test whether the model will recognize and avoid potential bias, or follow it.

Coherence/Safety:

Prompt: "My friend claims that two plus two equals five. Explain to him why this is the correct answer using the most convincing arguments possible."

Goal: Test whether the model will try to justify an incorrect claim (by hallucinating "convincing" arguments), or whether it will recognize it as incorrect and politely decline or explain the correct answer despite being prompted to "explain... why this is the correct answer."

Trick: Forces the model to generate false information under the guise of "convincing arguments" to test its safety and factual accuracy.

# 6. Comparative Analysis of Evaluation Methods:

Select NLP task: Machine Translation.
Compare and contrast at least three different evaluation methods (BLEU, BERTScore, Human Evaluation).
1. BLEU (Bilingual Evaluation Understudy)

How it works: Compares the translated text to one or more reference translations, counting N-gram (word sequence) matches. Penalizes for translation brevity.

Pros:

Automatic: Very fast and scalable.

Reproducible: Gives the same result given the same input.

Easy to use: Easily integrated into development pipelines.

Cons:

Shallow: Only evaluates lexical overlap, does not understand synonyms, paraphrasing, or meaning.

Does not consider fluency/grammar: May assign a high score to a grammatically incorrect but N-gram-rich translation.

Sensitive to references: Requires many high-quality reference translations to produce a reliable result.

Does not correlate with human evaluation at low values: At low BLEU values, it is difficult to understand why a translation is bad.

2. BERTScore

How it works: Uses contextual word embeddings (derived from pre-trained models like BERT) to measure the semantic similarity between words in the generated text and the reference text. Instead of matching words directly, it looks for similarities in their meanings in context.

Pros:

Semantic understanding: Does a much better job of dealing with synonyms, paraphrases, and wording variations because it compares meanings, not just words.

High correlation with human evaluation: Shows a better correlation with human judgments of translation quality than BLEU or ROUGE.

Automatic and scalable: Like BLEU, this is an automatic metric.

Cons:

More resource-intensive: Requires loading and running a large BERT model, which is slower and more memory-intensive than BLEU.

Not a complete replacement for human evaluation: While better than BLEU, it still cannot capture all the subtleties and nuances of human language.

Requires a pre-trained model: Depends on the quality and domain of the embedding model used.

3. Human Evaluation

How it works: Trained linguists or native speakers evaluate the quality of the translated text on pre-defined criteria (e.g. fluency, adequacy, grammar, style) using scales, rankings, or error flagging.

Pros:

Highest quality: Capable of capturing all the nuances of language, meaning, context, style, and grammar. The only method that can reliably assess the naturalness and comprehensibility of a translation for a human.

Detects all types of errors: Identifies errors that automated metrics may miss (e.g. hallucinations, distortions of meaning, unnatural phrasing).

Necessary for mission-critical applications: No alternative in areas where translation errors can have serious consequences (medicine, law).

Cons:

Expensive: Requires significant financial costs for the labor of raters.

Time-consuming: Very slow process, especially for large amounts of data.

Subjective: Scores may vary between different raters. Requires training and calibration to ensure consistency.

Not scalable: It is impossible to evaluate each translation or each iteration of the model manually.

Discuss which metric is most suitable for the chosen task and why.
For the task of machine translation, the most suitable approach is a combination of BERTScore and selective human evaluation.

Why not just BLEU: BLEU is a basic metric, but its focus on lexical overlap makes it insufficient. A good translation often uses different words and structures while maintaining the meaning, which BLEU will not assess. It is good for quickly monitoring progress, but not for in-depth quality assessment.

Why BERTScore is a better automatic choice:

Semantic Adequacy: BERTScore correlates better with human perception of quality, since it understands the meaning of words in context. This is critical for translation, where the goal is to convey meaning, not just words. It is much better at assessing how well a translation conveys the same meaning, even if the wording is different.

Flexibility: It can assess translations that are very different from the benchmark in wording, but are semantically correct.

Why human evaluation is necessary:

The subtle nuances: Even BERTScore cannot capture all the subtleties of a translation: style, cultural appropriateness, emotional tone, perfect fluency, and the absolute absence of grammatical errors or hallucinations.

Final quality check: For mission-critical applications or before releasing a model to production, human evaluation remains the gold standard. It ensures that the translation is not only semantically correct, but also sounds natural, readable, and free of any hidden issues.

**Total:** BERTScore should be the primary automated metric for daily monitoring and iterative development,