
# ✅ **Evaluation Metrics — Interview-Style Q&A**

---

### **1. What are evaluation metrics in NLP?**

Evaluation metrics are methods to measure how well a model’s output matches the expected result.
They help check accuracy, quality, and reliability of NLP or GenAI models.

---

### **2. What are common evaluation metrics for text generation?**

Some widely used metrics are:

* **BLEU** – checks how similar generated text is to reference text
* **ROUGE** – measures overlap between generated text and reference text
* **METEOR** – checks similarity including synonyms
* **BERTScore** – uses embeddings to measure semantic similarity

These are commonly used for summarization, translation, and content generation.

---

### **3. What is BLEU score?**

BLEU compares the n-grams of generated text with reference text.
Higher BLEU means the generated text closely matches the expected output.
It is mostly used for translation tasks.

---

### **4. What is ROUGE score?**

ROUGE measures how much the generated text overlaps with the reference text.
It is widely used for summarization because it checks how much important content was captured.

---

### **5. What is perplexity?**

Perplexity measures how well a language model predicts the next word.
Lower perplexity means the model is more confident and better at predicting text.
It is a common metric during LLM training.

---

### **6. What is BERTScore?**

BERTScore compares embeddings from a model like BERT to measure similarity.
It checks whether the generated text has the same meaning, not just the same words.
Useful for semantic evaluation.

---

### **7. What metrics are used for classification tasks?**

Common metrics include:

* Accuracy
* Precision
* Recall
* F1-score
* AUC-ROC

These measure how well the model classifies or detects categories.

---

### **8. What metrics are used in RAG evaluation?**

For RAG specifically, the main metrics are:

* **Recall@K** – Are the correct documents retrieved in top K?
* **Precision@K** – How many retrieved docs are relevant?
* **Similarity Score** – Cosine similarity between embeddings
* **Groundedness** – Does the answer stick to retrieved context?

These measure how well retrieval supports generation.

---

### **9. What metrics help detect hallucinations?**

Some approaches include:

* Groundedness scoring
* Faithfulness metrics
* Fact-checking consistency
* Human evaluation
* Retrieval coverage (is relevant context present?)

These help ensure the model’s output is aligned with real facts.

---

### **10. How do you evaluate embeddings?**

By checking:

* Retrieval accuracy (Recall@K)
* Relevance ranking
* Cosine similarity distribution
* Clustering performance

Better embeddings bring more relevant documents into the top results.

---

### **11. What metrics do you use for summarization evaluation?**

* ROUGE
* BLEU
* BERTScore
* Human evaluation for quality, relevance, and completeness

---

### **12. In enterprise GenAI, which evaluation method is most reliable?**

Human evaluation combined with groundedness checks is the most reliable.
Automated metrics are useful, but they don’t fully capture correctness or safety.



# ✅ **1. BLEU Score (Text Generation / Translation)**

```python
from nltk.translate.bleu_score import sentence_bleu

reference = ["The cat is sitting on the mat".split()]
candidate = "The cat sits on the mat".split()

score = sentence_bleu(reference, candidate)
print("BLEU Score:", score)
```

---

# ✅ **2. ROUGE Score (Summarization)**

```python
from rouge import Rouge

reference = "The cat is sitting on the mat"
candidate = "The cat sits on the mat"

rouge = Rouge()
scores = rouge.get_scores(candidate, reference)

print(scores)
```

This prints **ROUGE-1**, **ROUGE-2**, and **ROUGE-L** scores.

---

# ✅ **3. BERTScore (Semantic Similarity)**

```python
import bert_score

candidate = ["The cat sits on the mat"]
reference = ["A cat is sitting on a mat"]

P, R, F1 = bert_score.score(candidate, reference, lang="en")
print("BERTScore F1:", F1.mean().item())
```

---

# ✅ **4. Perplexity (Language Model Evaluation)**

Using a **GPT-2 model** for demonstration.

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import math

model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

text = "Generative AI is transforming industries."
input_ids = tokenizer.encode(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(input_ids, labels=input_ids)
    loss = outputs.loss
    perplexity = math.exp(loss)

print("Perplexity:", perplexity)
```

---

# ✅ **5. Cosine Similarity (Embedding Evaluation)**

Used in **RAG** and **semantic search**.

```python
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Example embeddings
query_emb = np.array([0.1, 0.2, 0.3])
doc_emb = np.array([0.1, 0.25, 0.35])

score = cosine_similarity([query_emb], [doc_emb])
print("Cosine Similarity:", score[0][0])
```

---

# ✅ **6. Recall@K (RAG Retrieval Evaluation)**

```python
import numpy as np

# true relevant documents
relevant_docs = {"doc2", "doc5"}

# top-K retrieved documents
retrieved_docs = ["doc5", "doc7", "doc2"]

intersection = len(set(relevant_docs).intersection(set(retrieved_docs)))
recall_at_k = intersection / len(relevant_docs)

print("Recall@K:", recall_at_k)
```

---

# ✅ **7. Precision@K (RAG Retrieval Evaluation)**

```python
relevant_docs = {"doc2", "doc5"}
retrieved_docs = ["doc5", "doc7", "doc2"]

intersection = len(set(relevant_docs).intersection(set(retrieved_docs)))
precision_at_k = intersection / len(retrieved_docs)

print("Precision@K:", precision_at_k)
```

---

# ✅ **8. Groundedness Check (Hallucination Detection)**

Simple prompt evaluation example.

```python
generated_answer = "Payroll tax rate is 12% in 2024."
retrieved_context = "Payroll tax rate is 10% in 2024."

is_grounded = generated_answer in retrieved_context
print("Grounded?", is_grounded)
```

(Real systems use advanced checks, but this is good for interview demonstration.)

---
