## Q: How do you evaluate generative text quality?**

**A:**
Evaluating generative text quality is multi-dimensional — we care about **fluency, relevance, coherence, factuality, and style**. No single metric is perfect, so we usually combine **automatic metrics** with **human evaluation**.

**1. Automatic Evaluation Metrics**

* **BLEU (Bilingual Evaluation Understudy)**

  * N-gram overlap between generated text and reference text.
  * Good for translation, but limited in capturing meaning.

* **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**

  * Measures overlap of n-grams, recall-focused.
  * Popular in summarization tasks.

* **METEOR**

  * Considers synonyms, stemming, and word order.
  * More semantically aware than BLEU/ROUGE.

* **BERTScore**

  * Uses contextual embeddings (from BERT/Transformer models) to compute semantic similarity.
  * Better at capturing meaning than surface-level overlaps.

* **Other task-specific metrics**

  * Factuality checks (e.g., grounding score in RAG).
  * Toxicity/safety metrics.

**2. Human Evaluation**

* Still considered **gold standard**.
* Judges fluency, coherence, factual accuracy, and helpfulness.
* Typically rated on Likert scales (e.g., 1–5) or via pairwise preference.
* Expensive and subjective, but captures nuance missed by automatic scores.

**3. Hybrid Approaches**

* Combine **automatic metrics for scale** and **human eval for depth**.
* Example: Use ROUGE/BERTScore for quick validation, then human eval for final deployment readiness.

**Enterprise Impact:**
For enterprises, automated metrics ensure **scalability and consistency**, while human evaluation ensures **trustworthiness and brand alignment** (critical in finance, healthcare, customer service).

---

👉 Quick check for you: if you were evaluating a **summarization model for legal documents**, which metric (BLEU, ROUGE, BERTScore, human eval) would you lean on most, and why?


## Q: How do you evaluate embeddings and retrieval quality?**

**A:**
We evaluate embeddings and retrieval quality using **IR-style ranking metrics**, which measure how well retrieved results align with ground-truth relevance.

**1. Precision\@k**

* Fraction of retrieved documents in the top-*k* that are relevant.
* Example: Precision\@5 = 0.8 → 4 out of 5 retrieved docs were relevant.
* **Best for:** Ensuring high accuracy in the *top answers*.

**2. Recall\@k**

* Fraction of all relevant documents that appear in the top-*k*.
* Example: Recall\@10 = 0.6 → model retrieved 60% of all possible relevant docs in top 10.
* **Best for:** Coverage (important in legal/medical search where missing facts is costly).

**3. MRR (Mean Reciprocal Rank)**

* Focuses on the position of the **first relevant document**.
* Example: If the first relevant doc is at rank 2, reciprocal rank = 1/2.
* **Best for:** Scenarios like QA/chatbots, where the first answer must be useful.

**4. NDCG (Normalized Discounted Cumulative Gain)**

* Weights relevance by position in ranking, giving more credit if highly relevant docs appear earlier.
* Normalized for fair comparison across queries.
* **Best for:** Multi-level relevance scoring (e.g., “highly relevant,” “partially relevant”).

**Enterprise Impact:**

* **Precision\@k** ensures users don’t see junk results.
* **Recall\@k** ensures compliance and completeness (key in finance/legal).
* **MRR/NDCG** optimize user experience by surfacing the most useful docs at the top.



## Q: What is perplexity, and why is it important in LLM evaluation?**

**A:**

* **Definition:**
  Perplexity is a measure of how well a language model predicts a sequence of words. Formally, it’s the **exponential of the cross-entropy loss**.

  $$
  \text{Perplexity} = e^{H(p,q)}
  $$

  where $H(p,q)$ is the cross-entropy between the true distribution $p$ and the model’s predicted distribution $q$.

* **Interpretation:**

  * Lower perplexity → model assigns **higher probability** to the correct sequence → better predictive ability.
  * A perplexity of *k* means: “On average, the model is as confused as if it had to choose among *k* equally likely options at each step.”

* **Why it matters for LLM evaluation:**

  1. **Model quality:** Indicates how well the LLM understands and predicts natural language.
  2. **Comparisons:** Useful for comparing different models or checkpoints during training.
  3. **Proxy for fluency:** Lower perplexity usually correlates with more fluent text.

* **Limitations:**

  * Doesn’t directly measure factuality, coherence, or task performance.
  * A model with low perplexity might still **hallucinate** or generate unsafe text.
  * Hence, perplexity is often used alongside task-specific metrics (BLEU, ROUGE, Recall\@k, human eval).

**Enterprise Impact:**
Perplexity helps assess **efficiency of training and fine-tuning**, ensuring that enterprise LLMs are not overfitting and remain generalizable. However, in production, business stakeholders also care about **truthfulness, compliance, and user trust**, which require broader evaluation beyond perplexity.


## Q: How do you measure toxicity, bias, and fairness in generative outputs?**

**A:**

We evaluate these using a mix of **automatic detectors, fairness metrics, and human review**.

---

**1. Toxicity Measurement**

* **Automated classifiers:** Use models like **Perspective API**, Detoxify, or fine-tuned BERT to flag offensive, harmful, or unsafe language.
* **Metrics:**

  * Toxicity score (0–1 probability).
  * % of generated outputs above a threshold.
* **Human evaluation:** Spot-check edge cases, since automated classifiers may miss cultural nuances.

---

**2. Bias Detection**

* **Word embedding association tests (WEAT):** Check associations between demographic groups and attributes.
* **Bias benchmarks:**

  * StereoSet (gender, race, profession biases).
  * CrowS-Pairs (causal bias).
* **Prompt-based testing:** Generate outputs with identity-specific prompts (e.g., “A doctor is…” vs “A nurse is…”) and compare.

---

**3. Fairness Evaluation**

* **Group fairness metrics:**

  * **Demographic parity:** Are positive/neutral outputs equally distributed across groups?
  * **Equalized odds:** Do error rates differ across groups?
* **Representation checks:** Ensure minority voices and perspectives are not underrepresented.
* **Human review:** Panels from diverse backgrounds to judge fairness and inclusivity.

---

**4. Enterprise Practices**

* Combine **automatic red-teaming** (large-scale synthetic testing) with **manual red-teaming** (security/ethics experts probing system).
* Maintain **auditable logs** of flagged toxic or biased generations.
* Integrate **guardrails** (policy filters, safety layers) before outputs reach end-users.

---

**Enterprise Impact:**
Managing toxicity, bias, and fairness ensures **regulatory compliance**, reduces **legal/brand risk**, and builds **user trust** — especially critical in sensitive industries like finance, healthcare, and HR.
