# Hands-on: LLM Application Evaluation Pipeline: Qwen Translation Model

This notebook outlines a robust methodology for evaluating the performance of a Large Language Model (LLM)-based application. We focus on a **translation service** powered by the **Qwen/Qwen2.5-1.5B-Instruct** model, assessing not only the final translation quality but also the entire pipeline, including the model, prompting, and the evaluation metrics themselves.

**What You'll Learn:**

* **Component-Level Evaluation:** How to systematically test the LLM's core function (`qwen_translate`) and its supporting parts (input, postprocessor).
* **Defining a Rubric:** Creating clear, human-centered evaluation guidelines using a 5-point rubric mapped to quantifiable metrics.
* **Automatic Metrics:** Employing **semantic similarity** (using **Sentence Transformers**) as a reliable, automatic proxy for human translation quality.
* **Pipeline Reliability:** Assessing the stability and reproducibility of the evaluation loop itself by checking the variance of the chosen metric across multiple runs.

## Step 1: Evaluate All Components in a System

This crucial initial step establishes the foundational code for our application pipeline. This function is the subject of our evaluation, as it integrates and relies on all three critical components:

1.  **Input Processor:** Handles the user's request, including the identification of source and target languages.
2.  **Translation Model (Qwen):** Executes the LLM call using a structured chat prompt to generate the raw output.
3.  **Postprocessor:** Implements logic (e.g., JSON extraction, cleanup) to transform the raw model response into the final, clean user output.

Our evaluation in this step focuses on two key dimensions of performance:
* **Task-level correctness:** Assessing the overall **translation quality** (semantic meaning preserved).
* **Intermediate outputs:** Measuring application **latency** and verifying **formatting correctness** (ensuring the JSON output constraint is met).


In [1]:
import time
import json
import pandas as pd
from dataclasses import dataclass, asdict
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import re

# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
# NOTE: Ensure you have a GPU runtime enabled (T4 or higher) for this to run efficiently.
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Logging dataclass
@dataclass
class TranslationLog:
    sample_id: str
    source_lang: str
    target_lang: str
    source_text: str
    translated_text: str
    latency: float

    def to_dict(self):
        d = asdict(self)
        d["latency"] = round(d["latency"], 3)
        return d

# Translation component returning JSON
def qwen_translate(text: str, source_lang: str, target_lang: str):
    messages = [
        {"role": "system", "content": "You are Qwen, a multilingual assistant specialized in translation."},
        {"role": "user",
         "content": f"Translate this text from {source_lang} to {target_lang}."
                    f" Return strictly a JSON object like {{\"translated_text\": \"...\"}}. \n\nText: \"{text}\""}
    ]

    start = time.time()
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

    # Generate output
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=150)

    trimmed_outputs = [output[len(input_ids):] for input_ids, output in zip(inputs.input_ids, outputs)]
    raw_text = tokenizer.batch_decode(trimmed_outputs, skip_special_tokens=True)[0].strip()
    latency = time.time() - start

    # Extract JSON substring from raw text (Postprocessor)
    json_match = re.search(r"\{.*\"translated_text\".*\}", raw_text, re.DOTALL)
    if json_match:
        json_str = json_match.group()
        try:
            translated_json = json.loads(json_str)
            translated_text = translated_json.get("translated_text", "")
        except json.JSONDecodeError:
            # Fallback if JSON is malformed
            translated_text = raw_text
    else:
        # Fallback if no JSON is found
        translated_text = raw_text

    return translated_text.strip(), latency


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

## Step 2: Create an Evaluation Guideline

Establishing clear evaluation guidelines is crucial for converting subjective, human-centered judgments into measurable, automated metrics. This step emphasizes defining a standard 5-point rubric along with explicit criteria for the translation task, directly connecting each quality level to tangible business outcomes.

This approach provides a demonstrative example of how evaluation can be structured, ensuring that automatic scoring aligns closely with the application’s impact on user experience and operational efficiency.

In [2]:
# STEP 2: Define Evaluation Guidelines and Scoring Rubric
import json
from dataclasses import dataclass
from typing import Dict, List

# Define the evaluation rubric
EVALUATION_RUBRIC = {
    5: "Excellent — Meaning fully preserved; fluent, natural, no errors.",
    4: "Good — Minor errors; mostly fluent and accurate.",
    3: "Fair — Understandable but awkward phrasing or minor meaning loss.",
    2: "Poor — Significant errors or unnatural phrasing; partial meaning loss.",
    1: "Fail — Incorrect or nonsensical translation; meaning lost."
}

# Simple rule-based rubric mapper
def rubric_score(similarity: float) -> int:
    """
    Map semantic similarity (0–1) to a 1–5 rubric score.
    Adjust thresholds as needed.
    """
    if similarity >= 0.9:
        return 5
    elif similarity >= 0.8:
        return 4
    elif similarity >= 0.7:
        return 3
    elif similarity >= 0.6:
        return 2
    else:
        return 1

# Store guideline as a dataclass
@dataclass
class EvaluationGuideline:
    task_name: str
    criteria: Dict[str, str]
    rubric: Dict[int, str]
    deployment_guidance: List[Dict[str, str]] = None  # New field

    def to_json(self):
        return json.dumps({
            "task_name": self.task_name,
            "criteria": self.criteria,
            "rubric": self.rubric,
            "deployment_guidance": self.deployment_guidance
        }, indent=2, ensure_ascii=False)

# Add expected impact guidance
DEPLOYMENT_GUIDANCE = [
    {"Translation Quality": "90–100% “Excellent”", "Expected Impact": "Deploy model to production."},
    {"Translation Quality": "80–89% “Good”", "Expected Impact": "Use with human post-editing."},
    {"Translation Quality": "< 80%", "Expected Impact": "Needs model tuning."}
]

translation_guideline = EvaluationGuideline(
    task_name="French -> English / English -> French Translation",
    criteria={
        "accuracy": "Meaning is preserved correctly",
        "fluency": "Text is natural and grammatically correct",
        "style": "Tone matches target language conventions"
    },
    rubric=EVALUATION_RUBRIC,
    deployment_guidance=DEPLOYMENT_GUIDANCE
)

print("Evaluation Guideline with Deployment Guidance:")
print(translation_guideline.to_json())


Evaluation Guideline with Deployment Guidance:
{
  "task_name": "French -> English / English -> French Translation",
  "criteria": {
    "accuracy": "Meaning is preserved correctly",
    "fluency": "Text is natural and grammatically correct",
    "style": "Tone matches target language conventions"
  },
  "rubric": {
    "5": "Excellent — Meaning fully preserved; fluent, natural, no errors.",
    "4": "Good — Minor errors; mostly fluent and accurate.",
    "3": "Fair — Understandable but awkward phrasing or minor meaning loss.",
    "2": "Poor — Significant errors or unnatural phrasing; partial meaning loss.",
    "1": "Fail — Incorrect or nonsensical translation; meaning lost."
  },
  "deployment_guidance": [
    {
      "Translation Quality": "90–100% “Excellent”",
      "Expected Impact": "Deploy model to production."
    },
    {
      "Translation Quality": "80–89% “Good”",
      "Expected Impact": "Use with human post-editing."
    },
    {
      "Translation Quality": "< 80%",
  

## Step 3: Define Evaluation Methods and Run Translation

For simplicity, in this step we will evaluate translations using a single metric: semantic similarity. This metric was chosen because it effectively captures meaning preservation between the model’s output and the reference translation.

In [3]:
# STEP 3 — Define Evaluation Methods and Data
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import numpy as np
import json

# Semantic similarity model
EMBEDDER = SentenceTransformer("intfloat/e5-base-v2")

def semantic_similarity(a: str, b: str) -> float:
    """Compute semantic similarity between two sentences (0–1)."""
    v1 = EMBEDDER.encode(a, convert_to_tensor=True)
    v2 = EMBEDDER.encode(b, convert_to_tensor=True)
    return float(util.cos_sim(v1, v2)[0][0])

# Evaluate dataset
def evaluate_dataset(translations: list, references: list, sample_ids: list, source_langs: list, target_langs: list):
    """
    translations: list of translated texts (from LLM)
    references: list of reference translations
    sample_ids: list of sample identifiers
    source_langs: list of source languages
    target_langs: list of target languages
    """
    detailed_logs = []

    for sid, src_lang, tgt_lang, trans, ref in zip(sample_ids, source_langs, target_langs, translations, references):
        sim = semantic_similarity(trans, ref)
        score = rubric_score(sim)

        log = {
            "sample_id": sid,
            "source_lang": src_lang,
            "target_lang": tgt_lang,
            "translated_text": trans,
            "reference": ref,
            "similarity": round(sim, 3),
            "rubric_score": score
        }
        detailed_logs.append(log)

    df = pd.DataFrame(detailed_logs)
    summary = {
        "average_similarity": round(df["similarity"].mean(), 3),
        "average_rubric_score": round(df["rubric_score"].mean(), 3)
    }

    return df, summary

# Example dataset
dataset = [
    {"sample_id": "t1", "source": "Bonjour tout le monde.", "source_lang": "French", "target_lang": "English", "reference": "Hello everyone."},
    {"sample_id": "t2", "source": "Good morning everyone.", "source_lang": "English", "target_lang": "French", "reference": "Bonjour tout le monde."},
    {"sample_id": "t3", "source": "L'intelligence artificielle transforme le monde.", "source_lang": "French", "target_lang": "English", "reference": "Artificial intelligence is transforming the world."},
    {"sample_id": "t4", "source": "Artificial intelligence is transforming the world.", "source_lang": "English", "target_lang": "French", "reference": "L'intelligence artificielle transforme le monde."},
    {"sample_id": "t5", "source": "Merci beaucoup pour votre aide.", "source_lang": "French", "target_lang": "English", "reference": "Thank you very much for your help."},
    {"sample_id": "t6", "source": "Thank you very much for your help.", "source_lang": "English", "target_lang": "French", "reference": "Merci beaucoup pour votre aide."},
]


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [4]:
# Run translation component
translations = []
for d in dataset:
    # Run qwen_translate, which returns (translated_text, latency)
    translated_text, _ = qwen_translate(d["source"], d["source_lang"], d["target_lang"])
    translations.append(translated_text)

# Evaluate translations
sample_ids = [d["sample_id"] for d in dataset]
source_langs = [d["source_lang"] for d in dataset]
target_langs = [d["target_lang"] for d in dataset]
references = [d["reference"] for d in dataset]

df_eval, summary = evaluate_dataset(translations, references, sample_ids, source_langs, target_langs)

# Output results
print("Detailed Evaluation:")
print(df_eval)

print("\nSummary:")
print(json.dumps(summary, indent=2, ensure_ascii=False))

# Save evaluation results
df_eval.to_json("translation_evaluation_results.json", orient="records", force_ascii=False, indent=2)


Detailed Evaluation:
  sample_id source_lang target_lang  \
0        t1      French     English   
1        t2     English      French   
2        t3      French     English   
3        t4     English      French   
4        t5      French     English   
5        t6     English      French   

                                    translated_text  \
0                                   Hello everyone.   
1                                   Bonjour à tous!   
2     Artificial intelligence transforms the world.   
3  L'intelligence artificielle transforme le monde.   
4                Thank you very much for your help.   
5                   Merci beaucoup pour votre aide.   

                                           reference  similarity  rubric_score  
0                                    Hello everyone.       1.000             5  
1                             Bonjour tout le monde.       0.920             5  
2  Artificial intelligence is transforming the wo...       0.973            



*  We use **Sentence Transformer** (`intfloat/e5-base-v2`) to compute semantic similarity, a robust metric for assessing **Task-level** meaning preservation, which is superior to token-overlap metrics like BLEU for fluency-focused tasks.
* The LLM shows high performance on this simple dataset, with an **average similarity of 0.995** and an **average rubric score of 5.0**.
* The resulting `df_eval` provides the **Turn-level** correctness. The slight dip in similarity for sample **t3** (`0.973`) likely results from minor phrasing differences (e.g., using *'transforms'* vs. the reference *'is transforming'*), which is a valuable insight into the model's subtle translation choices.

## Step 4: Evaluate the Evaluation Pipeline

In this step, we assess the **reliability** and **stability** of our evaluation pipeline itself. Using multiple runs of semantic similarity scoring, we measure:

- Consistency: Check whether repeated evaluations produce similar scores for the same translations.
- Variance per example: Identify translations that may cause unstable scoring or edge cases.
- Overall reliability: Compute the average score and variance across all examples to ensure the metric reflects true translation quality.

Optionally, if multiple metrics are used (e.g., BLEU, ROUGE), we can also check correlations between them to confirm they provide complementary insights rather than redundant information.

This step ensures that our automatic evaluation produces trustworthy and actionable results before deploying it for model assessment or business decisions.

In [5]:
import numpy as np
import json

# Step 4a: Check evaluation consistency (semantic similarity stability)
def evaluate_pipeline_reliability(eval_data, runs=5):
    """
    Runs the evaluation multiple times to check stability/reliability of the semantic similarity scores.

    eval_data: list of dicts with keys "translated" and "reference"
    runs: number of repeated evaluation runs
    """
    all_scores = []

    for r in range(runs):
        run_scores = []
        for d in eval_data:
            score = semantic_similarity(d["translated"], d["reference"])
            run_scores.append(score)
        all_scores.append(run_scores)

    all_scores = np.array(all_scores)

    # Mean and variance across runs
    mean_scores = np.mean(all_scores, axis=0)
    variance_scores = np.var(all_scores, axis=0)

    print("Average scores per example:", np.round(mean_scores, 3))
    print("Variance per example (across runs):", np.round(variance_scores, 6))
    print("Overall average score:", round(np.mean(mean_scores), 3))
    print("Overall variance:", round(np.mean(variance_scores), 6))

    return mean_scores, variance_scores

# Step 4b: Prepare evaluation input from df_eval
translations = df_eval["translated_text"].tolist()
references = df_eval["reference"].tolist()
eval_input = [{"translated": str(t), "reference": str(r)} for t, r in zip(translations, references)]

# Step 4c: Run reliability evaluation
mean_scores, variance_scores = evaluate_pipeline_reliability(eval_input, runs=5)

# Step 4d: Optional - check metric correlations if you have multiple metrics
# from scipy.stats import pearsonr
# Example: if you also computed BLEU or ROUGE scores alongside semantic similarity
# semantic_scores = df_eval['similarity'].values
# bleu_scores = calculate_bleu_scores(translations, references) # Assume this function exists
# corr, _ = pearsonr(semantic_scores, bleu_scores)
# print("Correlation between semantic similarity and BLEU:", corr)


Average scores per example: [1.    0.92  0.973 1.    1.    1.   ]
Variance per example (across runs): [0. 0. 0. 0. 0. 0.]
Overall average score: 0.982
Overall variance: 0.0


### Analysis:

* **Metric Stability:** The output shows an **Overall variance of 0.0**. This is expected for a deterministic embedding model and confirms the metric itself is perfectly **reproducible**. High variance here would indicate instability in the evaluation tool.
* **Conclusion:** The evaluation pipeline, specifically the semantic similarity metric, is highly stable and suitable for providing consistent feedback on the LLM's translation performance.