In [1]:
import weave
from set_env import set_env
import nest_asyncio

In [2]:
set_env("GOOGLE_API_KEY")
set_env("WANDB_API_KEY")
print("Env set")

Env set


In [3]:
try:
    import IPython
    in_jupyter = True
except ImportError:
    in_jupyter = False
if in_jupyter:
    nest_asyncio.apply()

In [4]:
from utils.config import WEAVE_PROJECT, ENTITY

In [5]:
weave_client = weave.init(f"{ENTITY}/{WEAVE_PROJECT}")

  from .autonotebook import tqdm as notebook_tqdm


Logged in as Weights & Biases user: a-sh0ts.
View Weave data at https://wandb.ai/eval-course/eval_course_ch1_dev/weave


## Why Evaluate LLMs?

### Traditional Software vs LLM Testing
Unlike traditional software where outputs are deterministic and can be unit tested, LLMs produce:
- Non-deterministic outputs that vary between runs
- Complex, open-ended responses for tasks like summarization and dialogue
- Outputs that require nuanced evaluation of quality, accuracy, and safety

### Key Reasons for LLM Evaluation:

1. Quality Assurance
- Conventional metrics (n-grams, semantic similarity) are insufficient for complex LLM tasks
- Need to assess multiple dimensions like factuality, coherence, and relevance
- Important to catch potential hallucinations and factual inconsistencies

2. Safety & Alignment 
- Ensure outputs are safe and non-toxic
- Verify adherence to ethical guidelines and business policies
- Maintain alignment with intended use cases and user expectations

3. Performance Monitoring
- Track model performance across different tasks and domains
- Identify areas needing improvement or fine-tuning
- Compare different model versions or configurations

4. Business Goals
- Validate that outputs meet specific business requirements
- Ensure cost-effective deployment of LLM solutions
- Maintain quality standards for production systems

![](./media/traditional_llm_eval.png)

In [6]:
from utils.prompts import medical_task, medical_system_prompt 
from utils.render import display_prompt

In [7]:
display_prompt(medical_system_prompt)
display_prompt(medical_task)

In [8]:
annotated_medical_data = weave.ref(f"weave:///{ENTITY}/{WEAVE_PROJECT}/object/medical_data_annotations:latest").get()
# annotated_medical_data = weave.ref("weave:///a-sh0ts/eval_course_ch1_dev/object/medical_data_annotations:At9gri9UasftpPe5VNzT3EuIXQWAo5MYX8aMf2cuE8A").get()



# Understanding Medical Data Extraction Evaluation

## The Task: What Are We Trying to Do?

### Raw Data Format
Medical conversations are messy and unstructured. Looking at our example data:

1. **Dialogue Format**:
- Back-and-forth conversation between doctor and patient
- Contains personal details, small talk, and medical information mixed together
- Informal language ("hey", "mm-hmm", "yeah")
- Important details scattered throughout

2. **Medical Notes**:
- More structured but still in prose
- Contains standardized sections (CHIEF COMPLAINT, HISTORY, etc.)
- Includes sensitive information (names, ages)
- Medical terminology and abbreviations

### Extraction Goals
The LLM needs to:
1. Find relevant information
2. Ignore irrelevant details
3. Standardize the format
4. Protect patient privacy
5. Maintain medical accuracy

![](./media/medical_chatbot.png)

In [9]:
from utils.render import print_dialogue_data

In [10]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[0])

### Behind the scenes you can imagine the LLM is doing the following:

In [11]:
from utils.llm_client import LLMClient
from utils.config import GEMINI_MODEL

In [12]:
llm = LLMClient(model_name=GEMINI_MODEL, client_type="gemini")
llm.predict(user_prompt=medical_task.format(transcript=annotated_medical_data[0][0]["input"]), system_prompt=medical_system_prompt)


🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192d9f9-feb4-72d0-b0ce-aefbe11daee7


'• **Chief complaint:** Bilateral elbow pain, right worse than left.\n\n• **History of present illness:**  The patient has experienced bilateral elbow pain for 1.5 years, worse on the right side. Pain is located on the medial aspect of both elbows, increases with upper extremity use, and is not relieved by ice.  The patient uses ibuprofen 800mg three times daily.  The patient is a weightlifter and played various sports in youth without elbow pain.\n\n• **Physical examination:** Pulses equal in all extremities; sensation normal.  Right elbow: limited range of motion with extension, pain with flexion, medial aspect pain on palpation, pain with supination, no pain with pronation. Left elbow: minimal pain with flexion and extension, slight limited ROM on extension, pain with supination, no pain with pronation. X-rays showed no fractures or bony misalignment.\n\n• **Symptoms:** Bilateral elbow pain (medial aspect), worse with use, unrelieved by ice.\n\n• **New medications:**  Whole blood tr

### Assuming we run this over a curated dataset, we can collect all the outputs from the LLM and annotate them.

In [13]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[1])


## Annotation: Building Quality Training Data

### Why Annotate?
Raw LLM outputs aren't enough - we need expert validation to:
1. Establish ground truth
2. Identify edge cases
3. Understand failure modes
4. Create evaluation standards

### Annotation Process
Medical experts should:

1. **Review the Full Context**:
   - Read entire conversation
   - Review medical notes
   - Understand complete patient story

2. **Evaluate LLM Output**:
   - Check factual accuracy
   - Verify completeness
   - Ensure privacy protection
   - Validate formatting

3. **Provide Structured Feedback**:
   - Binary score (pass/fail)
   - Written explanation
   - Specific issue identification

![](./media/annotation_ui.png)

In [14]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[2, 3, 4])


## Evaluation: Measuring Performance

### Direct vs Pairwise
For medical extraction:
- Use direct scoring (not A/B comparison)
- Task has objective right/wrong answers
- Need to catch critical errors

### Key Evaluation Dimensions

1. **Factual Accuracy**:
   - Are extracted details correct?
   - Do they match the source?
   - Is medical terminology accurate?

2. **Completeness**:
   - All required fields present?
   - Important details included?
   - Appropriate use of N/A?

3. **Privacy Protection**:
   - PII properly removed?
   - Patient identity protected?
   - Sensitive details handled appropriately?

4. **Format Compliance**:
   - Bullet point structure followed?
   - Within word limit?
   - Clear and readable?

### Using Evaluation Results

Results help us:
1. Improve prompts
2. Identify system limitations
3. Set quality standards
4. Monitor performance
5. Train better models

![](./media/eval_task_flowchart.png)

### Using our domain knowledge, we can write evaluation functions after investigating the task and annotations

In [15]:
test_output = annotated_medical_data[0][1]["output"]

In [16]:
from utils.prompts import medical_privacy_judge_prompt, MedicalPrivacyJudgement, medical_task_score_prompt, MedicalTaskScoreJudgement, medical_task_score_system_prompt, medical_privacy_system_prompt
import json

In [17]:
@weave.op()
def test_adheres_to_required_keys(model_output: str):
    # Required medical keys
    required_keys = [
        "Chief complaint",
        "History of present illness",
        "Physical examination",
        "Symptoms",
        "New medications with dosages",
        "Follow-up instructions"
    ]
    
    # Convert to lowercase for case-insensitive matching
    model_output_lower = model_output.lower()
    
    # Check if all required keys are present
    for key in required_keys:
        if key.lower() not in model_output_lower:
            return int(False)
            
    return int(True)

In [18]:
test_adheres_to_required_keys(test_output)

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192d9fa-065e-7471-9a67-e7f46062fb6d


0

In [19]:
@weave.op()
def test_adheres_to_word_limit(model_output: str):
    return int(len(model_output.split()) <= 150)

In [20]:
test_adheres_to_word_limit(test_output)

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192d9fa-0669-74e2-a261-5912f24d93ff


0

In [21]:
display_prompt(medical_privacy_system_prompt)
display_prompt(medical_privacy_judge_prompt)

In [22]:
@weave.op()
def judge_adheres_to_privacy_guidelines(model_output: str):
    llm = LLMClient(model_name=GEMINI_MODEL, client_type="gemini")
    response = llm.predict(user_prompt=medical_privacy_judge_prompt.format(text=model_output), system_prompt=medical_privacy_system_prompt, schema=MedicalPrivacyJudgement)
    try:
        result = json.loads(response.text.strip("\n"))
        return int(not result[0]["contains_pii"])
    except:
        return int(True) #TODO: Add json parsing as failure reason

In [23]:
judge_adheres_to_privacy_guidelines(test_output)

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192d9fa-067e-7993-a72f-7754243181ba


0

In [24]:
display_prompt(medical_task_score_system_prompt)
display_prompt(medical_task_score_prompt)

In [25]:
@weave.op()
def judge_overall_score(model_output: str):
    llm = LLMClient(model_name=GEMINI_MODEL, client_type="gemini")
    response = llm.predict(user_prompt=medical_task_score_prompt.format(text=model_output), system_prompt=medical_task_score_system_prompt, schema=MedicalTaskScoreJudgement)
    try:
        result = json.loads(response.text.strip("\n"))
        return result[0]["score"]
    except:
        return 0 #TODO: Add json parsing as failure reason


In [26]:
judge_overall_score(test_output)

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192d9fa-0991-7990-a884-7f39a79ad025


0

### We already have a dataset of annotated medical data. We can use this to test our evaluation functions.

In [27]:
@weave.op()
def annotated_data_passthrough(input, output):
    return output

In [28]:
annotated_medical_data[0][2]

0

In [29]:
evaluation_data = [
    {"input": annotated_row[0]["input"], "output": annotated_row[1]["output"], "scores": {"human_required_keys": annotated_row[3]["presence_of_keys"], "human_word_limit": annotated_row[3]["word_count"], "human_absence_of_PII": annotated_row[3]["absence_of_PII"], "human_overall_score": annotated_row[2]}}
    for annotated_row in annotated_medical_data
][0:5]

In [30]:
import asyncio

In [31]:
# Create evaluation
evaluation = weave.Evaluation(
    dataset=evaluation_data,
    scorers=[test_adheres_to_required_keys, test_adheres_to_word_limit, judge_adheres_to_privacy_guidelines, judge_overall_score]
)

# Run evaluation
evals = asyncio.run(evaluation.evaluate(annotated_data_passthrough))

🍩 https://wandb.ai/eval-course/eval_course_ch1_dev/r/call/0192d9fa-0c35-7593-b1c4-09b9307a6adb


### But do our test outputs adhere to the annotation expectations?

In [39]:
from utils.evals import get_evaluation_predictions


In [40]:
eval_call_id = "0192d9fa-0c35-7593-b1c4-09b9307a6adb"

In [41]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score

In [42]:
def get_evaluation_predictions(eval_call_id):
    """
    Extract and format evaluation predictions from a Weave evaluation call ID.
    
    Args:
        eval_call_id (str): ID of the Weave evaluation call to analyze
        
    Returns:
        pd.DataFrame: DataFrame containing paired human and model scores for each metric
    """
    eval_calls = weave_client.get_call(eval_call_id)
    predictions = []
    
    for eval_call in eval_calls.children():
        if eval_call.op_name.split("/")[-1].split(":")[0] == "Evaluation.predict_and_score":
            _eval_call = weave_client.get_call(eval_call.id)
            
            # Extract data
            input_text = _eval_call.inputs["example"]["input"]
            human_scores = _eval_call.inputs["example"]["scores"]
            model_scores = _eval_call.output["scores"]
            
            # Create paired scores
            scores = {
                'input': input_text,
                'required_keys': (human_scores['human_required_keys'], model_scores['test_adheres_to_required_keys']),
                'word_limit': (human_scores['human_word_limit'], model_scores['test_adheres_to_word_limit']),
                'privacy': (human_scores['human_absence_of_PII'], model_scores['judge_adheres_to_privacy_guidelines']),
                'overall': (human_scores['human_overall_score'], model_scores['judge_overall_score'])
            }
            predictions.append(scores)

    return pd.DataFrame(predictions)

In [43]:
df = get_evaluation_predictions(eval_call_id)
df

Unnamed: 0,input,required_keys,word_limit,privacy,overall
0,Dialogue: [doctor] hey dylan what's going on s...,"(1, 0)","(1, 0)","(0, 0)","(0, 0)"
1,"Dialogue: [doctor] hello , mrs . peterson . [p...","(1, 0)","(1, 1)","(1, 1)","(1, 0)"
2,"Dialogue: [doctor] hey , ms. hill . nice to se...","(0, 0)","(1, 1)","(1, 0)","(0, 1)"
3,Dialogue: [doctor] okay so we are recording ok...,"(0, 0)","(1, 1)","(1, 1)","(0, 0)"
4,"Dialogue: [doctor] hi keith , how are you ? [p...","(1, 0)","(1, 1)","(0, 0)","(0, 0)"


In [44]:
def calculate_kappa_scores(df, tuple_columns=['required_keys', 'word_limit', 'privacy', 'overall']):
    """
    Calculate Cohen's Kappa scores for human vs model predictions across multiple metrics.
    
    Args:
        df (pd.DataFrame): DataFrame containing paired scores as tuples (human_score, model_score)
        tuple_columns (list): List of column names containing the score tuples
        
    Returns:
        dict: Dictionary of kappa scores for each metric
    """
    labels = [0, 1]  # Binary classification labels
    kappa_scores = {}
    
    for col in tuple_columns:
        human_scores = df[col].apply(lambda x: x[0])
        pred_scores = df[col].apply(lambda x: x[1])
        
        kappa_scores[col] = cohen_kappa_score(
            human_scores,
            pred_scores,
            labels=labels,
            weights='linear'
        )
    
    return kappa_scores

# Example usage:
kappa_scores = calculate_kappa_scores(df)
for metric, score in kappa_scores.items():
    print(f"{metric}: {score:.3f}")

required_keys: 0.000
word_limit: 0.000
privacy: 0.615
overall: -0.250


In [45]:
def calculate_weighted_alignment(kappa_scores, weights=None):
    """
    Calculate weighted alignment score across all metrics.
    
    Args:
        kappa_scores (dict): Dictionary of kappa scores for each metric
        weights (dict): Optional dictionary of weights for each metric. 
                       If None, uses equal weights.
    
    Returns:
        float: Weighted average kappa score
    """
    # Default to equal weights if none provided
    if weights is None:
        weights = {metric: 1/len(kappa_scores) for metric in kappa_scores.keys()}
    
    # Validate weights
    assert set(weights.keys()) == set(kappa_scores.keys()), \
        "Weights must be provided for all metrics"
    assert abs(sum(weights.values()) - 1.0) < 1e-9, \
        "Weights must sum to 1"
    
    # Calculate weighted average
    weighted_score = sum(kappa_scores[metric] * weights[metric] 
                        for metric in kappa_scores.keys())
    
    return weighted_score

# Example weights (adjust these based on what's most important for your use case)
weights = {
    'required_keys': 0.3,    # High importance - core functionality
    'privacy': 0.3,          # High importance - compliance/safety
    'word_limit': 0.2,       # Medium importance - usability
    'overall': 0.2           # Medium importance - general quality
}

# Calculate aggregate score
aggregate_score = calculate_weighted_alignment(kappa_scores, weights)
print(f"\nWeighted Aggregate Alignment Score: {aggregate_score:.3f}")

# You can easily try different weightings:
privacy_focused_weights = {
    'required_keys': 0.2,
    'privacy': 0.5,          # Much higher weight on privacy
    'word_limit': 0.15,
    'overall': 0.15
}

privacy_focused_score = calculate_weighted_alignment(kappa_scores, privacy_focused_weights)
print(f"Privacy-Focused Alignment Score: {privacy_focused_score:.3f}")


Weighted Aggregate Alignment Score: 0.135
Privacy-Focused Alignment Score: 0.270
