In [1]:
import weave
from set_env import set_env
import nest_asyncio

In [2]:
set_env("GOOGLE_API_KEY")
set_env("WANDB_API_KEY")
print("Env set")

Env set


In [3]:
try:
    import IPython
    in_jupyter = True
except ImportError:
    in_jupyter = False
if in_jupyter:
    nest_asyncio.apply()

In [4]:
from utils.config import WEAVE_PROJECT, ENTITY

In [5]:
weave.init(f"{ENTITY}/{WEAVE_PROJECT}")

  from .autonotebook import tqdm as notebook_tqdm


Logged in as Weights & Biases user: a-sh0ts.
View Weave data at https://wandb.ai/a-sh0ts/eval_course_ch1_dev/weave


<weave.trace.weave_client.WeaveClient at 0x13499d6f0>

## Why Evaluate LLMs?

### Traditional Software vs LLM Testing
Unlike traditional software where outputs are deterministic and can be unit tested, LLMs produce:
- Non-deterministic outputs that vary between runs
- Complex, open-ended responses for tasks like summarization and dialogue
- Outputs that require nuanced evaluation of quality, accuracy, and safety

### Key Reasons for LLM Evaluation:

1. Quality Assurance
- Conventional metrics (n-grams, semantic similarity) are insufficient for complex LLM tasks
- Need to assess multiple dimensions like factuality, coherence, and relevance
- Important to catch potential hallucinations and factual inconsistencies

2. Safety & Alignment 
- Ensure outputs are safe and non-toxic
- Verify adherence to ethical guidelines and business policies
- Maintain alignment with intended use cases and user expectations

3. Performance Monitoring
- Track model performance across different tasks and domains
- Identify areas needing improvement or fine-tuning
- Compare different model versions or configurations

4. Business Goals
- Validate that outputs meet specific business requirements
- Ensure cost-effective deployment of LLM solutions
- Maintain quality standards for production systems

![](./media/traditional_llm_eval.png)

In [6]:
from utils.prompts import medical_task, medical_system_prompt 
from utils.render import display_prompt

In [7]:
display_prompt(medical_system_prompt)
display_prompt(medical_task)

In [8]:
annotated_medical_data = weave.ref(f"weave:///{ENTITY}/{WEAVE_PROJECT}/object/medical_data_annotations:latest").get()
# annotated_medical_data = weave.ref("weave:///a-sh0ts/eval_course_ch1_dev/object/medical_data_annotations:At9gri9UasftpPe5VNzT3EuIXQWAo5MYX8aMf2cuE8A").get()



# Understanding Medical Data Extraction Evaluation

## The Task: What Are We Trying to Do?

### Raw Data Format
Medical conversations are messy and unstructured. Looking at our example data:

1. **Dialogue Format**:
- Back-and-forth conversation between doctor and patient
- Contains personal details, small talk, and medical information mixed together
- Informal language ("hey", "mm-hmm", "yeah")
- Important details scattered throughout

2. **Medical Notes**:
- More structured but still in prose
- Contains standardized sections (CHIEF COMPLAINT, HISTORY, etc.)
- Includes sensitive information (names, ages)
- Medical terminology and abbreviations

### Extraction Goals
The LLM needs to:
1. Find relevant information
2. Ignore irrelevant details
3. Standardize the format
4. Protect patient privacy
5. Maintain medical accuracy

![](./media/medical_chatbot.png)

In [9]:
from utils.render import print_dialogue_data

In [10]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[0])

### Behind the scenes you can imagine the LLM is doing the following:

In [11]:
from utils.llm_client import LLMClient
from utils.config import GEMINI_MODEL

In [12]:
llm = LLMClient(model_name=GEMINI_MODEL, client_type="gemini")
llm.predict(user_prompt=medical_task.format(transcript=annotated_medical_data[0][0]["input"]), system_prompt=medical_system_prompt)


🍩 https://wandb.ai/a-sh0ts/eval_course_ch1_dev/r/call/0192d6d5-dc53-7912-ac7b-37e7e3f71f4f


'• **Chief complaint:** Bilateral elbow pain, right worse than left.\n\n• **History of present illness:** 1.5 years of bilateral elbow pain, worse on the right. Pain is medial, exacerbated by upper extremity use.  Patient uses ibuprofen 800mg TID. Ice provides no relief.  History of participation in contact sports in youth, but no previous elbow pain.  Currently lifts heavy weights.\n\n• **Physical examination:**  Pulses equal in all extremities. Normal distal sensation. Right elbow: limited range of motion on extension, pain with flexion and supination, medial tenderness. Left elbow: minimal pain with flexion and extension, slight limited ROM on extension, pain with supination.\n\n• **Symptoms:** Bilateral elbow pain (medial), worse with use of arms and hands, pain with flexion and extension, pain with supination.\n\n• **New medications:** MRI ordered; whole blood transfusion discussed as a treatment option.\n\n• **Follow-up instructions:**  MRI scheduled; whole blood transfusion to b

### Assuming we run this over a curated dataset, we can collect all the outputs from the LLM and annotate them.

In [13]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[1])


## Annotation: Building Quality Training Data

### Why Annotate?
Raw LLM outputs aren't enough - we need expert validation to:
1. Establish ground truth
2. Identify edge cases
3. Understand failure modes
4. Create evaluation standards

### Annotation Process
Medical experts should:

1. **Review the Full Context**:
   - Read entire conversation
   - Review medical notes
   - Understand complete patient story

2. **Evaluate LLM Output**:
   - Check factual accuracy
   - Verify completeness
   - Ensure privacy protection
   - Validate formatting

3. **Provide Structured Feedback**:
   - Binary score (pass/fail)
   - Written explanation
   - Specific issue identification

![](./media/annotation_ui.png)

In [14]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[2, 3, 4])


## Evaluation: Measuring Performance

### Direct vs Pairwise
For medical extraction:
- Use direct scoring (not A/B comparison)
- Task has objective right/wrong answers
- Need to catch critical errors

### Key Evaluation Dimensions

1. **Factual Accuracy**:
   - Are extracted details correct?
   - Do they match the source?
   - Is medical terminology accurate?

2. **Completeness**:
   - All required fields present?
   - Important details included?
   - Appropriate use of N/A?

3. **Privacy Protection**:
   - PII properly removed?
   - Patient identity protected?
   - Sensitive details handled appropriately?

4. **Format Compliance**:
   - Bullet point structure followed?
   - Within word limit?
   - Clear and readable?

### Using Evaluation Results

Results help us:
1. Improve prompts
2. Identify system limitations
3. Set quality standards
4. Monitor performance
5. Train better models

![](./media/eval_task_flowchart.png)

### Using our domain knowledge, we can write evaluation functions after investigating the task and annotations

In [15]:
test_output = annotated_medical_data[0][1]["output"]

In [16]:
from utils.prompts import medical_privacy_judge_prompt, MedicalPrivacyJudgement, medical_task_score_prompt, MedicalTaskScoreJudgement, medical_task_score_system_prompt, medical_privacy_system_prompt
import json

In [17]:
@weave.op()
def test_adheres_to_bullet_point_format(model_output: str):
    sections = [s.strip() for s in model_output.split("\n") if s.strip()]
    try:
        for section in sections:
            assert section.startswith("\u2022"), f"Section does not start with a bullet point: {section}"
    except AssertionError:
        return False
    return True


In [18]:
test_adheres_to_bullet_point_format(test_output)

🍩 https://wandb.ai/a-sh0ts/eval_course_ch1_dev/r/call/0192d6d5-e3ee-74c2-9545-baa83fbfd670


False

In [19]:
@weave.op()
def test_adheres_to_word_limit(model_output: str):
    return len(model_output.split()) <= 150

In [20]:
test_adheres_to_word_limit(test_output)

🍩 https://wandb.ai/a-sh0ts/eval_course_ch1_dev/r/call/0192d6d5-e3f8-7450-92b0-61ba6a7128d2


True

In [21]:
display_prompt(medical_privacy_system_prompt)
display_prompt(medical_privacy_judge_prompt)

In [22]:
@weave.op()
def judge_adheres_to_privacy_guidelines(model_output: str):
    llm = LLMClient(model_name=GEMINI_MODEL, client_type="gemini")
    response = llm.predict(user_prompt=medical_privacy_judge_prompt.format(text=model_output), system_prompt=medical_privacy_system_prompt, schema=MedicalPrivacyJudgement)
    try:
        result = json.loads(response.text.strip("\n"))
        return result[0]["contains_pii"]
    except:
        return False #TODO: Add json parsing as failure reason

In [23]:
judge_adheres_to_privacy_guidelines(test_output)

🍩 https://wandb.ai/a-sh0ts/eval_course_ch1_dev/r/call/0192d6d5-e427-70c3-9dbe-4d1706b2fcb6


True

In [24]:
display_prompt(medical_task_score_system_prompt)
display_prompt(medical_task_score_prompt)

In [25]:
@weave.op()
def judge_overall_score(model_output: str):
    llm = LLMClient(model_name=GEMINI_MODEL, client_type="gemini")
    response = llm.predict(user_prompt=medical_task_score_prompt.format(text=model_output), system_prompt=medical_task_score_system_prompt, schema=MedicalTaskScoreJudgement)
    try:
        result = json.loads(response.text.strip("\n"))
        return result[0]["score"]
    except:
        return 0 #TODO: Add json parsing as failure reason


In [26]:
judge_overall_score(test_output)

🍩 https://wandb.ai/a-sh0ts/eval_course_ch1_dev/r/call/0192d6d5-e713-7ec1-ac26-2b97602eff75


0

### We already have a dataset of annotated medical data. We can use this to test our evaluation functions.

In [27]:
@weave.op()
def annotated_data_passthrough(input, output):
    return output

In [28]:
evaluation_data = [
    {"input": annotated_row[0]["input"], "output": annotated_row[1]["output"]}
    for annotated_row in annotated_medical_data
][0:5]

In [29]:
import asyncio

In [30]:
# Create evaluation
evaluation = weave.Evaluation(
    dataset=evaluation_data,
    scorers=[test_adheres_to_bullet_point_format, test_adheres_to_word_limit, judge_adheres_to_privacy_guidelines, judge_overall_score]
)

# Run evaluation
asyncio.run(evaluation.evaluate(annotated_data_passthrough))

🍩 https://wandb.ai/a-sh0ts/eval_course_ch1_dev/r/call/0192d6d5-e97e-75e0-8dc8-08ae0b0404a2


{'test_adheres_to_bullet_point_format': {'true_count': 3,
  'true_fraction': 0.6},
 'test_adheres_to_word_limit': {'true_count': 5, 'true_fraction': 1.0},
 'judge_adheres_to_privacy_guidelines': {'true_count': 4,
  'true_fraction': 0.8},
 'judge_overall_score': {'mean': 0.6},
 'model_latency': {'mean': 0.04775018692016601}}

### But do our test outputs adhere to the annotation expectations?