In [1]:
import weave
from set_env import set_env
import nest_asyncio

In [None]:
set_env("GEMINI_API_KEY")
set_env("WANDB_API_KEY")
print("Env set")

In [3]:
try:
    import IPython
    in_jupyter = True
except ImportError:
    in_jupyter = False
if in_jupyter:
    nest_asyncio.apply()

In [None]:
weave.init(f"eval_course_ch1_dev")

## Why Evaluate LLMs?

### Traditional Software vs LLM Testing
Unlike traditional software where outputs are deterministic and can be unit tested, LLMs produce:
- Non-deterministic outputs that vary between runs
- Complex, open-ended responses for tasks like summarization and dialogue
- Outputs that require nuanced evaluation of quality, accuracy, and safety

### Key Reasons for LLM Evaluation:

1. Quality Assurance
- Conventional metrics (n-grams, semantic similarity) are insufficient for complex LLM tasks
- Need to assess multiple dimensions like factuality, coherence, and relevance
- Important to catch potential hallucinations and factual inconsistencies

2. Safety & Alignment 
- Ensure outputs are safe and non-toxic
- Verify adherence to ethical guidelines and business policies
- Maintain alignment with intended use cases and user expectations

3. Performance Monitoring
- Track model performance across different tasks and domains
- Identify areas needing improvement or fine-tuning
- Compare different model versions or configurations

4. Business Goals
- Validate that outputs meet specific business requirements
- Ensure cost-effective deployment of LLM solutions
- Maintain quality standards for production systems

![](./media/traditional_llm_eval.png)

In [5]:
medical_task = """
You are extracting insights from some medical records.
The records contain a medical note and a
dialogue between a doctor and a patient. You need
to extract values for the following: Chief
complaint, History of present illness, Physical
examination, symptoms experienced by the patient,
New medications prescribed or changed, including
dosages (N/A if not provided), and Follow-up
instructions (N/A if not provided). Your answer
should not include any personal identifiable
information (PII) such as name, age, gender, or
ID. Use "the patient" instead of their name, for
example. Return your answer as a bullet list,
where each bullet is formatted like •chief
complaint: xx. If there is no value for the key,
the value should be N/A. Keep your response
around 150 words (you may have to summarize some
extracted values to stay within the word limit).
{transcript}
"""

medical_system_prompt = """
You are a medical data extraction AI assistant. Your task is to accurately extract and summarize key medical information from patient records, adhering strictly to privacy guidelines and formatting instructions provided in the user's prompt. Focus on relevance and conciseness while ensuring all required fields are addressed.
"""

In [6]:
#TODO: Have a separate dataset called "unannotated_medical_data"
annotated_medical_data = weave.ref("weave:///a-sh0ts/medical_data_results/object/medical_data_annotations:7GcCtWgyPTWtKY48Z7v5VxwCNZXTTTpSMbmubAbyHT8").get()

In [8]:
from utils.render import print_dialogue_data



# Understanding Medical Data Extraction Evaluation

## The Task: What Are We Trying to Do?

### Raw Data Format
Medical conversations are messy and unstructured. Looking at our example data:

1. **Dialogue Format**:
- Back-and-forth conversation between doctor and patient
- Contains personal details, small talk, and medical information mixed together
- Informal language ("hey", "mm-hmm", "yeah")
- Important details scattered throughout

2. **Medical Notes**:
- More structured but still in prose
- Contains standardized sections (CHIEF COMPLAINT, HISTORY, etc.)
- Includes sensitive information (names, ages)
- Medical terminology and abbreviations

### Extraction Goals
The LLM needs to:
1. Find relevant information
2. Ignore irrelevant details
3. Standardize the format
4. Protect patient privacy
5. Maintain medical accuracy

![](./media/medical_chatbot.png)

In [None]:
print_dialogue_data(annotated_medical_data)


## Annotation: Building Quality Training Data

### Why Annotate?
Raw LLM outputs aren't enough - we need expert validation to:
1. Establish ground truth
2. Identify edge cases
3. Understand failure modes
4. Create evaluation standards

### Annotation Process
Medical experts should:

1. **Review the Full Context**:
   - Read entire conversation
   - Review medical notes
   - Understand complete patient story

2. **Evaluate LLM Output**:
   - Check factual accuracy
   - Verify completeness
   - Ensure privacy protection
   - Validate formatting

3. **Provide Structured Feedback**:
   - Binary score (pass/fail)
   - Written explanation
   - Specific issue identification

![](./media/annotation_ui.png)

In [None]:
#TODO: Replace with and add human annotations
#TODO: Load this dataset as annotated_medical_data while the previous one as unannotated_medical_data
print_dialogue_data(annotated_medical_data, indexes_to_show=[2, 3])

In [None]:
annotated_medical_data[0]


## Evaluation: Measuring Performance

### Direct vs Pairwise
For medical extraction:
- Use direct scoring (not A/B comparison)
- Task has objective right/wrong answers
- Need to catch critical errors

### Key Evaluation Dimensions

1. **Factual Accuracy**:
   - Are extracted details correct?
   - Do they match the source?
   - Is medical terminology accurate?

2. **Completeness**:
   - All required fields present?
   - Important details included?
   - Appropriate use of N/A?

3. **Privacy Protection**:
   - PII properly removed?
   - Patient identity protected?
   - Sensitive details handled appropriately?

4. **Format Compliance**:
   - Bullet point structure followed?
   - Within word limit?
   - Clear and readable?

### Using Evaluation Results

Results help us:
1. Improve prompts
2. Identify system limitations
3. Set quality standards
4. Monitor performance
5. Train better models

![](./media/eval_task_flowchart.png)