@flexpa/llm-fhir-eval

Note

Follow the development progress on FHIR Chat.

Overview

@flexpa/llm-fhir-eval is an evaluation framework designed to benchmark the performance of LLMs on FHIR-specific tasks including generation, validation, and extraction. This framework systematically tests and validates the capabilities of LLMs in handling various healthcare-interoperability related tasks, ensuring they meet the standards required for effective FHIR implementations. It implements evaluations from prior art such as FHIR-GPT.

Benchmark

@flexpa/llm-fhir-eval benchmarks FHIR-specific tasks including:

FHIR Resource Generation:
- Generate accurate FHIR resources such as Patient, Observation, MedicationStatement, etc.
- Test the ability to create complex resource relationships and validate terminology bindings.
FHIR Resource Validation:
- Validate FHIR resources using operations like $validate.
- Check for schema compliance, required field presence, and value set binding verification.
Data Extraction:
- Extract structured FHIR-compliant data from clinical notes and other unstructured data.
- Evaluate the proficiency of LLMs in extracting specific healthcare data elements.
Tool Use:
- Test models' ability to use FHIR validation tools and other healthcare-specific functions.
- Validate proper tool calling for FHIR operations.

Available Evaluations

Data Extraction (evals/extraction/)
- Description: Comprehensive evaluation of extracting structured FHIR data from unstructured clinical text.
- Configurations: Both minimalist and specialist approaches available.
- Test categories: Basic demographics, conditions, explanations of benefit, medication requests, observations.
FHIR Resource Generation (evals/generation/)
- Description: Tests the ability to generate valid FHIR resources and bundles.
- Configurations: Zero-shot bundle generation and multi-turn tool use scenarios.
- Models supported: GPT-3.5-turbo, GPT-4.1, O3 (low/high reasoning), Claude 3.5 Haiku, Claude 3.5 Sonnet, Claude Sonnet 4, Claude Opus 4

Custom Assertions

The framework includes custom assertion functions:

fhirPathEquals.mjs: Validates FHIR Path expressions
isBundle.mjs: Checks if output is a valid FHIR Bundle
metaElementMissing.mjs: Validates required metadata elements
validateOperation.mjs: Validates FHIR operation results

Tools

validateFhirBundle.mjs: Tool for validating FHIR Bundle resources

Custom Providers

AnthropicMessagesWithRecursiveToolCallsProvider.ts: Enhanced Anthropic provider with recursive tool calling (up to 10 depth levels)
OpenAiResponsesWithRecursiveToolCallsProvider.ts: Enhanced OpenAI provider with recursive tool calling

These providers enable multi-turn tool interactions where models can iteratively call validation tools to improve their FHIR resource generation.

Commands to Run Evaluations

Install dependencies and set up environment variables:

yarn install

Copy the .env.template file to .env and supply your API keys for the models you plan to test.

Run an evaluation:

# Example: Run the extraction evaluation with minimalist config
promptfoo eval -c evals/extraction/config-minimalist.yaml

# Example: Run the FHIR bundle generation evaluation
promptfoo eval -c evals/generation/config-zero-shot-bundle.yaml

# Example: Run multi-turn tool use evaluation
promptfoo eval -c evals/generation/config-multi-turn-tool-use.js

The evaluation will print its performance metrics to the console and optionally save results to files.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.yarn/patches		.yarn/patches
assertions		assertions
etc		etc
evals		evals
providers		providers
tools		tools
.editorconfig		.editorconfig
.env.template		.env.template
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

@flexpa/llm-fhir-eval

Overview

Benchmark

Available Evaluations

Custom Assertions

Tools

Custom Providers

Commands to Run Evaluations

About

Uh oh!

Uh oh!

Languages

flexpa/llm-fhir-eval

Folders and files

Latest commit

History

Repository files navigation

@flexpa/llm-fhir-eval

Overview

Benchmark

Available Evaluations

Custom Assertions

Tools

Custom Providers

Commands to Run Evaluations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages