This service provides robust preprocessing for FHIR ePI (electronic Product Information) bundles, supporting extraction, validation, and modification of both FHIR extensions and embedded HTML content. It is designed for pharmaceutical and healthcare data pipelines, enabling safe and flexible manipulation of FHIR R4 resources, including custom extensions and XHTML narratives.
- FhirEPI model for structured ePI bundle parsing
- HtmlElementLink extension management (CRUD operations)
- Recursive HTML content extraction from nested composition sections
- HTML content extraction, analysis, and modification (from
Composition.text.div) - Section-level HTML manipulation with automatic nesting level detection
- Standalone test runners for all major components
- Dockerized deployment with Python 3.12 and OpenAPI integration
- Extensive documentation and examples
- Python 3.12+
- (Optional) Docker
- (Optional) Virtual environment
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtpython test_html_element_link_standalone.py
python test_html_content_manager_standalone.pydocker build -t preprocessing-service .
docker run --rm -p 8080:8080 preprocessing-serviceNote: The Docker image uses Python 3.12-alpine for compatibility with the service dependencies.
docker run --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i https://raw.githubusercontent.com/Gravitate-Health/preprocessing-service-example/refs/heads/main/openapi.yaml -g python-flask -o /local/ --additional-properties=packageName=preprocessorImages are published automatically to GitHub Container Registry (GHCR) by CI for this repo and any forks.
- Canonical image for this repo:
ghcr.io/gravitate-health/preprocessing-service-example-python - For forks:
ghcr.io/<your-github-username-or-org>/<your-fork-repo>
docker pull ghcr.io/gravitate-health/preprocessing-service-example-python:main
# or a release tag, when available
docker pull ghcr.io/gravitate-health/preprocessing-service-example-python:v1.0.0docker run --rm -p 8080:8080 ghcr.io/gravitate-health/preprocessing-service-example-python:mainIf the image is private or your org requires auth, login to GHCR:
$env:CR_PAT = "<YOUR_GH_PAT_WITH_packages:read>"
$env:CR_PAT | docker login ghcr.io -u <YOUR_GH_USERNAME> --password-stdin- See
openapi/openapi.yamlfor full specification. - Main endpoint:
/preprocess(accepts FHIR Bundle, returns processed bundle)
Purpose: Manage FHIR HtmlElementLink extensions in Composition resources.
Key Functions:
list_html_element_links(composition)add_html_element_link(composition, element_class, concept, replace_if_exists=False)remove_html_element_link(composition, element_class)get_html_element_link(composition, element_class)
Example Usage:
from preprocessor.models.html_element_link_manager import (
list_html_element_links, add_html_element_link, remove_html_element_link
)
# List all HtmlElementLink extensions
links = list_html_element_links(composition)
# Add a new HtmlElementLink
add_html_element_link(composition, 'section-title', {'code': 'title', 'display': 'Section Title'})
# Remove an HtmlElementLink by class
remove_html_element_link(composition, 'section-title')Purpose: Extract, analyze, and modify HTML content in FHIR Composition text.div and nested sections.
Key Functions:
get_html_content(composition)- Extract HTML from composition text.divget_all_html_content(composition)- Recursively extract HTML from all sections and subsectionsupdate_html_content(composition, new_html)- Update composition HTMLupdate_section_html(sections, section_title, new_html, recursive=True)- Update specific section HTMLextract_all_html_from_sections(sections)- Recursively extract all section HTML with metadatafind_elements_by_class(html, class_name)- Find elements by CSS classreplace_html_section(html, start_marker, end_marker, replacement)- Replace HTML sectionsget_html_structure_summary(html)- Analyze HTML structure
Example Usage:
from preprocessor.models.html_content_manager import (
get_html_content, get_all_html_content, update_section_html,
find_elements_by_class, get_html_structure_summary
)
# Extract HTML from composition
html = get_html_content(composition)
# Recursively extract ALL HTML content from nested sections
all_html = get_all_html_content(composition)
print(f"Total sections: {all_html['total_sections']}")
print(f"Max nesting level: {all_html['max_nesting_level']}")
for section in all_html['sections']:
print(f"Level {section['level']}: {section['title']}")
print(f" HTML length: {len(section['html'])} chars")
print(f" Has subsections: {section['has_subsections']}")
# Update a specific section by title (searches recursively)
update_section_html(
composition['section'],
'What is in this leaflet',
'<div>New content</div>',
recursive=True
)
# Find all elements with a specific class
elements = find_elements_by_class(html, 'epi-section')
# Get a summary of the HTML structure
summary = get_html_structure_summary(html)
print(summary)FhirEPI Model Integration:
from preprocessor.models.fhir_epi import FhirEPI
# Load and parse ePI bundle
epi = FhirEPI.from_dict(bundle_dict)
# Extract all HTML content including nested sections
all_html = epi.get_all_html_content()Standalone test runners are provided for all major components:
test_html_element_link_standalone.py- Test HtmlElementLink extension managementtest_html_content_manager_standalone.py- Test HTML content operationstest_recursive_html_extraction.py- Test recursive HTML extraction from nested sections
Run with:
python3 test_html_element_link_standalone.py
python3 test_html_content_manager_standalone.py
python3 test_recursive_html_extraction.pyComprehensive examples demonstrating all features:
example_recursive_html_extraction.py- Recursive HTML extraction examples with 5 different use cases
Run examples:
python3 example_recursive_html_extraction.pySee the following markdown files for detailed guides and examples:
HTMLELEMENTLINK_GUIDE.mdHTMLELEMENTLINK_QUICK_REFERENCE.mdHTMLELEMENTLINK_EXAMPLES.mdHTMLELEMENTLINK_SUMMARY.mdHTMLELEMENTLINK_INDEX.mdHTMLELEMENTLINK_DELIVERABLES.md
See LICENSE for details.