Composable Rust crates for clinical data engineering.
Architecture Β· Roadmap Β· Contributing
clinical-rs is a Cargo workspace containing three independent crates for working with clinical healthcare data in Rust:
| Crate | Purpose | Status |
|---|---|---|
medcodes |
Medical code ontologies, hierarchy traversal, and cross-system mapping (ICD-10, ATC, LOINC, SNOMED CT, etc.) | π§ Pre-release |
mimic-etl |
MIMIC-III/IV CSV parser β Apache Arrow RecordBatches with memory-mapped I/O and parallel processing | π§ Pre-release |
clinical-tasks |
Task windowing engine β transforms clinical event streams into ML-ready (features, label) Arrow tables | π§ Pre-release |
Each crate publishes independently to crates.io and can be used standalone. Together, they form an end-to-end pipeline from raw clinical data to model-ready datasets.
Clinical ML data pipelines are bottlenecked by data loading, not model training. Python-based tools like PyHealth and pandas struggle with memory pressure and parallelism on large datasets like MIMIC-IV (300K+ patients, tens of millions of events).
clinical-rs targets that bottleneck:
- Arrow-native β every crate speaks Apache Arrow as its interchange format. Zero-copy interop with PyArrow, Polars, DataFusion, DuckDB, and Spark.
- Streaming-first β all ETL crates emit
RecordBatchiterators, not materialized collections. Same code path works for batch (collect β Parquet) and streaming (process β infer β emit). - Parallel by default β
rayon-based work-stealing parallelism without Python's GIL. Memory-mapped I/O viamemmap2for datasets larger than RAM. - Composable, not monolithic β use
medcodesalone for code lookups,mimic-etlalone for data loading, or wire them together throughclinical-tasks.
Add the crate(s) you need:
# Cargo.toml
[dependencies]
medcodes = "0.1" # medical code ontologies
mimic-etl = "0.1" # MIMIC-III/IV β Arrow
clinical-tasks = "0.1" # task windowing for MLuse medcodes::icd10cm::Icd10Cm;
let code = Icd10Cm::lookup("A41.9")?; // Sepsis, unspecified organism
let ancestors = code.ancestors(); // ["A41", "A30-A49", "A00-B99"]
let description = code.description(); // "Sepsis, unspecified organism"use medcodes::crossmap::CrossMap;
let icd_to_ccs = CrossMap::load(System::Icd10Cm, System::CcsCm)?;
let mapped = icd_to_ccs.map("A41.9")?; // ["2"] (CCS category: Septicemia)use mimic_etl::Mimic4Dataset;
let dataset = Mimic4Dataset::open("path/to/mimic-iv/")?;
let batches = dataset
.tables(&["diagnoses_icd", "prescriptions", "labevents"])
.into_event_stream()?; // Iterator<Item = RecordBatch>
// Write to Parquet
mimic_etl::to_parquet(batches, "output/events.parquet")?;use clinical_tasks::{MortalityPrediction, TaskConfig};
use arrow::ipc::reader::FileReader;
let events = FileReader::try_new(File::open("events.arrow")?)?;
let task = MortalityPrediction::new(TaskConfig {
observation_window: Duration::hours(48),
prediction_window: Duration::hours(24),
..Default::default()
});
let samples = task.apply(events)?; // Iterator<Item = RecordBatch> with features + label columnsNote: API examples are illustrative and will evolve before 0.1.0 release.
- Arrow is the contract. Crates communicate via Arrow RecordBatches. No custom serialization formats, no framework lock-in.
- Each crate stands alone.
medcodeshas zero dependencies onmimic-etl. A consumer building a FHIR pipeline can usemedcodes+clinical-taskswithout ever touching MIMIC data. - Correctness over cleverness. Medical code mappings are validated against official source files (CMS, WHO, NLM). Wrong mappings in clinical contexts cause harm.
- No model training. This project handles everything before and after the GPU. Train models in PyTorch/JAX, export to ONNX, run inference in Rust via the
ortcrate.
clinical-rs/
βββ crates/
β βββ medcodes/ # Medical code ontologies + cross-mapping
β β βββ src/
β β βββ data/ # Embedded code tables (build-time)
β β βββ Cargo.toml
β βββ mimic-etl/ # MIMIC-III/IV β Arrow ETL
β β βββ src/
β β βββ Cargo.toml
β βββ clinical-tasks/ # Task windowing engine
β βββ src/
β βββ Cargo.toml
βββ ARCHITECTURE.md
βββ TODO.md
βββ CONTRIBUTING.md
βββ LICENSE-MIT
βββ LICENSE-APACHE
βββ Cargo.toml # Workspace manifest
| Tool | Language | Focus | How clinical-rs differs |
|---|---|---|---|
| PyHealth | Python | End-to-end clinical ML toolkit (data + models + training) | We do data only β faster, Arrow-native, no model training |
| MedModels | Rust + Python | Graph-based RWE analysis (treatment effects, propensity matching) | We use columnar/Arrow, not graph. ML data loading, not RWE analytics |
| MEDS | Python | Medical event data standard | Complementary β we could emit MEDS-compatible schemas |
- Rust 1.94+ (2024 edition)
- MIMIC-III/IV access via PhysioNet credentialed access (for
mimic-etl)
Dual-licensed under MIT and Apache 2.0, at your option.
If you use clinical-rs in academic work, please cite:
@software{clinical_rs,
author = {Kresna Sucandra},
title = {clinical-rs: Composable Rust crates for clinical data engineering},
url = {https://github.com/SHA888/clinical-rs},
license = {MIT OR Apache-2.0},
}