Automating SDTM Mapping with Artificial Intelligence and Large Language Models
A Narrative Review
The Problem • Regulatory Landscape • Architecture • Key Findings • Research Program • Citation
The CDISC Study Data Tabulation Model (SDTM) is mandated by the U.S. Food and Drug Administration for regulatory submissions of clinical trial data. Manual SDTM mapping — the process of transforming raw study data into SDTM-conformant datasets — remains the single most resource-intensive bottleneck in the clinical data pipeline:
- 35–45 programmer hours per dataset
- 6–8 weeks end-to-end per study across 15–25+ domains
- $50,000–$200,000+ estimated cost per study (industry extrapolation; not empirically validated)
- Each day of submission delay carries significant commercial implications for pharmaceutical sponsors
This narrative review provides a critical assessment of automation technologies applied to this challenge across three technology generations, synthesizing the highest-quality available evidence while explicitly identifying where the field’s promise outpaces its published proof.
| Attribute | Value |
|---|---|
| Word Count | ~7,000 (excluding references and tables) |
| References | 30 (4 peer-reviewed journals, 7 regulatory documents, 6 conference proceedings, 4 vendor white papers, 3 preprints, 3 industry commentary, 3 open-source repositories) |
| Tables | 3 |
| Systems Reviewed | 8 (across 3 technology generations) |
| Evidence Quality Tiers | Peer-reviewed → Conference → Vendor claims (explicitly labeled throughout) |
- Regulatory affairs professionals evaluating AI-assisted submission workflows
- Clinical data standards architects designing SDTM automation strategies
- Pharmaceutical technology leaders assessing build-vs-buy decisions for SDTM tooling
- Academic researchers in clinical informatics and biomedical NLP
- Legal and compliance counsel reviewing AI deployment in regulated data pipelines
This work intersects the following regulatory and standards instruments:
| Instrument | Specific Provision | Relevance to SDTM Automation |
|---|---|---|
| FDA Federal Register 2023-27310 | 88 FR 86336–86337 (Dec 13, 2023) | Mandates SDTM v2.0 / SDTMIG v3.4 for NDAs, ANDAs, BLAs, INDs effective March 15, 2025 |
| FDA Study Data TCG v5.9 | October 2024 revision | Governs regulatory implementation of SDTM; conformance validation requirements |
| CDISC Controlled Terminology | Quarterly releases (most recent: March 27, 2026) | ~1,124 new terms per release; all automation systems must track evolving vocabulary |
| CDISC Dataset-JSON v1.1 | Published December 5, 2024 | Intended replacement for legacy SAS XPT transport format |
| FDA Draft AI Guidance | January 2025 | Transparency, explainability, and documentation framework for AI in drug development |
| 21 CFR Part 11 | Electronic Records; Electronic Signatures | Audit trail requirements applicable to AI-generated SDTM datasets |
| EU AI Act | Regulation 2024/1689 (in force August 1, 2024) | Potential high-risk classification for clinical data standardization tools |
| CDISC 360i | Public kickoff February 18, 2025 | Machine-executable metadata via formal ontologies; could shift paradigm from pattern recognition to rule execution |
The review organizes existing approaches into three generations based on their core computational paradigm — a classification reflecting chronological progression already evident in the literature rather than a novel taxonomy proposed by this paper.
flowchart LR
subgraph G1["Generation 1 — Metadata-Driven"]
G1A["SAS Macro Libraries"]
G1B["Metadata Repositories"]
G1C["Template Reuse"]
end
subgraph G2["Generation 2 — Supervised Machine Learning"]
G2A["Binary/Multiclass<br/>Classification"]
G2B["Siamese Neural<br/>Networks"]
G2C["Embedding Similarity<br/>Matching"]
end
subgraph G3["Generation 3 — LLM / Agentic AI"]
G3A["Retrieval-Augmented<br/>Generation (RAG)"]
G3B["Multi-Agent<br/>Orchestration"]
G3C["GPT-Based Code<br/>Generation"]
end
G1 -->|"Limitations:<br/>Inflexible to<br/>study variability"| G2
G2 -->|"Limitations:<br/>Requires large<br/>training corpus"| G3
style G1 fill:#1a5276,color:#ffffff
style G2 fill:#1a5276,color:#ffffff
style G3 fill:#1a5276,color:#ffffff
| Feature | Gen 1: Metadata-Driven | Gen 2: Machine Learning | Gen 3: LLM / Agentic AI |
|---|---|---|---|
| Core Technique | Reusable SAS macros; metadata repositories | Supervised classification; Siamese networks; embedding similarity | RAG; multi-agent orchestration; GPT-based code generation |
| Representative Systems | Pinnacle 21 + Formedix; sponsor macro libraries | JETConvert (Bioforum); Siamese classifier (Medidata); AbbVie AI | Saama SDTM Navigator; PHUSE two-stage RAG; PharmaSUG China GPT framework |
| Domain Accuracy | N/A (template-based) | 75.1% (Bioforum, 41 trials) → >90% (Medidata) | ~70% initial accuracy (Saama) |
| Variable Accuracy | 40–60% reuse rate | 83.8% (Bioforum, 61 trials) | Confidence-scored; benchmarks not yet published |
| Time Reduction | Moderate (reuse savings) | Moderate (training corpus required) | 50–70% reported (Saama) |
| Primary Limitation | Cannot adapt to study-specific deviations | Poor on rare/underrepresented domains | Hallucination risk; limited peer-reviewed validation |
| Evidence Quality | Industry documentation | 1 peer-reviewed journal article + white papers | Conference papers + vendor materials only |
Critical observation: Evidence quality decreases as technology generation increases. The only peer-reviewed journal publication among SDTM-specific automation systems is the Medidata Siamese network paper (Generation 2, PLoS ONE 2024). All Generation 3 claims rest on conference papers, vendor materials, and preprints.
flowchart TB
subgraph Source["Source Data Environment"]
EDC["EDC System<br/>(e.g., Medidata Rave, Veeva)"]
LAB["Central Lab Data"]
ECG["ECG Vendor Data"]
EPRO["eCOA / ePRO"]
end
subgraph Mapping["Mapping Function M: S → T"]
DOMAIN["Domain Identification<br/>(DM, AE, LB, VS, EX...)"]
VAR["Variable-Level Mapping<br/>(carry-forward, reformat,<br/>combine, derive, recode)"]
CT["Controlled Terminology<br/>Alignment (quarterly CT)"]
DERIVE["Derivation Logic<br/>(STUDYDAY, EPOCH,<br/>baseline flags)"]
end
subgraph Output["SDTM-Conformant Output"]
SDTM["SDTM Datasets<br/>(XPT or Dataset-JSON)"]
SPEC["Mapping Specifications"]
ACRF["Annotated CRF"]
VAL["Pinnacle 21 / CDISC CORE<br/>Conformance Validation"]
end
EDC --> DOMAIN
LAB --> DOMAIN
ECG --> DOMAIN
EPRO --> DOMAIN
DOMAIN --> VAR --> CT --> DERIVE --> SDTM
DERIVE --> SPEC
DERIVE --> ACRF
SDTM --> VAL
style Source fill:#2c3e50,color:#ffffff
style Mapping fill:#1a5276,color:#ffffff
style Output fill:#1e8449,color:#ffffff
| System | Metric | Value | Evidence Quality |
|---|---|---|---|
| Bioforum JETConvert (41 trials) | Domain prediction accuracy | 75.1% | White paper |
| Bioforum JETConvert (61 trials) | Variable mapping accuracy | 83.8% | White paper |
| Bioforum JETConvert (0.30 conf.) | Precision / Recall | 91.4% / 63.7% | White paper |
| Medidata Siamese Network | Overall classification accuracy | >90% | Peer-reviewed (PLoS ONE) |
| Medidata Siamese Network | Improvement over TF-IDF + XGBoost | ~15 percentage points | Peer-reviewed (PLoS ONE) |
| Saama SDTM Navigator | Auto-generated mapping proportion | ~75% | Vendor webinar |
| Saama SDTM Navigator | Initial mapping accuracy | ~70% | Vendor webinar |
| General Clinical NLP LLMs | F1-score (NER, classification, QA) | 88–93% | medRxiv preprint |
- Validation sparsity — Only 2.3% of published clinical programming validation studies provide quantitative effectiveness data
- No standardized benchmark — No multi-sponsor, multi-therapeutic-area SDTM mapping benchmark dataset exists
- Unquantified hallucination risk — Zero published studies measure LLM hallucination rates for SDTM-specific tasks
- Reproducibility barrier — All training datasets in reviewed systems are proprietary; no independent replication is possible
- Standard evolution risk — SDTMIG v4.0 (public review late 2025) will require retraining of all current automation systems
This manuscript is a narrative review, not a systematic review. No formal protocol was registered. Source selection reflects the author’s knowledge of the field and is subject to selection bias. No formal inclusion/exclusion criteria were applied. No novel systems, algorithms, benchmarks, or evaluation protocols are proposed. All content represents synthesis, organizational grouping, and critical appraisal of existing published work. The three-generation classification reflects chronological patterns already evident in the literature. This paper should not be used to guide clinical practice or regulatory decisions.
sdtm-mapping-ai/
├── README.md # This file
├── LICENSE
├── CITATION.cff
└── manuscript/
└── sdtm_mapping_ai_review.md # Full manuscript (~7,000 words)
This paper is part of a four-paper independent research program examining AI deployment in regulated clinical data environments:
flowchart TB
SDTM["① SDTM Mapping AI<br/>THIS REPOSITORY<br/>Three generations of<br/>automation technology"]
HALL["② Hallucination Mitigation<br/>Defense-in-depth RAG<br/>for clinical extraction"]
AGENT["③ Multi-Agent Architectures<br/>Agentic AI frameworks for<br/>clinical data management"]
PRIV["④ Privacy-Preserving RAG<br/>On-premises deployment<br/>for HIPAA/GDPR compliance"]
SDTM -->|"Identifies hallucination<br/>as key Gen 3 risk"| HALL
SDTM -->|"Maps to multi-agent<br/>pipeline stage 5"| AGENT
HALL -->|"Provides mitigation<br/>architecture for agents"| AGENT
AGENT -->|"Requires data sovereignty<br/>for patient-level data"| PRIV
PRIV -->|"Enables on-premises<br/>RAG for SDTM tasks"| SDTM
style SDTM fill:#e74c3c,color:#ffffff,stroke:#c0392b
style HALL fill:#2c3e50,color:#ffffff
style AGENT fill:#2c3e50,color:#ffffff
style PRIV fill:#2c3e50,color:#ffffff
| # | Repository | Focus |
|---|---|---|
| 1 | sdtm-mapping-ai | Landscape review: what exists, what works, what’s missing |
| 2 | clinical-rag-hallucination | Solving the hallucination problem identified as Gen 3’s primary risk |
| 3 | clinical-query-agent | Operationalizing multi-agent orchestration for clinical data workflows |
| 4 | privacy-rag-onprem | Deploying RAG within institutional security perimeters for U.S. data sovereignty |
@article{arogyasami2026sdtm,
title = {Automating SDTM Mapping with Artificial Intelligence and Large
Language Models: A Narrative Review},
author = {Arogyasami, DanielMartin},
year = {2026},
note = {Preprint — not yet peer reviewed},
url = {https://github.com/DanielMartin-Arogyasami/sdtm-mapping-ai}
}DanielMartin Arogyasami — Enterprise Clinical AI & Data Architect
This manuscript is shared for academic discussion and independent scholarly review. See for terms.
This work was conducted independently and does not reflect, utilize, or derive from any employer’s internal systems, proprietary data, or strategic initiatives. All sources are publicly accessible at the URLs provided in the reference list. The author declares no financial or non-financial conflicts of interest related to any SDTM automation vendor, product, or service discussed in this review. No funding was received.
Keywords: SDTM · CDISC · clinical data standardization · artificial intelligence · large language models · machine learning · retrieval-augmented generation · FDA regulatory submissions · clinical trial automation · data harmonization · 21 CFR Part 11