NARS-Guided Transformer Attention for clinical transformers under extreme missingness
TL;DR: NGTA is a clinical transformer that does not just rank patients; it tries to tell the truth about its own uncertainty. It estimates epistemic uncertainty with MC Dropout, heuristically converts that uncertainty into initial NARS-style truth values, injects explicit human-written medical rules at inference time, and feeds the revised confidence back into attention so brittle evidence is downweighted before the final prediction is made.
NGTA is a neurosymbolic clinical prediction architecture that maps neural uncertainty into NARS truth values and feeds revised confidence back into Transformer attention during inference. The repository now supports two benchmarks in parallel:
tcga: TCGA-THCA lymph node metastasis prediction from merged clinical tables plus a mutation-derived binary gene panelwids: WiDS Datathon 2020 ICU hospital mortality prediction from a high-missingness ICU tabular cohort
The research paper lives in paper/main.pdf, with source in paper/main.tex.
After an email exchange on April 21, 2026, Pei Wang pointed out two conceptual issues that now shape this repository:
- Statistical variance is not itself NARS evidence amount or native NARS confidence. In NGTA, MC-dropout variance is now described explicitly as a heuristic initializer for neural confidence that can later be revised by symbolic evidence.
- The strong-deduction confidence calculation in the manuscript needed to match the standard NAL rule rather than the custom form previously written in the paper.
Repository updates made from that feedback:
src/nars_interface.pynow exposes standard NAL strong deduction, revision, evidence-confidence conversion, and expectation helpers.- Triggered symbolic rules are grounded by an explicit deduction step from empirical observations before neural-symbolic revision.
- The README and paper now attribute these clarifications to Pei Wang and describe the variance-to-confidence mapping more carefully.
- Inference-Time Logic Injection: Fuses MC-Dropout epistemic uncertainty with NARS symbolic logic and pushes the revised confidence signal directly into Transformer attention during inference.
- Scale & Safety: Benchmarked on
91,713ICU stays. In the current full run, the baseline transformer is best on AUC at0.88294, the flat-confidence control is best on ECE at0.00490with 95% CI[0.00411, 0.00969], the MC-confidence-only ablation is best on accuracy at0.92905, and the NARS-gated variant is best on Brier score at0.05618with 95% CI[0.05327, 0.05945]. - Glass-Box Activity: On held-out WiDS ICU data, explicit symbolic rules fired in
8551of13757stays for13031total feature-level revisions, showing that the logic layer is active rather than decorative. - Multi-Modal Ready: Demonstrated on fused clinical tabular features and genomic mutation matrices on TCGA-THCA, where the same interface remains operational as a clinical-plus-genomic proof of concept. The TCGA transformer variants are not statistically separated from one another on the 69-case held-out split.
NGTA is a medical prediction system for messy hospital-style tables where many values are missing. It uses a Transformer to make predictions, but it does not stop at producing a single risk score. It estimates epistemic uncertainty, checks a set of human-written medical rules, and then uses both pieces of information to adjust how the model pays attention to the input features before the final output is emitted.
Many clinical AI systems can give a strong prediction even when the data are incomplete or unreliable. That is dangerous in real settings because missing hospital data can produce overconfident probabilities that look trustworthy when they are not. NGTA is designed to separate "high score" from "high confidence" and to expose a human-readable revision path when symbolic rules intervene. That makes the system more useful in high-missingness environments like ICU data, where safer calibration matters as much as raw accuracy.
In standard clinical prediction, models optimize for point-estimate accuracy but lack native mechanisms to express epistemic doubt, leading to overconfident extrapolation when faced with missing features. NGTA is built around the opposite design goal: instead of a black-box predictor that guesses blindly across data gaps, it calculates feature-level uncertainty and can route attention toward explicit medical heuristics when uncertainty is high. In that sense, the repository's core contrast is simple: standard transformers behave like black boxes, while NGTA is designed to behave like a glass box.
The main novelty is not just "Transformer + rules." The key idea is that NGTA turns neural uncertainty into explicit symbolic truth values in a NARS-compatible evidential space, revises those values with domain rules, and then feeds the revised confidence back into Transformer attention. In simple terms: the model can use both learned patterns and symbolic evidence to decide how much trust to place in each feature at inference time, while also leaving behind an auditable evidential trace.
This repository is not a full NARS cognitive architecture. It operationalizes selected NAL truth-value functions as an interface layer for a clinical transformer: heuristic neural truth initialization, explicit symbolic deduction from triggered observations, and revision-based fusion before attention reweighting.
The end result is not just another tabular model with a rules layer attached to the side. It is an auditable, human-in-the-loop reasoning engine: instead of emitting an overconfident scalar score on missing data, the system exposes what it does not confidently know and provides a direct insertion point for clinicians to inject overriding physiological rules into the inference path itself. We refer to this uncertainty-conditioned attention update as Dynamic Evidential Routing. The novel outcome of this project is that calibration, clinician steerability, and auditability all appear in the same deployed inference loop.
- The Transformer reads the patient features and predicts risk.
- Monte Carlo dropout is used to measure how stable that prediction is across repeated passes.
- That uncertainty is heuristically converted into initial NARS-style truth values: frequency and confidence.
- If a symbolic rule fires, the rule is first grounded by explicit NAL deduction from an empirical observation and then combined with the neural truth value using NARS revision.
- The revised confidence is used to reweight attention, so uncertain or weakly supported features matter less.
- The pipeline then evaluates discrimination, calibration, decision curves, symbolic trigger activity, and baseline comparisons.
The two benchmarks test different strengths of the architecture:
tcgais the multi-modal proof of concept. It shows that NGTA can fuse clinical variables with a genomic mutation matrix without breaking the mathematical interface.widsis the primary empirical validation. It shows that the same architecture scales to a much larger ICU dataset with heavy missingness and gives the clearest large-scale view of calibration, uncertainty routing, and symbolic activity.
The main result is that NGTA works as intended on both a small multi-modal cancer dataset and a much larger high-missingness ICU dataset, but the two datasets support different claims.
- On
tcga, the Transformer-based models still beat the random-forest baseline numerically. The best default AUC is0.72605, tied acrossflat_confidence,mc_confidence_only, andnars_gated, versus0.66134for random forest. This supports the claim that the interface can learn useful signal from combined clinical and genomic inputs, but it does not support a claim that NARS gating is statistically better than the other Transformer variants. - The flat-confidence control is the strongest TCGA Transformer variant overall in the current default run because it pairs that tied-best AUC with the best Brier score (
0.21184), the best ECE (0.13897), and the best accuracy (0.68116). TCGA should therefore still be treated as a multi-modal interface proof of concept rather than evidence that dynamic NARS gating dominates simpler confidence gates on very small cohorts. - On
wids, all Transformer variants are extremely close on AUC around0.8829. At full precision, the baseline transformer is best on AUC, the flat-confidence control is best on ECE at0.00490, the MC-confidence-only ablation is best on accuracy at0.92905, and the NARS-gated version is best on Brier score at0.05618. - The WiDS baseline-vs-NARS paired bootstrap intervals now include zero for both Brier difference (
-0.000004to0.000061) and ECE difference (-0.000355to0.001796). That means the current run does not statistically establish a calibration gain for NARS gating over the baseline transformer. - The WiDS NARS-gated variant is also not statistically separated from the flat-confidence control on Brier or ECE. The new MC-confidence-only ablation sits almost exactly between the generic confidence gate and the symbolic gate, which makes the interpretation sharper: symbolic rules are active at scale, but the current default run still does not isolate their marginal calibration effect over neural uncertainty gating alone.
- The WiDS result still matters because the transformer family remains stronger than the random forest on the main summaries, and the symbolic path is physically active during inference. But the right interpretation is now narrower: this run supports operational neurosymbolic routing and competitive calibration, not a confirmed within-family superiority claim for NARS gating.
- The symbolic rules were not just decorative. On the held-out WiDS test set, ICU rules fired in
8551of13757cases for13031total feature-level revisions, which means the neurosymbolic revision path was active at scale rather than sitting unused. - Taken together, the results support a narrower and more defensible claim than "always better accuracy": NGTA is competitive on discrimination, operational as a human-auditable instrumentation layer under heavy missingness, and strongest as a framework for explicit uncertainty routing rather than as a proved winner over every control.
Put differently: the main architectural achievement here is safety-oriented behavior, not just ranking performance. NGTA turns the transformer's attention update from an opaque mapping into a transparent and auditable inference path, where uncertainty is explicit, rule interventions are traceable, and probability reliability becomes something the user can inspect rather than simply assume.
Create an environment and install dependencies:
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -r requirements.txtRun one dataset:
python main.py --dataset tcga
python main.py --dataset widsRun the whole repository pipeline:
python main.py --run-all--run-all is the full orchestration entrypoint. It runs the complete TCGA pipeline and the complete WiDS pipeline sequentially, computes every metric/chart/trace artifact for both datasets, and writes an aggregate run_all_summary.json at the chosen output root.
Useful flags:
--data-dir: directory containing the TCGA tables / MAF files andwids_icu.csv--output-dir: base directory for per-dataset outputs--epochs,--batch-size,--learning-rate,--weight-decay--mc-samples,--gamma,--seed--d-model,--num-heads,--num-layers,--dropout,--patience--seeds 0 1 2 3 4: run multiple seeds and aggregate submission-ready metrics--baseline-set standard: add calibrated logistic regression, ExtraTrees, and histogram gradient boosting baselines--ablation-set submission: add symbolic-disabled and rule-truth sensitivity summaries--export-case-traces: write curated glass-box case traces for representative held-out patients--paper-tables: export aggregate CSV and LaTeX tables underresults/submission
Notes:
- WiDS uses a dataset-specific batch-size override of
512 --datasetis used for single-dataset execution;--run-allruns both datasets regardless- outputs are namespaced by dataset so TCGA and WiDS artifacts do not overwrite each other
- multi-seed runs are written under
results/seed_<seed>/...so repeated submission runs do not overwrite each other
Submission-oriented run:
python main.py --run-all --seeds 0 1 2 3 4 --baseline-set standard --ablation-set submission --export-case-traces --paper-tablesThis writes:
results/submission/multiseed_metrics.csvresults/submission/baseline_comparison.csvresults/submission/ablation_summary.csvresults/submission/case_traces.csvresults/submission/paper_tables.tex
The submission artifacts are intended to support a theory-forward framing: NGTA is a glass-box evidential routing interface for clinical transformers, with performance treated as feasibility evidence rather than as a claim of universal superiority.
TCGA expects the following in data/:
clinical.tsvexposure.tsvfamily_history.tsvfollow_up.tsvpathology_detail.tsv- one or more
*.maffiles
WiDS expects:
wids_icu.csv
The WiDS branch uses exactly these 15 core features:
- Continuous numeric:
age,bmi,d1_heartrate_max,d1_sysbp_min,d1_temp_max,d1_lactate_max,d1_bun_max,d1_creatinine_max,d1_glucose_max,d1_wbc_max,d1_spo2_min,d1_platelets_min,apache_4a_hospital_death_prob - Binary pass-through:
elective_surgery - Categorical:
gender
Preprocessing rules:
pd.read_csv(..., na_values=['NA'])- drop rows where
hospital_deathis missing - stratified
70/15/15split with the run seed KNNImputer(n_neighbors=5)on the 13 continuous features, fit on train onlySimpleImputer(strategy='most_frequent')+ one-hot encoding forgenderStandardScaleron the 13 continuous features only, fit on train only
WiDS symbolic ICU rules are evaluated after KNN imputation and before scaling:
d1_lactate_max >= 4.0d1_sysbp_min <= 90.0age >= 75.0d1_creatinine_max >= 2.0
This repository is a first methods implementation, not a clinical validation package.
- The TCGA held-out split has only
69cases. The transformer variants are close and should not be described as statistically separated from one another. - The symbolic rule bases are deliberately thin: four thyroid rules and four ICU rules. They demonstrate that the NARS revision path is active, but they are not independently curated clinical ontologies.
- Following feedback from Pei Wang on April 21, 2026, the repository treats the variance-to-confidence map as an application-specific heuristic initializer, not as a claim that model variance directly measures NARS evidence amount.
- The current results do not establish that these exact hand-selected rules are sufficient or optimal. A stronger study would lock a broader expert-curated rule base before evaluation and report sensitivity to rule inclusion and truth-value assignments.
- There is no external validation cohort in this snapshot. Clinical claims would require temporally or institutionally independent test cohorts with locked preprocessing, model settings, and rule definitions.
Each dataset writes a full artifact bundle under the chosen output root:
<output-dir>/tcga/charts<output-dir>/tcga/metrics<output-dir>/tcga/traces<output-dir>/wids/charts<output-dir>/wids/metrics<output-dir>/wids/traces
Top-level orchestration output:
<output-dir>/run_all_summary.json
Per-dataset metrics/traces include:
metrics.csvtraining_history.csvgamma_ablation.csvdecision_curve.csvcalibration_reliability.csvrun_summary.jsontest_predictions.csv- ROC, calibration, training-history, gamma-ablation, and decision-curve plots
metrics.csv now reports 95% bootstrap confidence intervals for AUC, Brier score, and ECE across the random forest, baseline transformer, flat-confidence transformer, MC-confidence-only ablation, and NARS-gated transformer. run_summary.json also includes paired bootstrap Brier/ECE deltas for the main comparisons.
The current default full run was produced with:
python main.py --run-allThis was a single-seed default run with baseline_set=minimal, ablation_set=quick, mc_samples=50, gamma=2.0, and seed 0. The richer multi-seed submission artifacts are produced only by the longer --seeds ... --baseline-set standard --ablation-set submission --export-case-traces --paper-tables command.
Result bundles written by that run:
results/run_all_summary.jsonresults/tcga/metrics/run_summary.jsonresults/wids/metrics/run_summary.json
TCGA-THCA full-run summary:
Role in the paper: multi-modal proof of concept for clinical-plus-genomic fusion
- Split:
319 / 69 / 69train/validation/test from457labeled cases - Best default AUC:
0.72605, tied acrossflat_confidence,mc_confidence_only, andnars_gated - Best Brier:
0.21184forflat_confidencewith 95% CI[0.17930, 0.24551] - Best ECE:
0.13897forflat_confidencewith 95% CI[0.11629, 0.25967] - Best accuracy:
0.68116forflat_confidence - MC-confidence-only ablation: AUC
0.72605, Brier0.21219, ECE0.14027, accuracy0.66667 - Symbolic activity:
42 / 69held-out cases with any trigger,79total feature-level revisions - Interpretation: the flat-confidence control is strongest overall among transformer variants, while the MC-confidence-only and NARS-gated variants only tie it on AUC. The transformer variants are not statistically separated on this small split.
WiDS ICU full-run summary:
Role in the paper: primary empirical validation for scale, missingness, and calibration
- Split:
64199 / 13757 / 13757train/validation/test from91713labeled rows - Input width:
16model features after preprocessing - Best AUC:
0.88294forbaseline - Best Brier:
0.05618fornars_gatedwith 95% CI[0.05327, 0.05945] - Best ECE:
0.00490forflat_confidencewith 95% CI[0.00411, 0.00969] - Best accuracy:
0.92905formc_confidence_only - MC-confidence-only ablation: AUC
0.88288, Brier0.05618, ECE0.00495, accuracy0.92905 - Symbolic activity:
8551 / 13757held-out cases with any trigger,13031total feature-level revisions - Paired bootstrap comparisons:
baseline -> nars_gatedBrier0.05621 -> 0.05618; paired delta CI[-0.000004, 0.000061]baseline -> nars_gatedECE0.00601 -> 0.00494; paired delta CI[-0.000355, 0.001796]flat_confidence -> nars_gatedBrier0.05618 -> 0.05618; paired delta CI[-0.000009, 0.000015]flat_confidence -> nars_gatedECE0.00490 -> 0.00494; paired delta CI[-0.000598, 0.000790]random_forest -> nars_gatedBrier0.05821 -> 0.05618; paired delta CI[0.001351, 0.002726]random_forest -> nars_gatedECE0.00764 -> 0.00494; paired delta CI[-0.001682, 0.005866]- AUC confidence intervals overlap across all WiDS variants.
- Per-rule test triggers:
rule_lactate: 1936rule_hypotension: 5338rule_age: 3443rule_creatinine: 2314
main.py: CLI entry point and--run-allorchestrationsrc/data_loader.py: TCGA ingestion and preprocessingsrc/wids_loader.py: WiDS ingestion, preprocessing, and ICU rule-mask generationsrc/knowledge_base.py: TCGA symbolic rule basesrc/wids_knowledge_base.py: WiDS symbolic ICU rule basesrc/neural_encoder.py: tabular Transformer with MC-dropout inferencesrc/nars_interface.py: heuristic neural truth mapping plus standard NAL deduction, revision, and evidential utility operatorssrc/attention_hook.py: confidence-based attention gatingsrc/pipeline.py: training, baselines, evaluation, plotting, and summary generationpaper/main.pdf: compiled research paperpaper/main.tex: manuscript source
The repository updates in this snapshot were shaped directly by Pei Wang's email feedback on April 21, 2026. In particular, he pointed out that statistical variance is not the same thing as NARS evidence amount and that the manuscript's deduction confidence formula needed to match standard NAL. The current code and paper now reflect those corrections. The project also relies on public TCGA-THCA data from the NCI Genomic Data Commons.