CancerRCDPredictor

Authors:
Emanuell de Souza Rodrigues
Higor Almeida Cordeiro Nogueira
Victor dos Santos Lopes
Enrique Medina-Acosta

Purpose

CancerRCDPredictor engineers a novel Pan-Cancer Multi-Omic SuperLearner pipeline designed to mathematically overcome the critical algorithmic bottlenecks of precision oncology—specifically extreme data sparsity, dimensional missingness, and the structural failures of traditional linear proportional-hazards models.

By mapping non-linear survival topologies across 33 tumor types and introducing a Dual-Track genotypic sparsity displacement architecture, we provide a mathematically resilient predictive framework. Crucially, our system guarantees local interpretability via N-dimensional TreeSHAP interactions, directly answering the mandate for transparent, audit-compliant AI tools capable of managing multi-modal biological complexity without sacrificing patient data.

The framework was developed from the methodological architecture described in the manuscript:

A Pan-Cancer Multi-Omic SuperLearner for Regulated Cell Death Survival Topologies

Overview

CancerRCDPredictor was developed as a translational extension of the CancerRCDShiny ecosystem, transforming large-scale prognostic signature catalogs into an interactive predictive and interpretability engine.

The platform enables the exploration of:

96 validated Pan-Cancer predictive cohorts
12,613 biologically filtered multi-omic signatures
7 omic layers
SHAP-based survival geometries and LIME surrogate models
Cohort-level interaction topologies (mapping 26,800 Synergistic, Antagonistic, and Bifurcation dependencies)
10,306 patient-specific non-proportional hazard trajectories
1,050 patient samples in a clinical blind validation cohort
150 elite "Golden Anchor" RCD signatures (Quadripartite-validated apex drivers)
A Dual-Track Inference Architecture (powered by an MVL SuperLearner and XGBoost fallback)

The system was specifically engineered to bypass limitations of classical proportional hazards models and capture complex non-linear biological survival structures.

Key Discoveries & Architectural Outcomes

Beyond serving as a predictive engine, the CancerRCDPredictor pipeline generated profound biological and algorithmic discoveries:

The Terminal Harvester & 150 Golden Anchors:

From an initial universe of 14,595 signatures, the Quadripartite framework forced features to survive a rigorous 4/4 algorithmic constraint (RSF VIMP, XGBoost Gain, Boruta Z-score, MTLR L2-Norm). This distilled the landscape down to exactly 150 "Golden Anchors"—the absolute highest echelon of pan-cancer prognostic reliability.

Algorithmic Displacement & Genotypic Erasure:

The architecture revealed a severe structural displacement during high-dimensional model competition. Continuous phenotypic layers (transcript isoforms, mRNA) monopolized 85.7% of the predictive topology, mathematically suppressing and erasing static genomic mutations and CNVs (0.0% retention in the golden anchors).

Dynamic SuperLearner Voting (Lush vs. Supreme Exemplars):

The SuperLearner dynamically adapts voting weights to the cohort's biological complexity. In high-entropy "Lush" environments (e.g., LGG), it distributes trust equally across all 4 base-learners (25% each) to synthesize fragmented signals. In "Supreme" deterministic environments (e.g., READ_OS), it routes up to 95.7% of trust into XGBoost to maximize resolution.

Key Features

Multi-Omic Predictive Architecture & Nomenclature

The platform integrates seven omic layers:

Protein abundance (.1)
Somatic mutations (.2)
Copy Number Variation (CNV) (.3)
miRNA expression (.4)
Transcript isoform-specific expression (.5)
mRNA expression (.6)
DNA methylation (.7)

These layers are tracked through an 11-Part Tokenized Nomenclature System (CTAB-GSI.GFC.PFC.SCS.TNC.HRC.SMC.TMC.TIC.RCD), ensuring programmatic parsing of biological function, immune landscape, and Regulated Cell Death (RCD) pathways directly from the signature ID.

Explainable AI Framework

CancerRCDPredictor incorporates multiple explainability modules to prevent "black box" predictions:

Global Explainability

SHAP Beeswarm plots: Global impact ranking and feature dominance visualization.

Local Explainability

LIME Surrogate Models: Point-of-care localized linear surrogates mapping individualized hazard boundaries.
TreeSHAP Waterfall & Force Plots: Decompiling the exact predictive logic of the non-linear SuperLearner for individual patient trajectories.

Interaction Explainability

Mapping 26,800 statistically significant 3D TreeSHAP trans-signature dependencies across three mathematical archetypes:

Synergism: Hazard Amplification.
Antagonism: Functional Rescue Effect.
Context-Dependent Bifurcation: Topological sign-reversals.

Educational Sandbox

The platform was also designed as a pedagogical topology explorer and educational sandbox for Explainable Artificial Intelligence in precision oncology.

Dedicated educational modules explain:

SHAP interpretation
Survival geometries
Multi-omic interactions
Precision oncology trajectories
Non-proportional hazard dynamics

Audit-Compliant Analytical Architecture

The platform operates on a strictly deterministically gated architecture, governed by 4 Constitutional Contracts (Groupwise isolation, endpoint-scoped cohorts, zero predictor-driven sample reduction, and explicit exclusion ledgers) alongside rigorous Identifiability Thresholds ($E_{min} \ge 20$, $N_{min} \ge 50$).

Phase I — Harmonization and Reconstruction

Universal Resume Engine: A fault-tolerant pipeline deploying 12 distinct imputation methods (including kNN, missForest, XGBoost, LightGBM, MICE, and iSVD) with automated .rds checkpointing for memory safety. _ Generation of LiSHMOM: 372 Lineage-Specific Harmonized Multi-Omic Matrices. _ Leakage prevention protocols.

Phase II — CANARY Structural Diagnostics

CoxNet Feasibility Auditing: Mathematically mapping proportional-hazards failures and $\mu$-ladder exhaustion.
Geometric Admissibility Gating & Survival Topology Certification.

Phase III — Quadripartite Ensemble Synthesis & Calibration

The Base-Learners: Random Survival Forests (RSF), XGBoost, Survival-Boruta, and Multi-Task Logistic Regression (MTLR).
The Meta-Learner: Synthesized via a Multi-View Elastic Net SuperLearner (MVL).
Brier Calibration Audit: Rigorous post-hoc probability validation using Inverse Probability of Censoring Weighting (IPCW) and Time-Dependent Brier Scores (IBS) across 1-, 3-, and 5-year horizons.
"No Cohort Stays Behind" Policy: Algebraic fallback defenses (Z-Score Mean Imputation, Micro-Jitter Variance Injection, Boruta Coerced 0.5 Resolution) preventing singular matrix crashes.

Internal Blind Validation & Dual-Track Inference Engine

To guarantee 100% predictive penetrance against the 1,050 pristine validation records, a Dual-Track Inference Engine was deployed:

Path A (SuperLearner): Synthesizes continuous risk hazard Z-scores for structurally intact records.
Path B (Native XGBoost Fallback): Autonomously routes highly fragmented patient records through sparsity-aware split finding to prevent artificial risk escalation.

Analytical Architecture

The platform follows a strict three-phase audit-compliant architecture:

Phase I — Harmonization and Reconstruction

Multi-omic harmonization
Layer-specific preprocessing
Missing-data auditing
Fault-tolerant imputation engine
Leakage prevention protocols

Phase II — CANARY Structural Diagnostics

CoxNet feasibility auditing
Proportional hazards diagnostics
Geometric admissibility gating
Survival topology certification

Phase III — Quadripartite Ensemble Synthesis

The final predictive framework combines:

Random Survival Forests (RSF)
XGBoost
Survival-Boruta
Multi-Task Logistic Regression (MTLR)

through a Multi-View Elastic Net SuperLearner architecture.

Platform Modules

Module	Description
Welcome	Platform overview and predictive architecture
How to Read	Educational interpretability guide
Methodological Integrity	Phase I–III methodological overview
MVL Performance	Time-dependent AUROC exploration
Global Impact	SHAP Beeswarm visualization
Interaction Topologies	Synergy and antagonism mapping
Precision Oncology	Individual patient trajectory decomposition
Signature Interpreter	Multi-omic signature exploration

Dataset Scope

The framework integrates:

TCGA Pan-Cancer cohorts
UCSCXena resources
Multi-omic clinical matrices

Clinical survival endpoints include:

Overall Survival (OS)
Disease-Specific Survival (DSS)
Disease-Free Interval (DFI)
Progression-Free Interval (PFI)

The analytical architecture generated:

14,907 prognostic nomenclatures
17,875 biological target elements
372 harmonized preprocessing matrices

Technologies

Backend

R
Shiny
survival
glmnet
randomForestSRC
xgboost
SHAP
LIME

Frontend

Bootstrap 5
bslib
DT
Responsive Glassmorphism UI
Dynamic rendering pipelines

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
CancerRCDPredictor Shiny		CancerRCDPredictor Shiny
Datasets		Datasets
Figure GitHub		Figure GitHub
Supplementary Material		Supplementary Material
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CancerRCDPredictor

Purpose

Overview

Key Discoveries & Architectural Outcomes

The Terminal Harvester & 150 Golden Anchors:

Algorithmic Displacement & Genotypic Erasure:

Dynamic SuperLearner Voting (Lush vs. Supreme Exemplars):

Key Features

Multi-Omic Predictive Architecture & Nomenclature

Explainable AI Framework

Global Explainability

Local Explainability

Interaction Explainability

Educational Sandbox

Audit-Compliant Analytical Architecture

Phase I — Harmonization and Reconstruction

Phase II — CANARY Structural Diagnostics

Phase III — Quadripartite Ensemble Synthesis & Calibration

Internal Blind Validation & Dual-Track Inference Engine

Analytical Architecture

Phase I — Harmonization and Reconstruction

Phase II — CANARY Structural Diagnostics

Phase III — Quadripartite Ensemble Synthesis

Platform Modules

Dataset Scope

Technologies

Backend

Frontend

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CancerRCDPredictor

Purpose

Overview

Key Discoveries & Architectural Outcomes

The Terminal Harvester & 150 Golden Anchors:

Algorithmic Displacement & Genotypic Erasure:

Dynamic SuperLearner Voting (Lush vs. Supreme Exemplars):

Key Features

Multi-Omic Predictive Architecture & Nomenclature

Explainable AI Framework

Global Explainability

Local Explainability

Interaction Explainability

Educational Sandbox

Audit-Compliant Analytical Architecture

Phase I — Harmonization and Reconstruction

Phase II — CANARY Structural Diagnostics

Phase III — Quadripartite Ensemble Synthesis & Calibration

Internal Blind Validation & Dual-Track Inference Engine

Analytical Architecture

Phase I — Harmonization and Reconstruction

Phase II — CANARY Structural Diagnostics

Phase III — Quadripartite Ensemble Synthesis

Platform Modules

Dataset Scope

Technologies

Backend

Frontend

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages