Skip to content

Part II Mathematical Foundations

José Mauricio Gómez Julián edited this page Jun 28, 2026 · 6 revisions

← Part I — Conceptual Framework · gdpar Wiki Home · Part III — Computational Architecture →


Part II — Mathematical Foundations

This part develops, at reference grade, the mathematics the package implements: the AMM decomposition and its identifiability theory (§II.1), the asymptotic theory of the three paths (§II.2), the Empirical-Bayes theory and its multivariate extension (§II.3), the distributional families and links (§II.4), the B-spline W bases (§II.5), grouped references (§II.6), the causal-inference bridge (§II.7), the geometry-adaptive sampling metrics (§II.8), and the dependence-robust inference machinery (§II.9). The identifiability material corresponds to package source R/check_identifiability.R, R/preflight*.R, R/amm_spec.R; the conditions stated here are enforced in code and the cross-references make the link explicit.

II.1 The AMM canonical form and identifiability

II.1.1 Setting

There are $n$ units $i=1,\dots,n$. Each has covariates $x_i\in\mathcal X\subseteq\mathbb R^d$ drawn i.i.d. from a population law $\mu$, a latent parameter vector $\theta_i\in\mathbb R^p$, and an outcome $y_i\sim\mathcal D(\theta_i)$. The framework posits $\theta_i=\theta_{\text{ref}}+\Delta(x_i,\theta_{\text{ref}})$ with $\theta_{\text{ref}}\in\Theta\subseteq\mathbb R^p$ and $\Delta:\mathcal X\times\Theta\to\mathbb R^p$. Two questions are answered: what form does $\Delta$ take, and when are $(\theta_{\text{ref}},\Delta)$ recoverable from data. The second is split into three logically distinct layers kept separate throughout:

  • (L1) Algebraic-functional identifiability — does the latent function $\theta_i(\cdot)$ determine the components $(\theta_{\text{ref}},a,b,W)$? (Theorem 1A.)
  • (L2) Statistical identifiability — do the observable data ${(x_i,y_i)}$ determine them? (Lemma 1B, via a hypothesis on the response family.)
  • (L3) Numerical verifiability — in a chosen finite basis, can a runtime diagnostic detect identifiability or its failure? (Proposition 1C, the Gram-matrix check.)

II.1.2 The AMM hierarchy

The space of deviation forms is stratified by joint polynomial order in $(x,\theta_{\text{ref}})$:

Level Form Joint order Status
0 (degenerate) $\Delta=0$ $(0,0)$ standard non-mixed regression
1 (linear additive) $\Delta=Ax$, $A$ fixed $(1,0)$ classical random coefficients; no reference-dependence
2 (canonical AMM) $\Delta=a(x)+b(x)\odot\theta_{\text{ref}}+W(\theta_{\text{ref}})x$ mixed the canonical default
2.5 (full-matrix mult.) $\Delta=a(x)+B(x)\theta_{\text{ref}}+W(\theta_{\text{ref}})x$ Level 2 is $B=\mathrm{diag}(b)$
3 (quadratic) adds $x^\top Q_k,x$, $\theta^\top R_k,\theta$, $\theta\odot W x$ up to $(1,2)$
$K$ (polynomial closure) $\sum_{1\le j+k\le K,,j\ge1}\mathcal T_{j,k}(x^{\otimes j},\theta_{\text{ref}}^{\otimes k})$ $\le K$ $j=0$ terms excluded (non-individuating)
$\infty$ (universal) $\Delta=\Phi(x,\theta_{\text{ref}})$ arbitrary measurable hypernetwork; Proposition 1F

The canonical Level-2 AMM is

$$\boxed{;\Delta(x,\theta_{\text{ref}})=\underbrace{a(x)}_{\text{additive}}+\underbrace{b(x)\odot\theta_{\text{ref}}}_{\text{multiplicative (Hadamard)}}+\underbrace{W(\theta_{\text{ref}}),x}_{\text{modulated}};}$$

with $a,b:\mathcal X\to\mathbb R^p$ measurable and $W:\Theta\to\mathbb R^{p\times d}$ continuous with $W(\theta_0)=0$ at an anchor $\theta_0$. It is singled out by (P1) minimality of reference-dependence (smallest extension of Level 1 in which $\theta_{\text{ref}}$ enters nontrivially), (P2) separability of its three mechanisms, (P3) computational tractability (closed-form gradients, finite sufficient statistics, tractable HMC).

Approximation (Scheme 1D). On compact $\mathcal X,\Theta$, for continuous $\Delta$ and any $\varepsilon>0$ there is a finite $K$ and a Level-$K$ form with polynomial $a,b,W$ uniformly $\varepsilon$-close (Stone–Weierstrass). Flagged caveats: this is an approximation, not an identifiability statement; it covers only the polynomial restriction; it needs compactness; it gives no rate.

II.1.3 Standing assumptions (the six conditions enforced in code)

Condition Meaning
(C1) $\mathbb E_\mu[X]=0$ covariates centered
(C2) $\mathbb E_\mu[a(X)]=0$ additive component centered
(C3) $\mathbb E_\mu[b(X)]=0$ multiplicative component centered
(C4) $W(\theta_0)=0$ for some $\theta_0\in\Theta$ modulating matrix anchored
(C5) $\mathrm{Cov}_\mu(X)$ full rank $d$; $a,b\in L^2$; $W\in C(\Theta)$ bounded support/integrability
(C6) $\theta_*\in\Theta^\circ={\theta:\theta_k\neq0\ \forall k}$ non-degeneracy of the reference

The joint consequence of (C1)–(C4) is the centering of the framework: $\mathbb E_\mu[\Delta(X,\theta_{\text{ref}})]=0$ for every $\theta_{\text{ref}}$ — when an individual is "average", its deviation is zero. (C6) is genuine: a zero coordinate of $\theta_*$ kills the multiplicative term there and $b$ becomes unidentified in that coordinate; it is met for parameters bounded away from zero (rates, scales, precisions) or enforced by reparametrization ($\log\theta_{\text{ref}}$).

In the implementation, (C2)/(C3) are enforced empirically by column-centering the design matrices $Z_a,Z_b$ in the R-side AMM design constructor (so $\mathrm{colMeans}(Z_a)=0$ exactly); no sum-to-zero constraint is imposed on the coefficients $a_{\text{coef}},b_{\text{coef}}$ (that would wrongly exclude effects with non-zero coefficient mean). (C4) is enforced by the parametrization $W(\theta)=W_0(\theta)-W_0(\theta_0)$.

II.1.4 Functional Independence Condition (FIC)

Let $\mathcal F_a,\mathcal F_b\subseteq L^2_0(\mu,\mathbb R^p)$, $\mathcal F_W\subseteq C(\Theta,\mathbb R^{p\times d})$ be the component classes. Define the subspaces $\mathcal S_a=\mathcal F_a$, $\mathcal S_b(\theta_)={x\mapsto h(x)\odot\theta_:h\in\mathcal F_b}$, $\mathcal S_W={x\mapsto Mx:M\in\mathbb R^{p\times d}}$.

Abstract FIC at $\theta_*$. $\mathcal S_a,\mathcal S_b(\theta_*),\mathcal S_W$ are linearly independent in $L^2(\mu,\mathbb R^p)$: $f_a+f_b+f_W=0\Rightarrow f_a=f_b=f_W=0$.

The basis-restricted FIC ($\text{FIC}_B$) is the same condition restricted to a finite basis $B=(B_a,B_b)$; it is strictly weaker than abstract FIC when the function classes are infinite-dimensional, and coincides when $B$ exhausts them.

II.1.5 Theorem 1A (algebraic-functional identifiability)

Under (LIN) ($\mathcal F_a,\mathcal F_b,\mathcal F_W$ finite-dimensional linear subspaces), (C1)–(C5), (C6) at $\theta_$, and Abstract FIC at $\theta_$, the latent function $\theta_i^(\cdot)=\theta_+a(\cdot)+b(\cdot)\odot\theta_+W(\theta_)\cdot$ uniquely determines $(\theta_,a,b,W(\theta_))$.

The proof has three steps: (1) taking $\mathbb E_\mu$ of both representations and using (C1)–(C3) forces $\theta_=\theta_'$ (centering pins $\theta_{\text{ref}}$ to the marginal mean of $\theta_i(X)$); (2) subtract the common reference, form differences $\delta_a,\delta_b,\delta_W$, which by (LIN) remain in the classes, giving $\delta_a(x)+\delta_b(x)\odot\theta_+\delta_W x=0$; (3) Abstract FIC forces each summand to zero — $\delta_a=0$ directly, $\delta_b=0$ using (C6), $\delta_W=0$ using (C5) full-rank $X$. Scope: $W$ is pinned only at the single point $\theta_$; functional identification of $W$ on the prior support is Theorem 1E.

Necessity of FIC requires, in addition to (LIN) and (C1)–(C6), the hypothesis (EVAL): point evaluation $E_{\theta_}:\mathcal F_W\to\mathbb R^{p\times d}$, $W\mapsto W(\theta_)$, is surjective. If Abstract FIC fails (and (EVAL) holds), two distinct admissible tuples produce the same latent function — explicit non-identifiability. (LIN) lets perturbations stay in the classes; (EVAL) realizes the required $W$-shift at the anchor. Both are checkable in the finite representation and the library checks (EVAL) at the chosen anchor before fitting.

II.1.6 Lemma 1B (statistical bridge) and Corollary 1

Lemma 1B. Under Theorem 1A's hypotheses plus (D-ID) (the response family ${\mathcal D(\theta)}$ is identifiable in $\theta$: $\mathcal D(\theta)=\mathcal D(\theta')\Rightarrow\theta=\theta'$), the joint law of $(X_i,Y_i)$ determines $(\theta_,a,b,W(\theta_))$.

(D-ID) holds for one-parameter exponential families with canonical link, full-rank multi-parameter exponential families, ZIP/ZINB under independent variation of $\pi$ and $\theta$ (can fail at $\pi=1$), identifiable finite mixtures, and copula-linked multivariates under joint marginal+copula identifiability. (D-ID) is a hypothesis on the modelling setup, to be verified per response family, not a theorem of the framework. Corollary 1 restates the centering of the framework — $\mathbb E_\mu[\Delta(X,\theta_{\text{ref}})]=0$ — i.e. the Anchoring property is a theorem.

II.1.7 Proposition 1C (numerical verifiability — the Gram-matrix check)

Fix a basis $B$. The extended design matrix at $\theta_$ is $\mathbf Z_n(\theta_)=[,\Phi_a(X_{1:n}),\ \Phi_b(X_{1:n})\odot\theta_,\ X_{1:n},]\in\mathbb R^{np\times(J_a+J_b+d)}$, and $\mathbf G_n(\theta_)=n^{-1}\mathbf Z_n^\top\mathbf Z_n\to\mathbf G(\theta_*)$ in probability.

For the chosen finite representation $B$, $\text{FIC}B$ at $\theta$ holds iff $\mathbf G(\theta_)$ is non-singular (Gram non-singularity $=$ column linear independence).

Caveats: it diagnoses basis-restricted FIC, not abstract FIC; $\mathbf G_n\to\mathbf G$ only asymptotically (finite-sample near-singularity is a warning); singular $\mathbf G_n$ means non-identifiability within $B$. Operationally, before fitting the implementation computes the smallest eigenvalue and condition number of $\mathbf G_n(\theta_*)$; if $\lambda_{\min}<$ tol (default $10^{-8}$) it aborts and reports the dependent directions. This is gdpar_check_identifiability().

II.1.8 Condition C4-bis (cross-coordinate aliasing for $p>1$)

For $p>1$ with the coord-wise factorization, coordinate $k$ contributes $\eta_{i,k}=\theta_{\text{ref},k}+Z_{a,k}[i,\cdot]a_{\text{coef},k}+\sum_j(\theta_{\text{ref},k}^j-\theta_{\text{anchor},k}^j)W_{\text{raw}}[r_{k,j},\cdot]X[i,\cdot]^\top+\cdots$. If $Z_{a,k}$ and the modulating design $X$ share a column by name, a perturbation $a_k'(x)=a_k(x)+c,x_1$, $W_{k,x_1}'(\theta)=W_{k,x_1}(\theta)-c/(\theta_k^1-\theta_{\text{anchor},k}^1)$ leaves $\eta$ invariant at the diagnostic's single $\theta_{\text{ref},k}$ — aliasing.

(C4-bis) Coord-wise structural disjointness. For every $k$: $\mathrm{names}(Z_{a,k})\cap\mathrm{names}(X)=\emptyset$.

This is necessary but not sufficient (overlap enables aliasing; regularization can suppress it). The extended Gram matrix cannot detect it: at a fixed $\theta_{\text{ref}}$ the $W_{\text{per_k_dim}}$ columns collapse to scalar multiples of $X$ (false-positive rank deficit), and a Jacobian check at $W=0$ erases the very signal. So the package separates pre-fit (rank of $Z_{a,k}$ alone + name overlap with $X$, in check_C4_bis_per_k()) from post-fit posterior-geometry forensics (divergences, low ESS, high $\widehat R$). gdpar_check_identifiability(..., rigor=) offers "full" (default; aborts on overlap) and "fast" (warns of class gdpar_c4bis_overlap_warning, for users who intend the overlap and regularize the $W$ block via gdpar_prior()). The per-$k$ breakdown (passed, lambda_min/max, condition_number, shared_cols, collinear_directions) is in report$c4_bis$per_k.

II.1.9 Condition C7 (per-group anchor anti-aliasing)

Block 6.5 promotes the reference to a per-group anchor $\theta_{\text{ref}}[g]$, $g=1,\dots,J_{\text{groups}}$, drawn from $\mathrm{Normal}(\mu_{\theta_{\text{ref}}},\sigma_{\theta_{\text{ref}}})$ (activated by group = ~ species; group = NULL reduces bit-exactly to the single-anchor regime). If $Z_a$ (or $Z_b$) has a column rank-deficient with the per-group indicator $G$ (e.g. constant within each group), the model is non-identified along $\theta_{\text{ref}}[g]\mapsto\theta_{\text{ref}}[g]+c$, $a\mapsto a-c,e_{\text{col}}$.

(C7). When use_groups = 1, $\mathrm{rank}([G\mid Z_a])=\mathrm{ncol}(G)+\mathrm{ncol}(Z_a)$ and likewise for $Z_b$.

Enforced pre-fit by .check_group_aliasing_c7() in two layers: (1) within-group variance per column (catches constant-per-group / factor(group) aliases), (2) joint QR rank of normalized $[G\mid Z]$ (catches indirect aliases). Violation aborts with gdpar_input_error naming the columns. Together with C1–C4 (global Gram) and C4-bis (cross-component per coord), C7 completes a three-tier pre-flight; the post-fit forensic remains the posterior geometry.

II.1.10 Theorem 1E (functional identifiability of $W$ on the prior support)

In the Bayesian setting $\theta_{\text{ref}}\sim\pi_\Theta$. Theorem 1A pins $W$ only at $\theta_*$; outside $\mathrm{supp}(\pi_\Theta)$ no information about $W$ accrues. Three tiers, under Theorem 1A $\pi_\Theta$-a.e. and (D-ID):

  • (a) $W=W'$ $\pi_\Theta$-a.e. on $\mathrm{supp}(\pi_\Theta)$;
  • (b) $W=W'$ in $L^2(\pi_\Theta;\mathbb R^{p\times d})$ (the recommended default conclusion);
  • (c) $W=W'$ in $C(\overline{\mathrm{supp}(\pi_\Theta)})$, additionally requiring (BAY-1) support $=$ closure of a connected open $U$, (BAY-2) $W$ continuous (subsumed by (C5)), (BAY-3) $\pi_\Theta$ charges every non-empty open subset of $U$ (so the a.e.-identification set is dense). These are automatic for absolutely continuous priors with positive density on a connected open set.

No tier identifies $W$ outside $\overline{\mathrm{supp}(\pi_\Theta)}$: out-of-support reference predictions rely entirely on the prior and must be reported as such.

II.1.11 Proposition 1F (hypernetwork, Path 3) and the discrimination protocol

For the hypernetwork, $\mathcal F_a,\mathcal F_b,\mathcal F_W$ are nonlinear manifolds (images of feedforward nets), so (LIN) fails and Theorem 1A does not transfer. Then: (i) the realized function $\Phi_\phi(x,\theta)=a_\phi(x)+b_\phi(x)\odot\theta+W_\phi(\theta)x$ is the object of inference; (ii) the weights $\phi$ are not identifiable (permutation/sign/rescaling symmetries); (iii) under (D-ID), $\Phi_\phi$ is identifiable up to $L^2(\mu\otimes\pi_\Theta)$-equivalence — a function-level claim strictly weaker than identifiability of the AMM decomposition into $(a_\phi,b_\phi,W_\phi)$, which is an open question. Universal approximation (Hornik 1991; Pinkus 1999) gives density, not identifiability — the package records this distinction explicitly.

When Path 3 diverges from Path 1 on the same data, a four-step empirical protocol discriminates "richer structure" from "undetected non-identifiability": (1) stability across $\ge5$ random seeds (across-run SD vs within-run SD; "stable" iff $\le0.10\times$); (2) stratified $k$-fold ELPD comparison ($\overline{\Delta\text{elpd}}>2\text{SE}$ and positive in $\ge80%$ folds $\Rightarrow$ real structure); (3) posterior-predictive calibration at $\alpha\in{0.05,0.20,0.50}$; (4) component-wise decomposition of $\Delta\theta(x)$ into additive/multiplicative/modulated shares. A decision rule (5-row table) maps the joint outcome to a verdict (richer structure / mixed / no advantage / non-identifiability / model misspecified). The protocol is evidence-weighted, never a formal certificate.

II.1.12 Special cases and component selection

Standard models are Theorem-1A special cases verified to satisfy (LIN): standard regression (Level 0), random coefficients (Level 1, identifiability $\Leftrightarrow$ $\mathrm{Cov}_\mu(X)$ full rank), varying-coefficient (Level 2, $b\equiv0$, spline classes), hierarchical Bayes with multiplicative interaction (Level 2, $W\equiv0$; (C6) critical), full canonical AMM (Level 2). The hypernetwork falls outside (LIN), into Proposition 1F.

Component selection over the eight restrictions $M_S$, $S\subseteq{a,b,W}$, uses $\widehat{\text{elpd}}{\text{loo}}$ (Vehtari–Gelman–Gabry 2017) with differences $>2$ SE substantive; a stratified PSIS-LOO computed within $x$-quantile/categorical subgroups localizes structural heterogeneity and is reported by default. Frequentist (Path 2) selection uses generalized LRTs on effective d.f. (Wood 2017); Path 3 compares post-training component norms $|a\phi|,|b_\phi|,|W_\phi|$ to the regularization scale to prune dormant components.


II.2 Asymptotic theory of the three paths

The package develops asymptotics for all three paths to reference grade (only Path 1 is executable). The reference text throughout is Ghosal & van der Vaart (2017); AMM-specific theorems are specializations, with explicit statements of what is established, what the AMM specialization costs in extra hypotheses, and what remains open.

II.2.1 Path 1 (Hierarchical Bayesian) — three layers

The Path-1 model places priors on every component: $\theta_{\text{ref}}\sim\pi_\Theta$, $a\sim\pi_a$, $b\sim\pi_b$, $W\sim\pi_W$, joint prior $\pi=\pi_\Theta\otimes\pi_a\otimes\pi_b\otimes\pi_W$, posterior $\Pi_n\propto\pi\prod_i p(y_i\mid\theta_i)$. Write $\eta=(\theta_{\text{ref}},a,b,W)$, true value $\eta_*$. Two design regimes are distinguished because conclusions differ: (R-RANDOM) $X_i\overset{\text{iid}}\sim\mu$ (default) and (R-COND) fixed/conditional-on-design; they coincide under (R-EQUIV) (positive density of $\mu$ + uniformly Glivenko–Cantelli classes), automatic for finite-dim parametric classes.

Two distances, no global equivalence. Hellinger $d_H$ on the data distribution (the natural metric for Bayesian contraction) and $L^2(\mu)$ on the deviation function (the interpretable metric for the components). They are locally equivalent at $\eta_*$ only in the parametric smooth full-rank-Fisher case ($c_1 d_{L^2}\le d_H\le c_2 d_{L^2}$ on a neighborhood); globally and in non-parametric cases they can diverge. Theorems 4A/4B are stated in $d_H$ and specialized to $d_{L^2}$ only locally.

The three asymptotic layers parallel the three identifiability layers:

  • (L1) Posterior consistency$\Pi_n({d(\eta,\eta_*)>\varepsilon})\xrightarrow{P}0$ for every $\varepsilon$.
  • (L2) Contraction rate$\exists,\varepsilon_n\to0$, $n\varepsilon_n^2\to\infty$, with $\Pi_n({d>M\varepsilon_n})\xrightarrow{P}0$.
  • (L3) Bernstein–von Mises$\Pi_n(\sqrt n(\eta-\widehat\eta_n)\in\cdot)\xrightarrow{w}\mathcal N(0,I_*^{-1})$ in total variation.

Standing asymptotic hypotheses (additional to C1–C6, LIN, D-ID, IID): (PRIOR-KL) $\pi(B_\varepsilon(\eta_))>0$ for all $\varepsilon$ (KL-ball $B_\varepsilon={K(\eta_,\eta)\le\varepsilon^2,V(\eta_,\eta)\le\varepsilon^2}$); (PRIOR-THICK) $\pi(B_{\varepsilon_n}(\eta_))\ge e^{-C_1 n\varepsilon_n^2}$; (SIEVE) sieves $\Theta_n$ with $\pi(\Theta_n^c)\le e^{-C_2 n\varepsilon_n^2}$ and bracketing entropy $\log N_{[]}(\varepsilon_n,\Theta_n,d_H)\le C_3 n\varepsilon_n^2$; (TEST) tests with exponential type-I/II error decay; (LAN) local asymptotic normality at $\eta_$ with non-singular Fisher $I_$.

Theorem 4A (posterior consistency). Under C1–C6, LIN, D-ID, the Block-2 regularity (HOM)+(REG)+(IID), (PRIOR-KL), (TEST), and finite bracketing entropy on sieves with $\pi(\Theta_n^c)\to0$: $\Pi_n({d_H(\eta,\eta_)>\varepsilon})\xrightarrow{P_{\eta_}}0$. (Schwartz 1965 specialized to AMM; novelty is verifying PRIOR-KL and entropy for the product prior, which under LIN reduces to prior positivity at $\eta_*$.)

Theorem 4A discharges (REG-EST) of Block 2 in average-error form: the posterior-mean individual parameter $\widehat\theta_i^{\text{Bayes}}=\int[\theta_{\text{ref}}+\Delta(x_i,\theta_{\text{ref}})],d\Pi_n$ satisfies $\frac1n\sum_i|\widehat\theta_i^{\text{Bayes}}-\theta_*(x_i,\theta_{\text{ref}})|\xrightarrow{P}0$. The uniform-in-$i$ version needs sup-norm contraction (extra hypotheses, not implied by 4A).

Theorem 4B (contraction rate). Adding (PRIOR-THICK) and (SIEVE) for $\varepsilon_n\to0$, $n\varepsilon_n^2\to\infty$: $\Pi_n({d_H>M\varepsilon_n})\xrightarrow{P}0$ (Ghosal–Ghosh–van der Vaart 2000 specialized to AMM).

Rates by Level: Level 0/1 parametric $\varepsilon_n=n^{-1/2}$; VCM fixed-$J$ splines $(J/n)^{1/2}$, adaptive matched-smoothness $n^{-\beta/(2\beta+d)}$; Level 2 parametric $n^{-1/2}$, non-parametric determined by the slowest component subject to prior-matching caveats (the joint rate need not be a simple combination of component rates — numerical verification recommended). The library implements empirical contraction verification: refit at $n_1<\dots<n_K$, track credible-set diameters, and warn when the empirical rate diverges from prediction (signaling prior or model misspecification).

Theorem 4C (Bernstein–von Mises). For finite-dim parametric AMM (Levels 0/1/2 with finite-dim classes), under 4A+4B, (LAN), and a consistent $\sqrt n$-MLE: the posterior is asymptotically $\mathcal N(\widehat\eta_n,n^{-1}I_*^{-1})$ in total variation. Consequence: Bayesian credible intervals and asymptotic frequentist CIs coincide in the limit — this justifies reporting credible intervals as the primary uncertainty.

Proposition 4C-semi (semiparametric BvM). With parametric $\theta_{\text{ref}}$ and non-parametric $(a,b,W)$, under Castillo–Rousseau (2015) conditions ($\sqrt n$-recoverability + least-favorable-direction-aware prior), the marginal posterior of $\theta_{\text{ref}}$ is $\sqrt n$-asymptotically normal at the semiparametric efficiency bound $V_*$. Tight scope: this is only for the marginal of $\theta_{\text{ref}}$ — the function-valued $(a,b,W)$ need not be asymptotically Gaussian in a function-space metric, and the library reports their intervals as posterior quantiles (function-space credible balls), never as $\sqrt n$ Gaussian intervals.

Open questions explicitly recognized: (O1) full BvM for non-parametric components (only partial Sobolev-topology results exist); (O2) adaptive contraction rates for general AMM Path-1 priors; (O3) misspecification asymptotics under failure of (HOM)/(REG) (contraction to a KL-projection pseudo-true parameter, Kleijn–van der Vaart 2012). Implementation diagnostics: prior KL-support report, Stan $\widehat R$/ESS/divergences as indirect (TEST)/(SIEVE) checks, and a BvM calibration check comparing credible vs ML-Hessian intervals.

II.2.2 Path 2 (Varying-coefficient) and Path 3 (hypernetwork) — reference theory

These paths are not executable in gdpar 0.1.0 but carry reference-grade asymptotics:

  • Path 2 (VCM, vignette v05). Frequentist penalized-spline asymptotics: pointwise and uniform consistency of $\widehat\beta(\cdot)$, asymptotic normality at the spline rate $n^{-\beta/(2\beta+1)}$, with the reference recovered as $\beta(\bar z)$ and the deviation as $\beta(z)-\beta(\bar z)$; conditions specialize Fan–Zhang (2008), Stone (1985), Wood (2017). The curse of dimensionality in $z$ is the binding limitation.
  • Path 3 (hypernetwork, vignette v06). Only partial results are available for Bayesian neural networks: consistency under the Neural-Tangent-Kernel regime (Jacot et al. 2018; Bach 2017), PAC-Bayes generalization bounds (Dziugaite–Roy 2017), and an explicit acknowledgement that BvM and contraction rates are open (Hron et al. 2020). Universal approximation gives density, not identifiability; the function-level identifiability of $\Phi_\phi$ and its contraction are open (cf. Proposition 1F).

The cross-path consistency: parametric AMM gets the full $n^{-1/2}$ rate and full BvM; non-parametric AMM gets the smoothness-dependent rate and semiparametric BvM for the parametric subset only.


II.3 Empirical Bayes vs full Bayes

Partition $\eta=(\theta_{\text{ref}},\xi)$ with upper-level reference $\theta_{\text{ref}}$ and lower-level $\xi=(a,b,W)$. EB maximizes the marginal likelihood $L_n^{\text{marg}}(\theta_{\text{ref}})=\int p(Y_{1:n}\mid\theta_{\text{ref}},\xi)\pi_\xi(\xi),d\xi$ to get $\widehat\theta_{\text{ref}}^{\text{EB}}=\arg\max L_n^{\text{marg}}$, then conditions: $\Pi_n^{\text{EB}}(\xi)\propto p(Y_{1:n}\mid\widehat\theta_{\text{ref}}^{\text{EB}},\xi)\pi_\xi(\xi)$. FB gives $\theta_{\text{ref}}$ a prior $\pi_\Theta$ and integrates it out. Comparison metric: $d_{\text{TV}}(\Pi_n^{\text{EB}},\Pi_n^{\text{FB}})$. Standing hypotheses: (EB-MARG-ID) unique marginal-likelihood maximum with non-singular $I_{\theta\theta}^{\text{marg}}$; (PRIOR-FB-WEAK) $\mathrm{Var}n(\theta{\text{ref}}^*)/\mathrm{Var}\pi(\theta{\text{ref}})\to0$; (HIER-COMPLEX) bounded upper-level hyperparameters.

Theorem 7A (first-order equivalence). Under regularity + the three hypotheses, EB and FB lower-level posteriors agree asymptotically. Regime A (finite-dim parametric AMM): $d_{\text{TV}}(\Pi_n^{\text{EB}},\Pi_n^{\text{FB}})\xrightarrow{P}0$. Regime B (non-parametric AMM): TV is too strong; convergence holds for smooth bounded $L^2(\mu)$-Lipschitz functionals (equivalently Wasserstein-1/Prokhorov on the joint posterior). Specializes Petrone–Rousseau–Scricciolo (2014) and Rousseau–Szabo (2017). Practical content: for large $n$ + weak prior, EB and FB give essentially the same posterior over $\xi$; the choice is then computational/methodological.

Proposition 7B (higher-order coverage). Under Edgeworth-type expansion conditions (Bickel–Ghosh 1990), EB credible intervals for a smooth functional $g(\xi)$ under-cover by $O(n^{-1})$: $\mathbb P(g(\xi^)\in\mathrm{CI}n^{\text{EB},\alpha})=(1-\alpha)-C{g,\alpha}n^{-1}+o(n^{-1})$, with $C_{g,\alpha}\approx(g'(\xi^))^2/I_{\theta\theta}^{\text{marg}}\cdot\kappa(\alpha)$ (larger when $g$ is sensitive to $\theta_{\text{ref}}$, smaller when $\theta_{\text{ref}}$ is well-identified); FB covers to first order. The library applies a post-hoc inflation $\sqrt{1+C_{g,\alpha}/(n-q)}$ (argument eb_correction=TRUE), explicitly approximate.

Theorem 7C (compound decision, Robbins–Efron). For $K$ exchangeable units, $\frac1K\sum_k\mathbb E[(\widehat\xi_k^{\text{EB}}-\xi_k^)^2]\le\frac1K\sum_k\mathbb E[(\widehat\xi_k^{\text{FB}}-\xi_k^)^2]+B_K$ with $B_K\le\frac{C_1}{K}\mathbb E[(\widehat\theta_{\text{ref}}^{\text{EB}}-\theta_{\text{ref}}^*)^2]\to0$: EB risk approaches FB risk as $K\to\infty$ (squared-error loss on point estimates only — coverage still under-covers per 7B).

Proposition 7D. EB and FB differ substantially when: (i) small $I_{\theta\theta}^{\text{marg}}$ (poorly identified upper level); (ii) strongly informative $\pi_\Theta$; (iii) multimodal $L_n^{\text{marg}}$; (iv) misspecified $\pi_\xi$ (EB regularizes by tuning $\theta_{\text{ref}}$, FB does not).

Default is FB (gdpar()); EB (gdpar_eb()) is opt-in. EB's independent methodological advantages: honest avoidance of a prior on $\theta_{\text{ref}}$, and a deterministic, reproducible, frequentist-interpretable $\widehat\theta_{\text{ref}}^{\text{EB}}$.

Multivariate / multi-slot extension (vignette v07b): Theorem 7A* (to $\theta_{\text{ref}}\in\mathbb R^p$), Proposition 7B* (matricial coverage correction), Theorem 7C* (compound multi-slot, $K>1$), Proposition 7D* (four conditions extended), and open question O5*-EBFB on the anti-fragility of the Laplace Hessian under $p>1,K>1$.

The four gdpar_eb() path regimes (dispatched from the resolved $(K,p)$), each with a marginal+conditional Stan template pair and its own Proposition-7B correction:

Regime $(K,p)$ Stan template pair 7B correction
Base $K=1,p=1$ amm_eb_marginal.stan + amm_eb_conditional.stan scalar
Path A $K=1,p>1$ amm_eb_marginal_multi.stan + ..._conditional_multi.stan matricial 7B*
Path B $K>1,p=1$ amm_eb_marginal_K.stan + ..._conditional_K.stan per-slot scalar
Path C $K>1,p>1$ amm_eb_marginal_KxP.stan + ..._conditional_KxP.stan tensor $\mathbb R^{K\times p\times p}$

The canonical EB recipe (three steps): (i) marginal-likelihood maximization for $\widehat\theta_{\text{ref}}^{\text{EB}}$ via Laplace (cmdstanr::laplace() with multi-start + Levenberg–Marquardt ridge + condition-number guard — the anti-fragility strategy); (ii) plug $\widehat\theta_{\text{ref}}^{\text{EB}}$ as data into a conditional Stan model; (iii) sample the conditional posterior of $\xi$. The companion gdpar_compare_eb_fb() operationally verifies Theorem 7A (marginal TV) and Proposition 7B (per-cell width ratio).


II.6 Parametrizations and the pre-flight decision (CP / NCP / linear)

Hierarchical AMM posteriors can suffer the classic centered/non-centered pathology of hierarchical models. The package treats the parametrization of the multiplicative interaction $b(x)\odot\theta_{\text{ref}}$ as a first-class, data-driven decision rather than a fixed choice, and selects it pre-fit. Source: R/preflight.R, R/preflight_multi.R, R/contraction_diagnostic.R.

II.6.1 The three parametrizations

For the multiplicative term, write the contribution of coordinate involving the reference as $c_b=\theta_{\text{ref}}\cdot b_{\text{coef}}$ on the linear-predictor scale. Three parametrizations are available:

  • Centered (CP). Sample $b_{\text{coef}}$ directly; the term $\theta_{\text{ref}}\cdot b_{\text{coef}}$ couples $b_{\text{coef}}$ to $\theta_{\text{ref}}$ multiplicatively. CP mixes well when the data are informative about the interaction (the likelihood dominates the funnel).
  • Non-centered (NCP). Reparametrize $b_{\text{coef}}=\mu_b+\tau_b,\tilde b$ with $\tilde b\sim\mathrm{Normal}(0,1)$, decoupling the prior geometry; NCP mixes well when the data are weakly informative (prior-dominated funnel).
  • Linear reparametrization. Sample the product $c_b=\theta_{\text{ref}}\cdot b_{\text{coef}}$ directly as a linear coordinate, sidestepping the bilinear $(\theta_{\text{ref}},b)$ geometry altogether. This is the package's resolution of the deeper diagnosis (below): the root cause of the funnel is the non-linear $(\theta_{\text{ref}},b)$ parametrization, not the centering per se; sampling the linear product removes the bilinearity at the source.

The diagnosis that led here proceeded in three iterations — NCP did not cure it, CP did not cure it, and the residual pathology was traced to the bilinear coordinate; the final fix samples $c_b$ linearly. This is the maximum-robustness, root-cause resolution rather than a symptom patch.

II.6.2 The pre-flight decision

The choice among CP/NCP/(linear) is made by a pre-flight procedure that runs a short pilot and computes an information ratio discriminating prior-dominated from likelihood-dominated regimes, then dispatches:

  • preflight_parametrization() (scalar) / preflight_parametrization_multi() ($p>1$, per coordinate) run the pilot sample and the attribution/info-ratio computation.
  • The information ratio contrasts how much of the posterior variation in the interaction is attributable to the data vs the prior; an asymptotic $z$/$t$-style test on it, evaluated against thresholds, picks the regime. Defaults: tau_cp = 5, tau_ncp = 2 (the boundary thresholds of the ratio); a high ratio (data-informative) selects CP, a low ratio selects NCP.
  • The variance/score machinery is made dependence-aware: effective weights + a chain-aware block bootstrap + an asymptotic z-test give the ratio's sampling distribution without assuming independence of the pilot draws (Path B′ canonical design). The per-coordinate variant decides each coordinate of $\theta_{\text{ref}}\in\mathbb R^p$ separately and resolves a global vs per-dimension decision.
  • resolve_parametrization() / resolve_parametrization_multi() turn the diagnostic verdict into the concrete Stan-side toggle used by the code generator. preflight_global_decision() and preflight_per_dim() are the exported user-facing entry points; decision_to_logical() maps the verdict to the boolean the template consumes.

Confounding-induced NCP preference is treated as correct, not a defect: when a covariate confounds the reference, the prior-dominated geometry genuinely calls for NCP, and the diagnostic is designed to detect exactly that. The whole apparatus is data-driven: no parametrization is hard-coded; the knob is set from a declared, reproducible statistic of a pilot fit.

II.6.3 The contraction diagnostic

gdpar_contraction_diagnostic() (R/contraction_diagnostic.R) operationalizes the empirical contraction-rate verification of Theorem 4B (§II.2.1): it refits at increasing sample sizes and tracks the posterior credible-set diameter, flagging deviations from the predicted rate $\varepsilon_n$ as evidence of prior or model misspecification. It is the runtime instrument behind the asymptotic theory's "numerical verification of contraction".


II.4 Distributional families and links

A family is represented internally as an ordered list of per-parameter specifications (gdpar_param_spec), one per slot $k$. Each spec carries: a name ($\mu,\sigma,\phi,\nu,\pi,p,\dots$), a link (and its inverse), a family_role ∈ {location, scale, shape, df, mixture_pi, power}, a support (real_line/positive_real/unit_interval/bounded_open), an identifiability descriptor (did_status ∈ {holds, holds_under_condition, user_responsible} with condition + reference, i.e. the (D-ID) hypothesis of Lemma 1B made per-slot), a canonical prior kind, and a scope (per_observation or population). The link factory implements exactly three links:

$$\text{identity: } g(\mu)=\mu,\ g^{-1}(\eta)=\eta;\qquad \text{log: } g(\mu)=\log\mu,\ g^{-1}(\eta)=e^{\eta};\qquad \text{logit: } g(\mu)=\log\tfrac{\mu}{1-\mu},\ g^{-1}(\eta)=\tfrac{1}{1+e^{-\eta}}.$$

The built-in roster (location slot gets the user link; auxiliaries get fixed links — log for positive-real, logit for unit-interval, identity for the Tweedie power):

Family Slots (role, link)
gaussian $\mu$ (location), $\sigma$ (scale, log)
poisson $\mu$ (location)
neg_binomial_2 $\mu$ (location), $\phi$ (scale, log)
bernoulli $\mu$ (location)
beta $\mu$ (location), $\phi$ (scale, log)
gamma $\mu$ (location), shape (log)
student_t $\mu$, $\sigma$ (log), $\nu$ (df, log)
tweedie $\mu$ (log), $\phi$ (log), $p$ (power, identity, bounded $(1,2)$)
zip $\mu$ (location), $\pi$ (mixture_pi, logit)
zinb $\mu$, $\phi$ (log), $\pi$ (logit)
hurdle_poisson $\mu$, $\pi$ (logit)
hurdle_neg_binomial_2 $\mu$, $\phi$ (log), $\pi$ (logit)

Each family carries an integer stan_id selecting the likelihood branch in the Stan templates (Part V). Auxiliary slots default to population scope and can be promoted to per_observation (their own AMM decomposition) by the user — that is distributional regression: $\theta_i^{(k)}=\theta_{\text{ref}}^{(k)}+\Delta^{(k)}(x_i,\theta_{\text{ref}}^{(k)})$ on the slot's link scale. The package also supports heterogeneous per-slot families (different sub-family per slot) and custom families (gdpar_family_custom, gdpar_family_custom_K) where the user declares the likelihood and accepts responsibility for (D-ID) via did_override. The exported gdpar_family, gdpar_family_multi (the $p>1$ multivariate slot constructor) and the print methods complete the API. Zero-inflation/hurdle realize the dual deviation of §I.6 (both $\pi$ and the count parameter anchored). Full per-function detail: Part IV, R/families.R; per-likelihood Stan math: Part V.

II.5 The modulating basis W

The modulating component $W(\theta_{\text{ref}})x$ draws $W$ from a finite-dimensional basis built by W_basis(type, degree, knots, df, boundary_knots, basis_fn, dim, p) with three types:

  • polynomial (default degree 1): block-by-coordinate powers, no cross-terms, $$\text{eval}(\theta)=\big(\theta_1,\theta_1^2,\dots,\theta_1^{\deg},\ \theta_2,\dots,\theta_2^{\deg},\ \dots,\ \theta_{p},\dots,\theta_{p}^{\deg}\big);$$
  • bspline (default degree 3): per-coordinate B-spline bases concatenated, $\text{eval}(\theta)=(B(\theta_1),\dots,B(\theta_p))$, with Cox–de Boor evaluation performed Stan-side for differentiability inside HMC; the R side (.gdpar_resolve_bspline_knots, .gdpar_bspline_knots_full, .gdpar_validate_bspline_boundary_range) resolves interior knots (from knots or df) and boundary knots and validates the reference range; boundary_knots defaults to range(c(knots, tk));
  • user: an arbitrary basis_fn of declared dim.

materialize_W_basis() populates dim, p, and block_indices (the per-coordinate column blocks) once $p$ is known; as_per_k() reshapes to per-coordinate form. The anchoring constraint (C4) $W(\theta_0)=0$ is enforced by the differenced parametrization $W(\theta)=W_0(\theta)-W_0(\theta_0)$ (the apply_W_basis_diff mechanism, selected by a W_type_id). Detail: Part IV, R/W_basis.R; Stan-side Cox–de Boor: Part V.

II.7 Grouped references (per-group anchors)

Activated by group = ~ factor in gdpar(), the scalar/coord-wise reference is promoted to a per-group anchor $\theta_{\text{ref}}[g]\sim\mathrm{Normal}(\mu_{\theta_{\text{ref}}},\sigma_{\theta_{\text{ref}}})$, $g=1,\dots,J_{\text{groups}}$, with its own hierarchical hyperprior ($\sigma_{\theta_{\text{ref}}}$ a learned between-group scale). group = NULL reduces bit-exactly to the single-anchor regime. The anti-aliasing condition (C7) (§II.1.9) is enforced pre-fit by .check_group_aliasing_c7(). Anchor resolution across regimes is handled by resolve_anchor, resolve_anchor_multi, resolve_anchor_K; extraction of grouped references by the .extract_theta_ref_*_grouped helpers (Part IV, R/gdpar.R, R/methods.R).

II.8 The causal bridge (CATE / ITE)

Because the AMM produces individual parameters, individual treatment effects follow directly. gdpar_causal_bridge() implements a T-learner: fit the anchored model separately under treatment and control and read the individual CATE/ITE as the difference of anchored individual predictions, $$\widehat\tau(x_i)=\widehat\mu_1(x_i)-\widehat\mu_0(x_i),$$ with posterior draws giving full uncertainty on $\tau(x_i)$ (summarized by .summarize_cate). A battery of pre-checks (.check_bridge_path/_hierarchical/_family/_dim/_amm/_anchor) guards the bridge's applicability. The second layer, gdpar_compare_meta_learners(), benchmarks the AMM learner against external meta-learners through a pluggable adapter registry: gdpar_adapter_grf (R, generalized random forests) and gdpar_adapter_econml (Python EconML via reticulate, e.g. CausalForestDML realizing orthogonal/DML CATE). Adapters honor a two-layer contract (fit_predict_fun + optional predict_fun). Detail: Part IV, R/causal_bridge.R, R/compare_meta_learners.R, R/adapter_*.R; theory: vignettes v08, v08b, v08c.

II.9 Geometry-adaptive sampling (opt-in)

For geometrically hostile posteriors (funnels, near-determinism, heavy tails, multimodality), the package ships an opt-in geometry engine whose default path is bit-identical to ordinary HMC. When enabled it climbs a ladder of metrics $G(\theta)$ for Riemannian-manifold HMC:

  • Euclidean (gdpar_geom_metric_euclidean) — baseline mass matrix;
  • Riemannian / Fisher (gdpar_geom_metric_gp_fisher, gdpar_geom_metric_riemannian) with the SoftAbs regularization of the Hessian eigenvalues, $\lambda\mapsto\lambda\coth(\alpha\lambda)$, giving a positive-definite metric from a possibly-indefinite Hessian; log-Cholesky parametrization of $G$ and its differentials ($dM/d\psi$);
  • sub-Riemannian (gdpar_geom_metric_subriemannian) — a degenerate/constrained metric flowing along admissible directions;
  • relativistic / Finsler (gdpar_geom_metric_relativistic) — a relativistic kinetic energy capping velocities (heavy-tail/ill-conditioning robustness), with its own radial integrator.

A certifying orchestrator (gdpar_geom_orchestrate, with _criteria and _budget variants) diagnoses the pathology (gdpar_geometry_diagnostic: multimodality, heavy kurtosis, boundary proximity, difficulty curve), selects an entry rung, builds the metric, tunes $\varepsilon$, runs the success gate, and emits a certificate (gdpar_geom_certificate). A Laplace fallback (gdpar_geom_laplace: Newton/Laplace climb, observed information, unconstrained draws, fit-quality label) provides a plug-in posterior and an ELPD on par with mgcv-REML / INLA-Laplace when full sampling is certified infeasible (the resolution of the near-deterministic Tweedie case). Detail: Part IV, R/geometry_engine.R, R/geometry_orchestrator.R, R/geometry_suite.R, R/geometry_bridge.R, R/geometry_laplace.R, R/geometry_diagnostic.R; theory: vignette vop08.

II.10 Dependence-robust inference

gdpar does not model dependence in its point structure; it makes the inference robust to unmodelled dependence (point estimates unchanged, only uncertainty made robust). Two axes:

  • Temporal (gdpar_dependence_diagnostic, gdpar_dependence_robust): diagnostics = lag-1 autocorrelation, Durbin–Watson, Ljung–Box on residuals; robust SE/intervals via moving/circular block bootstrap with a data-driven block length. The automatic length is the Politis–White (2004) + Patton–Politis–White (2009) flat-top rule (a base-R hand-roll equal to np::b.star): flat-top autocovariance + adaptive bandwidth $b=(2\hat g^2/D)^{1/3}n^{1/3}$, $D=\tfrac43 \text{spec}^2$; default block_length=NULL reduces bit-exactly to the $n^{1/3}$ Künsch rate.
  • Spatial (gdpar_spatial_dependence_diagnostic, gdpar_spatial_dependence_robust): diagnostic = Moran's $I$ (hand-roll, no spdep/sf) with knn row-standardized / distance-band / user-supplied weights, two-sided permutation default + analytic Cliff–Ord option; robust inference via tiled randomized-origin spatial block bootstrap (Politis–Romano–Lahiri) with variance-optimal block side $g=\max(2,\lceil n^{1/4}\rceil)$ in $d=2$, plus a data-driven subsampling calibration of $g$.

Both axes work on the EB scalar path and the FB path (gdpar_fit and gdpar_eb_fit) via a shared engine (.gdpar_dependence_robust_engine) with class-dispatched estimate/SE/residual extractors and a frozen RNG contract that keeps the EB route bit-exact. Detail: Part IV, R/dependence_robust.R; theory: vignette vop09.



← Part I — Conceptual Framework · gdpar Wiki Home · Part III — Computational Architecture →

Clone this wiki locally