Part II Mathematical Foundations

← Part I — Conceptual Framework · gdpar Wiki Home · Part III — Computational Architecture →

Part II — Mathematical Foundations

This part develops, at reference grade, the mathematics the package implements: the AMM decomposition and its identifiability theory (§II.1), the asymptotic theory of the three paths (§II.2), the Empirical-Bayes theory and its multivariate extension (§II.3), the distributional families and links (§II.4), the B-spline W bases (§II.5), grouped references (§II.6), the causal-inference bridge (§II.7), the geometry-adaptive sampling metrics (§II.8), and the dependence-robust inference machinery (§II.9). The identifiability material corresponds to package source R/check_identifiability.R, R/preflight*.R, R/amm_spec.R; the conditions stated here are enforced in code and the cross-references make the link explicit.

II.1 The AMM canonical form and identifiability

II.1.1 Setting

There are $n$ units $i=1,\dots,n$. Each has covariates $x_i\in\mathcal X\subseteq\mathbb R^d$ drawn i.i.d. from a population law $\mu$, a latent parameter vector $\theta_i\in\mathbb R^p$, and an outcome $y_i\sim\mathcal D(\theta_i)$. The framework posits $\theta_i=\theta_{\text{ref}}+\Delta(x_i,\theta_{\text{ref}})$ with $\theta_{\text{ref}}\in\Theta\subseteq\mathbb R^p$ and $\Delta:\mathcal X\times\Theta\to\mathbb R^p$. Two questions are answered: what form does $\Delta$ take, and when are $(\theta_{\text{ref}},\Delta)$ recoverable from data. The second is split into three logically distinct layers kept separate throughout:

(L1) Algebraic-functional identifiability — does the latent function $\theta_i(\cdot)$ determine the components $(\theta_{\text{ref}},a,b,W)$? (Theorem 1A.)
(L2) Statistical identifiability — do the observable data ${(x_i,y_i)}$ determine them? (Lemma 1B, via a hypothesis on the response family.)
(L3) Numerical verifiability — in a chosen finite basis, can a runtime diagnostic detect identifiability or its failure? (Proposition 1C, the Gram-matrix check.)

II.1.2 The AMM hierarchy

The space of deviation forms is stratified by joint polynomial order in $(x,\theta_{\text{ref}})$:

Level	Form	Joint order	Status
0 (degenerate)	$\Delta=0$	$(0,0)$	standard non-mixed regression
1 (linear additive)	$\Delta=Ax$, $A$ fixed	$(1,0)$	classical random coefficients; no reference-dependence
2 (canonical AMM)	$\Delta=a(x)+b(x)\odot\theta_{\text{ref}}+W(\theta_{\text{ref}})x$	mixed	the canonical default
2.5 (full-matrix mult.)	$\Delta=a(x)+B(x)\theta_{\text{ref}}+W(\theta_{\text{ref}})x$	—	Level 2 is $B=\mathrm{diag}(b)$
3 (quadratic)	adds $x^\top Q_k,x$, $\theta^\top R_k,\theta$, $\theta\odot W x$	up to $(1,2)$	—
$K$ (polynomial closure)	$\sum_{1\le j+k\le K,,j\ge1}\mathcal T_{j,k}(x^{\otimes j},\theta_{\text{ref}}^{\otimes k})$	$\le K$	$j=0$ terms excluded (non-individuating)
$\infty$ (universal)	$\Delta=\Phi(x,\theta_{\text{ref}})$ arbitrary measurable	—	hypernetwork; Proposition 1F

The canonical Level-2 AMM is

$$\boxed{;\Delta(x,\theta_{\text{ref}})=\underbrace{a(x)}_{\text{additive}}+\underbrace{b(x)\odot\theta_{\text{ref}}}_{\text{multiplicative (Hadamard)}}+\underbrace{W(\theta_{\text{ref}}),x}_{\text{modulated}};}$$

with $a,b:\mathcal X\to\mathbb R^p$ measurable and $W:\Theta\to\mathbb R^{p\times d}$ continuous with $W(\theta_0)=0$ at an anchor $\theta_0$. It is singled out by (P1) minimality of reference-dependence (smallest extension of Level 1 in which $\theta_{\text{ref}}$ enters nontrivially), (P2) separability of its three mechanisms, (P3) computational tractability (closed-form gradients, finite sufficient statistics, tractable HMC).

Approximation (Scheme 1D). On compact $\mathcal X,\Theta$, for continuous $\Delta$ and any $\varepsilon>0$ there is a finite $K$ and a Level-$K$ form with polynomial $a,b,W$ uniformly $\varepsilon$-close (Stone–Weierstrass). Flagged caveats: this is an approximation, not an identifiability statement; it covers only the polynomial restriction; it needs compactness; it gives no rate.

II.1.3 Standing assumptions (the six conditions enforced in code)

	Condition	Meaning
(C1)	$\mathbb E_\mu[X]=0$	covariates centered
(C2)	$\mathbb E_\mu[a(X)]=0$	additive component centered
(C3)	$\mathbb E_\mu[b(X)]=0$	multiplicative component centered
(C4)	$W(\theta_0)=0$ for some $\theta_0\in\Theta$	modulating matrix anchored
(C5)	$\mathrm{Cov}_\mu(X)$ full rank $d$; $a,b\in L^2$; $W\in C(\Theta)$ bounded	support/integrability
(C6)	$\theta_*\in\Theta^\circ={\theta:\theta_k\neq0\ \forall k}$	non-degeneracy of the reference

The joint consequence of (C1)–(C4) is the centering of the framework: $\mathbb E_\mu[\Delta(X,\theta_{\text{ref}})]=0$ for every $\theta_{\text{ref}}$ — when an individual is "average", its deviation is zero. (C6) is genuine: a zero coordinate of $\theta_*$ kills the multiplicative term there and $b$ becomes unidentified in that coordinate; it is met for parameters bounded away from zero (rates, scales, precisions) or enforced by reparametrization ($\log\theta_{\text{ref}}$).

In the implementation, (C2)/(C3) are enforced empirically by column-centering the design matrices $Z_a,Z_b$ in the R-side AMM design constructor (so $\mathrm{colMeans}(Z_a)=0$ exactly); no sum-to-zero constraint is imposed on the coefficients $a_{\text{coef}},b_{\text{coef}}$ (that would wrongly exclude effects with non-zero coefficient mean). (C4) is enforced by the parametrization $W(\theta)=W_0(\theta)-W_0(\theta_0)$.

II.1.4 Functional Independence Condition (FIC)

Let $\mathcal F_a,\mathcal F_b\subseteq L^2_0(\mu,\mathbb R^p)$, $\mathcal F_W\subseteq C(\Theta,\mathbb R^{p\times d})$ be the component classes. Define the subspaces $\mathcal S_a=\mathcal F_a$, $\mathcal S_b(\theta_)={x\mapsto h(x)\odot\theta_:h\in\mathcal F_b}$, $\mathcal S_W={x\mapsto Mx:M\in\mathbb R^{p\times d}}$.

Abstract FIC at $\theta_*$. $\mathcal S_a,\mathcal S_b(\theta_*),\mathcal S_W$ are linearly independent in $L^2(\mu,\mathbb R^p)$: $f_a+f_b+f_W=0\Rightarrow f_a=f_b=f_W=0$.

The basis-restricted FIC ($\text{FIC}_B$) is the same condition restricted to a finite basis $B=(B_a,B_b)$; it is strictly weaker than abstract FIC when the function classes are infinite-dimensional, and coincides when $B$ exhausts them.

II.1.5 Theorem 1A (algebraic-functional identifiability)

Under (LIN) ($\mathcal F_a,\mathcal F_b,\mathcal F_W$ finite-dimensional linear subspaces), (C1)–(C5), (C6) at $\theta_$, and Abstract FIC at $\theta_$, the latent function $\theta_i^(\cdot)=\theta_+a(\cdot)+b(\cdot)\odot\theta_+W(\theta_)\cdot$ uniquely determines $(\theta_,a,b,W(\theta_))$.

The proof has three steps: (1) taking $\mathbb E_\mu$ of both representations and using (C1)–(C3) forces $\theta_=\theta_'$ (centering pins $\theta_{\text{ref}}$ to the marginal mean of $\theta_i(X)$); (2) subtract the common reference, form differences $\delta_a,\delta_b,\delta_W$, which by (LIN) remain in the classes, giving $\delta_a(x)+\delta_b(x)\odot\theta_+\delta_W x=0$; (3) Abstract FIC forces each summand to zero — $\delta_a=0$ directly, $\delta_b=0$ using (C6), $\delta_W=0$ using (C5) full-rank $X$. Scope: $W$ is pinned only at the single point $\theta_$; functional identification of $W$ on the prior support is Theorem 1E.

Necessity of FIC requires, in addition to (LIN) and (C1)–(C6), the hypothesis (EVAL): point evaluation $E_{\theta_}:\mathcal F_W\to\mathbb R^{p\times d}$, $W\mapsto W(\theta_)$, is surjective. If Abstract FIC fails (and (EVAL) holds), two distinct admissible tuples produce the same latent function — explicit non-identifiability. (LIN) lets perturbations stay in the classes; (EVAL) realizes the required $W$-shift at the anchor. Both are checkable in the finite representation and the library checks (EVAL) at the chosen anchor before fitting.

II.1.6 Lemma 1B (statistical bridge) and Corollary 1

Lemma 1B. Under Theorem 1A's hypotheses plus (D-ID) (the response family ${\mathcal D(\theta)}$ is identifiable in $\theta$: $\mathcal D(\theta)=\mathcal D(\theta')\Rightarrow\theta=\theta'$), the joint law of $(X_i,Y_i)$ determines $(\theta_,a,b,W(\theta_))$.

(D-ID) holds for one-parameter exponential families with canonical link, full-rank multi-parameter exponential families, ZIP/ZINB under independent variation of $\pi$ and $\theta$ (can fail at $\pi=1$), identifiable finite mixtures, and copula-linked multivariates under joint marginal+copula identifiability. (D-ID) is a hypothesis on the modelling setup, to be verified per response family, not a theorem of the framework. Corollary 1 restates the centering of the framework — $\mathbb E_\mu[\Delta(X,\theta_{\text{ref}})]=0$ — i.e. the Anchoring property is a theorem.

II.1.7 Proposition 1C (numerical verifiability — the Gram-matrix check)

Fix a basis $B$. The extended design matrix at $\theta_$ is $\mathbf Z_n(\theta_)=[,\Phi_a(X_{1:n}),\ \Phi_b(X_{1:n})\odot\theta_,\ X_{1:n},]\in\mathbb R^{np\times(J_a+J_b+d)}$, and $\mathbf G_n(\theta_)=n^{-1}\mathbf Z_n^\top\mathbf Z_n\to\mathbf G(\theta_*)$ in probability.

For the chosen finite representation $B$, $\text{FIC}B$ at $\theta$ holds iff $\mathbf G(\theta_)$ is non-singular (Gram non-singularity $=$ column linear independence).

Caveats: it diagnoses basis-restricted FIC, not abstract FIC; $\mathbf G_n\to\mathbf G$ only asymptotically (finite-sample near-singularity is a warning); singular $\mathbf G_n$ means non-identifiability within $B$. Operationally, before fitting the implementation computes the smallest eigenvalue and condition number of $\mathbf G_n(\theta_*)$; if $\lambda_{\min}<$ tol (default $10^{-8}$) it aborts and reports the dependent directions. This is gdpar_check_identifiability().

II.1.8 Condition C4-bis (cross-coordinate aliasing for $p>1$)

For $p>1$ with the coord-wise factorization, coordinate $k$ contributes $\eta_{i,k}=\theta_{\text{ref},k}+Z_{a,k}[i,\cdot]a_{\text{coef},k}+\sum_j(\theta_{\text{ref},k}^j-\theta_{\text{anchor},k}^j)W_{\text{raw}}[r_{k,j},\cdot]X[i,\cdot]^\top+\cdots$. If $Z_{a,k}$ and the modulating design $X$ share a column by name, a perturbation $a_k'(x)=a_k(x)+c,x_1$, $W_{k,x_1}'(\theta)=W_{k,x_1}(\theta)-c/(\theta_k^1-\theta_{\text{anchor},k}^1)$ leaves $\eta$ invariant at the diagnostic's single $\theta_{\text{ref},k}$ — aliasing.

(C4-bis) Coord-wise structural disjointness. For every $k$: $\mathrm{names}(Z_{a,k})\cap\mathrm{names}(X)=\emptyset$.

This is necessary but not sufficient (overlap enables aliasing; regularization can suppress it). The extended Gram matrix cannot detect it: at a fixed $\theta_{\text{ref}}$ the $W_{\text{per_k_dim}}$ columns collapse to scalar multiples of $X$ (false-positive rank deficit), and a Jacobian check at $W=0$ erases the very signal. So the package separates pre-fit (rank of $Z_{a,k}$ alone + name overlap with $X$, in check_C4_bis_per_k()) from post-fit posterior-geometry forensics (divergences, low ESS, high $\widehat R$). gdpar_check_identifiability(..., rigor=) offers "full" (default; aborts on overlap) and "fast" (warns of class gdpar_c4bis_overlap_warning, for users who intend the overlap and regularize the $W$ block via gdpar_prior()). The per-$k$ breakdown (passed, lambda_min/max, condition_number, shared_cols, collinear_directions) is in report$c4_bis$per_k.

II.1.9 Condition C7 (per-group anchor anti-aliasing)

Block 6.5 promotes the reference to a per-group anchor $\theta_{\text{ref}}[g]$, $g=1,\dots,J_{\text{groups}}$, drawn from $\mathrm{Normal}(\mu_{\theta_{\text{ref}}},\sigma_{\theta_{\text{ref}}})$ (activated by group = ~ species; group = NULL reduces bit-exactly to the single-anchor regime). If $Z_a$ (or $Z_b$) has a column rank-deficient with the per-group indicator $G$ (e.g. constant within each group), the model is non-identified along $\theta_{\text{ref}}[g]\mapsto\theta_{\text{ref}}[g]+c$, $a\mapsto a-c,e_{\text{col}}$.

(C7). When use_groups = 1, $\mathrm{rank}([G\mid Z_a])=\mathrm{ncol}(G)+\mathrm{ncol}(Z_a)$ and likewise for $Z_b$.

Enforced pre-fit by .check_group_aliasing_c7() in two layers: (1) within-group variance per column (catches constant-per-group / factor(group) aliases), (2) joint QR rank of normalized $[G\mid Z]$ (catches indirect aliases). Violation aborts with gdpar_input_error naming the columns. Together with C1–C4 (global Gram) and C4-bis (cross-component per coord), C7 completes a three-tier pre-flight; the post-fit forensic remains the posterior geometry.

II.1.10 Theorem 1E (functional identifiability of $W$ on the prior support)

In the Bayesian setting $\theta_{\text{ref}}\sim\pi_\Theta$. Theorem 1A pins $W$ only at $\theta_*$; outside $\mathrm{supp}(\pi_\Theta)$ no information about $W$ accrues. Three tiers, under Theorem 1A $\pi_\Theta$-a.e. and (D-ID):

(a) $W=W'$ $\pi_\Theta$-a.e. on $\mathrm{supp}(\pi_\Theta)$;
(b) $W=W'$ in $L^2(\pi_\Theta;\mathbb R^{p\times d})$ (the recommended default conclusion);
(c) $W=W'$ in $C(\overline{\mathrm{supp}(\pi_\Theta)})$, additionally requiring (BAY-1) support $=$ closure of a connected open $U$, (BAY-2) $W$ continuous (subsumed by (C5)), (BAY-3) $\pi_\Theta$ charges every non-empty open subset of $U$ (so the a.e.-identification set is dense). These are automatic for absolutely continuous priors with positive density on a connected open set.

No tier identifies $W$ outside $\overline{\mathrm{supp}(\pi_\Theta)}$: out-of-support reference predictions rely entirely on the prior and must be reported as such.

II.1.11 Proposition 1F (hypernetwork, Path 3) and the discrimination protocol

For the hypernetwork, $\mathcal F_a,\mathcal F_b,\mathcal F_W$ are nonlinear manifolds (images of feedforward nets), so (LIN) fails and Theorem 1A does not transfer. Then: (i) the realized function $\Phi_\phi(x,\theta)=a_\phi(x)+b_\phi(x)\odot\theta+W_\phi(\theta)x$ is the object of inference; (ii) the weights $\phi$ are not identifiable (permutation/sign/rescaling symmetries); (iii) under (D-ID), $\Phi_\phi$ is identifiable up to $L^2(\mu\otimes\pi_\Theta)$-equivalence — a function-level claim strictly weaker than identifiability of the AMM decomposition into $(a_\phi,b_\phi,W_\phi)$, which is an open question. Universal approximation (Hornik 1991; Pinkus 1999) gives density, not identifiability — the package records this distinction explicitly.

When Path 3 diverges from Path 1 on the same data, a four-step empirical protocol discriminates "richer structure" from "undetected non-identifiability": (1) stability across $\ge5$ random seeds (across-run SD vs within-run SD; "stable" iff $\le0.10\times$); (2) stratified $k$-fold ELPD comparison ($\overline{\Delta\text{elpd}}>2\text{SE}$ and positive in $\ge80%$ folds $\Rightarrow$ real structure); (3) posterior-predictive calibration at $\alpha\in{0.05,0.20,0.50}$; (4) component-wise decomposition of $\Delta\theta(x)$ into additive/multiplicative/modulated shares. A decision rule (5-row table) maps the joint outcome to a verdict (richer structure / mixed / no advantage / non-identifiability / model misspecified). The protocol is evidence-weighted, never a formal certificate.

II.1.12 Special cases and component selection

Standard models are Theorem-1A special cases verified to satisfy (LIN): standard regression (Level 0), random coefficients (Level 1, identifiability $\Leftrightarrow$ $\mathrm{Cov}_\mu(X)$ full rank), varying-coefficient (Level 2, $b\equiv0$, spline classes), hierarchical Bayes with multiplicative interaction (Level 2, $W\equiv0$; (C6) critical), full canonical AMM (Level 2). The hypernetwork falls outside (LIN), into Proposition 1F.

Component selection over the eight restrictions $M_S$, $S\subseteq{a,b,W}$, uses $\widehat{\text{elpd}}{\text{loo}}$ (Vehtari–Gelman–Gabry 2017) with differences $>2$ SE substantive; a stratified PSIS-LOO computed within $x$-quantile/categorical subgroups localizes structural heterogeneity and is reported by default. Frequentist (Path 2) selection uses generalized LRTs on effective d.f. (Wood 2017); Path 3 compares post-training component norms $|a\phi|,|b_\phi|,|W_\phi|$ to the regularization scale to prune dormant components.

II.2 Asymptotic theory of the three paths

The package develops asymptotics for all three paths to reference grade (only Path 1 is executable). The reference text throughout is Ghosal & van der Vaart (2017); AMM-specific theorems are specializations, with explicit statements of what is established, what the AMM specialization costs in extra hypotheses, and what remains open.

II.2.1 Path 1 (Hierarchical Bayesian) — three layers

The Path-1 model places priors on every component: $\theta_{\text{ref}}\sim\pi_\Theta$, $a\sim\pi_a$, $b\sim\pi_b$, $W\sim\pi_W$, joint prior $\pi=\pi_\Theta\otimes\pi_a\otimes\pi_b\otimes\pi_W$, posterior $\Pi_n\propto\pi\prod_i p(y_i\mid\theta_i)$. Write $\eta=(\theta_{\text{ref}},a,b,W)$, true value $\eta_*$. Two design regimes are distinguished because conclusions differ: (R-RANDOM) $X_i\overset{\text{iid}}\sim\mu$ (default) and (R-COND) fixed/conditional-on-design; they coincide under (R-EQUIV) (positive density of $\mu$ + uniformly Glivenko–Cantelli classes), automatic for finite-dim parametric classes.

Two distances, no global equivalence. Hellinger $d_H$ on the data distribution (the natural metric for Bayesian contraction) and $L^2(\mu)$ on the deviation function (the interpretable metric for the components). They are locally equivalent at $\eta_*$ only in the parametric smooth full-rank-Fisher case ($c_1 d_{L^2}\le d_H\le c_2 d_{L^2}$ on a neighborhood); globally and in non-parametric cases they can diverge. Theorems 4A/4B are stated in $d_H$ and specialized to $d_{L^2}$ only locally.

The three asymptotic layers parallel the three identifiability layers:

(L1) Posterior consistency — $\Pi_n({d(\eta,\eta_*)>\varepsilon})\xrightarrow{P}0$ for every $\varepsilon$.
(L2) Contraction rate — $\exists,\varepsilon_n\to0$, $n\varepsilon_n^2\to\infty$, with $\Pi_n({d>M\varepsilon_n})\xrightarrow{P}0$.
(L3) Bernstein–von Mises — $\Pi_n(\sqrt n(\eta-\widehat\eta_n)\in\cdot)\xrightarrow{w}\mathcal N(0,I_*^{-1})$ in total variation.

Standing asymptotic hypotheses (additional to C1–C6, LIN, D-ID, IID): (PRIOR-KL) $\pi(B_\varepsilon(\eta_))>0$ for all $\varepsilon$ (KL-ball $B_\varepsilon={K(\eta_,\eta)\le\varepsilon^2,V(\eta_,\eta)\le\varepsilon^2}$); (PRIOR-THICK) $\pi(B_{\varepsilon_n}(\eta_))\ge e^{-C_1 n\varepsilon_n^2}$; (SIEVE) sieves $\Theta_n$ with $\pi(\Theta_n^c)\le e^{-C_2 n\varepsilon_n^2}$ and bracketing entropy $\log N_{[]}(\varepsilon_n,\Theta_n,d_H)\le C_3 n\varepsilon_n^2$; (TEST) tests with exponential type-I/II error decay; (LAN) local asymptotic normality at $\eta_$ with non-singular Fisher $I_$.

Theorem 4A (posterior consistency). Under C1–C6, LIN, D-ID, the Block-2 regularity (HOM)+(REG)+(IID), (PRIOR-KL), (TEST), and finite bracketing entropy on sieves with $\pi(\Theta_n^c)\to0$: $\Pi_n({d_H(\eta,\eta_)>\varepsilon})\xrightarrow{P_{\eta_}}0$. (Schwartz 1965 specialized to AMM; novelty is verifying PRIOR-KL and entropy for the product prior, which under LIN reduces to prior positivity at $\eta_*$.)

Theorem 4A discharges (REG-EST) of Block 2 in average-error form: the posterior-mean individual parameter $\widehat\theta_i^{\text{Bayes}}=\int[\theta_{\text{ref}}+\Delta(x_i,\theta_{\text{ref}})],d\Pi_n$ satisfies $\frac1n\sum_i|\widehat\theta_i^{\text{Bayes}}-\theta_*(x_i,\theta_{\text{ref}})|\xrightarrow{P}0$. The uniform-in-$i$ version needs sup-norm contraction (extra hypotheses, not implied by 4A).

Theorem 4B (contraction rate). Adding (PRIOR-THICK) and (SIEVE) for $\varepsilon_n\to0$, $n\varepsilon_n^2\to\infty$: $\Pi_n({d_H>M\varepsilon_n})\xrightarrow{P}0$ (Ghosal–Ghosh–van der Vaart 2000 specialized to AMM).

Rates by Level: Level 0/1 parametric $\varepsilon_n=n^{-1/2}$; VCM fixed-$J$ splines $(J/n)^{1/2}$, adaptive matched-smoothness $n^{-\beta/(2\beta+d)}$; Level 2 parametric $n^{-1/2}$, non-parametric determined by the slowest component subject to prior-matching caveats (the joint rate need not be a simple combination of component rates — numerical verification recommended). The library implements empirical contraction verification: refit at $n_1<\dots<n_K$, track credible-set diameters, and warn when the empirical rate diverges from prediction (signaling prior or model misspecification).

Theorem 4C (Bernstein–von Mises). For finite-dim parametric AMM (Levels 0/1/2 with finite-dim classes), under 4A+4B, (LAN), and a consistent $\sqrt n$-MLE: the posterior is asymptotically $\mathcal N(\widehat\eta_n,n^{-1}I_*^{-1})$ in total variation. Consequence: Bayesian credible intervals and asymptotic frequentist CIs coincide in the limit — this justifies reporting credible intervals as the primary uncertainty.

Proposition 4C-semi (semiparametric BvM). With parametric $\theta_{\text{ref}}$ and non-parametric $(a,b,W)$, under Castillo–Rousseau (2015) conditions ($\sqrt n$-recoverability + least-favorable-direction-aware prior), the marginal posterior of $\theta_{\text{ref}}$ is $\sqrt n$-asymptotically normal at the semiparametric efficiency bound $V_*$. Tight scope: this is only for the marginal of $\theta_{\text{ref}}$ — the function-valued $(a,b,W)$ need not be asymptotically Gaussian in a function-space metric, and the library reports their intervals as posterior quantiles (function-space credible balls), never as $\sqrt n$ Gaussian intervals.

Open questions explicitly recognized: (O1) full BvM for non-parametric components (only partial Sobolev-topology results exist); (O2) adaptive contraction rates for general AMM Path-1 priors; (O3) misspecification asymptotics under failure of (HOM)/(REG) (contraction to a KL-projection pseudo-true parameter, Kleijn–van der Vaart 2012). Implementation diagnostics: prior KL-support report, Stan $\widehat R$/ESS/divergences as indirect (TEST)/(SIEVE) checks, and a BvM calibration check comparing credible vs ML-Hessian intervals.

II.2.2 Path 2 (Varying-coefficient) and Path 3 (hypernetwork) — reference theory

These paths are not executable in gdpar 0.1.0 but carry reference-grade asymptotics:

Path 2 (VCM, vignette v05). Frequentist penalized-spline asymptotics: pointwise and uniform consistency of $\widehat\beta(\cdot)$, asymptotic normality at the spline rate $n^{-\beta/(2\beta+1)}$, with the reference recovered as $\beta(\bar z)$ and the deviation as $\beta(z)-\beta(\bar z)$; conditions specialize Fan–Zhang (2008), Stone (1985), Wood (2017). The curse of dimensionality in $z$ is the binding limitation.
Path 3 (hypernetwork, vignette v06). Only partial results are available for Bayesian neural networks: consistency under the Neural-Tangent-Kernel regime (Jacot et al. 2018; Bach 2017), PAC-Bayes generalization bounds (Dziugaite–Roy 2017), and an explicit acknowledgement that BvM and contraction rates are open (Hron et al. 2020). Universal approximation gives density, not identifiability; the function-level identifiability of $\Phi_\phi$ and its contraction are open (cf. Proposition 1F).

The cross-path consistency: parametric AMM gets the full $n^{-1/2}$ rate and full BvM; non-parametric AMM gets the smoothness-dependent rate and semiparametric BvM for the parametric subset only.

II.3 Empirical Bayes vs full Bayes

Partition $\eta=(\theta_{\text{ref}},\xi)$ with upper-level reference $\theta_{\text{ref}}$ and lower-level $\xi=(a,b,W)$. EB maximizes the marginal likelihood $L_n^{\text{marg}}(\theta_{\text{ref}})=\int p(Y_{1:n}\mid\theta_{\text{ref}},\xi)\pi_\xi(\xi),d\xi$ to get $\widehat\theta_{\text{ref}}^{\text{EB}}=\arg\max L_n^{\text{marg}}$, then conditions: $\Pi_n^{\text{EB}}(\xi)\propto p(Y_{1:n}\mid\widehat\theta_{\text{ref}}^{\text{EB}},\xi)\pi_\xi(\xi)$. FB gives $\theta_{\text{ref}}$ a prior $\pi_\Theta$ and integrates it out. Comparison metric: $d_{\text{TV}}(\Pi_n^{\text{EB}},\Pi_n^{\text{FB}})$. Standing hypotheses: (EB-MARG-ID) unique marginal-likelihood maximum with non-singular $I_{\theta\theta}^{\text{marg}}$; (PRIOR-FB-WEAK) $\mathrm{Var}n(\theta{\text{ref}}^*)/\mathrm{Var}\pi(\theta{\text{ref}})\to0$; (HIER-COMPLEX) bounded upper-level hyperparameters.

Theorem 7A (first-order equivalence). Under regularity + the three hypotheses, EB and FB lower-level posteriors agree asymptotically. Regime A (finite-dim parametric AMM): $d_{\text{TV}}(\Pi_n^{\text{EB}},\Pi_n^{\text{FB}})\xrightarrow{P}0$. Regime B (non-parametric AMM): TV is too strong; convergence holds for smooth bounded $L^2(\mu)$-Lipschitz functionals (equivalently Wasserstein-1/Prokhorov on the joint posterior). Specializes Petrone–Rousseau–Scricciolo (2014) and Rousseau–Szabo (2017). Practical content: for large $n$ + weak prior, EB and FB give essentially the same posterior over $\xi$; the choice is then computational/methodological.

Proposition 7B (higher-order coverage). Under Edgeworth-type expansion conditions (Bickel–Ghosh 1990), EB credible intervals for a smooth functional $g(\xi)$ under-cover by $O(n^{-1})$: $\mathbb P(g(\xi^)\in\mathrm{CI}n^{\text{EB},\alpha})=(1-\alpha)-C{g,\alpha}n^{-1}+o(n^{-1})$, with $C_{g,\alpha}\approx(g'(\xi^))^2/I_{\theta\theta}^{\text{marg}}\cdot\kappa(\alpha)$ (larger when $g$ is sensitive to $\theta_{\text{ref}}$, smaller when $\theta_{\text{ref}}$ is well-identified); FB covers to first order. The library applies a post-hoc inflation $\sqrt{1+C_{g,\alpha}/(n-q)}$ (argument eb_correction=TRUE), explicitly approximate.

Theorem 7C (compound decision, Robbins–Efron). For $K$ exchangeable units, $\frac1K\sum_k\mathbb E[(\widehat\xi_k^{\text{EB}}-\xi_k^)^2]\le\frac1K\sum_k\mathbb E[(\widehat\xi_k^{\text{FB}}-\xi_k^)^2]+B_K$ with $B_K\le\frac{C_1}{K}\mathbb E[(\widehat\theta_{\text{ref}}^{\text{EB}}-\theta_{\text{ref}}^*)^2]\to0$: EB risk approaches FB risk as $K\to\infty$ (squared-error loss on point estimates only — coverage still under-covers per 7B).

Proposition 7D. EB and FB differ substantially when: (i) small $I_{\theta\theta}^{\text{marg}}$ (poorly identified upper level); (ii) strongly informative $\pi_\Theta$; (iii) multimodal $L_n^{\text{marg}}$; (iv) misspecified $\pi_\xi$ (EB regularizes by tuning $\theta_{\text{ref}}$, FB does not).

Default is FB (gdpar()); EB (gdpar_eb()) is opt-in. EB's independent methodological advantages: honest avoidance of a prior on $\theta_{\text{ref}}$, and a deterministic, reproducible, frequentist-interpretable $\widehat\theta_{\text{ref}}^{\text{EB}}$.

Multivariate / multi-slot extension (vignette v07b): Theorem 7A* (to $\theta_{\text{ref}}\in\mathbb R^p$), Proposition 7B* (matricial coverage correction), Theorem 7C* (compound multi-slot, $K>1$), Proposition 7D* (four conditions extended), and open question O5*-EBFB on the anti-fragility of the Laplace Hessian under $p>1,K>1$.

The four gdpar_eb() path regimes (dispatched from the resolved $(K,p)$), each with a marginal+conditional Stan template pair and its own Proposition-7B correction:

Regime	$(K,p)$	Stan template pair	7B correction
Base	$K=1,p=1$	`amm_eb_marginal.stan` + `amm_eb_conditional.stan`	scalar
Path A	$K=1,p>1$	`amm_eb_marginal_multi.stan` + `..._conditional_multi.stan`	matricial 7B*
Path B	$K>1,p=1$	`amm_eb_marginal_K.stan` + `..._conditional_K.stan`	per-slot scalar
Path C	$K>1,p>1$	`amm_eb_marginal_KxP.stan` + `..._conditional_KxP.stan`	tensor $\mathbb R^{K\times p\times p}$

The canonical EB recipe (three steps): (i) marginal-likelihood maximization for $\widehat\theta_{\text{ref}}^{\text{EB}}$ via Laplace (cmdstanr::laplace() with multi-start + Levenberg–Marquardt ridge + condition-number guard — the anti-fragility strategy); (ii) plug $\widehat\theta_{\text{ref}}^{\text{EB}}$ as data into a conditional Stan model; (iii) sample the conditional posterior of $\xi$. The companion gdpar_compare_eb_fb() operationally verifies Theorem 7A (marginal TV) and Proposition 7B (per-cell width ratio).

II.6 Parametrizations and the pre-flight decision (CP / NCP / linear)

Hierarchical AMM posteriors can suffer the classic centered/non-centered pathology of hierarchical models. The package treats the parametrization of the multiplicative interaction $b(x)\odot\theta_{\text{ref}}$ as a first-class, data-driven decision rather than a fixed choice, and selects it pre-fit. Source: R/preflight.R, R/preflight_multi.R, R/contraction_diagnostic.R.

II.6.1 The three parametrizations

For the multiplicative term, write the contribution of coordinate involving the reference as $c_b=\theta_{\text{ref}}\cdot b_{\text{coef}}$ on the linear-predictor scale. Three parametrizations are available:

Centered (CP). Sample $b_{\text{coef}}$ directly; the term $\theta_{\text{ref}}\cdot b_{\text{coef}}$ couples $b_{\text{coef}}$ to $\theta_{\text{ref}}$ multiplicatively. CP mixes well when the data are informative about the interaction (the likelihood dominates the funnel).
Non-centered (NCP). Reparametrize $b_{\text{coef}}=\mu_b+\tau_b,\tilde b$ with $\tilde b\sim\mathrm{Normal}(0,1)$, decoupling the prior geometry; NCP mixes well when the data are weakly informative (prior-dominated funnel).
Linear reparametrization. Sample the product $c_b=\theta_{\text{ref}}\cdot b_{\text{coef}}$ directly as a linear coordinate, sidestepping the bilinear $(\theta_{\text{ref}},b)$ geometry altogether. This is the package's resolution of the deeper diagnosis (below): the root cause of the funnel is the non-linear $(\theta_{\text{ref}},b)$ parametrization, not the centering per se; sampling the linear product removes the bilinearity at the source.

The diagnosis that led here proceeded in three iterations — NCP did not cure it, CP did not cure it, and the residual pathology was traced to the bilinear coordinate; the final fix samples $c_b$ linearly. This is the maximum-robustness, root-cause resolution rather than a symptom patch.

II.6.2 The pre-flight decision

The choice among CP/NCP/(linear) is made by a pre-flight procedure that runs a short pilot and computes an information ratio discriminating prior-dominated from likelihood-dominated regimes, then dispatches:

preflight_parametrization() (scalar) / preflight_parametrization_multi() ($p>1$, per coordinate) run the pilot sample and the attribution/info-ratio computation.
The information ratio contrasts how much of the posterior variation in the interaction is attributable to the data vs the prior; an asymptotic $z$/$t$-style test on it, evaluated against thresholds, picks the regime. Defaults: tau_cp = 5, tau_ncp = 2 (the boundary thresholds of the ratio); a high ratio (data-informative) selects CP, a low ratio selects NCP.
The variance/score machinery is made dependence-aware: effective weights + a chain-aware block bootstrap + an asymptotic z-test give the ratio's sampling distribution without assuming independence of the pilot draws (Path B′ canonical design). The per-coordinate variant decides each coordinate of $\theta_{\text{ref}}\in\mathbb R^p$ separately and resolves a global vs per-dimension decision.
resolve_parametrization() / resolve_parametrization_multi() turn the diagnostic verdict into the concrete Stan-side toggle used by the code generator. preflight_global_decision() and preflight_per_dim() are the exported user-facing entry points; decision_to_logical() maps the verdict to the boolean the template consumes.

Confounding-induced NCP preference is treated as correct, not a defect: when a covariate confounds the reference, the prior-dominated geometry genuinely calls for NCP, and the diagnostic is designed to detect exactly that. The whole apparatus is data-driven: no parametrization is hard-coded; the knob is set from a declared, reproducible statistic of a pilot fit.

II.6.3 The contraction diagnostic

gdpar_contraction_diagnostic() (R/contraction_diagnostic.R) operationalizes the empirical contraction-rate verification of Theorem 4B (§II.2.1): it refits at increasing sample sizes and tracks the posterior credible-set diameter, flagging deviations from the predicted rate $\varepsilon_n$ as evidence of prior or model misspecification. It is the runtime instrument behind the asymptotic theory's "numerical verification of contraction".

II.4 Distributional families and links

A family is represented internally as an ordered list of per-parameter specifications (gdpar_param_spec), one per slot $k$. Each spec carries: a name ($\mu,\sigma,\phi,\nu,\pi,p,\dots$), a link (and its inverse), a family_role ∈ {location, scale, shape, df, mixture_pi, power}, a support (real_line/positive_real/unit_interval/bounded_open), an identifiability descriptor (did_status ∈ {holds, holds_under_condition, user_responsible} with condition + reference, i.e. the (D-ID) hypothesis of Lemma 1B made per-slot), a canonical prior kind, and a scope (per_observation or population). The link factory implements exactly three links:

$$\text{identity: } g(\mu)=\mu,\ g^{-1}(\eta)=\eta;\qquad \text{log: } g(\mu)=\log\mu,\ g^{-1}(\eta)=e^{\eta};\qquad \text{logit: } g(\mu)=\log\tfrac{\mu}{1-\mu},\ g^{-1}(\eta)=\tfrac{1}{1+e^{-\eta}}.$$

The built-in roster (location slot gets the user link; auxiliaries get fixed links — log for positive-real, logit for unit-interval, identity for the Tweedie power):

Family	Slots (role, link)
gaussian	$\mu$ (location), $\sigma$ (scale, log)
poisson	$\mu$ (location)
neg_binomial_2	$\mu$ (location), $\phi$ (scale, log)
bernoulli	$\mu$ (location)
beta	$\mu$ (location), $\phi$ (scale, log)
gamma	$\mu$ (location), shape (log)
student_t	$\mu$, $\sigma$ (log), $\nu$ (df, log)
tweedie	$\mu$ (log), $\phi$ (log), $p$ (power, identity, bounded $(1,2)$)
zip	$\mu$ (location), $\pi$ (mixture_pi, logit)
zinb	$\mu$, $\phi$ (log), $\pi$ (logit)
hurdle_poisson	$\mu$, $\pi$ (logit)
hurdle_neg_binomial_2	$\mu$, $\phi$ (log), $\pi$ (logit)

Each family carries an integer stan_id selecting the likelihood branch in the Stan templates (Part V). Auxiliary slots default to population scope and can be promoted to per_observation (their own AMM decomposition) by the user — that is distributional regression: $\theta_i^{(k)}=\theta_{\text{ref}}^{(k)}+\Delta^{(k)}(x_i,\theta_{\text{ref}}^{(k)})$ on the slot's link scale. The package also supports heterogeneous per-slot families (different sub-family per slot) and custom families (gdpar_family_custom, gdpar_family_custom_K) where the user declares the likelihood and accepts responsibility for (D-ID) via did_override. The exported gdpar_family, gdpar_family_multi (the $p>1$ multivariate slot constructor) and the print methods complete the API. Zero-inflation/hurdle realize the dual deviation of §I.6 (both $\pi$ and the count parameter anchored). Full per-function detail: Part IV, R/families.R; per-likelihood Stan math: Part V.

II.5 The modulating basis `W`

The modulating component $W(\theta_{\text{ref}})x$ draws $W$ from a finite-dimensional basis built by W_basis(type, degree, knots, df, boundary_knots, basis_fn, dim, p) with three types:

polynomial (default degree 1): block-by-coordinate powers, no cross-terms, $$\text{eval}(\theta)=\big(\theta_1,\theta_1^2,\dots,\theta_1^{\deg},\ \theta_2,\dots,\theta_2^{\deg},\ \dots,\ \theta_{p},\dots,\theta_{p}^{\deg}\big);$$
bspline (default degree 3): per-coordinate B-spline bases concatenated, $\text{eval}(\theta)=(B(\theta_1),\dots,B(\theta_p))$, with Cox–de Boor evaluation performed Stan-side for differentiability inside HMC; the R side (.gdpar_resolve_bspline_knots, .gdpar_bspline_knots_full, .gdpar_validate_bspline_boundary_range) resolves interior knots (from knots or df) and boundary knots and validates the reference range; boundary_knots defaults to range(c(knots, tk));
user: an arbitrary basis_fn of declared dim.

materialize_W_basis() populates dim, p, and block_indices (the per-coordinate column blocks) once $p$ is known; as_per_k() reshapes to per-coordinate form. The anchoring constraint (C4) $W(\theta_0)=0$ is enforced by the differenced parametrization $W(\theta)=W_0(\theta)-W_0(\theta_0)$ (the apply_W_basis_diff mechanism, selected by a W_type_id). Detail: Part IV, R/W_basis.R; Stan-side Cox–de Boor: Part V.

II.7 Grouped references (per-group anchors)

Activated by group = ~ factor in gdpar(), the scalar/coord-wise reference is promoted to a per-group anchor $\theta_{\text{ref}}[g]\sim\mathrm{Normal}(\mu_{\theta_{\text{ref}}},\sigma_{\theta_{\text{ref}}})$, $g=1,\dots,J_{\text{groups}}$, with its own hierarchical hyperprior ($\sigma_{\theta_{\text{ref}}}$ a learned between-group scale). group = NULL reduces bit-exactly to the single-anchor regime. The anti-aliasing condition (C7) (§II.1.9) is enforced pre-fit by .check_group_aliasing_c7(). Anchor resolution across regimes is handled by resolve_anchor, resolve_anchor_multi, resolve_anchor_K; extraction of grouped references by the .extract_theta_ref_*_grouped helpers (Part IV, R/gdpar.R, R/methods.R).

II.8 The causal bridge (CATE / ITE)

Because the AMM produces individual parameters, individual treatment effects follow directly. gdpar_causal_bridge() implements a T-learner: fit the anchored model separately under treatment and control and read the individual CATE/ITE as the difference of anchored individual predictions, $$\widehat\tau(x_i)=\widehat\mu_1(x_i)-\widehat\mu_0(x_i),$$ with posterior draws giving full uncertainty on $\tau(x_i)$ (summarized by .summarize_cate). A battery of pre-checks (.check_bridge_path/_hierarchical/_family/_dim/_amm/_anchor) guards the bridge's applicability. The second layer, gdpar_compare_meta_learners(), benchmarks the AMM learner against external meta-learners through a pluggable adapter registry: gdpar_adapter_grf (R, generalized random forests) and gdpar_adapter_econml (Python EconML via reticulate, e.g. CausalForestDML realizing orthogonal/DML CATE). Adapters honor a two-layer contract (fit_predict_fun + optional predict_fun). Detail: Part IV, R/causal_bridge.R, R/compare_meta_learners.R, R/adapter_*.R; theory: vignettes v08, v08b, v08c.

II.9 Geometry-adaptive sampling (opt-in)

For geometrically hostile posteriors (funnels, near-determinism, heavy tails, multimodality), the package ships an opt-in geometry engine whose default path is bit-identical to ordinary HMC. When enabled it climbs a ladder of metrics $G(\theta)$ for Riemannian-manifold HMC:

Euclidean (gdpar_geom_metric_euclidean) — baseline mass matrix;
Riemannian / Fisher (gdpar_geom_metric_gp_fisher, gdpar_geom_metric_riemannian) with the SoftAbs regularization of the Hessian eigenvalues, $\lambda\mapsto\lambda\coth(\alpha\lambda)$, giving a positive-definite metric from a possibly-indefinite Hessian; log-Cholesky parametrization of $G$ and its differentials ($dM/d\psi$);
sub-Riemannian (gdpar_geom_metric_subriemannian) — a degenerate/constrained metric flowing along admissible directions;
relativistic / Finsler (gdpar_geom_metric_relativistic) — a relativistic kinetic energy capping velocities (heavy-tail/ill-conditioning robustness), with its own radial integrator.

A certifying orchestrator (gdpar_geom_orchestrate, with _criteria and _budget variants) diagnoses the pathology (gdpar_geometry_diagnostic: multimodality, heavy kurtosis, boundary proximity, difficulty curve), selects an entry rung, builds the metric, tunes $\varepsilon$, runs the success gate, and emits a certificate (gdpar_geom_certificate). A Laplace fallback (gdpar_geom_laplace: Newton/Laplace climb, observed information, unconstrained draws, fit-quality label) provides a plug-in posterior and an ELPD on par with mgcv-REML / INLA-Laplace when full sampling is certified infeasible (the resolution of the near-deterministic Tweedie case). Detail: Part IV, R/geometry_engine.R, R/geometry_orchestrator.R, R/geometry_suite.R, R/geometry_bridge.R, R/geometry_laplace.R, R/geometry_diagnostic.R; theory: vignette vop08.

II.10 Dependence-robust inference

gdpar does not model dependence in its point structure; it makes the inference robust to unmodelled dependence (point estimates unchanged, only uncertainty made robust). Two axes:

Temporal (gdpar_dependence_diagnostic, gdpar_dependence_robust): diagnostics = lag-1 autocorrelation, Durbin–Watson, Ljung–Box on residuals; robust SE/intervals via moving/circular block bootstrap with a data-driven block length. The automatic length is the Politis–White (2004) + Patton–Politis–White (2009) flat-top rule (a base-R hand-roll equal to np::b.star): flat-top autocovariance + adaptive bandwidth $b=(2\hat g^2/D)^{1/3}n^{1/3}$, $D=\tfrac43 \text{spec}^2$; default block_length=NULL reduces bit-exactly to the $n^{1/3}$ Künsch rate.
Spatial (gdpar_spatial_dependence_diagnostic, gdpar_spatial_dependence_robust): diagnostic = Moran's $I$ (hand-roll, no spdep/sf) with knn row-standardized / distance-band / user-supplied weights, two-sided permutation default + analytic Cliff–Ord option; robust inference via tiled randomized-origin spatial block bootstrap (Politis–Romano–Lahiri) with variance-optimal block side $g=\max(2,\lceil n^{1/4}\rceil)$ in $d=2$, plus a data-driven subsampling calibration of $g$.

Both axes work on the EB scalar path and the FB path (gdpar_fit and gdpar_eb_fit) via a shared engine (.gdpar_dependence_robust_engine) with class-dispatched estimate/SE/residual extractors and a frozen RNG contract that keeps the EB route bit-exact. Detail: Part IV, R/dependence_robust.R; theory: vignette vop09.

← Part I — Conceptual Framework · gdpar Wiki Home · Part III — Computational Architecture →

gdpar Wiki

Home

Part II Mathematical Foundations

Part II — Mathematical Foundations

II.1 The AMM canonical form and identifiability

II.1.1 Setting

II.1.2 The AMM hierarchy

II.1.3 Standing assumptions (the six conditions enforced in code)

II.1.4 Functional Independence Condition (FIC)

II.1.5 Theorem 1A (algebraic-functional identifiability)

II.1.6 Lemma 1B (statistical bridge) and Corollary 1

II.1.7 Proposition 1C (numerical verifiability — the Gram-matrix check)

II.1.8 Condition C4-bis (cross-coordinate aliasing for $p&gt;1$)

II.1.9 Condition C7 (per-group anchor anti-aliasing)

II.1.10 Theorem 1E (functional identifiability of $W$ on the prior support)

II.1.11 Proposition 1F (hypernetwork, Path 3) and the discrimination protocol

II.1.12 Special cases and component selection

II.2 Asymptotic theory of the three paths

II.2.1 Path 1 (Hierarchical Bayesian) — three layers

II.2.2 Path 2 (Varying-coefficient) and Path 3 (hypernetwork) — reference theory

II.3 Empirical Bayes vs full Bayes

II.6 Parametrizations and the pre-flight decision (CP / NCP / linear)

II.6.1 The three parametrizations

II.6.2 The pre-flight decision

II.6.3 The contraction diagnostic

II.4 Distributional families and links

II.5 The modulating basis W

II.7 Grouped references (per-group anchors)

II.8 The causal bridge (CATE / ITE)

II.9 Geometry-adaptive sampling (opt-in)

II.10 Dependence-robust inference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gdpar Wiki

Clone this wiki locally

II.1.8 Condition C4-bis (cross-coordinate aliasing for $p>1$)

II.5 The modulating basis `W`