Skip to content

Part IV Function Reference 4

José Mauricio Gómez Julián edited this page Jun 29, 2026 · 3 revisions

← Part IV — Exhaustive Function Reference (3/7) · gdpar Wiki Home · Part IV — Exhaustive Function Reference (5/7) →


R/gdpar.R

dedup_message_blocks(lines)

Purpose

Internal helper that deduplicates consecutive message blocks from captured stderr lines. It is used by gdpar() when verbose = FALSE to silence the duplicate divergence warnings emitted by cmdstanr during sampling. The function splits the captured lines into blocks delimited by empty lines, retains only the first occurrence of each unique block (compared by full text), and returns a flattened character vector suitable for re-emission via message().

Arguments

Argument Type Meaning
lines Character vector Captured stderr lines from cmdstanr sampling.

Mathematics

The algorithm operates in two passes:

  1. Block segmentation. Given the input vector $\mathbf{l} = (l_1, l_2, \dots, l_n)$, consecutive non-empty strings are accumulated into a block $B_i$. An empty string acts as a block delimiter: when encountered and the current accumulator is non-empty, the accumulator is closed as a completed block. Leading empty lines (before any non-empty content) and consecutive empty lines between blocks are silently consumed because the accumulator is empty at those points. Trailing empty lines after the last non-empty content do not produce a block.

  2. Deduplication. For each block $B_i$ in order of first appearance, a key $k_i = \text{paste}(B_i, \text{collapse} = \text{"\textbackslash n"})$ is computed. If $k_i \notin S$ where $S$ is the set of previously seen keys, the block's lines (followed by an empty-string separator) are appended to the output and $k_i$ is added to $S$. Otherwise the block is dropped.

A final trimming step removes the trailing empty-string separator from the output if present.

Returns

A character vector. If length(lines) == 0L, returns character(0). Otherwise returns the deduplicated blocks flattened into a single character vector, with blocks separated by empty strings (""), and the trailing separator removed. Ready for line-by-line re-emission via message().

Notes

  • Marked @keywords internal and @noRd; not exported.
  • The comparison key uses paste(b, collapse = "\n"), so blocks are compared by their full multi-line text content, not line-by-line.
  • The function does not modify its input; it builds new vectors via concatenation.
  • No side effects; no errors raised on any input (including NULL, which would cause length(NULL) == 0L to be TRUE, returning character(0)).

gdpar(formula, family = gdpar_family("gaussian"), amm = amm_spec(), W = NULL, data, prior = NULL, path = c("bayes", "vcm", "hyper"), anchor = "prior_mean", skip_id_check = FALSE, chains = 4L, iter_warmup = 1000L, iter_sampling = 1000L, adapt_delta = 0.95, max_treedepth = 12L, refresh = 100L, verbose = TRUE, seed = NULL, group = NULL, parametrization = c("auto", "ncp", "cp"), parametrization_a = NULL, parametrization_W = NULL, parametrization_aggregation = NULL, id_check_rigor = c("full", "fast"), ...)

Purpose

Main exported entry point of the gdpar package. Fits the AMM (Additive–Multiplicative Modulation) canonical decomposition of the individual parameter via Path 1 (hierarchical Bayesian inference through Stan/cmdstanr). The function orchestrates input validation, covariate standardization, basis-restricted identifiability diagnostics, Stan model code generation, compilation, HMC/NUTS sampling, and convergence diagnostic collection. It also dispatches to internal multivariate (.gdpar_multi), K-individual (.gdpar_K), and placeholder paths (Path 2/Path 3) based on the input configuration.

Arguments

Argument Type Meaning
formula Formula or gdpar_formula_set Two-sided formula (y ~ ...) or a gdpar_formula_set object. In the legacy two-sided form, the LHS is the outcome variable; the RHS either lists covariates entering the modulating component as the linear factor $x$ in $W(\theta)x$ (with AMM components declared via amm), or contains AMM wrapper calls a(...), b(...), W(). A gdpar_formula_set is the multi-parameter form with one slot per K-individual parameter.
family gdpar_family, gdpar_family_multi, or named list of gdpar_family Response distribution. Defaults to gdpar_family("gaussian"). A named list of gdpar_family objects triggers the heterogeneous-families-per-slot path (sub-phase 8.3.7), requiring $K \geq 2$.
amm amm_spec or named list of amm_spec AMM specification. Defaults to amm_spec() (Level 0 degenerate). A named list triggers the K-individual low-level path. Must remain at default when formula is a gdpar_formula_set or contains AMM wrapper calls in the RHS.
W W_basis or NULL Modulating basis for K-individual paths. NULL by default. Ignored in the legacy single-amm_spec path (where the basis travels via amm$W). Supplying a non-NULL W in the legacy path raises an error.
data Data frame Data containing variables referenced by formula and amm.
prior gdpar_prior or NULL Prior specification. Defaults to gdpar_prior() when NULL.
path Character scalar Estimation path: "bayes" (Path 1, default), "vcm" (Path 2), "hyper" (Path 3). Only "bayes" is implemented; the other two raise gdpar_unsupported_feature_error.
anchor Numeric scalar or character Anchor value $\theta_0$ in linear-predictor scale. "prior_mean" (default) or "empirical_y" (linkfun applied to outcome mean).
skip_id_check Logical scalar If TRUE, skip the Gram-matrix identifiability diagnostic. Default FALSE.
chains Integer scalar Number of HMC chains. Default 4L.
iter_warmup Integer scalar Warmup iterations per chain. Default 1000L.
iter_sampling Integer scalar Sampling iterations per chain. Default 1000L.
adapt_delta Numeric scalar Target acceptance probability for NUTS. Default 0.95. Validated to be in $[0.5, 0.999]$.
max_treedepth Integer scalar Maximum NUTS tree depth. Default 12L.
refresh Integer scalar Progress-reporting frequency passed to cmdstanr. Default 100L. Must be non-negative.
verbose Logical scalar Verbosity of package-level informational messages. Default TRUE. Also controls show_messages and show_exceptions in cmdstanr.
seed Integer scalar or NULL Random seed for cmdstanr. Default NULL. Also forwarded to the pre-flight diagnostic when parametrization = "auto".
group One-sided formula or NULL Grouping variable for per-group hierarchical estimation of theta_ref. Default NULL (single global theta_ref). Only single-variable one-sided formulas accepted (e.g., ~ species).
parametrization Character scalar CP/NCP parametrization selection: "auto" (default, runs pre-flight), "ncp" (force non-centered), "cp" (force centered).
parametrization_a Character scalar or NULL Override for component a: "ncp" or "cp". NULL inherits from parametrization.
parametrization_W Character scalar or NULL Override for component W: "ncp" or "cp". NULL inherits from parametrization. If both parametrization_a and parametrization_W are explicit, pre-flight is skipped.
parametrization_aggregation Character scalar or NULL Aggregation rule for multivariate per-coordinate CP/NCP decisions: "any_ncp", "majority", or "per_k". Only forwarded to .gdpar_multi; not used in the scalar or K-individual paths.
id_check_rigor Character scalar Identifiability check strictness: "full" (default) or "fast". Forwarded to .gdpar_K and .gdpar_multi; not directly used in the scalar path.
... Additional named arguments forwarded to the underlying cmdstanr sampler (via sample_args), and also forwarded to .gdpar_K / .gdpar_multi.

Mathematics

AMM canonical decomposition. The individual parameter $\theta$ is decomposed via an additive component $a$ and a modulating component $W$:

$$\theta = a + W(\theta_{\text{ref}}) \cdot x$$

where $x$ is a vector of covariates entering the modulating component.

Anchor parametrization. The anchor value $\theta_0$ enters as a parametrization device:

$$W(\theta) - W(\theta_{\text{anchor}})$$

Changing the anchor changes the parametrization but not the data-generating model.

Group-level random intercept. When group is supplied, $\theta_{\text{ref}}$ is promoted to a vector indexed by group with a Normal hyperprior:

$$\theta_{\text{ref}}[g] \sim \mathrm{Normal}(\mu_{\theta_{\text{ref}}}, \sigma_{\theta_{\text{ref}}})$$

Identifiability diagnostic test point. The diagnostic test point $\theta_{\text{ref,init}}$ is set to:

$$\theta_{\text{ref,init}} = \begin{cases} 1 & \text{if } b \neq \text{NULL} \text{ and } |\theta_0| < 10^{-8} \ \theta_0 & \text{otherwise} \end{cases}$$

This avoids a degenerate zero test point when the anchor is numerically zero and a multiplicative component $b$ is present.

Covariate standardization. Covariates entering the additive basis $a$ are centered (enforcing empirical assumption C1). Covariates entering the modulating component $W$ are additionally scaled to unit standard deviation. The transformation is recorded for consistent prediction on new data.

Parametrization pre-flight (Path B'). When parametrization = "auto", a short pre-flight NCP fit is run and per-component CP/NCP decisions are made via three sequential filters:

  1. A divergence-attribution $t$-test.
  2. An E-BFMI threshold check.
  3. A chain-aware block-bootstrap $z$-test on the posterior-to-prior contraction of the effective coefficient.

Returns

An object of class c("gdpar_fit", "list") with components:

Component Type Description
fit CmdStanFit object The underlying cmdstanr fit object.
amm amm_spec The (possibly materialized) AMM specification used.
family gdpar_family / gdpar_family_multi The (possibly promoted) family object.
prior gdpar_prior The prior specification.
design List The AMM design matrices and metadata from build_amm_design().
anchor Numeric scalar The resolved anchor value $\theta_0$.
stan_data List The data list passed to Stan.
identifiability_report List or NULL Report from gdpar_check_identifiability(), or NULL if skip_id_check = TRUE.
diagnostics List Convergence diagnostics from compute_diagnostics().
parametrization List Contains cp_a (logical), cp_W (logical), and meta (pre-flight diagnostic statistics when applicable).
group_info List or NULL Group information from .resolve_group_argument().
call call The matched call.
path Character scalar The resolved estimation path ("bayes").

Notes

Dispatch logic. The function has three major dispatch branches before reaching the scalar Stan pipeline:

  1. K-individual paths — triggered when formula is a gdpar_formula_set, when amm is a named list of amm_spec objects, or when formula is a classic two-sided formula whose RHS contains AMM wrapper calls (a()/b()/W(), detected by .gdpar_rhs_has_amm_calls()). In all three sub-paths, amm must be at its default amm_spec(). The formula set / AMM wrappers are converted to amm_list_canonical via .gdpar_formula_set_to_amm_spec_list(). If $K &gt; 1$, the function delegates to .gdpar_K(). If $K = 1$, the single amm_spec is extracted, the formula is reconstructed from union(all.vars(amm$a), all.vars(amm$b)), and execution falls through to the scalar path.

  2. Multivariate path — triggered when amm$p > 1L (and amm is a single amm_spec). Delegates to .gdpar_multi().

  3. Scalar univariate path — the main body, for amm$p being NULL or 1L.

Heterogeneous family detection. A named list is recognized as a heterogeneous family specification when all of the following hold: it is.list(), does not inherit from gdpar_family or gdpar_family_multi, has non-NULL names, all names are non-empty (nzchar), no duplicated names, and every element inherits from gdpar_family. If $K = 1$ with a heterogeneous family, an error is raised. If $K &gt; 1$, the family is resolved via .gdpar_resolve_heterogeneous_family_K() (producing a location_family and a family_id_k_vector); otherwise .gdpar_promote_scope_per_observation() is used.

Errors raised (via gdpar_abort):

Condition Error class
path = "vcm" gdpar_unsupported_feature_error
path = "hyper" gdpar_unsupported_feature_error
formula is a gdpar_formula_set but amm is not at default gdpar_input_error
Named list amm with empty slot name gdpar_input_error
Named list amm entry not inheriting amm_spec gdpar_input_error (with data containing slot and received)
Named list amm with duplicated names gdpar_input_error
Named list amm but formula not a two-sided formula gdpar_input_error
Classic formula with AMM calls but amm not at default gdpar_input_error
W supplied in the legacy single-amm_spec path gdpar_input_error
formula not a two-sided formula gdpar_input_error
gdpar_family_multi with amm$p NULL or 1 gdpar_input_error (with data containing family_class and amm_p)
Heterogeneous family with $K = 1$ gdpar_input_error (with data containing K)
refresh not a non-negative numeric scalar gdpar_input_error
verbose not a logical scalar gdpar_input_error
Outcome variable not in data gdpar_input_error
Outcome contains non-finite values (NA, NaN, Inf) gdpar_input_error
Identifiability check fails gdpar_identifiability_error (with data containing report)

Assertion checks (via assert_* helpers):

  • family must inherit from c("gdpar_family", "gdpar_family_multi") (unless it is a recognized named list).
  • amm must inherit from "amm_spec".
  • data must be a data frame.
  • prior must inherit from "gdpar_prior".
  • chains, iter_warmup, iter_sampling, max_treedepth must be count scalars.
  • adapt_delta must be a numeric scalar in $[0.5, 0.999]$.

Side effects:

  • Calls require_suggested("cmdstanr", ...); aborts if cmdstanr is not installed.
  • Writes Stan source code to a temporary file via write_stan_to_tempfile().
  • Compiles the Stan model via cmdstanr::cmdstan_model().
  • Performs HMC/NUTS sampling via cs_model$sample().
  • When verbose = FALSE: creates a temporary .log file, opens a message sink, redirects stderr to the file during sampling, then on exit reads the file, deduplicates blocks via dedup_message_blocks(), and re-emits them via message(). The on.exit handler is registered with add = TRUE, after = FALSE.
  • When verbose = TRUE: passes show_messages = TRUE and show_exceptions = TRUE to the sampler.
  • Emits informational messages via gdpar_inform() for: skipped identifiability check, D-ID conditional status, and group resolution.
  • Calls .check_group_aliasing_c7() when a group is supplied, enforcing condition C7 of Block 6.5 (no perfect aliasing of a/b columns with the group indicator).
  • The ... arguments are merged into sample_args by name, potentially overriding default sampler settings.

D-ID status message. When family$did_status == "holds_under_condition" and verbose = TRUE, a message is emitted with the family name, the D-ID condition, and the D-ID reference. The package documents but does not verify this condition from data.

Group argument. Resolved via .resolve_group_argument(group, data, n = length(y), verbose = verbose). If non-NULL, the group_id integer vector is passed to assemble_stan_data() and .check_group_aliasing_c7() is called with the design, group ID, and variable name.

Parametrization resolution. resolve_parametrization() is called with parametrization, parametrization_a, parametrization_W, prior, stan_data, amm, preflight_seed = seed, and verbose. Note that parametrization_aggregation is NOT passed to resolve_parametrization() in the scalar path. The resolved cp_a and cp_W flags are forwarded to generate_stan_code().

Stan code generation. generate_stan_code(prior, cp_a, cp_W) substitutes prior placeholders into the static template at inst/stan/amm_main.stan.

W basis materialization. When amm$W is non-NULL, it is materialized via materialize_W_basis(amm$W, p = 1L) before design assembly.

Formula RHS processing. The RHS is extracted as formula[c(1L, 3L)] and updated with ~ . + 0 (removing the intercept), then passed as formula_rhs to build_amm_design() and gdpar_check_identifiability().

S3 dispatch. The returned object has class c("gdpar_fit", "list"), enabling S3 method dispatch for print, summary, etc. (defined elsewhere in the package).


flush_captured()

Purpose

A closure defined inside gdpar() (not a top-level function). It is registered as an on.exit handler (with add = TRUE, after = FALSE) when verbose = FALSE, to flush the stderr messages captured during cmdstanr sampling. It un-sinks the message stream, closes the temporary log file connection, reads and deduplicates the captured lines, re-emits them via message(), and cleans up the temp file.

Arguments

None. The function captures msg_file (the temp file path) and msg_con (the file connection) from the enclosing scope of gdpar().

Returns

NULL (invisibly). Called for its side effects.

Notes

  • This is a local closure, not a package-level function. It is defined only when verbose = FALSE within gdpar().
  • The function first checks sink.number(type = "message") > 0L before un-sinking, guarding against double-un-sink.
  • It checks isOpen(msg_con) before closing.
  • File reading is wrapped in tryCatch: on error, character(0) is returned, and no messages are re-emitted.
  • The temp file is deleted via unlink(msg_file) regardless of read success.
  • Captured lines are deduplicated via dedup_message_blocks() before re-emission.
  • Each line of the deduplicated output is re-emitted via message(line).
  • Registered via on.exit(flush_captured(), add = TRUE, after = FALSE), ensuring it runs before other on.exit handlers added earlier.

.gdpar_multi(formula, family, amm, data, prior, anchor, skip_id_check, chains, iter_warmup, iter_sampling, adapt_delta, max_treedepth, refresh, verbose, seed, group = NULL, parametrization, parametrization_a, parametrization_W, parametrization_aggregation = NULL, id_check_rigor = "full", call, ...)

Purpose

Internal workhorse implementing the multivariate ($p &gt; 1$) Bayesian "Path 1" fitting route of gdpar(). It validates inputs, promotes a univariate family to a multivariate one when necessary, assembles the AMM design matrix, resolves the anchor and the CP/NCP parametrization, runs a basis-restricted identifiability check, generates and compiles Stan code via cmdstanr, executes HMC/NUTS sampling, computes posterior diagnostics, and returns a gdpar_fit object. It is the multivariate counterpart referenced by the trailing (not-yet-defined-in-this-section) K-individual path documented for $K &gt; 1, p = 1$.

Arguments

  • formula: formula (length 3, two-sided). The LHS names the outcome column in data; the RHS is rebuilt internally with the intercept suppressed.
  • family: object of class gdpar_family or gdpar_family_multi. A bare gdpar_family is auto-promoted to gdpar_family_multi with p = amm$p when amm$p > 1.
  • amm: list representing the anchor model matrix specification; must expose integer element $p (the outcome dimension) and may expose $W (basis) and $b (offset). $W, if present, is materialized through materialize_W_basis().
  • data: data.frame containing the outcome column and any RHS variables.
  • prior: gdpar_prior object; if NULL, replaced by gdpar_prior().
  • anchor: anchor specification forwarded to resolve_anchor_multi().
  • skip_id_check: logical scalar; when TRUE, the basis-restricted identifiability check (C1–C4 + C4-bis) is bypassed.
  • chains: non-negative integer scalar; number of MCMC chains.
  • iter_warmup: non-negative integer scalar; warmup iterations per chain.
  • iter_sampling: non-negative integer scalar; sampling iterations per chain.
  • adapt_delta: numeric scalar in $[0.5, 0.999]$; NUTS step-size adaptation target.
  • max_treedepth: non-negative integer scalar; NUTS maximum tree depth.
  • refresh: non-negative numeric scalar; Stan progress refresh interval.
  • verbose: logical scalar; toggles informative messages, Stan show_messages/show_exceptions, and message-sink behavior.
  • seed: integer scalar or NULL; PRNG seed forwarded to cmdstanr and to the parametrization preflight.
  • group: NULL or group specification; resolved by .resolve_group_argument().
  • parametrization: forwarded to resolve_parametrization_multi(); uniform CP/NCP flag or "auto".
  • parametrization_a: forwarded to resolve_parametrization_multi(); per-component $a$ parametrization override.
  • parametrization_W: forwarded to resolve_parametrization_multi(); per-component $W$ parametrization override.
  • parametrization_aggregation: NULL (defaulting internally to "any_ncp") or a string; rule for aggregating per-coordinate CP/NCP decisions.
  • id_check_rigor: string (default "full"); rigor level passed to gdpar_check_identifiability().
  • call: language object; the original call, stored verbatim in the returned object.
  • ...: extra named arguments merged into the sample_args list passed to cmdstanr::cmdstan_model$sample().

Mathematics

The function operationalizes the multivariate Path-1 Bayesian model. Let $p = \texttt{amm}\$p$ denote the outcome dimension and $n$ the number of observations. The outcome $Y \in \mathbb{R}^{n \times p}$ must be a finite matrix with exactly $p$ columns. The RHS design is built with intercept suppressed:

$$ \texttt{rhs} ;=; \texttt{update}\bigl(\texttt{formula}[c(1,3)],; \sim; . + 0\bigr). $$

The diagnostic test point for the identifiability check is chosen heuristically as

$$ \theta_{\text{ref}} ;=; \begin{cases} \mathbf{1}_p, &amp; \texttt{amm}$b \neq \text{NULL} ;\land; \lVert \text{anchor_value} \rVert_\infty &lt; 10^{-8}, \\ \text{anchor_value}, &amp; \text{otherwise}, \end{cases} $$

i.e. when the anchor is numerically zero and a basis offset $b is present, the all-ones vector is substituted to avoid a degenerate diagnostic. The identifiability report is produced by gdpar_check_identifiability() at rigor id_check_rigor; failure raises a gdpar_identifiability_error.

The CP/NCP decision is resolved per coordinate $k \in \{1,\dots,p\}$ by resolve_parametrization_multi(), yielding boolean vectors cp_a_per_k and cp_W_per_k plus uniform flags cp_a/cp_W. When the per-coordinate $a$ decisions are non-constant,

$$ \bigl|{\texttt{cp_a_per_k}}\bigr| > 1 ;\Longrightarrow; \texttt{stan_data$cp_a_per_k_data} ;=; \texttt{as.integer}(\texttt{cp_a_per_k}), $$

activating the segment()-based per-coordinate prior wiring in the generated Stan code. The metadata note records that uniform-NCP and uniform-CP branches compile to byte-identical code relative to the "H.1 template", and that the mode is "multivariate_phase_h" (Phase H.2 preflight).

Returns

A list with S3 class c("gdpar_fit", "list") containing:

  • fit: the cmdstanr fit object returned by $sample().
  • amm: the (possibly $W-materialized) amm input.
  • family: the (possibly promoted) gdpar_family_multi object.
  • prior: the gdpar_prior object used.
  • design: the AMM design object from build_amm_design().
  • anchor: the resolved anchor value (numeric vector of length $p$).
  • stan_data: the assembled Stan data list.
  • identifiability_report: the report from gdpar_check_identifiability(), or NULL when skip_id_check = TRUE.
  • diagnostics: output of compute_diagnostics(fit, verbose).
  • parametrization: a list with elements cp_a, cp_W, cp_a_per_k, cp_W_per_k, report, and meta (containing mode, note, aggregation, and requested).
  • group_info: result of .resolve_group_argument() (possibly NULL).
  • call: the stored call.
  • path: the string "bayes".
  • p: the integer outcome dimension.

Notes

  • Errors raised (class gdpar_input_error): formula not a length-3 formula; data not a data frame; prior not inheriting gdpar_prior; chains/iter_warmup/iter_sampling/max_treedepth not non-negative integer scalars; adapt_delta outside $[0.5, 0.999]$; refresh not a non-negative numeric scalar; verbose not a logical scalar; family not promotable to gdpar_family_multi; family$p != amm$p; outcome name absent from data; outcome not a matrix/array; outcome containing any non-finite (NA, NaN, Inf) entry — Path 1 does not impute; outcome column count differing from $p$.
  • Errors raised (class gdpar_identifiability_error): basis-restricted identifiability check (C1–C4 + C4-bis) fails at the diagnostic test point; the error data field carries the full report.
  • Family auto-promotion: when family is a bare gdpar_family (not gdpar_family_multi) and amm$p > 1, it is promoted via gdpar_family_multi(family, p = p) and a gdpar_family_promotion_message is emitted when verbose.
  • D-ID informational message: when family$did_status == "holds_under_condition" and verbose, a gdpar_did_message is emitted reporting the condition and reference; the package documents but does not verify the condition from data.
  • Group aliasing: when a group is supplied, .check_group_aliasing_c7() is invoked against the design and group id before Stan data assembly.
  • Message capture under verbose = FALSE: a temporary .log file is opened, sink(..., type = "message") is engaged, and an on.exit handler flush_captured() (a closure defined inside the function) closes the sink, reads the file, deduplicates blocks via dedup_message_blocks(), emits each surviving line through message(), and unlinks the temp file. The handler is registered with add = TRUE, after = FALSE.
  • Extra arguments: any ... names overwrite or extend sample_args entries, allowing caller override of cmdstanr::cmdstan_model$sample() parameters.
  • S3 dispatch: the returned object is dispatched on class gdpar_fit; no methods are defined within this section.
  • Trailing roxygen block: the section ends with a @noRd/@keywords internal documentation block describing a not-yet-defined K-individual path ($K &gt; 1, p = 1$) targeting the amm_distrib_K.stan template via .build_amm_design_K(), .assemble_stan_data_K(), and generate_stan_code_K(); that function's body is not present in this section and is therefore not described here.

Section 3: K-Individual Bayesian Fitting and Model Assembly


.gdpar_K(amm_list_canonical, family, data, prior, anchor, outcome_name, formula_env, skip_id_check, chains, iter_warmup, iter_sampling, adapt_delta, max_treedepth, refresh, verbose, seed, group = NULL, parametrization, parametrization_a, parametrization_W, id_check_rigor = "full", family_id_k_vector = NULL, call, ...)

Purpose

Primary internal entry point for Path 1 (K-individual hierarchical Bayesian) model fitting. Validates all user-supplied arguments, delegates model/data assembly to .gdpar_K_build(), invokes cmdstanr MCMC sampling, computes post-sampling diagnostics and post-fit identifiability (information-contraction) checks, and assembles the final gdpar_fit S3 object.

Arguments

Argument Type Meaning
amm_list_canonical list Canonical AMM (anchor-map model) specifications, one per K slot.
family gdpar_family Response distribution family with K-individual slots promoted to per-observation scope.
data data.frame Model data containing the outcome and all covariates.
prior gdpar_prior or NULL Prior specification; if NULL, defaults to gdpar_prior().
anchor numeric scalar, numeric vector (length K), or character ("prior_mean", "empirical_y") Reference-anchor value(s) for the K slots.
outcome_name character (length 1) Name of the outcome column in data.
formula_env environment Environment in which the model formula is constructed.
skip_id_check logical If TRUE, skip basis-restricted identifiability checks.
chains integer (count) Number of MCMC chains.
iter_warmup integer (count) Warmup iterations per chain.
iter_sampling integer (count) Post-warmup sampling iterations per chain.
adapt_delta numeric scalar in $[0.5,\, 0.999]$ NUTS step-size adaptation target.
max_treedepth integer (count) Maximum NUTS tree depth.
refresh non-negative numeric scalar Stan progress-report refresh interval.
verbose logical scalar Whether to show Stan messages and informational diagnostics.
seed integer or NULL Random seed for reproducibility.
group NULL or group specification Grouping structure for hierarchical partial pooling.
parametrization character Global CP/NCP parametrization mode ("cp" or "ncp").
parametrization_a character or NULL Per-component parametrization for the a coefficients; overrides parametrization when non-NULL.
parametrization_W character or NULL Per-component parametrization for the W basis; overrides parametrization when non-NULL.
id_check_rigor character (default "full") Rigor level for the K-level identifiability check.
family_id_k_vector NULL or vector Per-slot family ID vector.
call call The original matched call (stored in the returned object).
... named arguments Extra arguments forwarded to cs_model$sample(), overriding defaults.

Mathematics

The function orchestrates Bayesian inference for a K-individual dynamic-parameter model. Each slot $k \in \{1, \dots, K\}$ carries its own AMM with design matrices $Z_a^{(k)}$, $Z_b^{(k)}$, and shared covariate matrix $X$. The posterior is explored via the No-U-Turn Sampler (NUTS) with adaptation target adapt_delta and maximum tree depth max_treedepth.

Post-fit, an information-contraction diagnostic is computed per slot. Let $\pi_0(\theta_k)$ denote the prior and $\pi(\theta_k \mid y)$ the posterior for slot $k$'s anchor parameter. The contraction ratio is defined (conceptually) as:

$$ r_k = 1 - \frac{\mathrm{Var}_{\pi(\theta_k \mid y)}[\theta_k]}{\mathrm{Var}_{\pi_0(\theta_k)}[\theta_k]} $$

Slots with $r_k &lt; 0.1$ are flagged as information errors (posterior recovers prior); slots with $r_k \in [0.1, 0.5)$ are flagged as weak learning warnings.

Returns

A list with S3 class c("gdpar_fit", "list") containing:

Element Type Description
fit CmdStanFit object Raw posterior draws.
amm_list_canonical list Canonical AMM list (possibly with materialized $W$ bases).
family gdpar_family Resolved family object.
prior gdpar_prior Resolved prior object.
design_K list K-individual design (matrices $Z_a^{(k)}$, $Z_b^{(k)}$, $X$).
anchor numeric vector (length K) Resolved anchor values.
stan_data list Stan data list passed to the sampler.
identifiability_report list or NULL Pre-fit per-slot identifiability reports with K-level attribute.
identifiability_post_fit list Post-fit information-contraction report.
diagnostics list MCMC convergence diagnostics (R-hat, ESS, divergences, etc.).
parametrization list Resolved CP/NCP flags (cp_a, cp_W, cp_a_per_K, meta).
group_info list or NULL Resolved grouping information.
call call Original function call.
path character Always "bayes".
K integer Number of slots.
slot_names character vector Canonical slot names.

Notes

  • Input validation: Asserts data is a data.frame, prior inherits from gdpar_prior, chains/iter_warmup/iter_sampling/max_treedepth are counts, adapt_delta is in $[0.5, 0.999]$, refresh is a non-negative numeric scalar, and verbose is a logical scalar. Violations raise errors of class gdpar_input_error.
  • Suggested package: Requires cmdstanr (via require_suggested).
  • Message capture: When verbose = FALSE, Stan messages are redirected to a temporary .log file via sink(). A closure flush_captured (registered on on.exit) closes the sink, reads the file, deduplicates message blocks via dedup_message_blocks(), and replays them through message().
  • Extra arguments: Named elements in ... are merged into sample_args after all defaults are set, so they can override any sampling argument.
  • Seed: Only added to sample_args when non-NULL; coerced to integer.
  • Init: Only added to sample_args when bd$init is non-NULL (Tweedie family case).
  • Post-fit warnings: If any slot has status == "information_error", a warning of class gdpar_information_error is issued (always). If any slot has status == "warn" and verbose is TRUE, a warning of class gdpar_information_warning is issued.
  • Side effects: Creates (and cleans up) a temporary log file when verbose = FALSE. Calls do.call(cs_model$sample, sample_args) which writes to disk and may print to console.

.gdpar_K_build(amm_list_canonical, family, data, prior, anchor, outcome_name, formula_env, skip_id_check, verbose, group = NULL, parametrization, parametrization_a, parametrization_W, id_check_rigor = "full", family_id_k_vector = NULL, compile_model_methods = FALSE)

Purpose

Behaviour-preserving extraction of the build phase of .gdpar_K() — everything up to but not including the call to cs_model$sample(). Shared between .gdpar_K() (which passes compile_model_methods = FALSE) and gdpar_geom_fit() (which passes compile_model_methods = TRUE to expose $log_prob/$grad_log_prob/$hessian for the geometry engine). This ensures the model and data are built and compiled exactly once with no duplication.

Arguments

Argument Type Meaning
amm_list_canonical list Canonical AMM specifications, one per K slot.
family gdpar_family Response distribution family.
data data.frame Model data.
prior gdpar_prior or NULL Prior specification; defaults to gdpar_prior() if NULL.
anchor numeric or character Anchor specification for the K slots.
outcome_name character Name of the outcome column.
formula_env environment Environment for formula construction.
skip_id_check logical Whether to skip identifiability checks.
verbose logical Verbosity flag for informational messages.
group NULL or group spec Grouping structure.
parametrization character Global CP/NCP mode.
parametrization_a character or NULL Per-component a parametrization.
parametrization_W character or NULL Per-component W parametrization.
id_check_rigor character (default "full") Rigor for K-level identifiability check.
family_id_k_vector NULL or vector Per-slot family ID vector.
compile_model_methods logical (default FALSE) If TRUE, compiles with compile_model_methods = TRUE to expose log-prob/gradient/Hessian methods.

Mathematics

Design construction. The union of all covariate variables across the K AMM slots is collected:

$$ \mathcal{V} = \bigcup_{k=1}^{K} \left( \mathrm{vars}(a_k) \cup \mathrm{vars}(b_k) \right) $$

A model formula $y \sim \mathcal{V} + 0$ is constructed (or $y \sim 1 + 0$ if $\mathcal{V} = \emptyset$), and the K-individual design is built via .build_amm_design_K().

Anchor resolution. The anchor is resolved via resolve_anchor_K(), producing a length-K numeric vector $\boldsymbol{\theta}_{\mathrm{ref}} = (\theta_{\mathrm{ref}}^{(1)}, \dots, \theta_{\mathrm{ref}}^{(K)})$.

Per-slot identifiability check. For each slot $k$, a diagnostic test point is chosen:

$$ \theta_{\mathrm{diag}}^{(k)} = \begin{cases} 1 & \text{if } b_k \neq \mathrm{NULL} \text{ and } |\theta_{\mathrm{ref}}^{(k)}| < 10^{-8} \ \theta_{\mathrm{ref}}^{(k)} & \text{otherwise} \end{cases} $$

This avoids testing at a degenerate zero anchor when a dispersion/basis term $b_k$ is present. The check is delegated to gdpar_check_identifiability().

K-level identifiability check. A cross-slot check (via .check_identifiability_K()) verifies:

  • D-B3: per-slot $Z_a^{(k)}$ rank sufficiency.
  • D-B2: cross-slot extended Gram matrix rank.

Parametrization resolution. CP/NCP flags are resolved with per-component overrides:

$$ \texttt{cp_a} = \begin{cases} \mathbb{1}[\texttt{parametrization_a} = \texttt{"cp"}] & \text{if } \texttt{parametrization_a} \neq \text{NULL} \ \mathbb{1}[\texttt{parametrization} = \texttt{"cp"}] & \text{otherwise} \end{cases} $$

and analogously for cp_W. Per-slot parametrization (cp_a_per_K) is NULL (deferred to a future sub-phase).

Tweedie init. For the Tweedie family (family$stan_id == 9), the structural constraint $p \in (1.01, 1.99)$ is enforced via a uniform prior, but the default cmdstanr init draws unconstrained parameters in $(-2, 2)$ which frequently land outside the support. The init function pins $\theta_{\mathrm{ref}}^{(K)}$ (the $p$ slot) at the midpoint $1.5$:

$$ \theta_{\mathrm{ref}}^{(j, k)} = \begin{cases} 0 & k < K \ 1.5 & k = K \end{cases}, \quad j = 1, \dots, J_{\mathrm{groups}} $$

Returns

A list with four top-level elements:

Element Type Description
cs_model CmdStanModel Compiled Stan model.
stan_data list Stan data list.
init function or NULL Per-chain init function (Tweedie only); NULL otherwise.
meta list Metadata list (see below).

The meta sub-list contains:

Element Type Description
amm_list_canonical list AMM list (with materialized $W$ bases).
family gdpar_family Resolved family.
prior gdpar_prior Resolved prior.
design_K list K-individual design.
anchor_value numeric vector (length K) Resolved anchor values.
id_report list or NULL Per-slot identifiability reports with K_level attribute.
group_info list or NULL Resolved group information.
parametrization_resolved list Resolved CP/NCP flags and metadata.
slot_names character vector Canonical slot names.
K integer Number of slots.
stan_src character Generated Stan source code.

Notes

  • Outcome validation: Aborts with gdpar_input_error if outcome_name is not a column in data, if the outcome is matrix/array-valued (Path 1 with $K &gt; 1, p = 1$ requires a length-$n$ vector), or if the outcome contains any non-finite values (NA, NaN, Inf). Path 1 does not impute.
  • $W$ basis materialization: For each slot, if $W is non-NULL, it is materialized via materialize_W_basis(W, p = 1L).
  • Identifiability skip: When skip_id_check = TRUE and verbose = TRUE, an informational message of class gdpar_identifiability_message is issued; id_report is set to NULL.
  • Identifiability failure: Per-slot failures abort with class gdpar_identifiability_error (including the slot name and report in data). K-level failures abort similarly, identifying whether the failure was in the per-slot rank check (D-B3) or the cross-slot extended Gram check (D-B2).
  • D-ID conditional families: When family$did_status == "holds_under_condition" and verbose = TRUE, an informational message of class gdpar_did_message is issued documenting the condition and reference. The package does not verify the condition from data.
  • Group aliasing: When groups are present, .check_group_aliasing_c7() is called for each slot's design ($Z_a^{(k)}$, $Z_b^{(k)}$, $X$) against the group ID.
  • Model compilation: When compile_model_methods = FALSE, calls cmdstanr::cmdstan_model(stan_path) (bit-identical to pre-refactor behavior). When TRUE, calls cmdstanr::cmdstan_model(stan_path, compile_model_methods = TRUE).
  • Stan code generation: Delegates to generate_stan_code_K() with the resolved prior, CP/NCP flags, and family.
  • Tempfile: Stan source is written to a temporary file via write_stan_to_tempfile().

flush_captured() (local closure defined inside .gdpar_K)

Purpose

Closure defined inside .gdpar_K() when verbose = FALSE. Registered on on.exit() to cleanly close the message sink, read the captured Stan messages from the temporary log file, deduplicate them, and replay them through message().

Arguments

None. Captures msg_con (the file connection) and msg_file (the temp file path) from the enclosing scope.

Returns

NULL (invisible). Side effect: messages are replayed to the R message stream.

Notes

  • Checks sink.number(type = "message") > 0L before closing the sink to avoid errors if the sink was already closed.
  • Checks isOpen(msg_con) before closing the connection.
  • Reads the log file with readLines(msg_file, warn = FALSE); on error, returns character(0).
  • Unlinks the temporary file after reading.
  • Passes captured lines through dedup_message_blocks() before replaying.
  • Registered via on.exit(flush_captured(), add = TRUE, after = FALSE) so it executes before any other on.exit handlers.

Documentation for resolve_anchor_K appears at the end of this section (roxygen block only); its function definition is not present in this section and will be documented in a subsequent section.

resolve_anchor_K(anchor, family, y, prior, slot_names, verbose)

Purpose

Resolves the user-supplied anchor argument for the K-slot (multi-parameter-per-response) path of gdpar. It returns a length-K numeric vector on the linear-predictor scale, one anchor value per parameter slot (e.g. location, scale, shape). This is the slot-aware counterpart of the simpler resolve_anchor.

Arguments

  • anchor: A numeric scalar, a numeric vector of length K, or a single character string ("prior_mean" or "empirical_y").
  • family: A gdpar_family object. Only used when anchor = "empirical_y"; its $linkfun component is applied to the outcome mean for the location (first) slot.
  • y: Numeric vector of outcomes. Used only when anchor = "empirical_y".
  • prior: A gdpar_prior object. Accepted as part of the signature but not referenced in the function body.
  • slot_names: Character vector of length K giving the names of the parameter slots. Used for naming the output and for validating a named anchor vector.
  • verbose: Logical scalar. If TRUE, an informational message is issued when the empirical anchor is used.

Mathematics

For "prior_mean":

$$\boldsymbol{\eta}^{\text{anchor}} = \mathbf{0} \in \mathbb{R}^{K}$$

For "empirical_y":

$$\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i, \qquad \eta_{\text{loc}} = g(\bar{y})$$

where $g$ is family$linkfun. The returned vector is:

$$\boldsymbol{\eta}^{\text{anchor}} = (\eta_{\text{loc}},; 0,; \ldots,; 0)^{\top}$$

i.e. only the first slot (location) receives the link-transformed mean; all remaining slots are anchored at $0$.

Returns

A numeric vector of length K with names set to slot_names.

Notes

  • Scalar broadcast: A finite numeric scalar is recycled to all K slots.
  • Named vector validation: If anchor is a length-K numeric vector with at least one non-empty name, the names must form the same set as slot_names (order-independent, checked via setequal). A mismatch raises a gdpar_input_error (class "gdpar_input_error") with data = list(received_names = ..., expected_names = ...).
  • Unnamed vector: A length-K numeric vector with no names is accepted positionally.
  • Link failure: If family$linkfun(yb) errors, the error is caught and re-raised as gdpar_input_error with the original condition message.
  • Invalid anchor: Any input not matching the accepted patterns raises gdpar_input_error with data = list(received = anchor, K = K, slot_names = slot_names).
  • The prior argument is unused in the body.
  • When verbose is TRUE and "empirical_y" is used, a message of class "gdpar_anchor_message" is emitted via gdpar_inform.

resolve_anchor_multi(anchor, family, y, prior, p, verbose)

Purpose

Multivariate-outcome counterpart of resolve_anchor. Resolves the anchor for a model with p outcome coordinates, returning a length-p numeric vector on the linear-predictor scale.

Arguments

  • anchor: A numeric scalar, a numeric vector of length p, or a single character string ("prior_mean" or "empirical_y").
  • family: A gdpar_family_multi object. Must contain a families component that is a list of length p, each element providing a $linkfun.
  • y: Numeric matrix of outcomes (n × p). Used only when anchor = "empirical_y".
  • prior: A gdpar_prior object. Accepted but not referenced in the body.
  • p: Integer scalar; the number of outcome coordinates (length of theta_ref).
  • verbose: Logical scalar. If TRUE, an informational message is issued when the empirical anchor is used.

Mathematics

For "prior_mean":

$$\boldsymbol{\eta}^{\text{anchor}} = \mathbf{0} \in \mathbb{R}^{p}$$

For "empirical_y":

$$\bar{y}_k = \frac{1}{n}\sum_{i=1}^{n} y_{ik}, \qquad \eta_k = g_k(\bar{y}_k), \quad k = 1, \ldots, p$$

where $g_k$ is family$families[[k]]$linkfun. The returned vector is:

$$\boldsymbol{\eta}^{\text{anchor}} = (\eta_1, \ldots, \eta_p)^{\top}$$

Returns

A numeric vector of length p (unnamed).

Notes

  • Scalar broadcast: A finite numeric scalar is recycled to length p.
  • Vector pass-through: A length-p finite numeric vector is returned as-is (coerced to double), without naming or name validation (unlike resolve_anchor_K).
  • Per-coordinate link: Each coordinate's link function is applied independently via vapply. If any fam_k$linkfun(yb[k]) errors, it is caught and re-raised as gdpar_input_error identifying the failing coordinate index k.
  • Invalid anchor: Raises gdpar_input_error with data = list(received = anchor, p = p).
  • The prior argument is unused in the body.
  • When verbose is TRUE and "empirical_y" is used, a message of class "gdpar_anchor_message" is emitted showing the formatted anchor values.

resolve_anchor(anchor, family, y, prior, verbose)

Purpose

Simplest (univariate) anchor resolver. Parses the user-supplied anchor argument and returns a single numeric value on the linear-predictor scale.

Arguments

  • anchor: A numeric scalar, or a single character string ("prior_mean" or "empirical_y").
  • family: A gdpar_family object. Must provide $linkfun (used for "empirical_y") and $link (a string naming the link, used in the verbose message).
  • y: Numeric vector of outcomes. Used only when anchor = "empirical_y".
  • prior: A gdpar_prior object. Accepted but not referenced in the body.
  • verbose: Logical scalar. If TRUE, an informational message is issued when the empirical anchor is used.

Mathematics

For "prior_mean":

$$\eta^{\text{anchor}} = 0$$

For "empirical_y":

$$\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i, \qquad \eta^{\text{anchor}} = g(\bar{y})$$

where $g$ is family$linkfun.

Returns

A numeric scalar.

Notes

  • Unlike resolve_anchor_K and resolve_anchor_multi, this function does not accept a vector anchor; only a scalar or string is valid.
  • If family$linkfun(yb) errors, the error is caught and re-raised as gdpar_input_error with the original condition message.
  • Invalid anchor: Raises gdpar_input_error with data = list(received = anchor).
  • The prior argument is unused in the body.
  • When verbose is TRUE and "empirical_y" is used, a message of class "gdpar_anchor_message" is emitted, referencing family$link (the link name) and the computed anchor value.

compute_diagnostics(fit, verbose = TRUE)

Purpose

Collects convergence diagnostics from a cmdstanr fit object at fit time. Extracts R-hat and effective sample size (bulk and tail) summaries via the posterior package, counts divergent transitions and tree-depth saturations from sampler diagnostics, and optionally warns when thresholds are violated.

Arguments

  • fit: A cmdstanr fit object. Must provide $draws(), $diagnostic_summary().
  • verbose: Logical scalar (default TRUE). If TRUE and convergence thresholds are violated, individual warnings are issued.

Mathematics

Let $V$ be the set of model variables retained after excluding those matching the regex ^(eta|log_lik|y_pred|theta_i|a_coef|b_coef|a_raw|b_raw|W_raw). The convergence criteria are:

$$\hat{R}_{\max} = \max_{v \in V} \hat{R}_v \leq 1.01$$

$$\text{ESS}_{\text{bulk,min}} = \min_{v \in V} \text{ESS}_{\text{bulk},v} \geq 400$$

$$\text{ESS}_{\text{tail,min}} = \min_{v \in V} \text{ESS}_{\text{tail},v} \geq 400$$

$$\frac{n_{\text{div}}}{N_{\text{iter}}} \leq 0.001, \qquad \frac{n_{\text{td}}}{N_{\text{iter}}} \leq 0.01$$

where $N_{\text{iter}} = n_{\text{iterations}} \times n_{\text{chains}}$, $n_{\text{div}}$ is the total divergent-transition count, and $n_{\text{td}}$ is the total tree-depth-saturation count. Overall convergence is:

$$\text{converged} = \bigwedge \text{ of all five conditions above}$$

Returns

An object of class c("gdpar_diagnostics", "list") with the following named components:

Component Type Description
rhat_max numeric (scalar, possibly NA_real_) Maximum R-hat across retained variables
ess_bulk_min numeric (scalar, possibly NA_real_) Minimum bulk ESS across retained variables
ess_tail_min numeric (scalar, possibly NA_real_) Minimum tail ESS across retained variables
divergent_count integer Total divergent transitions
divergent_relative numeric (possibly NA_real_) divergent_count / total_iter
treedepth_saturated integer Total tree-depth saturations
treedepth_relative numeric (possibly NA_real_) treedepth_saturated / total_iter
efmi_min numeric (possibly NA_real_) Minimum E-BFMI across chains
converged logical Whether all thresholds are satisfied
summary data.frame Full output of posterior::summarise_draws on retained variables
messages character vector Warning messages (populated only when verbose && !converged; otherwise empty)

Notes

  • Dependency check: Calls require_suggested("posterior", "summarize posterior draws") at entry.
  • Variable filtering: Variables whose names match the regex ^(eta|log_lik|y_pred|theta_i|a_coef|b_coef|a_raw|b_raw|W_raw) are excluded from convergence summaries. If this leaves zero variables, the function falls back to "theta_ref" (if present), then to the first variable in the draws.
  • Summary measures: Uses posterior::default_convergence_measures() appended to the summarise_draws call.
  • Diagnostic summary: fit$diagnostic_summary() is called inside tryCatch; if it errors or returns NULL, the divergent and treedepth counts default to 0L and efmi_min defaults to NA_real_.
  • Total iterations: Computed as posterior::niterations(draws) * posterior::nchains(draws) inside tryCatch; if this fails or is non-finite, set to NA_integer_, and the relative divergence/treedepth rates become NA_real_.
  • Warning behavior: When verbose = TRUE and converged = FALSE, each violated threshold produces a separate warning via gdpar_warn with class "gdpar_diagnostic_warning". The same messages are collected in the messages component.
  • Side effects: May emit up to five warnings (one per violated threshold) when verbose is TRUE.
  • S3 class: The returned object has class c("gdpar_diagnostics", "list"); no S3 methods for this class are defined in this section.

Diagnostics Accessor

diagnostics(fit)

Purpose

Exported accessor that retrieves the precomputed diagnostics component stored inside a fitted gdpar_fit object. It is the user-facing entry point for inspecting sampler-level diagnostics (e.g., divergences, R-hat, ESS, max treedepth) that were attached to the fit during model estimation. The function performs no recomputation; it simply validates the class of its input and returns the stored list element named diagnostics.

Arguments

Argument Type Meaning
fit S3 object of class "gdpar_fit" A fitted model object produced by the package's main estimation routine. Must contain a diagnostics element.

Returns

The object fit$diagnostics, returned as-is. The exact structure is whatever was attached at fit time (typically a named list of diagnostic quantities); this function does not coerce or reshape it.

Notes

  • Calls assert_inherits(fit, "gdpar_fit", "fit") before accessing the element. If fit does not inherit from class "gdpar_fit", an error is raised (with the argument name "fit" used in the message). The error is propagated from assert_inherits; no custom class is set by diagnostics itself.
  • No side effects: the function is pure with respect to fit.
  • S3 dispatch: none. This is an ordinary (non-generic) function; there is no UseMethod call, so subclasses of gdpar_fit cannot override behavior.
  • If fit$diagnostics is NULL or absent, NULL is returned silently (R's standard behavior for accessing a missing list element); no further validation of the element's presence is performed.

Group Argument Resolution

.resolve_group_argument(group, data, n, verbose = TRUE)

Purpose

Internal helper that translates the user-facing group argument — either NULL or a one-sided formula such as ~ species — into the integer grouping vector consumed by the Stan template underlying gdpar. It validates the formula, extracts the referenced variable from data, coerces it to a factor, and returns the integer codes together with metadata (variable name and level labels). When verbose is TRUE and at least one group has fewer than five observations, it emits an informational warning about the dominance of hierarchical shrinkage in small groups.

Arguments

Argument Type Meaning
group NULL or formula of length 2 (one-sided) The user-supplied grouping specification. NULL indicates no grouping (function returns NULL immediately). A one-sided formula must contain exactly one variable name on its right-hand side.
data data.frame The user-supplied data frame from which the grouping variable is extracted via data[[var_name]].
n integer (scalar) The sample size of the outcome variable, used to validate that the grouping variable has matching length.
verbose logical (scalar, default TRUE) When TRUE, enables the small-group informational warning. When FALSE (or any non-TRUE value, tested via isTRUE), the warning is suppressed.

Mathematics

Let $g_i \in \{1, \dots, J\}$ denote the group index for observation $i$, obtained by factor-coercing the raw grouping variable. Let $n_g = \sum_{i=1}^{n} \mathbf{1}(g_i = g)$ be the per-group count. The small-group warning fires when

$$ \min_{g \in {1,\dots,J}} n_g < 5. $$

In that regime, the hierarchical anchor $\theta_{\text{ref},g}$ is dominated by shrinkage toward the population mean $\mu_{\theta_{\text{ref}}}$, so per-group posteriors may be uninformative.

Returns

  • If group is NULL: returns NULL.
  • Otherwise: a named list with three components:
    • group_id: integer vector of length $n$, with values in $\{1, \dots, J\}$, obtained via as.integer(as.factor(raw)). The mapping from raw values to integers follows as.factor's level ordering (alphabetical by default for character inputs, or the factor's pre-existing level order).
    • var_name: character scalar — the single variable name extracted from the formula via all.vars(group)[[1]].
    • levels: character vector — levels(fac), i.e., the original group level labels in factor order.

Notes

  • Formula validation. If group is not NULL, it must inherits(group, "formula") and have length(group) == 2L (one-sided formulas have length 2: ~ plus RHS). Violations raise:
    gdpar_abort("Argument 'group' must be a one-sided formula such as ~ species.",
                class = "gdpar_input_error")
    
  • Single-variable check. all.vars(group) must yield exactly one variable. If zero or more than one are found, an error of class "gdpar_input_error" is raised, with data = list(variables_found = vars) attached for programmatic inspection.
  • Column existence. The extracted var_name must appear in colnames(data); otherwise an error of class "gdpar_input_error" is raised with data = list(missing_variable = var_name).
  • Length consistency. length(raw) must equal n. A mismatch raises an error of class "gdpar_input_error" reporting both lengths.
  • Missing values. Any NA in raw is fatal: an error of class "gdpar_input_error" is raised reporting the count of NA values via sum(is.na(raw)). The message instructs the user to remove or impute before fitting.
  • Zero levels. After as.factor, if nlevels(fac) < 1L (defensive guard; under normal R semantics this cannot occur for a non-NA vector), an error of class "gdpar_input_error" is raised.
  • Small-group warning. When isTRUE(verbose) and length(counts) > 0L and min(counts) < 5L, a warning is emitted via gdpar_warn with class "gdpar_grouping_warning". The message includes var_name, min(counts), and J_groups. The warning is purely informational (not an error) and does not abort resolution.
  • Coercion semantics. The raw variable is coerced via as.factor(raw), so numeric grouping variables are treated by their printed values; factor inputs retain their level order; character inputs are sorted alphabetically by as.factor.
  • No side effects beyond the optional warning. The function does not modify data or any global state.
  • Not exported. Marked @keywords internal and @noRd; intended solely for internal use within the package's fitting pipeline.

R/geometry_bridge.R

.gdpar_geom_enable_methods(fit, hessian = TRUE, seed = 1L)

Purpose Internal helper that enables the standalone log_prob, grad_log_prob, and optionally hessian methods on a CmdStanFit / CmdStanMCMC object. It is called by gdpar_geom_bridge() to prepare a fitted cmdstan object for consumption by the geometry engine. The call is idempotent: init_model_methods() is a no-op when the methods are already attached.

Arguments

Argument Type Meaning
fit CmdStanMCMC or CmdStanFit A cmdstan fit object whose compiled executable should have the standalone methods exposed.
hessian logical (length 1) Whether to request compilation of the standalone Hessian (needed by the Riemannian / SoftAbs geometry level). Defaults to TRUE.
seed integer-coercible (length 1) Forwarded to fit$init_model_methods(seed = ...). Seeds any internal RNG inside the standalone methods. Defaults to 1L.

Mathematics None; this is a plumbing / compilation-gating function.

Algorithm (two-stage fallback)

  1. Attempt fit$init_model_methods(seed, verbose = FALSE, hessian = TRUE).
  2. If that succeeds, return list(has_hessian = TRUE).
  3. If hessian = TRUE was requested and step 1 failed, retry with hessian = FALSE.
    • If the retry succeeds, emit a gdpar_geometry_warning via gdpar_warn() and return list(has_hessian = FALSE).
    • This fallback leaves the Euclidean, dense, and sub-Riemannian (expected-Fisher) levels fully operational; only the Riemannian SoftAbs level is disabled.
  4. If both attempts fail (or hessian = FALSE was requested and the single attempt failed), call gdpar_abort() with class "gdpar_input_error", surfacing the original error message. A C++ toolchain capable of compiling the model methods is required.

Returns invisible(list(has_hessian = logical(1))) — a single-element list recording whether the Hessian method is available. Returned invisibly.

Notes

  • The tryCatch / try_init wrapper converts errors to condition objects rather than propagating them, enabling the two-stage fallback.
  • A failure to expose log_prob / grad_log_prob at all is fatal (gdpar_abort); only Hessian failure is tolerated.
  • Side effects: may emit a gdpar_geometry_warning via gdpar_warn() on Hessian fallback.

.gdpar_geom_unconstrained_summary(fit)

Purpose Internal helper that extracts the unconstrained-dimension and a posterior-mean warm-start vector from a cmdstan fit with methods already enabled. The geometry engine operates on the unconstrained (real-vector) scale, so this provides both dim (the dimension of that space) and reference (the posterior mean in that space, used as a warm-start for position-dependent levels).

Arguments

Argument Type Meaning
fit CmdStanMCMC or CmdStanFit A methods-enabled cmdstan fit object. Must support $unconstrain_draws().

Mathematics Given the matrix of unconstrained draws $\mathbf{U} \in \mathbb{R}^{S \times d}$ where $S$ is the number of posterior samples and $d$ is the unconstrained dimension:

$$\text{dim} = d, \qquad \text{reference} = \frac{1}{S} \sum_{s=1}^{S} \mathbf{U}_{s,\cdot} \in \mathbb{R}^{d}$$

That is, the reference is the column-wise mean of the unconstrained draws (the posterior mean on the unconstrained scale).

Returns A named list with two elements:

Element Type Meaning
dim integer (scalar) Number of columns of the unconstrained draws matrix $d$.
reference numeric vector (length $d$, unnamed) Posterior mean on the unconstrained scale.

Notes

  • Calls posterior::as_draws_matrix(fit$unconstrain_draws()) then coerces to a base matrix via as.matrix().
  • If fit$unconstrain_draws() errors, gdpar_abort() is called with class "gdpar_input_error", forwarding the original message.
  • The reference vector is unname()-d to strip any column names.

.gdpar_geom_bridge_core(model, stan_data, dim, fisher = NULL, reference = NULL, engine_fit = NULL, has_hessian = TRUE, extra = list())

Purpose Internal factory that assembles the gdpar_geom_bridge list object from its constituent parts. Called by both gdpar_geom_bridge() (the post-hoc bridge) and gdpar_geom_fit() (the one-call entry). It constructs the target (for re-sampling diagnostics) and geom_target (for the geometry engine) components and attaches metadata.

Arguments

Argument Type Meaning
model CmdStanModel A compiled cmdstan model (with methods) that can be re-sampled. Used for the diagnostic target.
stan_data named list The Stan data list for the model.
dim integer-coercible (scalar) The unconstrained dimension $d$.
fisher function or NULL An optional function $\theta \mapsto F(\theta)$ returning the expected Fisher information matrix. Required by the sub-Riemannian geometry level. Must be NULL or a function; validated at entry.
reference numeric vector or NULL The unconstrained reference position (warm-start). Can be NULL if no warm-start is available.
engine_fit CmdStanMCMC / CmdStanFit or NULL A methods-enabled cmdstan fit to back geom_target. When NULL, the geom_target is derived from model instead (a cheap one-iteration sample would expose the methods).
has_hessian logical (scalar) Whether the Hessian method is available. When FALSE, the $hessian closure in geom_target is set to NULL so that the SoftAbs level degrades gracefully instead of erroring at call time.
extra named list Additional elements to merge into the returned object (e.g., extra metadata).

Returns An object of class c("gdpar_geom_bridge", "list") containing:

Element Type / Class Meaning
target list with $model, $dim, $data The re-samplable diagnostic target.
geom_target gdpar_geom_target object The engine sampling target on the unconstrained scale (exposes $log_prob, $grad_log_prob, and optionally $hessian).
fisher function or NULL The expected-Fisher function, if supplied.
reference numeric vector or NULL Warm-start position.
dim integer The unconstrained dimension.
model CmdStanModel The methods-enabled compiled model.
stan_data named list The Stan data list.
... any Any extra elements from the extra argument.

Notes

  • fisher is validated: if non-NULL and not a function, gdpar_abort() is called with class "gdpar_input_error".
  • geom_target is created by calling gdpar_geom_target(object = src, dim = dim, data = stan_data) where src is engine_fit when non-NULL, otherwise model.
  • When has_hessian is FALSE, geom_target$hessian is explicitly set to NULL so that downstream consumers (the SoftAbs level) degrade gracefully.
  • The dim argument is coerced to integer in both the target sub-list and the top-level object.

gdpar_geom_bridge(object, fisher = NULL, reference = NULL, hessian = TRUE, methods_seed = 1L, ...)

Purpose Exported function. The durable, path-agnostic core of the Block RG integration (RG.6 part ii). Takes an already-fitted gdpar_fit object, enables standalone log-probability / gradient / Hessian methods on its cmdstan fit, derives the unconstrained dimension and posterior-mean warm-start, recompiles a re-samplable CmdStanModel from the fit's own Stan source, and returns a gdpar_geom_bridge object consumable by gdpar_geom_orchestrate(). It never touches the gdpar() default fit path, so the golden tests remain bit-identical.

Arguments

Argument Type Meaning
object gdpar_fit A fitted gdpar model (result of gdpar()). Must carry $fit (a CmdStanMCMC / CmdStanFit) and $stan_data.
fisher function or NULL Optional function of the unconstrained $\theta$ returning the expected Fisher information matrix. Required by the sub-Riemannian level. Use gdpar_geom_fisher_simulator() when no closed form exists.
reference numeric vector or NULL Optional unconstrained reference position. Defaults to the posterior mean extracted from the fit's draws. If supplied, must have length equal to the unconstrained dimension $d$.
hessian logical (length 1) Whether to compile the standalone Hessian method (needed by the Riemannian SoftAbs level). Defaults to TRUE.
methods_seed integer (length 1) Seed forwarded to init_model_methods(). Defaults to 1L.
... Reserved for future extension; currently unused.

Algorithm

  1. Input validation: verify object inherits from "gdpar_fit", that $fit is a CmdStanMCMC / CmdStanFit, that $stan_data is non-NULL, that reference (if non-NULL) is numeric.
  2. Dependency check: call require_suggested("cmdstanr", ...) and require_suggested("posterior", ...).
  3. Enable methods: call .gdpar_geom_enable_methods(csfit, hessian, seed) on the fitted cmdstan object. This returns meth with meth$has_hessian.
  4. Unconstrained summary: call .gdpar_geom_unconstrained_summary(csfit) to obtain the dimension $d$ and the posterior-mean reference. If reference was NULL, use the posterior mean; otherwise validate its length equals $d$.
  5. Recompile model: call cmdstanr::cmdstan_model(cmdstanr::write_stan_file(csfit$code()), compile_model_methods = TRUE) to get a fresh, re-samplable CmdStanModel from the fit's Stan source. Cmdstanr's content-hash cache makes this a cache hit when the methods variant already exists.
  6. Assemble bridge: call .gdpar_geom_bridge_core(model, stan_data, d, fisher, reference, engine_fit = csfit, has_hessian = meth$has_hessian).

Returns An object of class c("gdpar_geom_bridge", "list") — see .gdpar_geom_bridge_core() for the full structure. Feed (target, geom_target, fisher, reference) to gdpar_geom_orchestrate().

Notes

  • Path-agnostic: works for K-individual, multi-coordinate, and single-coordinate gdpar_fit objects, provided they carry $fit and $stan_data.
  • Uses the fit's own $code() Stan source for recompilation, guaranteeing model identity.
  • Cmdstanr's hash-based compilation cache means recompilation is typically a no-op cache hit.
  • Errors at any validation or compilation step raise gdpar_abort() with class "gdpar_input_error".
  • This function is the tool RG.7 points at the real Tweedie count of benchmark 9.2.O.

print.gdpar_geom_bridge(x, ...)

Purpose S3 print method for objects of class "gdpar_geom_bridge". Provides a concise, human-readable summary of the bridge's key properties.

Arguments

Argument Type Meaning
x gdpar_geom_bridge The bridge object to print.
... Unused; present for S3 method compatibility.

Output (to console)

<gdpar_geom_bridge>
  unconstrained dim: <d>
  fisher supplied:   <TRUE/FALSE>
  reference:         <description>
  feed (target, geom_target, fisher, reference) to gdpar_geom_orchestrate()

Where <description> is:

  • "zeros (default)" when x$reference is NULL.
  • "length <n> (warm-start)" when x$reference is non-NULL, with <n> = length(x$reference).

Returns invisible(x) — the input object, unchanged.

Notes Pure side-effect (console output). No validation beyond what cat() and is.null() provide.


.gdpar_geom_fit_resolve_K(formula, family, amm, W, data)

Purpose Internal helper used by gdpar_geom_fit() to resolve the front-end inputs for the K-individual (distributional) model-building path. It reuses the same resolution logic as gdpar() without touching gdpar() itself, and returns the arguments that .gdpar_K_build() needs. Scoped to $K &gt; 1$.

Arguments

Argument Type Meaning
formula gdpar_formula_set, formula, or other A gdpar_bf() formula set, a classic formula with AMM wrapper calls on the RHS, or (with a named amm list) a two-sided y ~ ... formula.
family gdpar_family, gdpar_family_multi, or named list of gdpar_family The distribution family. A single gdpar_family is broadcast across all K slots; a named list enables heterogeneous families per slot.
amm amm_spec (default) or named list of amm_spec The additive multiresolution specification. When formula is a gdpar_formula_set, amm must stay at its default. When a named list of amm_spec, it is used directly.
W modulating basis or NULL Optional modulating basis for the K-individual paths.
data data frame The data (not directly used in resolution but part of the interface contract).

Algorithm / Resolution Logic

The function handles three mutually exclusive input patterns:

Case 1 — formula is a gdpar_formula_set:

  • amm must be at its default (amm_spec()); otherwise error.
  • Derive amm_list_canonical via .gdpar_formula_set_to_amm_spec_list(formula, W).
  • Extract outcome_name from formula$outcome and formula_env from formula$env.

Case 2 — amm is a named list of amm_spec:

  • Each entry must be a named amm_spec; validated in a loop.
  • formula must be a two-sided formula (length(formula) == 3L); otherwise error.
  • amm_list_canonical <- amm.
  • outcome_name extracted from formula[[2L]].
  • formula_env <- environment(formula).

Case 3 — Classic formula with AMM wrapper calls on the RHS:

  • Detected via .gdpar_rhs_has_amm_calls(formula).
  • amm must be at its default; otherwise error.
  • Derive the first eligible parameter name from family$param_specs[[1]]$name.
  • Build a gdpar_formula_set via do.call(gdpar_formula_set, list(formula)) named with that parameter.
  • Derive amm_list_canonical from the formula set.
  • Extract outcome_name and formula_env from the formula set.

If none of the three cases applies, gdpar_abort() is called with a message indicating that only the K-individual path is supported (RG.6 scope).

Post-resolution checks:

  • K <- length(amm_list_canonical). If $K \leq 1$, abort: gdpar_geom_fit() requires $K &gt; 1$.
  • Family resolution:
    • If family is a named list of gdpar_family (detected by checking it's a list, not a gdpar_family or gdpar_family_multi, has non-empty unique names, and every element inherits from "gdpar_family"): call .gdpar_resolve_heterogeneous_family_K(family, names(amm_list_canonical)) to get family_promoted and family_id_k_vector.
    • Otherwise: call .gdpar_promote_scope_per_observation(family, names(amm_list_canonical)) to get a single family_promoted; family_id_k_vector is NULL.

Returns A named list:

Element Type Meaning
amm_list_canonical named list of amm_spec Canonical AMM specification list (length $K$).
family gdpar_family or gdpar_family_multi The resolved (possibly promoted) family.
outcome_name character (length 1) Name of the response variable.
formula_env environment The environment associated with the formula (for evaluating terms).
family_id_k_vector integer vector or NULL Per-slot family index for heterogeneous family lists; NULL when a single family is broadcast.
K integer (scalar) Number of distributional slots ($K &gt; 1$).

Notes

  • This function is the shared seam between gdpar_geom_fit() and gdpar()'s internal K path: it reuses .gdpar_K_build() as the single source of model assembly.
  • The family_is_named_list detection checks: is.list(family), not inheriting from "gdpar_family" or "gdpar_family_multi", non-NULL names, all names non-empty (nzchar), no duplicated names, and every element inheriting from "gdpar_family".
  • All error paths use gdpar_abort() with class "gdpar_input_error".
  • The error for $K \leq 1$ includes data = list(K = K) for programmatic access to the slot count.

gdpar_geom_fit(formula, family, amm, W, data, prior, anchor, skip_id_check, parametrization, parametrization_a, parametrization_W, id_check_rigor, group, fisher, budget, criteria, entry_level, level_map, reference, speed, rest_mass, laplace_fallback, laplace_draws, n_grid, seed, verbose, ...)

Purpose
High-level entry point for fitting General Dynamic Parameter (GDP) models using geometry-adaptive optimization (opt-in controller) with reference anchoring. Builds the hierarchical Bayesian model once, then runs the geometry-adaptive controller (an alternative to the default gdpar() fit path). Requires cmdstanr and posterior packages.

Arguments

  • formula : formula – Model formula specifying response and predictors.
  • family : gdpar_family – Distribution family (default: gdpar_family("gaussian")).
  • amm : amm_spec – Adaptive Mixture of Multivariate normals specification (default: amm_spec()).
  • W : matrix or NULL – Weight matrix for the model (default: NULL).
  • data : data.frame – Data frame containing variables referenced in formula.
  • prior : list or NULL – Prior specifications for parameters (default: NULL).
  • anchor : character or numeric – Anchor for reference (default: "prior_mean").
  • skip_id_check : logical – If TRUE, skip identity checks (default: FALSE).
  • parametrization : character – One of "auto", "ncp" (non-centered), "cp" (centered); default "auto".
  • parametrization_a : character or NULL – Overrides parametrization for a parameters; must be "ncp" or "cp" if set.
  • parametrization_W : character or NULL – Overrides parametrization for W parameters; must be "ncp" or "cp" if set.
  • id_check_rigor : character – One of "full" or "fast" (default: "full").
  • group : character or NULL – Grouping variable for hierarchical structure (default: NULL).
  • fisher : matrix, list, or NULL – Precomputed Fisher information or specification (default: NULL).
  • budget : numeric or NULL – Computational budget for optimizer (default: NULL).
  • criteria : list or NULL – Convergence criteria for optimizer (default: NULL).
  • entry_level : integer or NULL – Initial level in hierarchy for geometry controller (default: NULL).
  • level_map : list or NULL – Mapping of levels for geometry controller (default: NULL).
  • reference : numeric or NULL – Reference point for anchoring (default: NULL).
  • speed : numeric – Speed parameter for geometry controller (default: 10).
  • rest_mass : numeric – Rest mass parameter for geometry controller (default: 1).
  • laplace_fallback : logical – If TRUE, use Laplace approximation as fallback (default: FALSE).
  • laplace_draws : integer – Number of draws from Laplace fallback (default: 0L).
  • n_grid : integer or NULL – Grid size for geometry controller (default: NULL).
  • seed : integer – Random seed for reproducibility (default: 20260603L).
  • verbose : logical – If TRUE, print progress messages (default: TRUE).
  • ... : Additional arguments passed to internal functions.

Mathematics
The function orchestrates a two-stage process:

  1. Model building: Constructs a Stan model via .gdpar_K_build, yielding a compiled model and data.
  2. Geometry-adaptive fitting:
    • A throwaway single-iteration MCMC sample (chains=1, iter_warmup=1, iter_sampling=1) extracts the unconstrained parameter dimension $d$ and enables model methods (e.g., Hessian).
    • A bridge object (.gdpar_geom_bridge_core) defines the target density $\pi(\theta)$ and geometry-informed target $\tilde{\pi}(\theta)$, optionally using a Fisher information matrix $\mathcal{I}(\theta)$.
    • The geometry controller (gdpar_geom_orchestrate) performs adaptive sampling over a hierarchy of levels, using speed $v$ and rest mass $m$ to navigate the parameter space. It may fall back to Laplace approximation $\mathcal{N}(\hat{\theta}, \mathcal{I}(\hat{\theta})^{-1})$ if requested.

Returns
An object of class gdpar_geom_fit (inheriting from list) with elements:

  • orchestration : Output from gdpar_geom_orchestrate.
  • bridge : Bridge object from .gdpar_geom_bridge_core.
  • status : Character scalar – optimization status (e.g., "resolved").
  • level : Numeric scalar – level at which resolution occurred.
  • metric : List – geometry metric used.
  • draws : Matrix – posterior draws (samples × parameters).
  • laplace : List – Laplace fallback results (if used).
  • certificate : List – verification certificate (if applicable).
  • stan_data : List – data passed to Stan.
  • family : gdpar_family – family object.
  • K : Integer – number of mixture components/groups.
  • slot_names : Character vector – model slot names.
  • call : Matched call.

Notes

  • Requires packages cmdstanr and posterior; raises error if missing.
  • The throwaway sample discards all draws; its purpose is only to initialize methods and compute $d$.
  • parametrization="auto" lets internal logic choose between non-centered (ncp) and centered (cp) parametrization.
  • The geometry controller adapts over levels; entry_level and level_map control this hierarchy.
  • If laplace_fallback=TRUE, uses laplace_draws posterior samples from the Laplace approximation.
  • seed ensures reproducibility of the geometry controller's internal sampling.
  • The printed message states that this function never alters the default gdpar() fit path.

print.gdpar_geom_fit(x, ...)

Purpose
S3 print method for gdpar_geom_fit objects. Provides a concise summary of the fit.

Arguments

  • x : gdpar_geom_fit – Object to print.
  • ... : Additional arguments (currently unused).

Mathematics
None.

Returns
Invisibly returns the input object x.

Notes

  • Dispatches automatically for objects of class gdpar_geom_fit.
  • Output varies based on x$status:
    • If status == "resolved", prints the level and draws dimensions (if draws exist).
    • Otherwise, prints verdict and prescription count from x$certificate (if present).
  • If x$laplace is not NULL, appends the Laplace fit quality label.
  • Does not modify the object; side effect is printing to console.

R/geometry_diagnostic.R

gdpar_geometry_thresholds()

Purpose

Returns the default decision thresholds for the rule-based classifier inside gdpar_geometry_diagnostic. These are exposed as data rather than hard-coded so that the calibration of Block RG.1 can be tuned against gdpar_geometry_suite and so users can re-calibrate for their own settings.

Arguments

None.

Returns

A named list of 13 numeric scalars:

Element Default Role
divergent_rate_high 0.01 Upper tolerable divergence rate before a pathology is flagged.
funnel_ebfmi_low 0.35 Minimum E-BFMI below which a funnel-type energy pathology is suspected.
heavy_cond_max 25 Maximum tolerable condition number for the heavy-tail class.
treedepth_sat_high 0.20 Upper tolerable fraction of iterations hitting max_treedepth.
condition_high 12 Condition-number cut above which anisotropy is flagged.
step_scale_ratio_low 0.10 Lower cut on the adapted-step-to-scale ratio.
nslope_grows 0.80 Slope threshold for the difficulty-vs-$n$ curve above which difficulty is deemed to grow with $n$ (quasi-determinism indicator).
flat_var_high 600 Variance cut above which a flat-direction pathology is flagged.
boundary_prox_high 0.02 Fraction of draws within boundary_eps of a boundary above which boundary pile-up is flagged.
boundary_eps 0.01 Distance defining proximity to a parameter boundary.
multimodal_high 2.5 Cut on the multimodality signal above which multimodality is flagged.
heavy_kurtosis_high 1.8 Excess-kurtosis cut above which heavy tails are flagged.
target_ess 400 Target effective sample size used for cost extrapolation.

Notes

  • The defaults were calibrated in sub-phase RG.1.c (session B9.23) against a synthetic suite over a difficulty × pilot-budget × replica grid. Thresholds were proposed by a data-driven Youden cut, regularised to interpretable values on a calibration fold, and validated out-of-sample on a held-out fold (held-out balanced accuracy rose from 0.60 to 0.89).
  • Versus the initial hand-set values, several cuts were tightened to compensate for signal attenuation in the short pilots: condition_high 50 → 12, heavy_kurtosis_high 3 → 1.8, boundary_prox_high 0.10 → 0.02, nslope_grows 0.30 → 0.80, funnel_ebfmi_low 0.25 → 0.35, heavy_cond_max 8 → 25, flat_var_high 1000 → 600, multimodal_high 2 → 2.5.
  • The calibration is against an idealised synthetic suite and is not claimed to be optimal on real posteriors. The funnel and heavy-tail classes remain mutually confusable (per-class recall ≈ 0.6–0.7).
  • No side effects; no errors raised; pure data factory.

gdpar_geometry_diagnostic(target, n_grid = NULL, difficulty = NULL, pilot_warmup = 150L, pilot_sampling = 150L, chains = 4L, adapt_delta = 0.8, max_treedepth = 10L, seed = 20260602L, thresholds = NULL, verbose = TRUE, ...)

Purpose

Top-level exported entry point of the geometry-diagnostic subsystem. It is an opt-in forensic probe: it runs a sequence of short pilot Stan fits at one or more data sizes $n$, extracts sampler- and posterior-level "signals" from each pilot, fits a difficulty curve across $n$, classifies the posterior geometry into a pathology/remedy/level triple, estimates a computational cost, and assembles everything into an S3 object of class gdpar_geometry_diagnostic. It does not modify any fit.

Arguments

Argument Type Meaning
target various (see .gdpar_geom_normalize_target) The posterior target to probe: a gdpar_geometry_target suite object, a list(type = "gdpar", formula, amm, data) gdpar specification, a list(stan_code, stan_data[, data_n_fn, bounds]) raw Stan target, or a list(model, data[, data_n_fn, bounds]) with a compiled CmdStanModel.
n_grid numeric / NULL Grid of data sizes at which to run pilots. If NULL, the default grid supplied by the normalised target is used. Coerced to numeric, then sort-ed and de-duplicated via unique.
difficulty various / NULL Difficulty level forwarded to a suite target's make(n, diff). Ignored for non-suite targets.
pilot_warmup integer (count) Warmup iterations per pilot fit. Asserted strictly positive integer via assert_count.
pilot_sampling integer (count) Sampling (post-warmup) iterations per pilot fit. Asserted via assert_count.
chains integer (count) Number of Markov chains per pilot. Asserted via assert_count.
adapt_delta numeric scalar in $(0,1)$ Stan adapt_delta. Asserted via assert_numeric_scalar with lower = 0, upper = 1.
max_treedepth integer (count) Stan max_treedepth. Asserted via assert_count.
seed integer (count) Base RNG seed. Pilot $i$ receives seed seed + i. Asserted via assert_count.
thresholds list / NULL Classification thresholds. If NULL, defaults are obtained from gdpar_geometry_thresholds().
verbose logical scalar If TRUE, emits an informational gdpar_optin_message describing the probe before running. Manually validated (must be length-1 logical).
... forwarded Extra arguments passed to .gdpar_geom_run_pilot (and thence to the Stan sampler / target make).

Mathematics

For each element $n_i$ of the sorted, de-duplicated grid $\{n_1, \dots, n_K\}$, a pilot is run with seed $s_i = \texttt{seed} + i$:

$$ \text{pilot}_i = \texttt{.gdpar_geom_run_pilot}(\text{norm},; n_i,; \text{controls},; s_i,; \dots) $$

Each pilot yields a signals sub-list; these are stacked row-wise into a data frame $\mathbf{S}$ with one row per grid point and columns:

$$ \mathbf{S} = \begin{bmatrix} n & \text{divergent_rate} & \text{ebfmi_min} & \text{treedepth_sat_rate} & \text{condition_number} & \text{step_scale_ratio} & \lambda_{\max,\text{cov}} & \text{mean_leapfrog} & \text{multimodality} & \text{heavy_kurtosis} & \text{boundary_proximity} & \text{failed} \end{bmatrix} $$

A difficulty curve is then estimated:

$$ \text{ncurve} = \texttt{.gdpar_geom_difficulty_curve}(\mathbf{S},; \text{thresholds}) $$

Classification uses the largest pilot's signals (pilot_max = pilots[[K]]), except that multimodality — being $n$-invariant — is aggregated across all pilots by taking the maximum (with na.rm = TRUE):

$$ \text{multimodality}_{\text{classify}} = \max_{i=1}^{K} \mathbf{S}_i[\text{multimodality}] $$

The classification and cost are then:

$$ \text{cls} = \texttt{.gdpar_geom_classify}(\text{sig_classify},; \text{ncurve},; \text{culprit},; \text{thresholds}) $$

$$ \text{cost} = \texttt{.gdpar_geom_cost}(\text{pilot_max},; \text{ncurve}) $$

Returns

A list with S3 class c("gdpar_geometry_diagnostic", "list") and elements:

Element Content
pathology Classified pathology string (from cls$pathology).
confidence Classification confidence (from cls$confidence).
recommended_geometry Recommended remedy (from cls$remedy).
geometry_level Recommended sampler level (from cls$level).
signals Data frame $\mathbf{S}$ (one row per grid point).
difficulty_curve Result of .gdpar_geom_difficulty_curve.
culprit Culprit information from the largest pilot (pilot_max$culprit).
cost Result of .gdpar_geom_cost.
rule_trace Trace from the classifier (cls$trace).
reproducibility A list: seed, n_grid, controls (warmup/sampling/chains/adapt_delta/max_treedepth), target_id (norm$meta$id), gdpar_version (utils::packageVersion("gdpar") as character), cmdstan_version (cmdstanr::cmdstan_version() wrapped in tryCatch, yielding NA_character_ on error).
ground_truth Present only if norm$meta$ground_truth is non-NULL; copied verbatim.
correct Present only if ground truth exists; identical(cls$pathology, norm$meta$ground_truth$pathology).

Notes

  • Input validation order: assert_count is called on pilot_warmup, pilot_sampling, chains, max_treedepth, and seed; assert_numeric_scalar on adapt_delta (bounds $(0,1)$); verbose is checked manually (length-1 logical, else gdpar_abort with class gdpar_input_error).
  • Suggested-package gating: require_suggested("cmdstanr", ...) and require_suggested("posterior", ...) are called before any work begins.
  • Threshold defaulting: when thresholds is NULL, gdpar_geometry_thresholds() is called exactly once.
  • Grid coercion: n_grid (whether user-supplied or defaulted) is passed through sort(unique(as.numeric(n_grid))), so non-numeric input will be coerced (and may error inside as.numeric), duplicates are removed, and order is ascending.
  • Verbose message: the informational message reports length(n_grid) pilot fits, length(n_grid) sizes, chains chains, and pilot_warmup + pilot_sampling iterations. It is emitted with class gdpar_optin_message via gdpar_inform.
  • Multimodality aggregation: suppressWarnings(max(signals$multimodality, na.rm = TRUE)) is used; if all values are NA, this yields -Inf (with the warning suppressed).
  • Ground-truth comparison: correct uses identical, so the classified pathology string must match the ground-truth string exactly (including type and encoding).
  • No side effects beyond the optional informational message and the Stan fits launched by .gdpar_geom_run_pilot.

.gdpar_geom_normalize_target(target, difficulty)

Purpose

Internal three-way (effectively four-way) adapter that normalises the heterogeneous target argument into a uniform list consumed by the pilot runner. It inspects the class and structure of target and dispatches to one of four branches: suite target, gdpar specification, raw Stan code, or compiled CmdStanModel.

Arguments

Argument Type Meaning
target various The user-supplied target. See the four recognised forms below.
difficulty various / NULL Difficulty level; only consumed by the suite-target branch.

Mathematics

For the gdpar-specification branch, the default grid is three logarithmically spaced points between a lower bound and the full data size:

$$ n_{\text{grid,default}} = \mathrm{unique}!\left(\mathrm{round}!\left(\exp!\left(\mathrm{seq}!\left(\log L,; \log n_{\text{full}},; \text{length.out} = 3\right)\right)\right)\right), \quad L = \max!\left(30,; \left\lceil \frac{n_{\text{full}}}{4} \right\rceil\right) $$

where $n_{\text{full}} = \texttt{nrow(target\$data)}$.

Returns

A normalised list whose common fields are kind (a string tag), n_grid_default (numeric vector of default grid sizes), and meta (a list with id, bounds, ground_truth). Branch-specific additional fields:

Branch kind n_grid_default meta Extra fields
inherits(target, "gdpar_geometry_target") "suite" target$n_grid id = target$id, bounds = target$bounds, ground_truth = list(pathology, geometry_remedy, culprit, difficulty_scales_with_n) extracted from the target make = function(n) target$make(n, diff) where `diff = difficulty %
is.list(target) && identical(target$type, "gdpar") "gdpar" Log-spaced 3-point grid (formula above) id = "gdpar_spec", bounds = NULL, ground_truth = NULL gdpar_target = target, n_full = nrow(target$data)
is.list(target) && !is.null(target$stan_code) "stan" 1 id = "stan_target", `bounds = target$bounds %
is.list(target) && (inherits(target$model, "CmdStanModel") || !is.null(target$model)) "model" 1 id = "cmdstan_model", `bounds = target$bounds %

Notes

  • Branch order matters: the suite-target branch is tested first (via inherits), then the gdpar-specification branch (via identical(target$type, "gdpar")), then the stan_code branch, then the model branch. A list that satisfies multiple conditions is dispatched to the earliest matching branch.
  • gdpar-specification validation: if target$data or target$formula is NULL, the function aborts with class gdpar_input_error and message "A gdpar target must supply at least 'formula' and 'data'.".
  • Model-branch guard: the condition inherits(target$model, "CmdStanModel") || !is.null(target$model) is logically equivalent to !is.null(target$model) (since a non-NULL value satisfies the second disjunct regardless of class). The inherits check is therefore redundant but harmless.
  • Unrecognised target: if none of the four branches matches, the function calls gdpar_abort with class gdpar_input_error and a message listing the four accepted forms.
  • %||% operator: used for bounds and data_n_fn defaults; the comment notes it is "defined canonically in R/preflight_multi.R".
  • difficulty for non-suite branches: silently ignored.
  • ground_truth for non-suite branches: always NULL, meaning the returned diagnostic object will not carry ground_truth or correct fields.
  • data_n_fn default: for the stan and model branches, if target$data_n_fn is not supplied, a closure is created that ignores its argument n and returns the fixed target$stan_data (or target$data). This means the raw-Stan and compiled-model branches are effectively single-$n$ targets (n_grid_default = 1) unless the user supplies a custom data_n_fn.
  • No side effects; pure transformation (the make closure captures target and diff but is not invoked here).

.gdpar_geom_run_pilot(norm, n_knob, controls, seed, ...)

Purpose

Top-level pilot runner for the geometry diagnostic. It attempts to draw a posterior sample at a given knob size n_knob, guards against sampling failures, and—on success—delegates to signal extraction and culprit localization. On failure it returns a sentinel structure so downstream code can proceed uniformly.

Arguments

Argument Type Meaning
norm list A "norm" descriptor of the target model. Must contain $kind (character), $meta$bounds (parameter bounds), and kind-specific fields consumed by .gdpar_geom_sample.
n_knob numeric / integer The knob value (typically data size) at which to run the pilot.
controls list Sampling control parameters. Must include $max_treedepth; other fields ($chains, $warmup, $sampling, $adapt_delta) are forwarded to the sampler.
seed integer Random seed passed to the sampler.
... any Extra arguments forwarded to .gdpar_geom_sample and ultimately to the underlying sampler or gdpar().

Returns

A list with elements:

Element Type Description
failed logical TRUE if sampling raised an error; FALSE otherwise.
signals list Diagnostic signal list (real signals on success, all-NA via .gdpar_geom_na_signals() on failure).
culprit list Culprit localization (real on success, empty via .gdpar_geom_empty_culprit() on failure).
elapsed numeric Elapsed wall-clock seconds (present only when failed = FALSE).

Notes

  • Sampling errors are caught with tryCatch. On error, gdpar_warn() is called with class "gdpar_diagnostic_warning" and a message of the form "Geometry pilot at n = <n_knob> failed: <message>.".
  • The format(n_knob) call in the warning message uses base R's generic format, so the rendered string depends on the class of n_knob.
  • .gdpar_geom_empty_culprit() is called from another section; it is invoked here but not defined here.

.gdpar_geom_sample(norm, n_knob, controls, seed, ...)

Purpose

Dispatches posterior sampling according to norm$kind. Supports four model kinds: "gdpar" (delegates to .gdpar_geom_sample_gdpar), "suite" (builds a Stan instance from a generator), "stan" (uses pre-supplied Stan code and a data-size function), and a fallback "model" kind (uses a pre-compiled cmdstanr model object).

Arguments

Argument Type Meaning
norm list Model descriptor. Fields used depend on norm$kind (see below).
n_knob numeric / integer Knob value passed to the data-generating function or used for subsetting.
controls list Must contain $chains, $warmup, $sampling, $adapt_delta, $max_treedepth.
seed integer Seed for the sampler.
... any Extra arguments forwarded to mod$sample() or gdpar().

Returns

The return value of .gdpar_geom_pack_fit(cs_fit, elapsed): a list with cs_fit, pm, chain_id, and elapsed.

Notes

  • norm$kind == "gdpar": immediately returns .gdpar_geom_sample_gdpar(...).
  • norm$kind == "suite": calls norm$make(n_knob) to obtain a list with $stan_code and $stan_data; compiles via .gdpar_geom_compile(code).
  • norm$kind == "stan": uses norm$stan_code directly and calls norm$data_n_fn(n_knob) for data; compiles via .gdpar_geom_compile(code).
  • Fallback (e.g. "model"): uses norm$model (a pre-compiled cmdstanr model) and norm$data_n_fn(n_knob) for data. No compilation step.
  • The mod$sample() call sets refresh = 0, show_messages = FALSE, show_exceptions = FALSE, and parallel_chains = controls$chains (i.e., one parallel chain per chain).
  • Elapsed time is measured with Sys.time() / difftime(..., units = "secs") and coerced to numeric.

.gdpar_geom_sample_gdpar(norm, n_knob, controls, seed, ...)

Purpose

Specialized sampler for norm$kind == "gdpar". Subsets the full gdpar target dataset to n_knob rows (randomly), assembles an argument list from the stored target, and invokes gdpar() via do.call.

Arguments

Argument Type Meaning
norm list Must contain $gdpar_target (a list of arguments for gdpar(), including $data) and $n_full (integer, total number of data rows).
n_knob numeric / integer Desired data subset size.
controls list Sampling controls ($warmup, $sampling, $chains, $adapt_delta, $max_treedepth).
seed integer Seed for gdpar().
... any Extra arguments merged into the gdpar() call.

Mathematics

Row indices are drawn uniformly without replacement:

$$ \mathcal{I} \sim \text{UniformSubsample}\bigl({1, \ldots, n_{\text{full}}},; \min(n_{\text{knob}},, n_{\text{full}})\bigr) $$

and the data is subsetted as $\text{sub} = \text{data}[\mathcal{I}, ]$.

Returns

The return value of .gdpar_geom_pack_fit(fit$fit, elapsed), where fit is the object returned by gdpar().

Notes

  • The argument list is built by copying norm$gdpar_target and then overriding: type is set to NULL (removed), data is replaced by the subset, and all control/seed fields are set from controls/seed.
  • skip_id_check = TRUE and verbose = FALSE are hardcoded into the args, suppressing gdpar's internal identifier validation and verbose output.
  • refresh is set to 0L.
  • Extra arguments in ... are appended via c(args, list(...)) and thus override any同名 fields already in args.

.gdpar_geom_pack_fit(cs_fit, elapsed)

Purpose

Normalizes a cmdstanr fit object into a compact list containing the posterior draws matrix (excluding lp__), per-draw chain IDs, and elapsed time. This standardized structure is consumed by signal extraction and culprit localization.

Arguments

Argument Type Meaning
cs_fit CmdStanMCMC A cmdstanr fit object returned by mod$sample() or gdpar()$fit.
elapsed numeric Wall-clock seconds elapsed during sampling.

Returns

A list:

Element Type Description
cs_fit CmdStanMCMC The original fit object (retained for later diagnostic queries).
pm numeric matrix Posterior draws matrix; rows = draws, columns = parameters (excluding lp__).
chain_id integer vector Chain identifier for each draw (from posterior::as_draws_df()$.chain``).
elapsed numeric Passed through unchanged.

Notes

  • lp__ is explicitly removed from the variable set via setdiff(vars, "lp__").
  • posterior::as_draws_matrix() followed by as.matrix() ensures pm is a plain numeric matrix.
  • The chain ID extraction relies on the .chain column produced by posterior::as_draws_df().

.gdpar_geom_compile(stan_code)

Purpose

Writes Stan source code to a file and compiles it into a cmdstanr model object. Relies on cmdstanr's file-hash caching so that repeated diagnostic calls with identical code do not recompile.

Arguments

Argument Type Meaning
stan_code character (scalar) Stan model source code as a string.

Returns

A CmdStanModel object.

Notes

  • Uses cmdstanr::write_stan_file(stan_code) to materialize the code to disk (typically a temp file with content hashing).
  • Then cmdstanr::cmdstan_model(f) compiles (or retrieves the cached compilation).
  • No explicit error handling; compilation failures propagate to the caller.

.gdpar_geom_extract_signals(cs_fit, pm, chain_id, bounds, max_treedepth, elapsed)

Purpose

Extracts a comprehensive set of size-invariant geometric diagnostic signals from a single pilot fit. These signals characterize sampler health (divergences, tree-depth saturation, EBFMI), posterior geometry (condition number, step-scale ratio, multimodality, heavy kurtosis, boundary proximity), and computational cost (leapfrog count, elapsed time).

Arguments

Argument Type Meaning
cs_fit CmdStanMCMC The cmdstanr fit object.
pm numeric matrix Posterior draws matrix (rows = draws, cols = parameters, lp__ excluded).
chain_id integer vector Chain ID per draw.
bounds list / matrix Parameter bounds, forwarded to .gdpar_geom_boundary_prox().
max_treedepth integer Maximum NUTS tree depth; used to compute treedepth saturation rate.
elapsed numeric Elapsed wall-clock seconds.

Mathematics

Let $N$ be the total number of post-warmup draws and $d$ the number of parameters.

Sampler diagnostics:

$$ \text{divergent_rate} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}[\text{divergent}_i = 1] $$

$$ \text{treedepth_sat_rate} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}[\text{treedepth}_i \geq \text{max_treedepth}] $$

$$ \text{ebfmi_min} = \min_{c} \text{EBFMI}_c $$

$$ \text{mean_leapfrog} = \frac{1}{N}\sum_{i=1}^{N} n_{\text{leapfrog},i}, \qquad \text{step_size} = \frac{1}{N}\sum_{i=1}^{N} \text{stepsize}_i $$

Covariance-based geometry:

$$ \Sigma = \text{Cov}(\text{pm}), \qquad \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d ;\text{(eigenvalues of } \Sigma \text{)} $$

$$ \lambda_{\max} = \lambda_1, \qquad \lambda_{\min} = \max(\lambda_d,, 0) $$

$$ \kappa = \frac{\lambda_{\max}}{\max(\lambda_{\min},, \varepsilon)}, \qquad \varepsilon = \texttt{.Machine$double.eps} $$

$$ \sigma_j = \sqrt{\max(\Sigma_{jj},, \varepsilon)}, \qquad \text{step_scale_ratio} = \frac{\text{step_size}}{\min_j \sigma_j} $$

Returns

A named list:

Element Type Description
divergent_rate numeric Fraction of divergent transitions.
ebfmi_min numeric Minimum EBFMI across chains.
treedepth_sat_rate numeric Fraction of draws hitting max_treedepth.
condition_number numeric $\kappa$ as above; NA_real_ if eigen-decomposition failed.
lambda_max_cov numeric Largest eigenvalue of the posterior covariance.
step_scale_ratio numeric Mean step size divided by smallest marginal SD.
mean_leapfrog numeric Mean number of leapfrog steps per draw.
multimodality numeric From .gdpar_geom_multimodality().
heavy_kurtosis numeric From .gdpar_geom_heavy_kurtosis() (defined in another section).
boundary_proximity numeric From .gdpar_geom_boundary_prox() (defined in another section).
elapsed numeric Passed through.
n_sampling integer Total number of post-warmup draws ($N$).

Notes

  • Sampler diagnostics are retrieved via cs_fit$sampler_diagnostics(format = "draws_df") and cs_fit$diagnostic_summary(quiet = TRUE).
  • ebfmi_min is wrapped in suppressWarnings(min(ds$ebfmi)).
  • Eigen-decomposition uses eigen(cv, symmetric = TRUE, only.values = TRUE) inside a tryCatch that returns NA_real_ on error. If all eigenvalues are NA, both lambda_max and lambda_min are set to NA_real_, and condition_number is NA_real_.
  • lambda_min is clamped to be non-negative via max(min(ev), 0).
  • The condition number denominator uses max(lambda_min, .Machine$double.eps) to avoid division by zero.
  • Marginal SDs use sqrt(pmax(diag(cv), .Machine$double.eps)) to avoid zero or negative variances from numerical issues.

.gdpar_geom_na_signals()

Purpose

Returns a template signal list with every field set to NA_real_. Used by .gdpar_geom_run_pilot when sampling fails, so that downstream consumers receive a structurally complete (but missing-valued) signal object.

Arguments

None.

Returns

A named list with all twelve signal fields set to NA_real_:

divergent_rate, ebfmi_min, treedepth_sat_rate, condition_number, lambda_max_cov, step_scale_ratio, mean_leapfrog, multimodality, heavy_kurtosis, boundary_proximity, elapsed, n_sampling.

Notes

  • The field names and order exactly mirror those produced by .gdpar_geom_extract_signals().
  • All values are explicitly NA_real_ (typed missing), not NA or NA_integer_.

.gdpar_geom_multimodality(pm, chain_id)

Purpose

Computes a between-chain mean-separation signal relative to within-chain spread. Chains that settle in different modes produce large between-chain mean differences, inflating this metric. The design is intended to be robust to short pilot runs.

Arguments

Argument Type Meaning
pm numeric matrix Posterior draws matrix; rows = draws, columns = parameters.
chain_id integer vector Chain identifier for each row of pm. Length must equal nrow(pm).

Mathematics

Let $C$ be the number of unique chains and $d$ the number of columns (parameters). For each coordinate $j \in \{1, \ldots, d\}$:

$$ \bar{x}_j^{(c)} = \frac{1}{n_c}\sum_{i:,\text{chain}_i = c} \text{pm}_{ij}, \qquad s_j^{(c)} = \text{sd}\bigl({\text{pm}_{ij} : \text{chain}_i = c}\bigr) $$

$$ \bar{s}_j = \frac{1}{C}\sum_{c=1}^{C} s_j^{(c)} \quad (\text{with } \texttt{na.rm=TRUE}) $$

If $\bar{s}_j$ is not finite or $\bar{s}_j \leq 0$, the score for coordinate $j$ is $0$. Otherwise:

$$ m_j = \frac{\text{sd}\bigl(\bar{x}_j^{(1)}, \ldots, \bar{x}_j^{(C)}\bigr)}{\bar{s}_j} $$

The returned multimodality signal is:

$$ M = \max_{j} m_j \quad (\text{with } \texttt{na.rm=TRUE}) $$

Returns

A single numeric value. Returns 0 if fewer than 2 unique chains are present.

Notes

  • Early return of 0 (not NA) when length(unique(chain_id)) < 2.
  • Per-chain means and SDs are computed with vapply over the unique chain set; the order of chains is determined by unique(chain_id).
  • Within-chain spread mean(csds, na.rm = TRUE) averages the per-chain standard deviations (not a pooled variance).
  • The guard !is.finite(within) || within <= 0 returns 0 for that coordinate, preventing division by zero or non-finite values.
  • The final max(scores, na.rm = TRUE) tolerates NA scores from individual coordinates.

Geometry Diagnostic — Culprit Localisation, Difficulty Curve, and Rule-Based Classifier

.gdpar_geom_heavy_kurtosis(pm)

Purpose

Computes a scalar summary of tail heaviness across all parameter marginals in a posterior matrix. Used by the classifier as the heavy_kurtosis signal to detect funnel-neck or heavy-tail pathologies.

Arguments

Argument Type Meaning
pm numeric matrix Posterior draws matrix with rows = draws, columns = parameters.

Mathematics

For each column $j$ of pm, let $x = \mathrm{pm}[:,\,j]$, $\bar{x} = \frac{1}{N}\sum_i x_i$, and $\sigma^2 = \frac{1}{N}\sum_i (x_i - \bar{x})^2$ (population variance). If $\sigma^2 \le 0$, the column contributes $0$. Otherwise the excess kurtosis is

$$ \kappa_j = \frac{\frac{1}{N}\sum_i (x_i - \bar{x})^4}{\sigma^4} - 3. $$

The returned value is

$$ \max_j \kappa_j \quad (\text{with } \texttt{na.rm=TRUE}). $$

Returns

A single numeric scalar — the maximum excess kurtosis over all parameter columns.

Notes

  • Uses vapply with numeric(1), so type stability is guaranteed.
  • na.rm = TRUE in the final max means columns that produced NA (e.g. all-constant columns returning 0, or columns with NaN) do not propagate.
  • The variance estimator is the biased (population) estimator, not the sample ($N-1$) estimator, so kurtosis values are slightly inflated relative to the unbiased form for small $N$.

.gdpar_geom_boundary_prox(cs_fit, bounds, eps = 0.01)

Purpose

Measures how tightly the posterior draws of bounded parameters pile against their declared lower or upper bound. Returns the worst-case (maximum) proximity fraction across all bounded parameters, used as the boundary_proximity signal.

Arguments

Argument Type Meaning
cs_fit CmdStanFit-like object A fitted model object exposing a draws() method compatible with the posterior package.
bounds named list of length-2 numeric vectors, or NULL Each element is c(lo, hi) giving the declared bound for the parameter named by the list element.
eps numeric scalar (default 0.01) Relative tolerance: a draw is "near" a bound if it lies within eps * (hi - lo) of that bound.

Mathematics

For each bounded parameter $n$ with bounds $[\ell_n, u_n]$ and range $r_n = u_n - \ell_n$, the proximity fractions are

$$ f_{\ell,n} = \frac{1}{N}\sum_i \mathbf{1}!\left[x_{i,n} \le \ell_n + \varepsilon, r_n\right], \qquad f_{u,n} = \frac{1}{N}\sum_i \mathbf{1}!\left[x_{i,n} \ge u_n - \varepsilon, r_n\right]. $$

The per-parameter score is $p_n = \max(f_{\ell,n},\, f_{u,n})$, and the returned value is $\max_n p_n$ (with na.rm = TRUE).

Returns

A numeric scalar in $[0, 1]$. Returns 0 if bounds is NULL or empty, or if no bounded parameter name matches a variable in the draws.

Notes

  • Variables are retrieved via posterior::variables(draws_arr). Parameters in bounds that are absent from the draws silently contribute 0.
  • posterior::subset_draws and posterior::as_draws_matrix are used to extract a single variable; the result is coerced with as.numeric.
  • If hi - lo is zero or negative, rng is non-positive and the comparisons x <= lo + eps * rng / x >= hi - eps * rng still execute but the geometric meaning is undefined.

.gdpar_geom_empty_culprit()

Purpose

Factory function returning an empty culprit-localisation data frame with the correct schema, used as a sentinel when no culprit is identified.

Arguments

None.

Returns

A data.frame with zero rows and three columns:

Column Type
parameter character
mechanism character
score numeric

Constructed with stringsAsFactors = FALSE.

Notes

  • No side effects. Pure constructor.

.gdpar_geom_localize(cs_fit, pm, chain_id, bounds)

Purpose

Localises the geometric "culprit" parameters responsible for a detected pathology by combining three independent mechanisms: (a) flat/anisotropic directions from the covariance spectrum, (b) divergence-neck parameters that separate divergent from non-divergent draws, and (c) boundary-piling of bounded parameters.

Arguments

Argument Type Meaning
cs_fit CmdStanFit-like object Provides sampler_diagnostics(format = "draws_df") for divergence information and draws() (via .gdpar_geom_boundary_prox).
pm numeric matrix Posterior draws matrix (rows = draws, columns = parameters). Column names are used as parameter names.
chain_id (any) Accepted but never used in the function body.
bounds named list of length-2 numeric vectors, or NULL Declared parameter bounds, as in .gdpar_geom_boundary_prox.

Mathematics

(a) Flat / anisotropic direction. Let $\Sigma = \mathrm{cov}(\mathrm{pm})$ with eigendecomposition $\Sigma = V \Lambda V^\top$ (computed via eigen(..., symmetric = TRUE)). Let $v_1$ be the first eigenvector (largest eigenvalue $\lambda_1$). The loading magnitude is $|v_{1,k}|$ for parameter $k$. The top $\min(2, p)$ parameters by loading receive score $|v_{1,k}| \cdot \lambda_1$.

(b) Divergence neck. Let $D_i \in \{0, 1\}$ indicate whether draw $i$ is divergent. For each parameter $j$ with finite, positive sample standard deviation $s_j$, compute the standardised mean shift

$$ \Delta_j = \frac{\left|,\overline{x_j^{\mathrm{div}}} - \overline{x_j^{\mathrm{nondiv}}},\right|}{s_j}. $$

Parameters with $\Delta_j &gt; 0.5$ are flagged with mechanism "divergence_neck" and score $\Delta_j$.

(c) Boundary pile. For each bounded parameter, .gdpar_geom_boundary_prox is called with a single-element bounds list. Parameters with proximity $&gt; 0.05$ are flagged with mechanism "boundary_pile".

Returns

A data.frame with columns parameter (character), mechanism (character), score (numeric), sorted by score in decreasing order, with rownames set to NULL. If no culprit rows are produced, returns .gdpar_geom_empty_culprit().

Mechanism labels:

Mechanism string Source
"flat_or_anisotropic_direction" Covariance eigenvector (a)
"divergence_neck" Divergence separation (b)
"boundary_pile" Boundary proximity (c)

Notes

  • The eigendecomposition is wrapped in tryCatch; if it fails, mechanism (a) is silently skipped.
  • Mechanism (b) is only evaluated when any(div) && !all(div) — i.e., at least one but not all draws are divergent. If all draws diverge, the mechanism is skipped entirely.
  • In mechanism (b), columns with non-finite or non-positive sd(x) are skipped via next.
  • In mechanism (c), .gdpar_geom_boundary_prox is called once per bounded parameter with a single-element sublist bounds[nm], so the eps default of 0.01 is always used.
  • chain_id is a dead parameter — it is accepted for interface consistency but has no effect on the output.

.gdpar_geom_difficulty_curve(signals, thresholds)

Purpose

Fits a log–log regression of condition number against data size $n$ to determine whether geometric difficulty grows with $n$. Produces the grows_with_n flag consumed by the classifier.

Arguments

Argument Type Meaning
signals data.frame Must contain columns failed (logical), condition_number (numeric), and n (numeric). Typically a stacked table of signals across multiple fit sizes.
thresholds named list Must contain nslope_grows (numeric): the slope threshold above which difficulty is declared to grow with $n$.

Mathematics

Rows are filtered to those with !failed, finite condition_number, and finite n. Let the filtered set be $\{(n_k, \kappa_k)\}$. If fewer than 2 rows or fewer than 2 distinct $n$ values remain, the result is NA. Otherwise, an ordinary least-squares fit is performed:

$$ \log \kappa = \alpha + \beta \log n + \varepsilon. $$

The slope $\beta$ is extracted as coef(fit)[2]. The flag grows_with_n is TRUE iff $\beta &gt;$ thresholds$nslope_grows.

Returns

A named list:

Element Type Meaning
slope numeric The fitted exponent $\beta$, or NA_real_ if insufficient data.
grows_with_n logical or NA TRUE if slope exceeds threshold; NA if insufficient data.
note character or NULL "need at least two distinct sizes" if insufficient data; NULL otherwise.

Notes

  • The filtering predicate uses is.finite on both condition_number and n, so NA, NaN, Inf, and -Inf rows are excluded.
  • length(unique(s$n)) < 2L checks for at least two distinct sizes; duplicate sizes are allowed in the regression.
  • stats::lm is used without error handling; if the regression matrix is degenerate (e.g., all identical n after filtering, which is already guarded), lm would still execute but the guard prevents this.
  • unname is applied to the slope coefficient to strip the "log(n)" name.

.gdpar_geom_remedy_for(pathology)

Purpose

Maps a pathology label to a recommended remedy (metric / sampler strategy) and a difficulty level. Called by .gdpar_geom_classify via its inner decide closure.

Arguments

Argument Type Meaning
pathology character scalar One of the recognised pathology labels (see table below).

Returns

A named list with two elements:

Element Type Meaning
remedy character The recommended remedy string.
level integer A difficulty/severity level (can be NA_integer_).

Mapping table:

pathology remedy level
"isotropic" "euclidean_diagonal" 0L
"anisotropic" "euclidean_dense" 1L
"funnel" "riemannian" 3L
"heavy_tails" "finsler_relativistic" 4L
"quasi_deterministic" "sub_riemannian" 5L
"multimodal" "tempering" 6L
"boundary" "boundary_reparam" 6L
"flat_direction" "reparam_eliminate" -1L
(any other) "unknown" NA_integer_

Notes

  • Implemented via switch with a default fallback, so unrecognised inputs never error.
  • The level values are not contiguous (level 2 is unused), suggesting an ordinal scale where some slots are reserved or deprecated.

.gdpar_geom_classify(sig, ncurve, culprit, th)

Purpose

The central rule-based classifier that consumes all geometric signals, the difficulty-vs-$n$ curve, and (nominally) the culprit localisation, then emits a pathology label, confidence score, remedy, and a trace of which rules fired.

Arguments

Argument Type Meaning
sig named list Signal values. Expected elements: heavy_kurtosis, condition_number, boundary_proximity, multimodality, lambda_max_cov, ebfmi_min, divergent_rate, step_scale_ratio.
ncurve named list Output of .gdpar_geom_difficulty_curve; must contain grows_with_n.
culprit (any) Accepted but never used in the function body.
th named list Threshold values. Expected elements: heavy_kurtosis_high, condition_high, boundary_prox_high, multimodal_high, flat_var_high, funnel_ebfmi_low, divergent_rate_high, step_scale_ratio_low, heavy_cond_max, nslope_grows (the latter is used only indirectly via ncurve).

Mathematics

Two boolean predicates are precomputed:

$$ \texttt{kurt_high} = [\texttt{sig$heavy_kurtosis} \ge \texttt{th$heavy_kurtosis_high}], \qquad \texttt{cond_high} = [\texttt{sig$condition_number} \ge \texttt{th$condition_high}]. $$

The confidence mapping function sq(x, ref) maps a signal value $x$ relative to its threshold $\rho$ to a confidence in $[0, 1]$:

$$ \mathrm{sq}(x, \rho) = \begin{cases} 0.5 & \text{if } x \text{ or } \rho \text{ is non-finite, or } \rho \le 0, \\ \min!\left(1,; \max!\left(0,; \frac{x}{2\rho} + 0.25\right)\right) & \text{otherwise.} \end{cases} $$

This yields confidence $0.25$ when $x = 0$, $0.75$ when $x = \rho$ (exactly at threshold), and $1.0$ when $x \ge 1.5\rho$.

The classification rules are evaluated in strict priority order (first match wins):

Priority Pathology Condition Confidence
1 boundary sig$boundary_proximity >= th$boundary_prox_high sq(boundary_proximity, boundary_prox_high)
2 multimodal sig$multimodality >= th$multimodal_high sq(multimodality, multimodal_high)
3 flat_direction sig$lambda_max_cov >= th$flat_var_high AND !grows sq(lambda_max_cov, flat_var_high)
4 quasi_deterministic cond_high AND grows sq(condition_number, condition_high)
5 anisotropic cond_high AND !kurt_high sq(condition_number, condition_high)
6 funnel (kurt_high AND sig$ebfmi_min <= th$funnel_ebfmi_low) OR (sig$divergent_rate >= th$divergent_rate_high AND sig$step_scale_ratio <= th$step_scale_ratio_low) sq(heavy_kurtosis, heavy_kurtosis_high) if energy branch; sq(divergent_rate, divergent_rate_high) if divergence branch
7 heavy_tails kurt_high AND sig$condition_number <= th$heavy_cond_max sq(heavy_kurtosis, heavy_kurtosis_high)
8 anisotropic (residual) cond_high (n-curve unknown) 0.4 (fixed)
9 isotropic (default) no threshold exceeded 0.6 (fixed)

Returns

A named list:

Element Type Meaning
pathology character The matched pathology label.
confidence numeric Confidence in $[0, 1]$ (or a fixed value for rules 8–9).
remedy character Remedy string from .gdpar_geom_remedy_for.
level integer or NA_integer_ Difficulty level from .gdpar_geom_remedy_for.
trace character vector Human-readable strings documenting which rule fired and why.

Notes

  • The inner function decide(p, conf) closes over the trace variable in the enclosing scope. Each rule appends a trace string to trace before calling decide, so the returned trace reflects the path taken.
  • culprit is a dead parameter — accepted for interface consistency but never referenced.
  • All threshold comparisons use isTRUE(...) wrappers, which means NA comparisons safely evaluate to FALSE rather than propagating NA. This is critical: if any signal is NA, the corresponding rule is simply skipped.
  • Rule 6 (funnel) has two sub-conditions: funnel_energy (high kurtosis + low E-BFMI) and funnel_div (high divergence rate + low step/scale ratio). If both are true, funnel_energy takes precedence in the trace message and confidence computation because the if (funnel_energy) branch in the trace if-else is evaluated first.
  • Rule 8 catches the case where cond_high is true but grows is NA (n-curve unknown) and kurt_high is false — rules 4, 5, and 6 all fail, but rule 8 fires with a fixed confidence of 0.4.
  • Rule 9 (isotropic) fires with a fixed confidence of 0.6 when no pathology threshold is exceeded. This is higher than rule 8's 0.4, reflecting higher confidence in a clean diagnosis than in a residual one.
  • The sq function's default return of 0.5 for non-finite inputs means that if a signal is NA but the threshold comparison somehow passed (which isTRUE prevents), the confidence would be 0.5.
  • The function does not modify sig, ncurve, culprit, or th; it is side-effect-free.

.gdpar_geom_cost(pilot_max, ncurve)

Purpose
Internal helper for geometry diagnostics. Computes a compact cost and tractability summary from a pilot run’s sampler signals and from the difficulty-vs-sample-size curve.

Arguments

  • pilot_max: list. Expected components:
    • failed: scalar logical or any value tested by isTRUE; TRUE indicates the pilot run failed.
    • signals: list with numeric scalar fields:
      • elapsed: elapsed time for the sampling phase.
      • n_sampling: number of sampling draws.
      • mean_leapfrog: mean leapfrog steps.
      • treedepth_sat_rate: tree-depth saturation rate.
  • ncurve: list. Expected component grows_with_n, tested by isTRUE; indicates whether difficulty grows with sample size.

Mathematics
Let $t = \text{elapsed}$, $n = \text{n\_sampling}$, $s = \text{treedepth\_sat\_rate}$, and $g = \mathrm{isTRUE}(\text{ncurve\$grows\_with\_n})$. If the pilot is not failed and $t$ is finite:

$$ \text{seconds_per_1000_draws} = \frac{t}{\max(n, 1)} \times 1000. $$

Tractability is classified as:

$$ \text{tractability} = \begin{cases} \text{unknown} & \text{if } \mathrm{isTRUE}(\text{failed}) \text{ or } t \text{ is not finite},\\ \text{intractable (escalate geometry or certify limit)} & \text{if } \mathrm{isTRUE}(s \ge 0.5) \text{ and } g,\\ \text{expensive} & \text{if } \mathrm{isTRUE}(s \ge 0.5) \text{ and not } g,\\ \text{tractable} & \text{otherwise}. \end{cases} $$

Returns
A list with elements:

  • seconds_per_1000_draws: numeric scalar, or NA_real_ in the failure branch.
  • mean_leapfrog: sig$mean_leapfrog, or NA_real_ in the failure branch.
  • treedepth_saturation: sig$treedepth_sat_rate, or NA_real_ in the failure branch.
  • tractability: character string; one of "unknown", "tractable", "expensive", or "intractable (escalate geometry or certify limit)".

Notes
The failure branch is triggered when isTRUE(pilot_max$failed) or !is.finite(sig$elapsed). The max(sig$n_sampling, 1) guard protects against zero draws, but not against NA; if n_sampling is NA, per_1000 becomes NA. isTRUE(sat >= 0.5) requires sat to be a length-one numeric/logical value; NA or length not equal to one yields FALSE, causing classification to fall through to less severe branches. Assumes pilot_max$signals exists; if failed is not TRUE and sig$elapsed is absent or NULL, the finite check may error. Internal function; no S3 dispatch.


print.gdpar_geometry_diagnostic(x, ...)

Purpose
S3 print method for objects of class gdpar_geometry_diagnostic. Renders a compact console summary of the diagnosed pathology, recommended geometry, difficulty curve, computational cost, culprit parameters, and optional ground-truth comparison.

Arguments

  • x: object of class gdpar_geometry_diagnostic, expected to be a list with components:
    • pathology: character string.
    • confidence: numeric scalar.
    • recommended_geometry: character string.
    • geometry_level: numeric or character geometry level.
    • difficulty_curve: list with optional slope and grows_with_n.
    • cost: list with seconds_per_1000_draws and tractability.
    • culprit: data frame with column parameter; may have zero rows.
    • correct: NULL, or a scalar logical/NA indicating whether classification was correct.
    • ground_truth: list with pathology, used when correct is non-NULL.
  • ...: unused; present for S3 generic compatibility.

Returns
Invisibly returns x.

Notes
Writes output to the console via cat. Printed lines include:

  • Header: <gdpar_geometry_diagnostic>.
  • Pathology with confidence formatted using digits = 2.
  • Recommended geometry and geometry level.
  • An n-curve line only if x$difficulty_curve$slope is not NULL; slope is formatted with digits = 3, followed by grows_with_n.
  • Cost line with seconds_per_1000_draws formatted with digits = 3 and tractability.
  • Culprit line only if nrow(x$culprit) > 0L; prints up to the first three values of x$culprit$parameter using utils::head(..., 3), comma-separated.
  • Ground-truth line only if !is.null(x$correct); this includes FALSE and NA values, not only TRUE, and accesses x$ground_truth$pathology.

Edge cases: if x$culprit is NULL, nrow(x$culprit) is NULL and the if condition errors due to length zero; the method assumes culprit is a data frame. If x$correct is non-NULL but x$ground_truth is missing, accessing x$ground_truth$pathology will error. The method does not validate the class or required fields. Exported as an S3 method.


← Part IV — Exhaustive Function Reference (3/7) · gdpar Wiki Home · Part IV — Exhaustive Function Reference (5/7) →

Clone this wiki locally