Grammar- and Resource-Aligned Certifiable Speculative Decoding
NeuroGRACE-Spec is a speculative decoding framework that injects grammatical/semantic constraints and resource costs directly into both the target conditional distribution q_λ and the proposal distribution r_λ. Combined with a per-token certifiable accept–residual sampler, the method preserves exact sampling from q_λ while enabling high-throughput block verification with a small proposal model.
- Grammar-constrained sampling: Inject feasibility masks gi(a | h) to enforce grammatical and semantic constraints
- Cost-aligned distributions: Incorporate resource costs via cost-to-go estimates Ĵ(h) and incremental costs ∆i(a)
- Certifiable accept-residual sampling: Per-token verification that preserves exact sampling from target distribution
- High-throughput block verification: Propose K tokens, verify with one teacher-forced pass, sample residual on rejection
- Adaptive λ control: Automatically adjust effort/precision weight λ based on cost constraints
- Numerically stable: All operations in log-space to prevent underflow/overflow
pip install -r requirements.txtfrom neurograce_spec import (
TargetDistribution,
ProposalDistribution,
BlockVerification,
CostToGoEstimator,
GrammarMask
)
# Create your large and small models
large_model = ... # Your large target model
small_model = ... # Your small proposal model
# Create grammar mask and cost estimator
grammar_mask = GrammarMask(vocab, vocab_inv)
cost_estimator = CostToGoEstimator(vocab_size=vocab_size)
# Create distributions
target_dist = TargetDistribution(
large_model=large_model,
grammar_mask=grammar_mask,
cost_estimator=cost_estimator,
lambda_weight=1.0
)
proposal_dist = ProposalDistribution(
small_model=small_model,
grammar_mask=grammar_mask,
cost_estimator=cost_estimator,
lambda_weight=1.0
)
# Create block verifier
block_verifier = BlockVerification(
target_dist=target_dist,
proposal_dist=proposal_dist,
block_size=4
)
# Generate sequence
prefix = [token1, token2, ...]
emitted_tokens, metadata = block_verifier.verify_block(prefix, verbose=True)Target Distribution q_λ (Eq. 1):
q_λ(a | h) = PT(a | h) * gi(a | h) * e^(-λ * ∆i(a))
/ Σ_b PT(b | h) * gi(b | h) * e^(-λ * ∆i(b))
Proposal Distribution r_λ (Eq. 2):
r_λ(a | h) = PS(a | h) * gi(a | h) * e^(-λ * ∆̂i(a))
/ Σ_b PS(b | h) * gi(b | h) * e^(-λ * ∆̂i(b))
-
Generate K tokens with small model: y₁:K ~ ∏ r_λ(· | h_{t-1})
-
Run one teacher-forced pass of large model to get {q_λ(· | h^(j))} for j=1...K
-
Verify sequentially:
- Compute accept probability: α = min(1, q_λ(y_j | h') / r_λ(y_j | h'))
- If accepted: emit token, continue
- If rejected: sample from residual q_res, stop block
-
Residual distribution (Eq. 4):
q_res(a | h) ∝ q_λ(a | h) - min(q_λ(a | h), r_λ(a | h))
Maintains cost constraint E[C] ≤ τ by updating:
λ ← [λ + η(Ê[C] - τ)]_+
neurograce_spec/
├── __init__.py # Package exports
├── distributions.py # q_λ and r_λ distributions
├── block_verification.py # Algorithm 1: Block verification
├── cost_estimator.py # Cost-to-go estimation
├── grammar.py # Grammar/semantic masks
├── adaptive_control.py # Adaptive λ control
└── utils.py # Numerical stability utilities
examples/
└── example_usage.py # Example script
requirements.txt # Dependencies
README.md # This file
- A: Token alphabet
- h ∈ H: Prefix/history state
- PT(a | h): Large model conditional
- PS(a | h): Small model conditional
- gi(a | h) ∈ {0, 1}: Grammar/semantic feasibility mask
- Ĵ(h): Cost-to-go approximation
- ∆i(a) = Ĵ(next(h, a)) - Ĵ(h): Incremental cost
- λ ≥ 0: Effort/precision weight
- K: Speculative block size
The accept-residual mechanism ensures that the per-position marginals match q_λ exactly, preserving the target distribution while enabling efficient block verification.
See examples/example_usage.py for a complete working example with dummy models.
python examples/example_usage.pyTest with actual transformer models from Hugging Face:
python examples/test_real_llm.pyThis will:
- Load GPT-2 (large) and DistilGPT-2 (small) models
- Generate text using NeuroGRACE-Spec
- Show accept rate and compare with baseline generation
Custom Configuration:
from examples.test_real_llm import test_with_real_models
test_with_real_models(
large_model_name="gpt2",
small_model_name="distilgpt2",
prompt="Your prompt here",
max_tokens=50,
block_size=4,
device="cpu" # or "cuda" for GPU
)See examples/README.md for more details.
-
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience.
-
Lieder, F., & Griffiths, T. L. (2020). Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.
-
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding.