Skip to content

FastLM/NeuroSpec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NeuroGRACE-Spec

Grammar- and Resource-Aligned Certifiable Speculative Decoding

NeuroGRACE-Spec is a speculative decoding framework that injects grammatical/semantic constraints and resource costs directly into both the target conditional distribution q_λ and the proposal distribution r_λ. Combined with a per-token certifiable accept–residual sampler, the method preserves exact sampling from q_λ while enabling high-throughput block verification with a small proposal model.

Features

  • Grammar-constrained sampling: Inject feasibility masks gi(a | h) to enforce grammatical and semantic constraints
  • Cost-aligned distributions: Incorporate resource costs via cost-to-go estimates Ĵ(h) and incremental costs ∆i(a)
  • Certifiable accept-residual sampling: Per-token verification that preserves exact sampling from target distribution
  • High-throughput block verification: Propose K tokens, verify with one teacher-forced pass, sample residual on rejection
  • Adaptive λ control: Automatically adjust effort/precision weight λ based on cost constraints
  • Numerically stable: All operations in log-space to prevent underflow/overflow

Installation

pip install -r requirements.txt

Quick Start

from neurograce_spec import (
    TargetDistribution,
    ProposalDistribution,
    BlockVerification,
    CostToGoEstimator,
    GrammarMask
)

# Create your large and small models
large_model = ...  # Your large target model
small_model = ...  # Your small proposal model

# Create grammar mask and cost estimator
grammar_mask = GrammarMask(vocab, vocab_inv)
cost_estimator = CostToGoEstimator(vocab_size=vocab_size)

# Create distributions
target_dist = TargetDistribution(
    large_model=large_model,
    grammar_mask=grammar_mask,
    cost_estimator=cost_estimator,
    lambda_weight=1.0
)

proposal_dist = ProposalDistribution(
    small_model=small_model,
    grammar_mask=grammar_mask,
    cost_estimator=cost_estimator,
    lambda_weight=1.0
)

# Create block verifier
block_verifier = BlockVerification(
    target_dist=target_dist,
    proposal_dist=proposal_dist,
    block_size=4
)

# Generate sequence
prefix = [token1, token2, ...]
emitted_tokens, metadata = block_verifier.verify_block(prefix, verbose=True)

Core Components

Distributions

Target Distribution q_λ (Eq. 1):

q_λ(a | h) = PT(a | h) * gi(a | h) * e^(-λ * ∆i(a))
             / Σ_b PT(b | h) * gi(b | h) * e^(-λ * ∆i(b))

Proposal Distribution r_λ (Eq. 2):

r_λ(a | h) = PS(a | h) * gi(a | h) * e^(-λ * ∆̂i(a))
             / Σ_b PS(b | h) * gi(b | h) * e^(-λ * ∆̂i(b))

Block Verification (Algorithm 1)

  1. Generate K tokens with small model: y₁:K ~ ∏ r_λ(· | h_{t-1})

  2. Run one teacher-forced pass of large model to get {q_λ(· | h^(j))} for j=1...K

  3. Verify sequentially:

    • Compute accept probability: α = min(1, q_λ(y_j | h') / r_λ(y_j | h'))
    • If accepted: emit token, continue
    • If rejected: sample from residual q_res, stop block
  4. Residual distribution (Eq. 4):

    q_res(a | h) ∝ q_λ(a | h) - min(q_λ(a | h), r_λ(a | h))
    

Adaptive λ Control (Eq. 5)

Maintains cost constraint E[C] ≤ τ by updating:

λ ← [λ + η(Ê[C] - τ)]_+

File Structure

neurograce_spec/
├── __init__.py              # Package exports
├── distributions.py         # q_λ and r_λ distributions
├── block_verification.py    # Algorithm 1: Block verification
├── cost_estimator.py        # Cost-to-go estimation
├── grammar.py              # Grammar/semantic masks
├── adaptive_control.py      # Adaptive λ control
└── utils.py                # Numerical stability utilities

examples/
└── example_usage.py         # Example script

requirements.txt            # Dependencies
README.md                   # This file

Mathematical Foundations

Notation

  • A: Token alphabet
  • h ∈ H: Prefix/history state
  • PT(a | h): Large model conditional
  • PS(a | h): Small model conditional
  • gi(a | h) ∈ {0, 1}: Grammar/semantic feasibility mask
  • Ĵ(h): Cost-to-go approximation
  • ∆i(a) = Ĵ(next(h, a)) - Ĵ(h): Incremental cost
  • λ ≥ 0: Effort/precision weight
  • K: Speculative block size

Exactness Guarantee

The accept-residual mechanism ensures that the per-position marginals match q_λ exactly, preserving the target distribution while enabling efficient block verification.

Examples

Basic Example

See examples/example_usage.py for a complete working example with dummy models.

python examples/example_usage.py

Real LLM Test

Test with actual transformer models from Hugging Face:

python examples/test_real_llm.py

This will:

  • Load GPT-2 (large) and DistilGPT-2 (small) models
  • Generate text using NeuroGRACE-Spec
  • Show accept rate and compare with baseline generation

Custom Configuration:

from examples.test_real_llm import test_with_real_models

test_with_real_models(
    large_model_name="gpt2",
    small_model_name="distilgpt2",
    prompt="Your prompt here",
    max_tokens=50,
    block_size=4,
    device="cpu"  # or "cuda" for GPU
)

See examples/README.md for more details.

References

  1. Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience.

  2. Lieder, F., & Griffiths, T. L. (2020). Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources.

  3. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding.

About

Grammar- and Resource-Aligned Certifiable Speculative Decoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages