I implemented a PyTorch-level runtime to validate the static-shape prefill/decode design before relying on the exported artifact. The runtime uses the wrapped decoder layers only for the transformer blocks, while keeping the rest of the model unchanged. In particular:

embedding, final norm, and lm head still use the original PyTorch modules for simplicity.
prefill uses the prefill decoder-layer wrapper
decode uses the decode decoder-layer wrapper
KV cache is managed entirely by the runtime, not inside the exported graph

The main goal is to verify that the proposed static decode contract is actually workable end-to-end.

Runtime behavior

For prefill:

the full prompt is passed through the prefill wrappers
each layer returns its full KV tensors for the prompt
the runtime writes those outputs into external cache buffers

For decode:

each step takes a single-token hidden state with shape (B, 1, D)
the runtime prepares:
- position_embeddings = (cos_t, sin_t) for the current absolute position
- attention_mask with fixed shape (B, 1, max_seq)
- past_key_value with fixed shapes (B, num_kv_heads, max_seq - 1, head_dim)
the decode wrapper returns:
- next hidden state
- delta KV only: (new_k, new_v) with shape (B, num_kv_heads, 1, head_dim)
the runtime writes this delta into the external cache buffer

This keeps decode fully static while preserving the intended RoPE and KV-cache behavior.

Mask/cache layout used in the runtime

The current runtime assumes:

past tokens are packed into the first past_len slots of the static past buffer
padded past slots remain masked
the current token is appended internally by the attention path
the decode attention mask explicitly unmasks:
-valid past positions
- the current-token slot
no causal mask is built dynamically inside decode

This gives us a way to check correctness at the PyTorch level first:

whether the static decode input contract is sufficient
whether the external cache update model is correct
whether prefill and decode connect cleanly
whether step-by-step logits match the original reference model closely enough

So before testing Circle/runtime integration, we can already verify that the wrapper design itself is logically sound.

[quanization] Enable prefill-decode modeling #586

Description

What

How

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions