Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 52 additions & 14 deletions docs/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ A Pin is a JSON object with the following fields:
|---|---|---|---|
| `v` | integer | yes | Protocol version. Must equal `1`. |
| `model` | string | yes | Embedding model identifier. |
| `model_hash` | string | no | Optional content hash of the model weights. |
| `model_hash` | string | no | Optional content hash of the model weights. When present, MUST match the format `"sha256:" || hex(SHA-256(weights))` where the input is the concatenation of model weight shards in sorted filename order. Implementations that cannot meet this convention MUST omit the field rather than emit a non-conforming value. |
| `source_hash` | string | yes | Hash of the source text (§3.1). |
| `vec_hash` | string | yes | Hash of the embedding (§3.2). |
| `vec_dtype` | string | yes | One of `"f32"` or `"f64"`. |
Expand All @@ -83,7 +83,22 @@ The signature in `sig` is produced over a canonical byte sequence that excludes

This canonical form is fed directly into Ed25519 signing.

### 4.3 Example
### 4.3 Size limits

To bound parser resource consumption and prevent DoS through hostile pins, conforming v1 implementations MUST enforce:

| Limit | Maximum |
|---|---|
| Total pin JSON, UTF-8 byte length | 64 KiB (65,536 bytes) |
| `extra` entry count | 32 |
| Any `extra` key, UTF-8 byte length | 128 bytes |
| Any `extra` value, UTF-8 byte length | 1 KiB (1,024 bytes) |
| `vec_dim` | 1,048,576 (2^20) |
| `sig`, decoded byte length | exactly 64 (Ed25519 signature) |

Verifiers MUST reject oversized pins before parsing the signature. These limits are not part of the signed payload — they are parser-side defenses.

### 4.4 Example

```json
{
Expand All @@ -103,38 +118,51 @@ This canonical form is fed directly into Ed25519 signing.

A verifier MUST:

0. Reject pins whose serialized JSON exceeds the size limits in §4.3 before parsing.
1. Reject pins whose `v` field is unknown to it.
2. Reject pins whose `kid` is not in its key registry.
2. Reject pins whose `kid` is not in its key registry, OR whose registry entry's `valid_from`/`valid_until` window excludes `ts` (see §7).
3. Reconstruct the canonical byte sequence (§4.2) and verify `sig` against the registered public key for `kid`.
4. If a ground-truth source string was supplied, recompute `hash_text(source)` and compare to `source_hash`.
5. If a ground-truth vector was supplied, recompute `hash_vector(vector, vec_dtype)` and compare to `vec_hash`. Also check that the supplied vector's shape matches `vec_dim`.
6. If an expected model identifier was supplied, compare to `model`.
7. If the caller supplied an expected `vectorpin.record_id` / `vectorpin.collection_id` / `vectorpin.tenant_id`, the verifier MUST compare against the value in `extra` and reject on mismatch. These reserved keys are the v1 replay-protection mechanism (§8); a verifier that ignores them when the caller has supplied an expected value MUST be considered non-conformant.

Verifiers MUST distinguish at least these failure modes (the reference implementation uses the names below; other implementations MAY use different names but MUST distinguish the cases):

- `UNSUPPORTED_VERSION`
- `UNKNOWN_KEY`
- `KEY_EXPIRED` — `kid` is registered but `ts` falls outside the registered validity window.
- `SIGNATURE_INVALID`
- `VECTOR_TAMPERED`
- `SOURCE_MISMATCH`
- `MODEL_MISMATCH`
- `SHAPE_MISMATCH`
- `RECORD_MISMATCH` / `COLLECTION_MISMATCH` / `TENANT_MISMATCH` — caller-supplied expected value for the corresponding reserved `extra` key does not match.
- `PARSE_ERROR` — pin JSON exceeds size limits, contains unknown top-level fields, or fails type validation.

## 6. Storage conventions

Adapter implementations SHOULD store pins under the metadata key `vectorpin`. Backends without free-form metadata fields are out of scope for this version of the protocol — provenance must travel with the data.

## 7. Key rotation
## 7. Key rotation and revocation

Verifiers MUST support multiple `kid` -> public key mappings simultaneously. Issuers rotate by:
Verifiers MUST support multiple `kid` -> public key mappings simultaneously, each with an optional validity window `(valid_from, valid_until)` of RFC 3339 timestamps. Issuers rotate by:

1. Generating a new keypair with a fresh `kid`.
2. Adding the new public key to all relevant verifier registries.
2. Adding the new public key to all relevant verifier registries, with a `valid_from` no earlier than the moment the new private key becomes operational.
3. Switching production signing to the new private key.
4. Optionally re-pinning the corpus over time.
5. Removing the old public key from registries once re-pinning is complete or the rotation policy expires.
5. Setting `valid_until` on the old key entry to the rotation cutover instant (do not remove the entry — historical pins must continue to verify against it).

Old pins continue to verify against the old public key as long as their `ts` falls within the old key's `(valid_from, valid_until)` window.

Old pins continue to verify against the old public key during this window.
### Revocation distinct from rotation

If a private key is **compromised** (as opposed to merely rotated for hygiene), the corresponding `kid` entry MUST be marked with `valid_until` set to the latest moment the key is believed to have been uncompromised. Pins with `ts` after that instant return `KEY_EXPIRED`; pins with `ts` before it continue to verify. This preserves the integrity of historical pins while immediately invalidating anything an attacker could produce post-compromise.

Operators SHOULD pair this with a transparency-log entry (e.g., sigstore Rekor or a project-specific append-only log) for the revocation event itself, so that downstream verifiers can detect a malicious registry rollback.

The protocol does not specify a revocation file format in v1; this is intentionally out of band so deployments can integrate with existing PKI / sigstore infrastructure. The minimum requirement on a v1.x verifier is to honor the `(valid_from, valid_until)` window however it is delivered.

## 8. Reserved `extra` keys

Expand All @@ -152,17 +180,27 @@ A v1.1 candidate spec promotes `record_id`, `collection_id`, and `tenant_id` to

## 9. Security considerations

- **Replay**: Pins are not bound to a specific record id at the wire format level. An attacker who copies a pin from one record to another can pass verification only if the vector and source they paste alongside match the pin. Implementations that need stronger replay protection SHOULD use the reserved `vectorpin.collection_id` / `vectorpin.record_id` / `vectorpin.tenant_id` keys defined in §8.
- **Time**: The `ts` field is informational. Verifiers MAY reject pins outside an acceptable time window but the protocol does not require it.
- **Key custody**: An attacker with the private signing key can produce arbitrary pins. Treat the signing key as a high-value secret.
- **Source-time integrity**: VectorPin attests to the relationship between source and vector at pin time. It does not attest that the source itself was authentic at ingestion.
- **Replay**: Pins are not bound to a specific record id at the wire format level. An attacker who copies a pin from one record to another can pass verification only if the vector and source they paste alongside match the pin. Implementations that need stronger replay protection SHOULD use the reserved `vectorpin.collection_id` / `vectorpin.record_id` / `vectorpin.tenant_id` keys defined in §8, and verifiers MUST enforce them when the caller supplies an expected value (see §5 step 7).
- **Time**: The `ts` field is informational *for the pin* but load-bearing for revocation: verifiers MUST consult `(valid_from, valid_until)` on the `kid` registration (§7) and reject pins whose `ts` falls outside that window.
- **Key custody**: An attacker with the private signing key can produce arbitrary pins. Treat the signing key as a high-value secret. Reference implementations write private keys with mode `0600`; production deployments SHOULD use a KMS or hardware-backed signer rather than file-system keys.
- **Source-time integrity**: VectorPin attests to the relationship between source and vector at pin time. It does not attest that the source itself was authentic at ingestion. Pair VectorPin with source-side controls (signed ingestion logs, document provenance) where this matters.
- **DoS via malformed pins**: Without the §4.3 size limits, a single hostile pin can exhaust verifier resources. Implementations MUST enforce these limits before reaching the signature path.

## 10. Key distribution

The protocol assumes a verifier has access to a registry mapping `kid` to `(public_key, valid_from, valid_until)`. How that registry is populated is out of scope, but the following SHOULD apply to any production deployment:

- **Fingerprint format**: Operators identifying a key out of band (Slack, email, ticket) SHOULD use `SHA-256(pubkey_bytes)` truncated to the first 16 hex digits, formatted as four colon-separated quads, e.g. `1f3a:7b22:9e0d:c4f1`. Full 32-byte public keys are themselves URL-safe-base64 short enough to share verbatim where possible.
- **Production registries SHOULD reference a transparency log entry** (e.g., sigstore Rekor) for each `kid` registration and revocation. The log entry binds the key material to a publicly observable, append-only history, allowing downstream verifiers to detect a malicious registry rollback.
- **Trust-on-first-use (TOFU) is NOT RECOMMENDED for new pins** unless the operator has explicitly opted in. A verifier that auto-registers any `kid` it encounters provides no integrity guarantee — it is a checksum, not a signature.
- **Per-tenant key separation**: Multi-tenant deployments SHOULD issue separate `kid`s per tenant rather than share a single producer key, so that compromise of one tenant's environment cannot forge pins for another tenant.

## 10. Versioning
## 11. Versioning

This is protocol version 1. Future versions MAY:

- Add new optional fields under `extra`-style namespaces.
- Add new dtype identifiers.
- Add new signature/hash algorithms (with corresponding identifiers).

A change is breaking iff a v1 verifier would silently accept a v2 pin as valid when the v2 pin's additional semantics matter. Such changes MUST bump the major version.
A change is breaking iff a v1 verifier would silently accept a v2 pin as valid when the v2 pin's additional semantics matter. Such changes MUST bump the major version. Downgrade resistance is provided by including the protocol-version field in the signed canonical bytes (§4.2) and by §5 step 0 / §4.3 size limits, which together prevent an attacker from stripping new fields and presenting the remainder to an older verifier.
Loading