Support for INT8 in dense-only vector indexes#4133
Conversation
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
🟢 Coverage 67.48% diff coverage · -7.74% coverage variation
Metric Results Coverage variation ✅ -7.74% coverage variation Diff coverage ✅ 67.48% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (ff65372) 124872 91902 73.60% Head commit (e3bdfb9) 156348 (+31476) 102972 (+11070) 65.86% (-7.74%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4133) 123 83 67.48% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Code Review - PR #4133: Support for INT8 in dense-only vector indexesOverviewThis PR adds INT8 pre-quantized ingest support to Positive aspects
Issues and suggestions1. Silent data loss in
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for INT8 pre-quantized vector ingest, enabling a 4x reduction in storage and bandwidth for compatible embedding providers. Key changes include the addition of a VectorEncoding enum, dequantization logic in VectorUtils, and updates to LSMVectorIndex and its builders to handle INT8 encoding via BINARY columns. Feedback identifies a regression in SQLFunctionVectorNeighbors where non-string vertex identifiers (like RID) are no longer correctly handled for lookups. Additionally, it was noted that several internal methods in LSMVectorIndex still use a deprecated conversion utility, which will cause INT8 vectors to be skipped during graph rebuilding and validation.
| // Key is a vertex identifier - fetch the vertex and get its vector | ||
| final String keyStr = key.toString(); | ||
| // Vertex-id lookup path: if the key is a string and not a literal vector, fetch the stored vector property. | ||
| if (key instanceof String keyStr) { |
There was a problem hiding this comment.
The change to key instanceof String introduces a regression for non-string vertex identifiers. Previously, any object (such as a RID) was converted to a string via toString() and used for the lookup. Now, passing a RID object will cause the function to fall through to VectorUtils.toFloatArray(key), which will throw an IllegalArgumentException as it does not support Identifiable types. To restore compatibility, the check should include Identifiable.
| if (key instanceof String keyStr) { | |
| if (key instanceof String || key instanceof Identifiable) { | |
| final String keyStr = key.toString(); |
|
|
||
| final float[] queryVector; | ||
| try { | ||
| queryVector = VectorUtils.toFloatArray(keys[0]); |
There was a problem hiding this comment.
While get and put have been updated to use VectorUtils.toFloatArray, several other locations in this class (specifically in ensureGraphAvailable and buildGraphFromScratchWithRetry, e.g., lines 909, 1175, 1326) still use the deprecated VectorUtils.convertToFloatArray. This will cause INT8 encoded vectors (stored as byte[]) to be skipped during graph rebuilding and validation because convertToFloatArray returns null for byte[]. All internal usages should be migrated to toFloatArray to ensure full support for the new encoding.
- putBatch: log WARNING with rid + cause when a batch entry's vector type
conversion or dimension check fails, instead of silently dropping the row
during commit replay
- dequantizeInt8ToFloat: clamp byte -128 to -127 so a non-Cohere/OpenAI int8
source still produces values in [-1, 1] for COSINE similarity
- LSMVectorIndex.put: dimension mismatch error now names the input type and
length ("got byte[] of length N, expected dimensions M")
- VectorEncoding Javadoc: correct ARRAY_OF_BYTES to BINARY (the Java byte[]
ArcadeDB type), drop em-dash
- LSMVectorIndexFactoryHandler: clarify in the comment that the encoding
field assignment is safe (factory has not published the reference yet)
and flag the constructor-arg sprawl as tracked tech debt
- comparison matrix: trim INT8 row down to one line
Code Review - PR #4133: Support for INT8 in dense-only vector indexesOverviewThis PR adds INT8 pre-quantized ingest support to LSM_VECTOR indexes. Users declare the vector property as BINARY (one signed byte per dimension), and the engine dequantizes to float32 on the read path using value / 127.0f (Cohere/OpenAI calibration). HTTP payloads and document bucket storage shrink 4x. HNSW build/search still runs on float32 internally, pending upstream jvector#665. The design is clean and backwards-compatible. The VectorEncoding enum is well documented, the JSON serialization omits the default to avoid bloat in existing schemas, and the test covers encoding metadata, key-type flipping, end-to-end ingest, float32/INT8 parity, and error rejection. Critical - Index rebuild path misses INT8 vectorsFive callers still use the now-deprecated
The All five sites need a one-line change from A test that inserts INT8 vectors, triggers a rebuild (e.g. by setting Medium - toFloatArray silently dequantizes any byte[]
Low - INT8 encoding and scalar/INT8 quantization are not validated togetherA caller can construct an index with both Minor notes
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4133 +/- ##
=======================================
Coverage 64.56% 64.57%
=======================================
Files 1635 1636 +1
Lines 124872 124947 +75
Branches 26670 26682 +12
=======================================
+ Hits 80628 80679 +51
- Misses 33000 33016 +16
- Partials 11244 11252 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Critical fix: five rebuild/recover sites still called the deprecated VectorUtils.convertToFloatArray, which returns null for byte[]. An INT8 index reopened from disk would silently rebuild the HNSW graph from an empty vector set: - LSMVectorIndex.java HNSW validation pass (was marking every INT8 vector as invalid) - LSMVectorIndex.java document-scan rebuild fallback (silently skipping INT8 vectors) - LSMVectorIndex.java preload-for-HNSW pass (silently skipping INT8 vectors) - GrowableVectorValues lazy-load fallback (silently skipping INT8 vectors) - ArcadePageVectorValues page vector reading (false WARNING on every INT8 vector access) All five now route through VectorUtils.toFloatArray with appropriate try/catch to preserve the prior null/skip semantics. Other review fixes: - VectorUtils.toFloatArray Javadoc spells out that a byte[] input is unconditionally dequantized via dequantizeInt8ToFloat (Cohere/OpenAI calibration); callers outside an INT8 context must convert first. - LSMVectorIndexFactoryHandler rejects (encoding=INT8, quantization=INT8) with a clear error: the property is already byte-quantized at the wire level, so internal INT8 scalar quantization would re-quantize the dequantized floats - silent double-processing. Tests: - int8VectorsSurviveRebuildAfterReopen: regression guard. Inserts N int8 vectors, runs a search to materialise the graph, reopens the database to force a rebuild from disk, asserts the same query still returns hits. - rejectsInt8EncodingPlusInt8Quantization: validates the new factory- handler guard.
Code Review: INT8 Dense-Only Vector Index Support (#4133)OverviewThis PR adds end-to-end INT8 pre-quantized ingest for dense vector indexes. Callers can set The design is sound and the approach is backward-compatible. Test coverage is good. A few issues and observations follow. Issues1. Silent failure in
|
The bare toFloatArray(Object) used to dequantize any byte[] unconditionally, which is a footgun for non-vector callers that happen to encounter a byte[] value. Split into two methods so the safe behaviour is the default: - toFloatArray(Object) now rejects byte[] with a clear pointer to the encoding-aware overload or to dequantizeInt8ToFloat - toFloatArray(Object, VectorEncoding) dequantizes byte[] only when encoding == INT8; for any other encoding it delegates to the strict variant, so a stray byte[] in a FLOAT32 index is rejected up front All internal callers that have access to the index encoding switched to the new overload: LSMVectorIndex put / putBatch / get / HNSW validation / document-scan rebuild / preload, GrowableVectorValues lazy-load fallback, ArcadePageVectorValues page reader, and SQLFunctionVectorNeighbors extractQueryVector. SQL math/utility functions (vector.add, vector.cosineSimilarity, etc.) keep calling the strict variant via SQLFunctionAbstract.toFloatArray and therefore reject byte[] - users must dequantize first. Other follow-ups in this batch: - GrowableVectorValues now logs a WARNING when an unsupported vector type is encountered during the document-lookup fallback, matching the sibling ArcadePageVectorValues path. Silent drops made operational triage hard. - VectorEncoding.INT8 Javadoc documents the -128 calibration clamp so callers from non-Cohere/OpenAI providers know about the silent numeric correction at the [-128] edge case. - LSMVectorIndexFactoryHandler comment now carries a TODO marker for the constructor-to-config-record refactor that would let encoding move back into construction. - Removed VectorUtils.convertToFloatArray entirely - it had no callers after the rebuild-path fix and the deprecated null-returning variant was the source of the silent-INT8-drop class of bugs we just closed. Updated the lone test caller (LSMVectorIndexStorageBenchmark) to use toFloatArray. - Added .isInstanceOf(IndexException.class) check on the unknown-encoding rejection test to pin the exception type contract. - New sqlVectorNeighborsAcceptsByteArrayQuery test exercises the SQL `vector.neighbors` entry point with a byte[] query against an INT8 index, closing the SQL-surface coverage gap.
Code Review - PR #4133: Support for INT8 in dense-only vector indexesOverviewThis PR adds end-to-end support for pre-quantized INT8 vector ingest in dense-only Potential Bugs / Issues1. Post-construction mutation in final LSMVectorIndex index = new LSMVectorIndex(...);
index.metadata.encoding = vectorBuilder.encoding; // post-construction mutation
return index;The PR comment correctly notes the mutation window is unobservable today, but the safety argument relies on the index not being published before 2. No property-type vs. encoding consistency check There is no validation that the declared document property type matches the requested encoding. For example, a user can call: docType.createProperty("embedding", Type.ARRAY_OF_FLOATS); // float property
builder.withEncoding(VectorEncoding.INT8); // INT8 encodingThe index will be created successfully. When 3. This works correctly because Java promotes 4. ((LSMVectorIndexMetadata) metadata).encoding = encoding;If Performance Considerations5. In 6. This is unavoidable given JVector's current Code Quality7. The existing file ( 8. Comment verbosity in production code Several comments added inside method bodies (e.g., Test CoverageThe test suite ( 9. Missing This test builds two HNSW graphs and runs two neighbor searches. Per 10. No unit test for The 11. No test for The deliberate rejection of Minor Nits
SummaryThe feature is well-implemented and the core logic is correct. The main items to address before merge:
🤖 Generated with Claude Code |
Substantive: add a property-type vs encoding consistency check in the LSMVectorIndexFactoryHandler. A user could previously combine encoding=INT8 with an ARRAY_OF_FLOATS property (or encoding=FLOAT32 with a BINARY property) and only see the failure surface as silent mis-interpretation at put() / query time. The new guard rejects both mismatches at create() with a clear error pointing at the right knob. Other review follow-ups: - TypeLSMVectorIndexBuilder.withEncoding now goes through a private vectorMetadata() typed accessor that performs an instanceof guard, turning a metadata-swap subclass scenario into a clear IndexException instead of an unrelated ClassCastException. - VectorUtils.dequantizeInt8ToFloat now uses an explicit (int) cast on the byte before the bound check; same behaviour, clearer intent. - LSMVectorIndexMetadata license header moved above the package declaration (was after the imports). - Trimmed verbose inline comments added in the previous review pass; the canonical Javadocs in VectorEncoding and VectorUtils carry the detail. - Constructor-refactor TODO now references the freshly-filed tracker issue #4134 instead of an inline TODO comment, so the tech debt is visible outside the source file. Test coverage: - New VectorUtilsTest covers dequantize edge cases (-128 clamps to -127, -127 -> -1.0f, 0 -> 0.0f, 127 -> 1.0f, empty input), the strict toFloatArray rejecting byte[], the encoding-aware variant rejecting byte[] under FLOAT32 and dequantizing under INT8, and float[] passthrough. - LSMVectorIndexInt8IngestTest gains rejectsInt8EncodingWithFloatPropertyType and rejectsFloat32EncodingWithBinaryPropertyType to pin the new property/encoding consistency check.
Code Review - PR #4133: Support for INT8 in dense-only vector indexesOverviewThis PR adds end-to-end INT8 (signed 8-bit integer) pre-quantized ingest support for Overall this is a well-structured, clearly motivated feature with good test coverage. A few items worth addressing: Issues1. Post-construction
|
- dequantizeInt8ToFloat now logs a one-time WARNING the first time it encounters byte -128 (clamped silently to -127). Cohere/OpenAI int8 endpoints emit [-127, 127] only, so a -128 byte indicates a non-Cohere source where the asymmetric clamp matters for DOT_PRODUCT in particular. Atomic gate keeps the warning to once per process. - New VectorEncoding.fromString centralizes the FLOAT32/INT8 string parser used by TypeLSMVectorIndexBuilder, BucketLSMVectorIndexBuilder, and the schema-JSON load path; the supported-values list cannot drift between the three call sites anymore. - HTTP/JSON wire routing for INT8 query vectors filed as #4135 (JSON arrays of integers / base64 strings -> byte[] in HTTP handlers); the comparison matrix entry now points at it so readers do not assume the 4x payload claim covers HTTP clients today. - putBatch Javadoc explicitly documents the put-vs-putBatch failure asymmetry: put throws on bad keys, putBatch logs WARNING and continues to avoid aborting an entire commit-replay batch on one bad row. - New SQL DDL test pins `CREATE INDEX ... LSM_VECTOR METADATA {encoding: 'INT8'}` end-to-end so a JSON key-name drift would surface. - VectorUtilsTest.toFloatArrayPassesThroughFloatArrayUnchanged now carries a comment explaining why all three overloads return the same float[] reference, so the identity assertion is not mistaken for an over-spec.
Code Review - PR #4133: INT8 Dense Vector Index SupportOverviewThis PR adds pre-quantized INT8 ingest to What's Done Well
Issues1. Post-construction final LSMVectorIndex index = new LSMVectorIndex(...);
index.metadata.encoding = vectorBuilder.encoding; // set after ctor
return index;The comment correctly argues the window is unobservable because the reference isn't published until 2. Property-type check is skipped when the property isn't declared final Property property = propertyOwner.getPolymorphicPropertyIfExists(propertyName);
if (property != null) {
// validation only runs here
}If the property hasn't been explicitly declared (e.g. vertex type without schema enforcement), the encoding/property-type mismatch check is silently bypassed. The put path will later fail or silently misbehave. Consider whether a 3. return new Type[] {
metadata.encoding == VectorEncoding.INT8 ? Type.BINARY : Type.ARRAY_OF_FLOATS
};Any code that routes on 4. HTTP/JSON deserialization path is not covered by this PR The PR ships the Java binary API and SQL METADATA path. The documentation correctly notes HTTP wire routing is tracked in #4135. However, clients sending INT8 vectors over HTTP today will send a JSON array of integers like Minor Notes
SummaryThe core implementation is solid and the test coverage is thorough. Items to address before merge:
🤖 Generated with Claude Code |
#4136) * HTTP: route int8 query vectors to byte[] via $bytes/$int8 markers (#4135) Closes the HTTP/JSON gap left by the INT8 ingest landing (#4132/#4133): clients can now send int8 query vectors that reach the engine as byte[] and trigger the encoding-aware dequantization on LSM_VECTOR indexes, rather than getting silently round-tripped through float32 and losing the 4x payload claim on the wire. Wire convention (Extended JSON-style): - {"$bytes": "<base64>"} -> byte[] decoded from base64 - {"$int8": [v0, v1, ...]} -> byte[] packed from int values in [-128, 127] The int8 form also accepts the float[] / double[] shapes that JSONObject.toMap(optimizeNumericArrays=true) produces for JSON integer arrays, with a fractional-value check that rejects non-integer floats so a caller mixing up float and int8 vectors fails loudly at the wire boundary. Implementation: - AbstractQueryHandler.decodeTypedJsonMarkers recursively walks the parsed param map and rewrites single-key {"$bytes" | "$int8": ...} objects into byte[]; multi-key maps and unrelated single-key maps pass through unchanged so existing user data with leading-$ keys is not silently transformed. - mapParams calls the decoder before its existing ordinal-vs-named routing so the byte[] flows verbatim to SQL parameter binding. Tests: - AbstractQueryHandlerTypedJsonMarkersTest: 11 unit cases pin the decoder contract (base64 decode, int list decode, float[] decode, out-of-range / non-integer / non-numeric / bad-base64 rejection, multi-key passthrough, list-of-markers recursion, scalar passthrough). - Int8VectorHttpIT: 2 end-to-end cases spin up the HTTP server, create an INT8 vector index, and submit `vector.neighbors` queries via HTTP using both marker forms; the seed-0 record comes back as the top hit confirming the byte[] path is exercised. Comparison matrix updated to drop the "HTTP/JSON wire routing tracked in #4135" caveat - INT8 ingest is now end-to-end. * #4134 LSMVectorIndex: consolidate constructor args into LSMVectorIndexConfig record Replaces the 17-positional-arg primary constructor with a single LSMVectorIndexConfig value object. The factory handler no longer needs to post-mutate metadata.encoding after construction, so the metadata is fully populated atomically before the instance escapes. * HTTP int8 markers: review fixes (null/key checks, int[], lazy alloc, tests) - {"$bytes": null} now throws IllegalArgumentException naming the marker and the null instead of falling through to the recursive-map branch - {"$bytes": <non-string>} same treatment - Map-key recursion validates instanceof String and throws a clear IllegalArgumentException instead of letting a hypothetical non-string key surface as an opaque ClassCastException - $int8 now also accepts int[] payloads alongside List/float[]/double[] for completeness - decodeTypedJsonMarkers short-circuits without allocating a fresh LinkedHashMap when the param map carries no nested Map/List, which is the normal case for non-vector queries - Trimmed multi-paragraph Javadocs and what-not-why inline comments per CLAUDE.md style Tests: - New cases for double[] payload, empty $int8 array, empty $bytes string, {"$bytes": null} rejection, int[] payload, and a two-level nested-map recursion (sibling to the existing list-recursion test). - Test-class headers trimmed to one-line Javadocs. * HTTP int8 markers: URL-safe base64, long[], zero-alloc passthrough, OpenAPI - $bytes now accepts URL-safe base64 (RFC 4648 section 5) by retrying with Base64.getUrlDecoder() on the standard decoder's failure. Common in ML tooling that base64-encodes embeddings using - and _ in place of + and /. - $int8 now accepts long[] payloads alongside List, float[], double[], int[]. - decodeTypedJsonMarkers and the Map/List recursion arms now return the original reference when no entry was rewritten. A parameter map of scalars + plain nested maps no longer pays for a fresh LinkedHashMap allocation per request - only marker-bearing requests build a new map. - Decoder split into two private helpers (decodeBytesMarker / decodeInt8Marker) so the dispatcher reads as a one-line switch on the marker key. - OpenAPI spec for /query and /command param fields now documents the $bytes / $int8 marker convention so users discover it from the API reference instead of source code. - Int8VectorHttpIT POSTs Content-Type: application/json explicitly. Tests: - New cases for long[] payload, explicit -128/127 boundary, URL-safe base64 round-trip, and a same-reference assertion that pins the zero-allocation passthrough on marker-free maps. * HTTP int8 markers: fix nested-map break, depth guard, IAE -> 400 on tx wrap - decodeTypedJsonMarker's nested-map prefix-copy loop now uses an index-based break instead of reference equality on the key. The previous loop assumed the same Map.Entry returns the same key reference across iterations, which holds for HashMap/LinkedHashMap but is not part of the Map contract. - Decoder recursion is bounded at 32 levels with an IllegalArgumentException on overflow; protects against StackOverflowError on hostile or accidentally deeply-nested JSON without depending on the upstream parser's depth limit. - Decode call moved from mapParams to PostCommandHandler.execute() so it runs before the database.transaction wrapper rather than under it. - AbstractServerHttpHandler's TransactionException catch arm now unwraps an IllegalArgumentException cause and returns HTTP 400, matching the un-wrapped catch arm. Without this, a malformed marker thrown from inside the transaction lambda was wrapped in a TransactionException and downgraded to HTTP 500 even though the underlying problem is bad client input. Tests: - New int8MarkerNullValueIsRejected gives the int8 path symmetric null-payload coverage (the bytes path already had it). - New deeplyNestedPayloadIsRejected pins the 32-level depth guard. - New Int8VectorHttpIT.int8MarkerOutOfRangeReturnsHttp400 confirms the IllegalArgumentException -> HTTP 400 chain end-to-end (was returning 500 prior to the AbstractServerHttpHandler unwrap fix). * HTTP int8 markers: simplify toInt8 guard, dedup IT helpers, ordinal test - toInt8 drops the redundant Double.isNaN / Double.isInfinite checks. NaN already trips the v != Math.floor(v) guard (NaN compared with anything is false, so != is true). Infinity passes that guard but is caught by the subsequent range check, so explicit handling here was dead code. Comment notes both flow paths so a future reader does not accidentally re-add the redundancy. - Int8VectorHttpIT now factors postQuery on top of postQueryRaw, sharing a single connection-setup helper and HttpResult type instead of two near-identical bodies. - @tag("slow") on the IT class so CI runs that filter out slow tests skip the full server boot. Spinning up the HTTP server + creating an index + 16 inserts puts the elapsed time over the multi-second threshold called out in CLAUDE.md. Tests: - New ordinalKeyMapWithMarkersIsDecoded covers the positional-array call shape (params keyed "0", "1", ...) that PostCommandHandler produces from a JSON array body. Without this, the typed-marker decoder is only exercised under named-key params at the unit level.
* Support for INT8 in dense-only vector indexes
* INT8 ingest: review fixes (logging, byte clamp, error clarity)
- putBatch: log WARNING with rid + cause when a batch entry's vector type
conversion or dimension check fails, instead of silently dropping the row
during commit replay
- dequantizeInt8ToFloat: clamp byte -128 to -127 so a non-Cohere/OpenAI int8
source still produces values in [-1, 1] for COSINE similarity
- LSMVectorIndex.put: dimension mismatch error now names the input type and
length ("got byte[] of length N, expected dimensions M")
- VectorEncoding Javadoc: correct ARRAY_OF_BYTES to BINARY (the Java byte[]
ArcadeDB type), drop em-dash
- LSMVectorIndexFactoryHandler: clarify in the comment that the encoding
field assignment is safe (factory has not published the reference yet)
and flag the constructor-arg sprawl as tracked tech debt
- comparison matrix: trim INT8 row down to one line
* INT8 ingest: fix rebuild paths + reject INT8 encoding+quant combo
Critical fix: five rebuild/recover sites still called the deprecated
VectorUtils.convertToFloatArray, which returns null for byte[]. An INT8
index reopened from disk would silently rebuild the HNSW graph from an
empty vector set:
- LSMVectorIndex.java HNSW validation pass (was marking every INT8 vector
as invalid)
- LSMVectorIndex.java document-scan rebuild fallback (silently skipping
INT8 vectors)
- LSMVectorIndex.java preload-for-HNSW pass (silently skipping INT8 vectors)
- GrowableVectorValues lazy-load fallback (silently skipping INT8 vectors)
- ArcadePageVectorValues page vector reading (false WARNING on every INT8
vector access)
All five now route through VectorUtils.toFloatArray with appropriate
try/catch to preserve the prior null/skip semantics.
Other review fixes:
- VectorUtils.toFloatArray Javadoc spells out that a byte[] input is
unconditionally dequantized via dequantizeInt8ToFloat (Cohere/OpenAI
calibration); callers outside an INT8 context must convert first.
- LSMVectorIndexFactoryHandler rejects (encoding=INT8, quantization=INT8)
with a clear error: the property is already byte-quantized at the wire
level, so internal INT8 scalar quantization would re-quantize the
dequantized floats - silent double-processing.
Tests:
- int8VectorsSurviveRebuildAfterReopen: regression guard. Inserts N int8
vectors, runs a search to materialise the graph, reopens the database
to force a rebuild from disk, asserts the same query still returns hits.
- rejectsInt8EncodingPlusInt8Quantization: validates the new factory-
handler guard.
* INT8 ingest: encoding-aware toFloatArray, log silent skips, SQL coverage
The bare toFloatArray(Object) used to dequantize any byte[] unconditionally,
which is a footgun for non-vector callers that happen to encounter a byte[]
value. Split into two methods so the safe behaviour is the default:
- toFloatArray(Object) now rejects byte[] with a clear pointer to the
encoding-aware overload or to dequantizeInt8ToFloat
- toFloatArray(Object, VectorEncoding) dequantizes byte[] only when
encoding == INT8; for any other encoding it delegates to the strict
variant, so a stray byte[] in a FLOAT32 index is rejected up front
All internal callers that have access to the index encoding switched to
the new overload: LSMVectorIndex put / putBatch / get / HNSW validation /
document-scan rebuild / preload, GrowableVectorValues lazy-load fallback,
ArcadePageVectorValues page reader, and SQLFunctionVectorNeighbors
extractQueryVector. SQL math/utility functions (vector.add,
vector.cosineSimilarity, etc.) keep calling the strict variant via
SQLFunctionAbstract.toFloatArray and therefore reject byte[] - users must
dequantize first.
Other follow-ups in this batch:
- GrowableVectorValues now logs a WARNING when an unsupported vector type
is encountered during the document-lookup fallback, matching the sibling
ArcadePageVectorValues path. Silent drops made operational triage hard.
- VectorEncoding.INT8 Javadoc documents the -128 calibration clamp so
callers from non-Cohere/OpenAI providers know about the silent numeric
correction at the [-128] edge case.
- LSMVectorIndexFactoryHandler comment now carries a TODO marker for the
constructor-to-config-record refactor that would let encoding move back
into construction.
- Removed VectorUtils.convertToFloatArray entirely - it had no callers
after the rebuild-path fix and the deprecated null-returning variant
was the source of the silent-INT8-drop class of bugs we just closed.
Updated the lone test caller (LSMVectorIndexStorageBenchmark) to use
toFloatArray.
- Added .isInstanceOf(IndexException.class) check on the unknown-encoding
rejection test to pin the exception type contract.
- New sqlVectorNeighborsAcceptsByteArrayQuery test exercises the SQL
`vector.neighbors` entry point with a byte[] query against an INT8
index, closing the SQL-surface coverage gap.
* INT8 ingest: property/encoding guard, builder typed accessor, edge tests
Substantive: add a property-type vs encoding consistency check in the
LSMVectorIndexFactoryHandler. A user could previously combine
encoding=INT8 with an ARRAY_OF_FLOATS property (or encoding=FLOAT32 with a
BINARY property) and only see the failure surface as silent
mis-interpretation at put() / query time. The new guard rejects both
mismatches at create() with a clear error pointing at the right knob.
Other review follow-ups:
- TypeLSMVectorIndexBuilder.withEncoding now goes through a private
vectorMetadata() typed accessor that performs an instanceof guard,
turning a metadata-swap subclass scenario into a clear IndexException
instead of an unrelated ClassCastException.
- VectorUtils.dequantizeInt8ToFloat now uses an explicit (int) cast on
the byte before the bound check; same behaviour, clearer intent.
- LSMVectorIndexMetadata license header moved above the package
declaration (was after the imports).
- Trimmed verbose inline comments added in the previous review pass; the
canonical Javadocs in VectorEncoding and VectorUtils carry the detail.
- Constructor-refactor TODO now references the freshly-filed tracker
issue ArcadeData#4134 instead of an inline TODO comment, so the tech debt is
visible outside the source file.
Test coverage:
- New VectorUtilsTest covers dequantize edge cases (-128 clamps to -127,
-127 -> -1.0f, 0 -> 0.0f, 127 -> 1.0f, empty input), the strict
toFloatArray rejecting byte[], the encoding-aware variant rejecting
byte[] under FLOAT32 and dequantizing under INT8, and float[] passthrough.
- LSMVectorIndexInt8IngestTest gains rejectsInt8EncodingWithFloatPropertyType
and rejectsFloat32EncodingWithBinaryPropertyType to pin the new
property/encoding consistency check.
* INT8 ingest: -128 one-time WARNING, dedup encoding parser, SQL DDL test
- dequantizeInt8ToFloat now logs a one-time WARNING the first time it
encounters byte -128 (clamped silently to -127). Cohere/OpenAI int8
endpoints emit [-127, 127] only, so a -128 byte indicates a non-Cohere
source where the asymmetric clamp matters for DOT_PRODUCT in particular.
Atomic gate keeps the warning to once per process.
- New VectorEncoding.fromString centralizes the FLOAT32/INT8 string parser
used by TypeLSMVectorIndexBuilder, BucketLSMVectorIndexBuilder, and the
schema-JSON load path; the supported-values list cannot drift between
the three call sites anymore.
- HTTP/JSON wire routing for INT8 query vectors filed as ArcadeData#4135 (JSON
arrays of integers / base64 strings -> byte[] in HTTP handlers); the
comparison matrix entry now points at it so readers do not assume the
4x payload claim covers HTTP clients today.
- putBatch Javadoc explicitly documents the put-vs-putBatch failure
asymmetry: put throws on bad keys, putBatch logs WARNING and continues
to avoid aborting an entire commit-replay batch on one bad row.
- New SQL DDL test pins `CREATE INDEX ... LSM_VECTOR METADATA
{encoding: 'INT8'}` end-to-end so a JSON key-name drift would surface.
- VectorUtilsTest.toFloatArrayPassesThroughFloatArrayUnchanged now carries
a comment explaining why all three overloads return the same float[]
reference, so the identity assertion is not mistaken for an over-spec.
…cadeData#4… (ArcadeData#4136) * HTTP: route int8 query vectors to byte[] via $bytes/$int8 markers (ArcadeData#4135) Closes the HTTP/JSON gap left by the INT8 ingest landing (ArcadeData#4132/ArcadeData#4133): clients can now send int8 query vectors that reach the engine as byte[] and trigger the encoding-aware dequantization on LSM_VECTOR indexes, rather than getting silently round-tripped through float32 and losing the 4x payload claim on the wire. Wire convention (Extended JSON-style): - {"$bytes": "<base64>"} -> byte[] decoded from base64 - {"$int8": [v0, v1, ...]} -> byte[] packed from int values in [-128, 127] The int8 form also accepts the float[] / double[] shapes that JSONObject.toMap(optimizeNumericArrays=true) produces for JSON integer arrays, with a fractional-value check that rejects non-integer floats so a caller mixing up float and int8 vectors fails loudly at the wire boundary. Implementation: - AbstractQueryHandler.decodeTypedJsonMarkers recursively walks the parsed param map and rewrites single-key {"$bytes" | "$int8": ...} objects into byte[]; multi-key maps and unrelated single-key maps pass through unchanged so existing user data with leading-$ keys is not silently transformed. - mapParams calls the decoder before its existing ordinal-vs-named routing so the byte[] flows verbatim to SQL parameter binding. Tests: - AbstractQueryHandlerTypedJsonMarkersTest: 11 unit cases pin the decoder contract (base64 decode, int list decode, float[] decode, out-of-range / non-integer / non-numeric / bad-base64 rejection, multi-key passthrough, list-of-markers recursion, scalar passthrough). - Int8VectorHttpIT: 2 end-to-end cases spin up the HTTP server, create an INT8 vector index, and submit `vector.neighbors` queries via HTTP using both marker forms; the seed-0 record comes back as the top hit confirming the byte[] path is exercised. Comparison matrix updated to drop the "HTTP/JSON wire routing tracked in ArcadeData#4135" caveat - INT8 ingest is now end-to-end. * ArcadeData#4134 LSMVectorIndex: consolidate constructor args into LSMVectorIndexConfig record Replaces the 17-positional-arg primary constructor with a single LSMVectorIndexConfig value object. The factory handler no longer needs to post-mutate metadata.encoding after construction, so the metadata is fully populated atomically before the instance escapes. * HTTP int8 markers: review fixes (null/key checks, int[], lazy alloc, tests) - {"$bytes": null} now throws IllegalArgumentException naming the marker and the null instead of falling through to the recursive-map branch - {"$bytes": <non-string>} same treatment - Map-key recursion validates instanceof String and throws a clear IllegalArgumentException instead of letting a hypothetical non-string key surface as an opaque ClassCastException - $int8 now also accepts int[] payloads alongside List/float[]/double[] for completeness - decodeTypedJsonMarkers short-circuits without allocating a fresh LinkedHashMap when the param map carries no nested Map/List, which is the normal case for non-vector queries - Trimmed multi-paragraph Javadocs and what-not-why inline comments per CLAUDE.md style Tests: - New cases for double[] payload, empty $int8 array, empty $bytes string, {"$bytes": null} rejection, int[] payload, and a two-level nested-map recursion (sibling to the existing list-recursion test). - Test-class headers trimmed to one-line Javadocs. * HTTP int8 markers: URL-safe base64, long[], zero-alloc passthrough, OpenAPI - $bytes now accepts URL-safe base64 (RFC 4648 section 5) by retrying with Base64.getUrlDecoder() on the standard decoder's failure. Common in ML tooling that base64-encodes embeddings using - and _ in place of + and /. - $int8 now accepts long[] payloads alongside List, float[], double[], int[]. - decodeTypedJsonMarkers and the Map/List recursion arms now return the original reference when no entry was rewritten. A parameter map of scalars + plain nested maps no longer pays for a fresh LinkedHashMap allocation per request - only marker-bearing requests build a new map. - Decoder split into two private helpers (decodeBytesMarker / decodeInt8Marker) so the dispatcher reads as a one-line switch on the marker key. - OpenAPI spec for /query and /command param fields now documents the $bytes / $int8 marker convention so users discover it from the API reference instead of source code. - Int8VectorHttpIT POSTs Content-Type: application/json explicitly. Tests: - New cases for long[] payload, explicit -128/127 boundary, URL-safe base64 round-trip, and a same-reference assertion that pins the zero-allocation passthrough on marker-free maps. * HTTP int8 markers: fix nested-map break, depth guard, IAE -> 400 on tx wrap - decodeTypedJsonMarker's nested-map prefix-copy loop now uses an index-based break instead of reference equality on the key. The previous loop assumed the same Map.Entry returns the same key reference across iterations, which holds for HashMap/LinkedHashMap but is not part of the Map contract. - Decoder recursion is bounded at 32 levels with an IllegalArgumentException on overflow; protects against StackOverflowError on hostile or accidentally deeply-nested JSON without depending on the upstream parser's depth limit. - Decode call moved from mapParams to PostCommandHandler.execute() so it runs before the database.transaction wrapper rather than under it. - AbstractServerHttpHandler's TransactionException catch arm now unwraps an IllegalArgumentException cause and returns HTTP 400, matching the un-wrapped catch arm. Without this, a malformed marker thrown from inside the transaction lambda was wrapped in a TransactionException and downgraded to HTTP 500 even though the underlying problem is bad client input. Tests: - New int8MarkerNullValueIsRejected gives the int8 path symmetric null-payload coverage (the bytes path already had it). - New deeplyNestedPayloadIsRejected pins the 32-level depth guard. - New Int8VectorHttpIT.int8MarkerOutOfRangeReturnsHttp400 confirms the IllegalArgumentException -> HTTP 400 chain end-to-end (was returning 500 prior to the AbstractServerHttpHandler unwrap fix). * HTTP int8 markers: simplify toInt8 guard, dedup IT helpers, ordinal test - toInt8 drops the redundant Double.isNaN / Double.isInfinite checks. NaN already trips the v != Math.floor(v) guard (NaN compared with anything is false, so != is true). Infinity passes that guard but is caught by the subsequent range check, so explicit handling here was dead code. Comment notes both flow paths so a future reader does not accidentally re-add the redundancy. - Int8VectorHttpIT now factors postQuery on top of postQueryRaw, sharing a single connection-setup helper and HttpResult type instead of two near-identical bodies. - @tag("slow") on the IT class so CI runs that filter out slow tests skip the full server boot. Spinning up the HTTP server + creating an index + 16 inserts puts the elapsed time over the multi-second threshold called out in CLAUDE.md. Tests: - New ordinalKeyMapWithMarkersIsDecoded covers the positional-array call shape (params keyed "0", "1", ...) that PostCommandHandler produces from a JSON array body. Without this, the typed-marker decoder is only exercised under named-key params at the unit level.
Support for INT8 in dense-only vector indexes