Skip to content

Support for INT8 in dense-only vector indexes#4133

Merged
lvca merged 6 commits into
mainfrom
vector-int8-ingest
May 7, 2026
Merged

Support for INT8 in dense-only vector indexes#4133
lvca merged 6 commits into
mainfrom
vector-int8-ingest

Conversation

@lvca
Copy link
Copy Markdown
Member

@lvca lvca commented May 7, 2026

Support for INT8 in dense-only vector indexes

@lvca lvca self-assigned this May 7, 2026
@lvca lvca added this to the 26.5.1 milestone May 7, 2026
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 7, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

🟢 Coverage 67.48% diff coverage · -7.74% coverage variation

Metric Results
Coverage variation -7.74% coverage variation
Diff coverage 67.48% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (ff65372) 124872 91902 73.60%
Head commit (e3bdfb9) 156348 (+31476) 102972 (+11070) 65.86% (-7.74%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4133) 123 83 67.48%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review - PR #4133: Support for INT8 in dense-only vector indexes

Overview

This PR adds INT8 pre-quantized ingest support to LSM_VECTOR indexes. Users can now declare a vector property as BINARY (one signed byte per dimension) and the engine dequantizes to float32 on the read path (value / 127.0f). Storage and HTTP payload shrink 4x. HNSW build/search still runs on float32 internally, pending upstream jvector#665. The change is backwards-compatible (default remains FLOAT32).


Positive aspects

  • Clean design - VectorEncoding enum is well-documented with a clear distinction from VectorQuantizationType
  • Backwards compatibility - FLOAT32 default means existing indexes need zero changes
  • JSON serialization omits default - toJSON() only writes encoding when it differs from FLOAT32 (no bloat in existing schemas)
  • Persistence tested - int8EncodingSurvivesReopen verifies encoding survives a database close/reopen cycle
  • Good test breadth - 5 test cases cover encoding metadata, key type flipping, end-to-end ingest, float32/INT8 parity, and error rejection

Issues and suggestions

1. Silent data loss in putBatch() (medium)

LSMVectorIndex.java around line 211:

} catch (final IllegalArgumentException e) {
    continue;
}

putBatch() silently skips vectors that fail type conversion, while put() rethrows. This inconsistency can silently drop indexed vectors with no log output. At minimum add a LogManager.instance().log(...) warning here so operators can detect when data is being skipped.

2. Post-construction metadata mutation (low / design concern)

LSMVectorIndex.java in LSMVectorIndexFactoryHandler.create():

index.metadata.encoding = vectorBuilder.encoding;

The inline comment acknowledges this avoids "ballooning the already-17-arg primary constructor." The workaround is pragmatic, but mutating a public field after construction means the object is briefly in an inconsistent state (FLOAT32 encoding, but possibly an INT8 builder). Since fromJSON() already reads encoding correctly on reopen, the risk is low in practice - but it is worth tracking as tech debt.

3. Byte value -128 falls outside the dequantization convention (low)

VectorUtils.dequantizeInt8ToFloat() uses value / 127.0f. A Java byte can hold -128, which maps to approximately -1.0079f - slightly outside the [-1, 1] range that COSINE similarity assumes (and that Cohere/OpenAI int8 endpoints never produce). If a caller passes a raw byte[] not from a Cohere/OpenAI source, COSINE similarity could yield subtly incorrect results. Consider either:

  • clamping: Math.max(-127, b[i]) / 127.0f
  • or documenting that -128 is invalid input for INT8 encoding

4. putBatch() exception path drops the cause

} catch (final IllegalArgumentException e) {
    continue;  // cause e is swallowed
}

If the silent-skip behavior is intentional, at least preserve the message for diagnostics.

5. No @Tag("slow") on the test class

The int8AndFloat32ParityOnSameLogicalVectors test builds two indexes with 50 vectors in a single transaction. It's probably fast enough for CI, but the CLAUDE.md convention asks for @Tag("slow") on tests that take noticeably long. Worth reviewing against the CI timing budget.

6. Missing dimension check for byte[] input before toFloatArray()

After VectorUtils.toFloatArray() converts a byte[] to float[], the dimension check fires on the resulting float array. The length is the same (one float per byte), so the check is correct - but the error message could be clearer if the dimension mismatch originates from a byte array. Minor, but a "got byte[] of length N, expected dimensions M" message would be more actionable.


Minor nits

  • VectorEncoding.java Javadoc says ARRAY_OF_BYTES in the INT8 bullet, but getKeyTypes() returns Type.BINARY. BINARY is the correct ArcadeDB type; the Javadoc should match.
  • The comparison table entry in arcadedb-vs-leading-vector-dbms.md is now quite long for a table cell; consider a footnote to keep the table readable.

Summary

The core implementation is solid and the backwards-compatibility story is clean. The main actionable item before merge is the silent data loss in putBatch() (a log warning at minimum). The post-construction metadata mutation and the -128 boundary case are lower priority but worth a comment or follow-up issue.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for INT8 pre-quantized vector ingest, enabling a 4x reduction in storage and bandwidth for compatible embedding providers. Key changes include the addition of a VectorEncoding enum, dequantization logic in VectorUtils, and updates to LSMVectorIndex and its builders to handle INT8 encoding via BINARY columns. Feedback identifies a regression in SQLFunctionVectorNeighbors where non-string vertex identifiers (like RID) are no longer correctly handled for lookups. Additionally, it was noted that several internal methods in LSMVectorIndex still use a deprecated conversion utility, which will cause INT8 vectors to be skipped during graph rebuilding and validation.

// Key is a vertex identifier - fetch the vertex and get its vector
final String keyStr = key.toString();
// Vertex-id lookup path: if the key is a string and not a literal vector, fetch the stored vector property.
if (key instanceof String keyStr) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The change to key instanceof String introduces a regression for non-string vertex identifiers. Previously, any object (such as a RID) was converted to a string via toString() and used for the lookup. Now, passing a RID object will cause the function to fall through to VectorUtils.toFloatArray(key), which will throw an IllegalArgumentException as it does not support Identifiable types. To restore compatibility, the check should include Identifiable.

Suggested change
if (key instanceof String keyStr) {
if (key instanceof String || key instanceof Identifiable) {
final String keyStr = key.toString();


final float[] queryVector;
try {
queryVector = VectorUtils.toFloatArray(keys[0]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While get and put have been updated to use VectorUtils.toFloatArray, several other locations in this class (specifically in ensureGraphAvailable and buildGraphFromScratchWithRetry, e.g., lines 909, 1175, 1326) still use the deprecated VectorUtils.convertToFloatArray. This will cause INT8 encoded vectors (stored as byte[]) to be skipped during graph rebuilding and validation because convertToFloatArray returns null for byte[]. All internal usages should be migrated to toFloatArray to ensure full support for the new encoding.

- putBatch: log WARNING with rid + cause when a batch entry's vector type
  conversion or dimension check fails, instead of silently dropping the row
  during commit replay
- dequantizeInt8ToFloat: clamp byte -128 to -127 so a non-Cohere/OpenAI int8
  source still produces values in [-1, 1] for COSINE similarity
- LSMVectorIndex.put: dimension mismatch error now names the input type and
  length ("got byte[] of length N, expected dimensions M")
- VectorEncoding Javadoc: correct ARRAY_OF_BYTES to BINARY (the Java byte[]
  ArcadeDB type), drop em-dash
- LSMVectorIndexFactoryHandler: clarify in the comment that the encoding
  field assignment is safe (factory has not published the reference yet)
  and flag the constructor-arg sprawl as tracked tech debt
- comparison matrix: trim INT8 row down to one line
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review - PR #4133: Support for INT8 in dense-only vector indexes

Overview

This PR adds INT8 pre-quantized ingest support to LSM_VECTOR indexes. Users declare the vector property as BINARY (one signed byte per dimension), and the engine dequantizes to float32 on the read path using value / 127.0f (Cohere/OpenAI calibration). HTTP payloads and document bucket storage shrink 4x. HNSW build/search still runs on float32 internally, pending upstream jvector#665.

The design is clean and backwards-compatible. The VectorEncoding enum is well documented, the JSON serialization omits the default to avoid bloat in existing schemas, and the test covers encoding metadata, key-type flipping, end-to-end ingest, float32/INT8 parity, and error rejection.


Critical - Index rebuild path misses INT8 vectors

Five callers still use the now-deprecated VectorUtils.convertToFloatArray(), which returns null for byte[] rather than dequantizing:

File Line Path
LSMVectorIndex.java 913 HNSW validation pass - marks every INT8 vector as invalid
LSMVectorIndex.java 1179 Document-scan fallback during rebuild - silently skips INT8 vectors
LSMVectorIndex.java 1330 Vector preload for HNSW build - silently skips INT8 vectors
GrowableVectorValues.java 134 Lazy-load fallback - silently skips INT8 vectors
ArcadePageVectorValues.java 230 Page vector loading - logs a false WARNING for every INT8 vector

The put/putBatch paths were correctly updated to toFloatArray(), but these rebuild/recover paths were not. The practical consequence is that an INT8 index that exceeds mutationsBeforeRebuild (or is reopened after a crash that forces a full rebuild) will silently lose all its vectors - the HNSW graph is rebuilt from an empty set. The ArcadePageVectorValues case also floods the log with incorrect warnings on every vector access.

All five sites need a one-line change from VectorUtils.convertToFloatArray(vectorObj) to VectorUtils.toFloatArray(vectorObj), plus a null-check replacement with try/catch (since toFloatArray throws rather than returns null).

A test that inserts INT8 vectors, triggers a rebuild (e.g. by setting mutationsBeforeRebuild(1) and doing two inserts), and then queries would catch this regression class.


Medium - toFloatArray silently dequantizes any byte[]

VectorUtils.toFloatArray(Object) now dispatches to dequantizeInt8ToFloat for any byte[] input, regardless of whether the caller is in an INT8 index context. Any existing code that passes a byte[] through this utility (e.g. binary keys, raw storage objects) will now receive silently scaled floats instead of an error. If none of the other callers of toFloatArray will ever pass a raw non-INT8 byte[], this is fine - but it is worth a comment at the method level making that assumption explicit.


Low - INT8 encoding and scalar/INT8 quantization are not validated together

A caller can construct an index with both withEncoding(INT8) and withQuantization(VectorQuantizationType.INT8). The index would then dequantize from byte[] to float at ingest, and then re-quantize to INT8 internally for HNSW storage. There is no error or warning to signal this is likely a mistake. At a minimum, a check in the builder's create() or the factory handler would prevent silent double-processing.


Minor notes

  • The post-construction index.metadata.encoding = vectorBuilder.encoding mutation is correctly documented as tech debt - no action needed now.
  • The -128 clamping in dequantizeInt8ToFloat is correct and the Javadoc explains the calibration convention well.
  • The int8EncodingSurvivesReopen test is a good addition and correctly validates JSON round-trip persistence.
  • The docs update accurately reflects the shipped/pending state.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 57.72358% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.57%. Comparing base (ff65372) to head (e3bdfb9).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...java/com/arcadedb/index/vector/LSMVectorIndex.java 54.54% 17 Missing and 8 partials ⚠️
...m/arcadedb/schema/BucketLSMVectorIndexBuilder.java 20.00% 8 Missing ⚠️
...om/arcadedb/index/vector/GrowableVectorValues.java 0.00% 7 Missing ⚠️
...unction/sql/vector/SQLFunctionVectorNeighbors.java 50.00% 4 Missing and 1 partial ⚠️
.../arcadedb/index/vector/ArcadePageVectorValues.java 25.00% 2 Missing and 1 partial ⚠️
...com/arcadedb/schema/TypeLSMVectorIndexBuilder.java 75.00% 2 Missing and 1 partial ⚠️
...in/java/com/arcadedb/index/vector/VectorUtils.java 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4133   +/-   ##
=======================================
  Coverage   64.56%   64.57%           
=======================================
  Files        1635     1636    +1     
  Lines      124872   124947   +75     
  Branches    26670    26682   +12     
=======================================
+ Hits        80628    80679   +51     
- Misses      33000    33016   +16     
- Partials    11244    11252    +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Critical fix: five rebuild/recover sites still called the deprecated
VectorUtils.convertToFloatArray, which returns null for byte[]. An INT8
index reopened from disk would silently rebuild the HNSW graph from an
empty vector set:

- LSMVectorIndex.java HNSW validation pass (was marking every INT8 vector
  as invalid)
- LSMVectorIndex.java document-scan rebuild fallback (silently skipping
  INT8 vectors)
- LSMVectorIndex.java preload-for-HNSW pass (silently skipping INT8 vectors)
- GrowableVectorValues lazy-load fallback (silently skipping INT8 vectors)
- ArcadePageVectorValues page vector reading (false WARNING on every INT8
  vector access)

All five now route through VectorUtils.toFloatArray with appropriate
try/catch to preserve the prior null/skip semantics.

Other review fixes:

- VectorUtils.toFloatArray Javadoc spells out that a byte[] input is
  unconditionally dequantized via dequantizeInt8ToFloat (Cohere/OpenAI
  calibration); callers outside an INT8 context must convert first.
- LSMVectorIndexFactoryHandler rejects (encoding=INT8, quantization=INT8)
  with a clear error: the property is already byte-quantized at the wire
  level, so internal INT8 scalar quantization would re-quantize the
  dequantized floats - silent double-processing.

Tests:
- int8VectorsSurviveRebuildAfterReopen: regression guard. Inserts N int8
  vectors, runs a search to materialise the graph, reopens the database
  to force a rebuild from disk, asserts the same query still returns hits.
- rejectsInt8EncodingPlusInt8Quantization: validates the new factory-
  handler guard.
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review: INT8 Dense-Only Vector Index Support (#4133)

Overview

This PR adds end-to-end INT8 pre-quantized ingest for dense vector indexes. Callers can set withEncoding(VectorEncoding.INT8) on the index builder; the document property is declared as BINARY (one byte per dimension), reducing HTTP payload and bucket storage 4x vs. the float32 default. On the read path, bytes are dequantized to float32 via value / 127.0f (Cohere/OpenAI calibration convention) before being passed to JVector's HNSW internals.

The design is sound and the approach is backward-compatible. Test coverage is good. A few issues and observations follow.


Issues

1. Silent failure in GrowableVectorValues - inconsistent with sibling class

In GrowableVectorValues.java (fall-back document lookup path), when toFloatArray throws, the exception is silently swallowed:

} catch (final IllegalArgumentException ignored) {
    // unsupported vector type for this index encoding - leaves vector null
}

But ArcadePageVectorValues.java logs a Level.WARNING for the same situation. Silent drops make operational triage very hard - an operator whose INT8 vectors are being silently skipped during search has no log signal to investigate. At minimum, add a DEBUG-level log; a WARNING (matching the sibling) would be better.

2. toFloatArray(Object) unconditionally dequantizes byte[] - public API footgun

The Javadoc already flags this:

Callers that are NOT in an INT8 index context must not pass a raw byte[] into this method

But the method is public static, so any code in the engine or downstream that encounters a byte[] and routes it through toFloatArray will silently get scaled floats instead of an error. The method's own Javadoc says to convert up-front instead, but callers cannot easily know what the method will do unless they read that warning.

A safer alternative: keep toFloatArray(Object) without the byte[] branch (for non-INT8 contexts), and add a toFloatArray(Object, VectorEncoding) overload that applies dequantization only when encoding == INT8. Internal call sites already have access to metadata.encoding, so the signature change is minimal.

3. rejectsInt8EncodingUnknownName test - exception type not asserted

int8EncodingRejectsUnknownName only checks hasMessageContaining("Invalid vector encoding"). Adding .isInstanceOf(IndexException.class) would pin the contract more precisely and catch any future wrapping that changes the exception type.


Observations / Minor Points

4. Encoding-after-construction in factory handler

final LSMVectorIndex index = new LSMVectorIndex(...);
index.metadata.encoding = vectorBuilder.encoding;  // patched post-construction
return index;

The comment explains the 17-arg constructor constraint and that the brief inconsistent window is unobservable. That reasoning is correct for now. A TODO linking to a follow-up issue for the constructor-to-config-record refactor would keep the tech debt visible without blocking this PR.

5. Does convertToFloatArray still exist?

The PR replaces many call sites from the old null-returning convertToFloatArray to the exception-throwing toFloatArray, but the diff for VectorUtils.java does not show convertToFloatArray being removed. If it still exists, it should be deprecated (@Deprecated) or removed to prevent future callers from accidentally using the null-returning variant on INT8 paths.

6. SQL function layer not tested for byte[] query vectors

extractQueryVector in SQLFunctionVectorNeighbors was refactored to call VectorUtils.toFloatArray and now handles byte[] inputs. However, LSMVectorIndexInt8IngestTest only exercises the index API directly (via lsm.get() / lsm.findNeighborsFromVector()). Adding a test that runs vector_neighbors(...) via SQL with a byte[] query vector would close the gap for the SQL path.

7. -128 clamping in dequantizeInt8ToFloat is Cohere/OpenAI-specific

The Javadoc documents this well. Worth noting in the public VectorEncoding.INT8 Javadoc as well (currently only mentions value / 127.0f without explaining the -128 clamp) so callers from non-Cohere/OpenAI providers understand the silent data modification risk.


Positives

  • Clean separation between encoding (wire/storage) and quantization (HNSW-internal) - the guard rejecting the encoding=INT8 + quantization=INT8 combination is particularly good
  • Metadata round-trips correctly: toJSON / fromJSON / withMetadata all updated consistently; backward-compat preserved by omitting encoding for FLOAT32
  • int8EncodingSurvivesReopen and int8VectorsSurviveRebuildAfterReopen directly guard the regression paths from the original bug (rebuild from disk silently dropping int8 vectors)
  • Error messages in put, get, and putBatch are detailed enough to debug dimension mismatches
  • Documentation quality is high throughout (enum Javadoc, builder methods, comparison table in arcadedb-vs-leading-vector-dbms.md)

Summary

The implementation is correct and well-tested. The primary issues to address before merge are the silent failure in GrowableVectorValues (#1) and confirming whether convertToFloatArray still exists and needs deprecating (#5). The toFloatArray footgun (#2) is the most architecturally risky long-term but can be addressed as follow-on work if preferred.

The bare toFloatArray(Object) used to dequantize any byte[] unconditionally,
which is a footgun for non-vector callers that happen to encounter a byte[]
value. Split into two methods so the safe behaviour is the default:

- toFloatArray(Object) now rejects byte[] with a clear pointer to the
  encoding-aware overload or to dequantizeInt8ToFloat
- toFloatArray(Object, VectorEncoding) dequantizes byte[] only when
  encoding == INT8; for any other encoding it delegates to the strict
  variant, so a stray byte[] in a FLOAT32 index is rejected up front

All internal callers that have access to the index encoding switched to
the new overload: LSMVectorIndex put / putBatch / get / HNSW validation /
document-scan rebuild / preload, GrowableVectorValues lazy-load fallback,
ArcadePageVectorValues page reader, and SQLFunctionVectorNeighbors
extractQueryVector. SQL math/utility functions (vector.add,
vector.cosineSimilarity, etc.) keep calling the strict variant via
SQLFunctionAbstract.toFloatArray and therefore reject byte[] - users must
dequantize first.

Other follow-ups in this batch:

- GrowableVectorValues now logs a WARNING when an unsupported vector type
  is encountered during the document-lookup fallback, matching the sibling
  ArcadePageVectorValues path. Silent drops made operational triage hard.
- VectorEncoding.INT8 Javadoc documents the -128 calibration clamp so
  callers from non-Cohere/OpenAI providers know about the silent numeric
  correction at the [-128] edge case.
- LSMVectorIndexFactoryHandler comment now carries a TODO marker for the
  constructor-to-config-record refactor that would let encoding move back
  into construction.
- Removed VectorUtils.convertToFloatArray entirely - it had no callers
  after the rebuild-path fix and the deprecated null-returning variant
  was the source of the silent-INT8-drop class of bugs we just closed.
  Updated the lone test caller (LSMVectorIndexStorageBenchmark) to use
  toFloatArray.
- Added .isInstanceOf(IndexException.class) check on the unknown-encoding
  rejection test to pin the exception type contract.
- New sqlVectorNeighborsAcceptsByteArrayQuery test exercises the SQL
  `vector.neighbors` entry point with a byte[] query against an INT8
  index, closing the SQL-surface coverage gap.
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review - PR #4133: Support for INT8 in dense-only vector indexes

Overview

This PR adds end-to-end support for pre-quantized INT8 vector ingest in dense-only LSM_VECTOR indexes. Vectors stored as BINARY (one signed byte per dimension) are dequantized to float32 on the read path via value / 127.0f (Cohere/OpenAI calibration convention) before being handed to JVector. Storage and HTTP payload shrink 4x vs FLOAT32. The approach is correct given the current JVector limitation (no native int8 HNSW until datastax/jvector#665).


Potential Bugs / Issues

1. Post-construction mutation in LSMVectorIndexFactoryHandler.create() (LSMVectorIndex.java)

final LSMVectorIndex index = new LSMVectorIndex(...);
index.metadata.encoding = vectorBuilder.encoding;  // post-construction mutation
return index;

The PR comment correctly notes the mutation window is unobservable today, but the safety argument relies on the index not being published before create() returns. This breaks if the constructor registers the index with any listener or cache that might read metadata.encoding during construction. The TODO(#4132 follow-up) comment is good - this should be elevated to a tracked issue or follow-up PR so it doesn't get forgotten.

2. No property-type vs. encoding consistency check

There is no validation that the declared document property type matches the requested encoding. For example, a user can call:

docType.createProperty("embedding", Type.ARRAY_OF_FLOATS);   // float property
builder.withEncoding(VectorEncoding.INT8);                    // INT8 encoding

The index will be created successfully. When put() is later called with a float[], toFloatArray(float[], INT8) will silently return the raw floats without dequantization (since it only dequantizes byte[]), but the stored document value will be float[]. Conversely, a user declaring Type.BINARY with FLOAT32 encoding will get confusing byte[] rejection errors at query time. A guard in LSMVectorIndexFactoryHandler.create() (or in the builders) that checks propertyType == BINARY ↔ encoding == INT8 would surface the misconfiguration early.

3. dequantizeInt8ToFloat bound check uses int8[i] < -127 with Java byte

This works correctly because Java promotes byte to int before comparison - but it's worth a second glance. Java byte range is [-128, 127], so int8[i] is -128 when the sentinel case applies, which compares correctly as int. No bug, but a clarifying cast (int) int8[i] or an explicit note in the code would make the intent clearer.

4. TypeLSMVectorIndexBuilder.withEncoding(VectorEncoding) performs an unchecked cast

((LSMVectorIndexMetadata) metadata).encoding = encoding;

If metadata is ever not an LSMVectorIndexMetadata (e.g., the builder is subclassed or metadata is set differently), this silently throws ClassCastException. An instanceof guard or a typed accessor on IndexMetadata would be safer.


Performance Considerations

5. try/catch in tight build loops

In buildGraphFromScratchWithRetry, the encoding-aware conversion is wrapped in a try/catch per vector entry. Java's exception machinery is fast on the happy path (no-exception), but adding a catch handler does increase the size of the method's exception table and can affect JIT optimization for the loop. The fix is correct and the cold-path behavior is the right trade-off, but it is worth noting for very large batch builds.

6. dequantizeInt8ToFloat allocates a new float[] on every call

This is unavoidable given JVector's current float32 contract, but every HNSW graph traversal step that reads a stored vector will allocate a new float array. Callers that read the same vector multiple times (e.g., during graph construction where a seed vector is re-visited) will re-dequantize it each time. If performance-sensitive paths can cache the dequantized form (similar to the existing preloadedVectors map), that would reduce allocations on the hot path.


Code Quality

7. LSMVectorIndexMetadata is missing the Apache 2.0 license header

The existing file (LSMVectorIndexMetadata.java) does not appear to have the standard project copyright header. New file VectorEncoding.java has it correctly. This may be a pre-existing issue, but it is worth verifying and fixing.

8. Comment verbosity in production code

Several comments added inside method bodies (e.g., ArcadePageVectorValues.java, GrowableVectorValues.java, LSMVectorIndex.java) are longer than the project style guide recommends. The VectorEncoding.java Javadoc and VectorUtils.java method Javadocs are excellent and necessary. However, inline block comments inside methods explaining JVector internals could be trimmed - the key invariant is captured in the VectorEncoding Javadoc, which is the canonical place for it.


Test Coverage

The test suite (LSMVectorIndexInt8IngestTest) is thorough and well-structured. A few gaps:

9. Missing @Tag("slow") on int8AndFloat32ParityOnSameLogicalVectors

This test builds two HNSW graphs and runs two neighbor searches. Per CLAUDE.md, functional regression tests that take noticeably long need @Tag("slow"). Even with only 50 vectors, the graph construction path involves non-trivial work.

10. No unit test for dequantizeInt8ToFloat edge cases

The -128 clamp is a non-obvious invariant (documented carefully in both Javadocs) but has no dedicated test. Recommend adding a unit test in VectorUtilsTest (or alongside the new test class) covering at minimum: input -128 produces -127/127.0f, input -127 produces -1.0f, input 0 produces 0.0f, input 127 produces 1.0f.

11. No test for toFloatArray(byte[], FLOAT32) rejecting byte[]

The deliberate rejection of byte[] in a FLOAT32 context is a key safety contract (prevents silent mis-scaling in non-INT8 contexts). A test that confirms toFloatArray(new byte[]{1,2,3}, VectorEncoding.FLOAT32) throws IllegalArgumentException would pin this contract.


Minor Nits

  • int8EncodingFlipsKeyTypesToBinary test: the assertion assertThat(lsm.getKeyTypes()).containsExactly(Type.BINARY) is good; also asserting that getKeyTypes() returns Type.ARRAY_OF_FLOATS for the default case in defaultEncodingIsFloat32 is already covered.
  • The quantize() helper in the test clamps to [-127, 127] (matching the dequantizer's convention) - this is correct.
  • LSMVectorIndex.java error message improvement at put(): "Vector dimension does not match index dimension " + metadata.dimensions + ": got " + keys[0].getClass().getSimpleName() + " of length " + vector.length is a clear improvement over the previous message.

Summary

The feature is well-implemented and the core logic is correct. The main items to address before merge:

  1. Required: Add property-type vs. encoding consistency validation at index creation time.
  2. Required: Add @Tag("slow") to int8AndFloat32ParityOnSameLogicalVectors.
  3. Recommended: Add unit tests for dequantizeInt8ToFloat edge cases and the FLOAT32 rejection of byte[].
  4. Recommended: Elevate the post-construction mutation TODO to a tracked issue.
  5. Optional: Trim inline block comments in production methods; the Javadocs are sufficient.

🤖 Generated with Claude Code

Substantive: add a property-type vs encoding consistency check in the
LSMVectorIndexFactoryHandler. A user could previously combine
encoding=INT8 with an ARRAY_OF_FLOATS property (or encoding=FLOAT32 with a
BINARY property) and only see the failure surface as silent
mis-interpretation at put() / query time. The new guard rejects both
mismatches at create() with a clear error pointing at the right knob.

Other review follow-ups:

- TypeLSMVectorIndexBuilder.withEncoding now goes through a private
  vectorMetadata() typed accessor that performs an instanceof guard,
  turning a metadata-swap subclass scenario into a clear IndexException
  instead of an unrelated ClassCastException.
- VectorUtils.dequantizeInt8ToFloat now uses an explicit (int) cast on
  the byte before the bound check; same behaviour, clearer intent.
- LSMVectorIndexMetadata license header moved above the package
  declaration (was after the imports).
- Trimmed verbose inline comments added in the previous review pass; the
  canonical Javadocs in VectorEncoding and VectorUtils carry the detail.
- Constructor-refactor TODO now references the freshly-filed tracker
  issue #4134 instead of an inline TODO comment, so the tech debt is
  visible outside the source file.

Test coverage:

- New VectorUtilsTest covers dequantize edge cases (-128 clamps to -127,
  -127 -> -1.0f, 0 -> 0.0f, 127 -> 1.0f, empty input), the strict
  toFloatArray rejecting byte[], the encoding-aware variant rejecting
  byte[] under FLOAT32 and dequantizing under INT8, and float[] passthrough.
- LSMVectorIndexInt8IngestTest gains rejectsInt8EncodingWithFloatPropertyType
  and rejectsFloat32EncodingWithBinaryPropertyType to pin the new
  property/encoding consistency check.
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review - PR #4133: Support for INT8 in dense-only vector indexes

Overview

This PR adds end-to-end INT8 (signed 8-bit integer) pre-quantized ingest support for LSM_VECTOR indexes, enabling 4x storage/bandwidth reduction for clients that already produce int8 embeddings (Cohere, OpenAI reduced-precision endpoints). The implementation introduces a VectorEncoding enum to separate wire/storage encoding from the existing internal VectorQuantizationType, updates all vector read/write paths to be encoding-aware, and adds a builder-time guard against the double-quantization trap.

Overall this is a well-structured, clearly motivated feature with good test coverage. A few items worth addressing:


Issues

1. Post-construction encoding mutation (known tech debt, needs tracking)

In LSMVectorIndex.LSMVectorIndexFactoryHandler.create():

final LSMVectorIndex index = new LSMVectorIndex(...);
index.metadata.encoding = vectorBuilder.encoding;  // assigned after construction
return index;

The comment correctly notes that the window is unobservable today because the reference hasn't been published yet, and #4134 is filed. However, encoding is a mutable public field on a mutable object - if the constructor were ever refactored to call a method that reads metadata.encoding (e.g. during index initialisation or a virtual dispatch to a subclass), it would silently see FLOAT32. The fix (folding encoding into the constructor or into LSMVectorIndexMetadata.fromJSON) would eliminate the fragility; until then the comment is load-bearing.

2. Missing HTTP/JSON deserialization path test

The PR description states "HTTP payload + bucket storage shrink 4x", but there is no test or code showing how a raw HTTP client sends an INT8 query vector. JSON has no native byte[] type - clients typically send either a JSON array of integers (e.g. [0, 64, 127, -1]) or a base64 blob. The HTTP handler (PostCommandHandler / PostQueryHandler) and the SQL parameter binder both need to route a JSON array of integers through to byte[] so the encoding-aware path kicks in. Without a test or documented convention, operators shipping this feature will find the 4x claim only materialises for the storage side, not for the wire round-trip from HTTP clients.

3. Silent -128 clamping is a semantic change, not just a safety clamp

dequantizeInt8ToFloat clamps byte -128 to -127 before dividing by 127.0f, yielding exactly -1.0f. The Javadoc explains this correctly. However, for DOT_PRODUCT similarity (where the result is not normalised by vector magnitude), the distinction between -128/127f ≈ -1.0079f and -1.0f is meaningful - clamping introduces an asymmetric floor that does not exist for any other value in the range. If a caller sources vectors from a tool that emits values in [-128, 127] (rather than Cohere/OpenAI's [-127, 127]), every -128 dimension gets silently nudged. Consider exposing the calibration convention as a parameter (divisor, clampMin) or at minimum logging a one-time warning the first time a -128 byte is encountered.

4. getKeyTypes() change may break callers that switch on ARRAY_OF_FLOATS

return new Type[] {
    metadata.encoding == VectorEncoding.INT8 ? Type.BINARY : Type.ARRAY_OF_FLOATS
};

Any code that calls index.getKeyTypes() and switches or compares against Type.ARRAY_OF_FLOATS (e.g., index routing, schema validation, or HA replication serialisation) will now see Type.BINARY for INT8 indexes and may fall into an unhandled branch. A quick grep across the engine for getKeyTypes callers and ARRAY_OF_FLOATS comparisons would confirm this is safe - it may well be that only the vector index itself uses this, but it is worth verifying.

5. Silent vector drop in GrowableVectorValues.getVector()

} catch (final IllegalArgumentException e) {
    LogManager.instance().log(this, Level.WARNING, ...);
}
// vector remains null → falls through to: if (vector != null && ...)

When the encoding-aware conversion fails, the ordinal is silently treated as having no vector, which causes it to be excluded from the HNSW graph during rebuild. The regression test int8VectorsSurviveRebuildAfterReopen guards the happy path, but there is no test for what happens when a single record in a batch has a corrupt/wrong-type property - the HNSW graph is built from a partially empty set with no visible error beyond a WARNING in the log. This is consistent with how ArcadePageVectorValues handles the same case, so it's at least coherent, but documenting this as intentional (or adding a counter/metric) would help operators notice silent data loss.


Minor / Non-blocking Notes

  • Duplication in withEncoding(String): The string-to-enum validation logic is copy-pasted identically in both TypeLSMVectorIndexBuilder and BucketLSMVectorIndexBuilder. A shared static helper or delegating BucketLSMVectorIndexBuilder.withEncoding(String) to the enum would remove the duplication.

  • No SQL DDL test: The PR description mentions WITH encoding INT8 is available in CREATE INDEX SQL metadata, but no test exercises the SQL DDL path end-to-end (parse → builder → withMetadata(JSONObject) → factory). The existing Java builder tests are solid but the SQL surface could silently break if the JSON key name drifts.

  • toFloatArrayPassesThroughFloatArrayUnchanged tests all three overloads for INT8 encoding - the assertion that VectorUtils.toFloatArray(in, VectorEncoding.INT8) returns the same reference as in when in is a float[] relies on the else branch delegating to toFloatArray(Object) which identity-returns float[]. This is correct today but fragile if the INT8 branch is ever widened. The test is valuable; a comment in the test explaining why INT8 returns the same float[] reference (it's a pass-through for non-byte[] inputs) would prevent future confusion.

  • putBatch silently continues on conversion error: For the ComparableVector branch a bad conversion throws and propagates, but for a raw byte[] in a non-INT8 index the catch swallows and logs. The asymmetry (exception vs WARNING+continue) between put() and putBatch() is intentional for replay resilience, but a comment making that explicit would help the next reader.


Test Coverage

The test suite is thorough:

  • defaultEncodingIsFloat32 - regression guard for the default
  • int8EncodingFlipsKeyTypesToBinary - key type contract
  • int8IngestEndToEnd - byte[] round-trip via both findNeighborsFromVector and IndexInternal#get
  • int8AndFloat32ParityOnSameLogicalVectors - recall parity check
  • int8EncodingSurvivesReopen - persistence
  • rejectsInt8EncodingWithFloatPropertyType / rejectsFloat32EncodingWithBinaryPropertyType - builder-time guards
  • rejectsInt8EncodingPlusInt8Quantization - double-quantization guard
  • int8VectorsSurviveRebuildAfterReopen - graph rebuild regression
  • sqlVectorNeighborsAcceptsByteArrayQuery - SQL surface

Missing: SQL DDL path, HTTP deserialization path, and the silent -128 clamping behaviour at DOT_PRODUCT similarity.


Summary

The feature is well-implemented, backwards-compatible, and clearly motivated. The four issues above (post-construction mutation tracking, HTTP path, -128 clamping semantics, and getKeyTypes caller audit) should be addressed before merge; the rest are suggestions for robustness.

- dequantizeInt8ToFloat now logs a one-time WARNING the first time it
  encounters byte -128 (clamped silently to -127). Cohere/OpenAI int8
  endpoints emit [-127, 127] only, so a -128 byte indicates a non-Cohere
  source where the asymmetric clamp matters for DOT_PRODUCT in particular.
  Atomic gate keeps the warning to once per process.
- New VectorEncoding.fromString centralizes the FLOAT32/INT8 string parser
  used by TypeLSMVectorIndexBuilder, BucketLSMVectorIndexBuilder, and the
  schema-JSON load path; the supported-values list cannot drift between
  the three call sites anymore.
- HTTP/JSON wire routing for INT8 query vectors filed as #4135 (JSON
  arrays of integers / base64 strings -> byte[] in HTTP handlers); the
  comparison matrix entry now points at it so readers do not assume the
  4x payload claim covers HTTP clients today.
- putBatch Javadoc explicitly documents the put-vs-putBatch failure
  asymmetry: put throws on bad keys, putBatch logs WARNING and continues
  to avoid aborting an entire commit-replay batch on one bad row.
- New SQL DDL test pins `CREATE INDEX ... LSM_VECTOR METADATA
  {encoding: 'INT8'}` end-to-end so a JSON key-name drift would surface.
- VectorUtilsTest.toFloatArrayPassesThroughFloatArrayUnchanged now carries
  a comment explaining why all three overloads return the same float[]
  reference, so the identity assertion is not mistaken for an over-spec.
@lvca lvca merged commit ae224e9 into main May 7, 2026
11 of 14 checks passed
@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Code Review - PR #4133: INT8 Dense Vector Index Support

Overview

This PR adds pre-quantized INT8 ingest to LSM_VECTOR dense indexes. A VectorEncoding enum cleanly separates wire/storage encoding from internal VectorQuantizationType, the read path dequantizes byte[] → float[] via the Cohere/OpenAI value / 127.0f convention, and all existing vector paths are updated. Storage and HTTP payload shrink 4x. The change is fully backwards-compatible (FLOAT32 default). JVector still runs on float32 internally; native int8 HNSW is deferred to datastax/jvector#665.


What's Done Well

  • Clean abstraction - VectorEncoding is well-documented and the distinction from VectorQuantizationType is clearly explained
  • Early validation - builder-time checks for encoding/property-type mismatch and the double-quantization trap (encoding=INT8 + quantization=INT8) surface misuse before data is written
  • Good -128 handling - the one-time WARNING via DEQUANTIZE_MIN128_WARNED.compareAndSet avoids log spam while giving operators a signal
  • Comprehensive tests - LSMVectorIndexInt8IngestTest covers the end-to-end round-trip, persistence across reopen, rebuild regression, SQL DDL surface, and the SQL vector.neighbors function
  • VectorUtilsTest pins edge cases - boundary values, empty array, byte[] rejection in strict mode, and the pass-through contract for float[]

Issues

1. Post-construction metadata.encoding mutation (acknowledged, should be tracked)

final LSMVectorIndex index = new LSMVectorIndex(...);
index.metadata.encoding = vectorBuilder.encoding;  // set after ctor
return index;

The comment correctly argues the window is unobservable because the reference isn't published until create() returns. However, if the constructor is ever refactored to call a virtual method or register a listener that reads metadata.encoding, it would silently see FLOAT32. The comment references #4134 for consolidating the positional args - please confirm that issue is filed and linked so this doesn't get lost.

2. Property-type check is skipped when the property isn't declared

final Property property = propertyOwner.getPolymorphicPropertyIfExists(propertyName);
if (property != null) {
    // validation only runs here
}

If the property hasn't been explicitly declared (e.g. vertex type without schema enforcement), the encoding/property-type mismatch check is silently bypassed. The put path will later fail or silently misbehave. Consider whether a null property should be treated as a warning or whether the user should be required to declare the property before creating a vector index.

3. getKeyTypes() change may break undiscovered callers

return new Type[] {
    metadata.encoding == VectorEncoding.INT8 ? Type.BINARY : Type.ARRAY_OF_FLOATS
};

Any code that routes on ARRAY_OF_FLOATS (schema validation, HA replication serialisation, index routing) will now see BINARY for INT8 indexes. A grep across the codebase for getKeyTypes callers and ARRAY_OF_FLOATS comparisons would confirm this is safe. This is especially important for the HA replication path where schema metadata is serialised and replicated to followers.

4. HTTP/JSON deserialization path is not covered by this PR

The PR ships the Java binary API and SQL METADATA path. The documentation correctly notes HTTP wire routing is tracked in #4135. However, clients sending INT8 vectors over HTTP today will send a JSON array of integers like [-1, 64, 127], which will be deserialized as List<Integer> or Object[] - not byte[]. The encoding-aware toFloatArray(Object, VectorEncoding.INT8) will fall through to the toFloatArray(Object) base path, which handles Object[] of Number via floatValue(), producing values in [-128, 127] rather than the expected [-1.0, 1.0] range. This means HTTP clients sending INT8 vectors will silently get wrong similarity scores until #4135 is resolved. A note in the docs or a warning in the HTTP handler would help operators avoid this pitfall.


Minor Notes

  • int8AndFloat32ParityOnSameLogicalVectors builds two HNSW graphs - consider @Tag("slow") per CLAUDE.md guidance on functional regression tests with non-trivial graph construction
  • putBatch correctly continues with WARNING on conversion failure while put throws - the asymmetry is intentional (replay resilience) and the Javadoc now explains it clearly; good
  • LSMVectorIndexStorageBenchmark silently swallows IllegalArgumentException - acceptable for a benchmark
  • The b < -127 check in dequantizeInt8ToFloat is correct because the explicit (int) cast makes the promotion visible - clear intent

Summary

The core implementation is solid and the test coverage is thorough. Items to address before merge:

  1. Required - Confirm LSMVectorIndex: consolidate primary constructor args into a config record #4134 is filed and the post-construction mutation is tracked
  2. Required - Decide on behavior when the embedding property is undeclared (null check bypass)
  3. Recommended - Audit getKeyTypes() callers for ARRAY_OF_FLOATS comparisons, particularly in the HA path
  4. Recommended - Add a warning or documentation note that HTTP clients sending INT8 arrays will get incorrect scores until HTTP/JSON: route int8 query vectors to byte[] for end-to-end INT8 ingest payload savings #4135 lands
  5. Optional - @Tag("slow") on int8AndFloat32ParityOnSameLogicalVectors

🤖 Generated with Claude Code

@lvca lvca deleted the vector-int8-ingest branch May 7, 2026 23:05
lvca added a commit that referenced this pull request May 8, 2026
#4136)

* HTTP: route int8 query vectors to byte[] via $bytes/$int8 markers (#4135)

Closes the HTTP/JSON gap left by the INT8 ingest landing (#4132/#4133):
clients can now send int8 query vectors that reach the engine as byte[]
and trigger the encoding-aware dequantization on LSM_VECTOR indexes,
rather than getting silently round-tripped through float32 and losing
the 4x payload claim on the wire.

Wire convention (Extended JSON-style):
- {"$bytes": "<base64>"}    -> byte[] decoded from base64
- {"$int8":  [v0, v1, ...]} -> byte[] packed from int values in [-128, 127]

The int8 form also accepts the float[] / double[] shapes that
JSONObject.toMap(optimizeNumericArrays=true) produces for JSON integer
arrays, with a fractional-value check that rejects non-integer floats so
a caller mixing up float and int8 vectors fails loudly at the wire
boundary.

Implementation:

- AbstractQueryHandler.decodeTypedJsonMarkers recursively walks the
  parsed param map and rewrites single-key {"$bytes" | "$int8": ...}
  objects into byte[]; multi-key maps and unrelated single-key maps pass
  through unchanged so existing user data with leading-$ keys is not
  silently transformed.
- mapParams calls the decoder before its existing ordinal-vs-named
  routing so the byte[] flows verbatim to SQL parameter binding.

Tests:

- AbstractQueryHandlerTypedJsonMarkersTest: 11 unit cases pin the
  decoder contract (base64 decode, int list decode, float[] decode,
  out-of-range / non-integer / non-numeric / bad-base64 rejection,
  multi-key passthrough, list-of-markers recursion, scalar passthrough).
- Int8VectorHttpIT: 2 end-to-end cases spin up the HTTP server,
  create an INT8 vector index, and submit `vector.neighbors` queries
  via HTTP using both marker forms; the seed-0 record comes back as the
  top hit confirming the byte[] path is exercised.

Comparison matrix updated to drop the "HTTP/JSON wire routing tracked
in #4135" caveat - INT8 ingest is now end-to-end.

* #4134 LSMVectorIndex: consolidate constructor args into LSMVectorIndexConfig record

Replaces the 17-positional-arg primary constructor with a single
LSMVectorIndexConfig value object. The factory handler no longer needs
to post-mutate metadata.encoding after construction, so the metadata
is fully populated atomically before the instance escapes.

* HTTP int8 markers: review fixes (null/key checks, int[], lazy alloc, tests)

- {"$bytes": null} now throws IllegalArgumentException naming the marker
  and the null instead of falling through to the recursive-map branch
- {"$bytes": <non-string>} same treatment
- Map-key recursion validates instanceof String and throws a clear
  IllegalArgumentException instead of letting a hypothetical non-string
  key surface as an opaque ClassCastException
- $int8 now also accepts int[] payloads alongside List/float[]/double[]
  for completeness
- decodeTypedJsonMarkers short-circuits without allocating a fresh
  LinkedHashMap when the param map carries no nested Map/List, which is
  the normal case for non-vector queries
- Trimmed multi-paragraph Javadocs and what-not-why inline comments
  per CLAUDE.md style

Tests:
- New cases for double[] payload, empty $int8 array, empty $bytes
  string, {"$bytes": null} rejection, int[] payload, and a two-level
  nested-map recursion (sibling to the existing list-recursion test).
- Test-class headers trimmed to one-line Javadocs.

* HTTP int8 markers: URL-safe base64, long[], zero-alloc passthrough, OpenAPI

- $bytes now accepts URL-safe base64 (RFC 4648 section 5) by retrying
  with Base64.getUrlDecoder() on the standard decoder's failure. Common
  in ML tooling that base64-encodes embeddings using - and _ in place of
  + and /.
- $int8 now accepts long[] payloads alongside List, float[], double[],
  int[].
- decodeTypedJsonMarkers and the Map/List recursion arms now return the
  original reference when no entry was rewritten. A parameter map of
  scalars + plain nested maps no longer pays for a fresh LinkedHashMap
  allocation per request - only marker-bearing requests build a new map.
- Decoder split into two private helpers (decodeBytesMarker /
  decodeInt8Marker) so the dispatcher reads as a one-line switch on the
  marker key.
- OpenAPI spec for /query and /command param fields now documents the
  $bytes / $int8 marker convention so users discover it from the API
  reference instead of source code.
- Int8VectorHttpIT POSTs Content-Type: application/json explicitly.

Tests:
- New cases for long[] payload, explicit -128/127 boundary, URL-safe
  base64 round-trip, and a same-reference assertion that pins the
  zero-allocation passthrough on marker-free maps.

* HTTP int8 markers: fix nested-map break, depth guard, IAE -> 400 on tx wrap

- decodeTypedJsonMarker's nested-map prefix-copy loop now uses an
  index-based break instead of reference equality on the key. The
  previous loop assumed the same Map.Entry returns the same key
  reference across iterations, which holds for HashMap/LinkedHashMap
  but is not part of the Map contract.
- Decoder recursion is bounded at 32 levels with an IllegalArgumentException
  on overflow; protects against StackOverflowError on hostile or
  accidentally deeply-nested JSON without depending on the upstream
  parser's depth limit.
- Decode call moved from mapParams to PostCommandHandler.execute() so
  it runs before the database.transaction wrapper rather than under it.
- AbstractServerHttpHandler's TransactionException catch arm now
  unwraps an IllegalArgumentException cause and returns HTTP 400,
  matching the un-wrapped catch arm. Without this, a malformed marker
  thrown from inside the transaction lambda was wrapped in a
  TransactionException and downgraded to HTTP 500 even though the
  underlying problem is bad client input.

Tests:

- New int8MarkerNullValueIsRejected gives the int8 path symmetric
  null-payload coverage (the bytes path already had it).
- New deeplyNestedPayloadIsRejected pins the 32-level depth guard.
- New Int8VectorHttpIT.int8MarkerOutOfRangeReturnsHttp400 confirms the
  IllegalArgumentException -> HTTP 400 chain end-to-end (was returning
  500 prior to the AbstractServerHttpHandler unwrap fix).

* HTTP int8 markers: simplify toInt8 guard, dedup IT helpers, ordinal test

- toInt8 drops the redundant Double.isNaN / Double.isInfinite checks.
  NaN already trips the v != Math.floor(v) guard (NaN compared with
  anything is false, so != is true). Infinity passes that guard but is
  caught by the subsequent range check, so explicit handling here was
  dead code. Comment notes both flow paths so a future reader does not
  accidentally re-add the redundancy.
- Int8VectorHttpIT now factors postQuery on top of postQueryRaw, sharing
  a single connection-setup helper and HttpResult type instead of two
  near-identical bodies.
- @tag("slow") on the IT class so CI runs that filter out slow tests
  skip the full server boot. Spinning up the HTTP server + creating an
  index + 16 inserts puts the elapsed time over the multi-second
  threshold called out in CLAUDE.md.

Tests:

- New ordinalKeyMapWithMarkersIsDecoded covers the positional-array
  call shape (params keyed "0", "1", ...) that PostCommandHandler
  produces from a JSON array body. Without this, the typed-marker
  decoder is only exercised under named-key params at the unit level.
tae898 pushed a commit to humemai/arcadedb-embedded-python that referenced this pull request May 10, 2026
* Support for INT8 in dense-only vector indexes

* INT8 ingest: review fixes (logging, byte clamp, error clarity)

- putBatch: log WARNING with rid + cause when a batch entry's vector type
  conversion or dimension check fails, instead of silently dropping the row
  during commit replay
- dequantizeInt8ToFloat: clamp byte -128 to -127 so a non-Cohere/OpenAI int8
  source still produces values in [-1, 1] for COSINE similarity
- LSMVectorIndex.put: dimension mismatch error now names the input type and
  length ("got byte[] of length N, expected dimensions M")
- VectorEncoding Javadoc: correct ARRAY_OF_BYTES to BINARY (the Java byte[]
  ArcadeDB type), drop em-dash
- LSMVectorIndexFactoryHandler: clarify in the comment that the encoding
  field assignment is safe (factory has not published the reference yet)
  and flag the constructor-arg sprawl as tracked tech debt
- comparison matrix: trim INT8 row down to one line

* INT8 ingest: fix rebuild paths + reject INT8 encoding+quant combo

Critical fix: five rebuild/recover sites still called the deprecated
VectorUtils.convertToFloatArray, which returns null for byte[]. An INT8
index reopened from disk would silently rebuild the HNSW graph from an
empty vector set:

- LSMVectorIndex.java HNSW validation pass (was marking every INT8 vector
  as invalid)
- LSMVectorIndex.java document-scan rebuild fallback (silently skipping
  INT8 vectors)
- LSMVectorIndex.java preload-for-HNSW pass (silently skipping INT8 vectors)
- GrowableVectorValues lazy-load fallback (silently skipping INT8 vectors)
- ArcadePageVectorValues page vector reading (false WARNING on every INT8
  vector access)

All five now route through VectorUtils.toFloatArray with appropriate
try/catch to preserve the prior null/skip semantics.

Other review fixes:

- VectorUtils.toFloatArray Javadoc spells out that a byte[] input is
  unconditionally dequantized via dequantizeInt8ToFloat (Cohere/OpenAI
  calibration); callers outside an INT8 context must convert first.
- LSMVectorIndexFactoryHandler rejects (encoding=INT8, quantization=INT8)
  with a clear error: the property is already byte-quantized at the wire
  level, so internal INT8 scalar quantization would re-quantize the
  dequantized floats - silent double-processing.

Tests:
- int8VectorsSurviveRebuildAfterReopen: regression guard. Inserts N int8
  vectors, runs a search to materialise the graph, reopens the database
  to force a rebuild from disk, asserts the same query still returns hits.
- rejectsInt8EncodingPlusInt8Quantization: validates the new factory-
  handler guard.

* INT8 ingest: encoding-aware toFloatArray, log silent skips, SQL coverage

The bare toFloatArray(Object) used to dequantize any byte[] unconditionally,
which is a footgun for non-vector callers that happen to encounter a byte[]
value. Split into two methods so the safe behaviour is the default:

- toFloatArray(Object) now rejects byte[] with a clear pointer to the
  encoding-aware overload or to dequantizeInt8ToFloat
- toFloatArray(Object, VectorEncoding) dequantizes byte[] only when
  encoding == INT8; for any other encoding it delegates to the strict
  variant, so a stray byte[] in a FLOAT32 index is rejected up front

All internal callers that have access to the index encoding switched to
the new overload: LSMVectorIndex put / putBatch / get / HNSW validation /
document-scan rebuild / preload, GrowableVectorValues lazy-load fallback,
ArcadePageVectorValues page reader, and SQLFunctionVectorNeighbors
extractQueryVector. SQL math/utility functions (vector.add,
vector.cosineSimilarity, etc.) keep calling the strict variant via
SQLFunctionAbstract.toFloatArray and therefore reject byte[] - users must
dequantize first.

Other follow-ups in this batch:

- GrowableVectorValues now logs a WARNING when an unsupported vector type
  is encountered during the document-lookup fallback, matching the sibling
  ArcadePageVectorValues path. Silent drops made operational triage hard.
- VectorEncoding.INT8 Javadoc documents the -128 calibration clamp so
  callers from non-Cohere/OpenAI providers know about the silent numeric
  correction at the [-128] edge case.
- LSMVectorIndexFactoryHandler comment now carries a TODO marker for the
  constructor-to-config-record refactor that would let encoding move back
  into construction.
- Removed VectorUtils.convertToFloatArray entirely - it had no callers
  after the rebuild-path fix and the deprecated null-returning variant
  was the source of the silent-INT8-drop class of bugs we just closed.
  Updated the lone test caller (LSMVectorIndexStorageBenchmark) to use
  toFloatArray.
- Added .isInstanceOf(IndexException.class) check on the unknown-encoding
  rejection test to pin the exception type contract.
- New sqlVectorNeighborsAcceptsByteArrayQuery test exercises the SQL
  `vector.neighbors` entry point with a byte[] query against an INT8
  index, closing the SQL-surface coverage gap.

* INT8 ingest: property/encoding guard, builder typed accessor, edge tests

Substantive: add a property-type vs encoding consistency check in the
LSMVectorIndexFactoryHandler. A user could previously combine
encoding=INT8 with an ARRAY_OF_FLOATS property (or encoding=FLOAT32 with a
BINARY property) and only see the failure surface as silent
mis-interpretation at put() / query time. The new guard rejects both
mismatches at create() with a clear error pointing at the right knob.

Other review follow-ups:

- TypeLSMVectorIndexBuilder.withEncoding now goes through a private
  vectorMetadata() typed accessor that performs an instanceof guard,
  turning a metadata-swap subclass scenario into a clear IndexException
  instead of an unrelated ClassCastException.
- VectorUtils.dequantizeInt8ToFloat now uses an explicit (int) cast on
  the byte before the bound check; same behaviour, clearer intent.
- LSMVectorIndexMetadata license header moved above the package
  declaration (was after the imports).
- Trimmed verbose inline comments added in the previous review pass; the
  canonical Javadocs in VectorEncoding and VectorUtils carry the detail.
- Constructor-refactor TODO now references the freshly-filed tracker
  issue ArcadeData#4134 instead of an inline TODO comment, so the tech debt is
  visible outside the source file.

Test coverage:

- New VectorUtilsTest covers dequantize edge cases (-128 clamps to -127,
  -127 -> -1.0f, 0 -> 0.0f, 127 -> 1.0f, empty input), the strict
  toFloatArray rejecting byte[], the encoding-aware variant rejecting
  byte[] under FLOAT32 and dequantizing under INT8, and float[] passthrough.
- LSMVectorIndexInt8IngestTest gains rejectsInt8EncodingWithFloatPropertyType
  and rejectsFloat32EncodingWithBinaryPropertyType to pin the new
  property/encoding consistency check.

* INT8 ingest: -128 one-time WARNING, dedup encoding parser, SQL DDL test

- dequantizeInt8ToFloat now logs a one-time WARNING the first time it
  encounters byte -128 (clamped silently to -127). Cohere/OpenAI int8
  endpoints emit [-127, 127] only, so a -128 byte indicates a non-Cohere
  source where the asymmetric clamp matters for DOT_PRODUCT in particular.
  Atomic gate keeps the warning to once per process.
- New VectorEncoding.fromString centralizes the FLOAT32/INT8 string parser
  used by TypeLSMVectorIndexBuilder, BucketLSMVectorIndexBuilder, and the
  schema-JSON load path; the supported-values list cannot drift between
  the three call sites anymore.
- HTTP/JSON wire routing for INT8 query vectors filed as ArcadeData#4135 (JSON
  arrays of integers / base64 strings -> byte[] in HTTP handlers); the
  comparison matrix entry now points at it so readers do not assume the
  4x payload claim covers HTTP clients today.
- putBatch Javadoc explicitly documents the put-vs-putBatch failure
  asymmetry: put throws on bad keys, putBatch logs WARNING and continues
  to avoid aborting an entire commit-replay batch on one bad row.
- New SQL DDL test pins `CREATE INDEX ... LSM_VECTOR METADATA
  {encoding: 'INT8'}` end-to-end so a JSON key-name drift would surface.
- VectorUtilsTest.toFloatArrayPassesThroughFloatArrayUnchanged now carries
  a comment explaining why all three overloads return the same float[]
  reference, so the identity assertion is not mistaken for an over-spec.
tae898 pushed a commit to humemai/arcadedb-embedded-python that referenced this pull request May 10, 2026
…cadeData#4… (ArcadeData#4136)

* HTTP: route int8 query vectors to byte[] via $bytes/$int8 markers (ArcadeData#4135)

Closes the HTTP/JSON gap left by the INT8 ingest landing (ArcadeData#4132/ArcadeData#4133):
clients can now send int8 query vectors that reach the engine as byte[]
and trigger the encoding-aware dequantization on LSM_VECTOR indexes,
rather than getting silently round-tripped through float32 and losing
the 4x payload claim on the wire.

Wire convention (Extended JSON-style):
- {"$bytes": "<base64>"}    -> byte[] decoded from base64
- {"$int8":  [v0, v1, ...]} -> byte[] packed from int values in [-128, 127]

The int8 form also accepts the float[] / double[] shapes that
JSONObject.toMap(optimizeNumericArrays=true) produces for JSON integer
arrays, with a fractional-value check that rejects non-integer floats so
a caller mixing up float and int8 vectors fails loudly at the wire
boundary.

Implementation:

- AbstractQueryHandler.decodeTypedJsonMarkers recursively walks the
  parsed param map and rewrites single-key {"$bytes" | "$int8": ...}
  objects into byte[]; multi-key maps and unrelated single-key maps pass
  through unchanged so existing user data with leading-$ keys is not
  silently transformed.
- mapParams calls the decoder before its existing ordinal-vs-named
  routing so the byte[] flows verbatim to SQL parameter binding.

Tests:

- AbstractQueryHandlerTypedJsonMarkersTest: 11 unit cases pin the
  decoder contract (base64 decode, int list decode, float[] decode,
  out-of-range / non-integer / non-numeric / bad-base64 rejection,
  multi-key passthrough, list-of-markers recursion, scalar passthrough).
- Int8VectorHttpIT: 2 end-to-end cases spin up the HTTP server,
  create an INT8 vector index, and submit `vector.neighbors` queries
  via HTTP using both marker forms; the seed-0 record comes back as the
  top hit confirming the byte[] path is exercised.

Comparison matrix updated to drop the "HTTP/JSON wire routing tracked
in ArcadeData#4135" caveat - INT8 ingest is now end-to-end.

* ArcadeData#4134 LSMVectorIndex: consolidate constructor args into LSMVectorIndexConfig record

Replaces the 17-positional-arg primary constructor with a single
LSMVectorIndexConfig value object. The factory handler no longer needs
to post-mutate metadata.encoding after construction, so the metadata
is fully populated atomically before the instance escapes.

* HTTP int8 markers: review fixes (null/key checks, int[], lazy alloc, tests)

- {"$bytes": null} now throws IllegalArgumentException naming the marker
  and the null instead of falling through to the recursive-map branch
- {"$bytes": <non-string>} same treatment
- Map-key recursion validates instanceof String and throws a clear
  IllegalArgumentException instead of letting a hypothetical non-string
  key surface as an opaque ClassCastException
- $int8 now also accepts int[] payloads alongside List/float[]/double[]
  for completeness
- decodeTypedJsonMarkers short-circuits without allocating a fresh
  LinkedHashMap when the param map carries no nested Map/List, which is
  the normal case for non-vector queries
- Trimmed multi-paragraph Javadocs and what-not-why inline comments
  per CLAUDE.md style

Tests:
- New cases for double[] payload, empty $int8 array, empty $bytes
  string, {"$bytes": null} rejection, int[] payload, and a two-level
  nested-map recursion (sibling to the existing list-recursion test).
- Test-class headers trimmed to one-line Javadocs.

* HTTP int8 markers: URL-safe base64, long[], zero-alloc passthrough, OpenAPI

- $bytes now accepts URL-safe base64 (RFC 4648 section 5) by retrying
  with Base64.getUrlDecoder() on the standard decoder's failure. Common
  in ML tooling that base64-encodes embeddings using - and _ in place of
  + and /.
- $int8 now accepts long[] payloads alongside List, float[], double[],
  int[].
- decodeTypedJsonMarkers and the Map/List recursion arms now return the
  original reference when no entry was rewritten. A parameter map of
  scalars + plain nested maps no longer pays for a fresh LinkedHashMap
  allocation per request - only marker-bearing requests build a new map.
- Decoder split into two private helpers (decodeBytesMarker /
  decodeInt8Marker) so the dispatcher reads as a one-line switch on the
  marker key.
- OpenAPI spec for /query and /command param fields now documents the
  $bytes / $int8 marker convention so users discover it from the API
  reference instead of source code.
- Int8VectorHttpIT POSTs Content-Type: application/json explicitly.

Tests:
- New cases for long[] payload, explicit -128/127 boundary, URL-safe
  base64 round-trip, and a same-reference assertion that pins the
  zero-allocation passthrough on marker-free maps.

* HTTP int8 markers: fix nested-map break, depth guard, IAE -> 400 on tx wrap

- decodeTypedJsonMarker's nested-map prefix-copy loop now uses an
  index-based break instead of reference equality on the key. The
  previous loop assumed the same Map.Entry returns the same key
  reference across iterations, which holds for HashMap/LinkedHashMap
  but is not part of the Map contract.
- Decoder recursion is bounded at 32 levels with an IllegalArgumentException
  on overflow; protects against StackOverflowError on hostile or
  accidentally deeply-nested JSON without depending on the upstream
  parser's depth limit.
- Decode call moved from mapParams to PostCommandHandler.execute() so
  it runs before the database.transaction wrapper rather than under it.
- AbstractServerHttpHandler's TransactionException catch arm now
  unwraps an IllegalArgumentException cause and returns HTTP 400,
  matching the un-wrapped catch arm. Without this, a malformed marker
  thrown from inside the transaction lambda was wrapped in a
  TransactionException and downgraded to HTTP 500 even though the
  underlying problem is bad client input.

Tests:

- New int8MarkerNullValueIsRejected gives the int8 path symmetric
  null-payload coverage (the bytes path already had it).
- New deeplyNestedPayloadIsRejected pins the 32-level depth guard.
- New Int8VectorHttpIT.int8MarkerOutOfRangeReturnsHttp400 confirms the
  IllegalArgumentException -> HTTP 400 chain end-to-end (was returning
  500 prior to the AbstractServerHttpHandler unwrap fix).

* HTTP int8 markers: simplify toInt8 guard, dedup IT helpers, ordinal test

- toInt8 drops the redundant Double.isNaN / Double.isInfinite checks.
  NaN already trips the v != Math.floor(v) guard (NaN compared with
  anything is false, so != is true). Infinity passes that guard but is
  caught by the subsequent range check, so explicit handling here was
  dead code. Comment notes both flow paths so a future reader does not
  accidentally re-add the redundancy.
- Int8VectorHttpIT now factors postQuery on top of postQueryRaw, sharing
  a single connection-setup helper and HttpResult type instead of two
  near-identical bodies.
- @tag("slow") on the IT class so CI runs that filter out slow tests
  skip the full server boot. Spinning up the HTTP server + creating an
  index + 16 inserts puts the elapsed time over the multi-second
  threshold called out in CLAUDE.md.

Tests:

- New ordinalKeyMapWithMarkersIsDecoded covers the positional-array
  call shape (params keyed "0", "1", ...) that PostCommandHandler
  produces from a JSON array body. Without this, the typed-marker
  decoder is only exercised under named-key params at the unit level.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant