feat(es-vector): backport native Elasticsearch vector search to 1.12.7 by joaopamaral · Pull Request #2 · Automattic/OpenMetadata

joaopamaral · 2026-05-11T21:53:32Z

Summary

Backports the native Elasticsearch vector search work from upstream PR open-metadata/OpenMetadata#27111 to our 1.12.7 release line. Upstream PR uses an inline-per-entity embedding architecture (rewrites entity indices with dense_vector enrichment). 1.12.7 uses a dedicated vector_search_index, so this backport mirrors OpenSearchVectorService for ES 8.x/9.x against the existing 1.12.7 architecture.

What's in the backport

ElasticSearchVectorService — 14-method VectorIndexService implementation against ES 8.x/9.x, using the low-level Rest5Client (extracted from ElasticsearchClient._transport()). Mirrors OpenSearchVectorService:

dense_vector field type, top-level knn query format
executeGenericRequest handles 4xx (manual status check) + 5xx (ResponseException) symmetric with the OS path
extractRestClient hard-fails on unexpected transport type
readEntityBody tolerates null HttpEntity (ES returns no body on some 4xx)
search() persists the parent_id fallback into the hit map so consumers see a populated value, not just the grouping key
createOrUpdateIndex loads vector_search_index_es_native.json and replaces dims: 512 placeholder with the active embedding dimension before PUT /<index>

ElasticSearchVectorBulkProcessor — Rest5Client-based bulk NDJSON analog to VectorBulkProcessor. Speaks /_bulk directly, parses the items[] array for success/failed counts, same scheduler + flush + stats tracker interface.

ES-native mapping templates (en/jp/ru/zh) at openmetadata-spec/src/main/resources/elasticsearch/*/vector_search_index_es_native.json:

Drops OpenSearch-specific index.knn* settings
Replaces knn_vector { method: hnsw, engine: lucene, ... } with dense_vector { dims, index: true, similarity: cosine }

VectorSearchQueryBuilder.buildNativeESQuery — top-level ES knn block (field, query_vector, k, num_candidates, filter), overflow-safe num_candidates clamping via long math + Integer.MAX_VALUE cap.

SearchRepository wiring — replaces the ES else-branch (warn + return stub) with an actual ElasticSearchVectorService.init(...).

Bootstrap fixes for dedicated-arch on ES

1.12.7's dedicated-index architecture creates indices at boot Phase 1 (createMissingIndexes / createOrUpdateIndexTemplates) before the embedding client is initialized in Phase 3. The placeholder dims=512 from the JSON template would get baked into the index, and dense_vector.dims is immutable on existing ES indices. Three surgical fixes:

getIndexMapping: on SearchType=ELASTICSEARCH, swap vector_search_index.json → vector_search_index_es_native.json so ES doesn't reject the template (unknown setting [index.knn])
reformatVectorIndexWithDimension: patch dims (ES) instead of dimension (OS) for the active backend
createMissingIndexes / createOrUpdateIndexTemplates: skip the vectorEmbedding entry when ES — ElasticSearchVectorService.createOrUpdateIndex creates it in Phase 3 with the real model dimension

VectorSearchResource now reads vectorIndexService via SearchRepository instead of hardcoded OpenSearchVectorService.getInstance(), so the endpoint works for both backends.

Applicable upstream PR fixes (verified ported)

de20914 extractRestClient transport guard
f94c790 4xx status check in executeGenericRequest
11bbf70 symmetric ES/OS error format
114f63e parentId fallback persist + null HttpEntity tolerance
7d82e95/4ee2a93 configurable num_candidates multiplier
num_candidates overflow clamping
reindex-avoid: createOrUpdateIndex skips when index exists (matches 1.12.7 OS behavior)

Not applicable (inline-arch only)

Upstream PR fixes that target the inline-embed path on entity indices have no equivalent in 1.12.7's dedicated-arch:

EsUtils.enrichIndexMappingForElasticsearch (per-entity dense_vector enrichment)
_meta preserve + dim-mismatch hard-fail
hybrid pipeline gate (method not in 1.12.7)
VectorSearchResource admin gate, RecreateWithEmbeddings, ElasticSearchBulkSink inline-embed
from-pagination, camelCase parentId, partialUpdateEntity

Test plan

Adapted version of the e2e from upstream PR #27111 comment for dedicated-arch (embeddings live in vector_search_index, not the per-entity table index).

Stack: ES 9.3.0 + Postgres + DJL sentence-transformers/all-MiniLM-L6-v2 (384 dims, cosine)

mvn package clean
VectorSearchQueryBuilderTest: 20/20 passing
Boot logs: ElasticSearchVectorService initialized with model=...all-MiniLM-L6-v2, dimension=384
Boot logs: Created vector index openmetadata_vector_search_index with dimension 384
ES mapping check: embedding { type: dense_vector, dims: 384, index: true, similarity: cosine, index_options: bbq_hnsw }
VectorEmbeddingHandler registered on entity lifecycle bus
Created 3 tables with distinct descriptions (customer_purchases, user_logins, weather_data) — embeddings auto-indexed (5 docs in vector_search_index including parent entities)
Semantic queries against /api/v1/search/vector/query:
- "revenue from sales" → top-1: customer_purchases (0.6191)
- "login authentication" → top-1: user_logins (0.6306)
- "temperature humidity" → top-1: weather_data (0.6476)
Ranking matches upstream PR test (absolute scores differ because dedicated-arch embeds via VectorDocBuilder.fromEntity — chunked entity text, not the inline table-doc blob the inline-arch embeds)

🤖 Generated with Claude Code

Mirror OpenSearchVectorService for Elasticsearch 8.x/9.x: - ElasticSearchVectorService: 14-method implementation using Rest5Client low-level transport; dense_vector field type; top-level knn query format - ElasticSearchVectorBulkProcessor: Rest5Client-based bulk NDJSON analog to VectorBulkProcessor - vector_search_index_es_native.json (en/jp/ru/zh): ES-native mapping templates with dense_vector{dims, similarity:cosine} - VectorSearchQueryBuilder.buildNativeESQuery: ES top-level knn format with overflow-safe num_candidates clamping - SearchRepository: wire ELASTICSEARCH search type to ElasticSearchVectorService.init() (was a warn+return stub) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…earch Three fixes to make the ES vector backport actually serve queries in 1.12.7's dedicated-index architecture: - SearchRepository.getIndexMapping: when SearchType=ELASTICSEARCH and the IndexMapping's resource path is the OpenSearch-format vector_search_index.json, swap to vector_search_index_es_native.json. Without this, ES rejects the template (unknown setting [index.knn]). - SearchRepository.reformatVectorIndexWithDimension: on Elasticsearch, patch dense_vector.dims (not OpenSearch's knn_vector.dimension). The string-replace fallback also targets the ES field name. - SearchRepository.createMissingIndexes / createOrUpdateIndexTemplates: skip the "vectorEmbedding" entry on Elasticsearch. Boot Phase 1 runs before embeddingClient is initialized, so the JSON template's placeholder dimension would be baked in — and dense_vector.dims is immutable on an existing ES index. ElasticSearchVectorService creates the index later in Phase 3 with the active model's real dimension. - VectorSearchResource: read vectorIndexService via SearchRepository instead of OpenSearchVectorService.getInstance(), so the resource works for both backends. ES queries previously returned 503 "Vector search service is not initialized" because the resource only checked the OS singleton. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-11T21:54:08Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.