Skip to content

feat(es-vector): backport native Elasticsearch vector search to 1.12.7#2

Open
joaopamaral wants to merge 7 commits into
release-1.12.7from
release-1.12.7-es-vector
Open

feat(es-vector): backport native Elasticsearch vector search to 1.12.7#2
joaopamaral wants to merge 7 commits into
release-1.12.7from
release-1.12.7-es-vector

Conversation

@joaopamaral

Copy link
Copy Markdown

Summary

Backports the native Elasticsearch vector search work from upstream PR open-metadata/OpenMetadata#27111 to our 1.12.7 release line. Upstream PR uses an inline-per-entity embedding architecture (rewrites entity indices with dense_vector enrichment). 1.12.7 uses a dedicated vector_search_index, so this backport mirrors OpenSearchVectorService for ES 8.x/9.x against the existing 1.12.7 architecture.

What's in the backport

ElasticSearchVectorService — 14-method VectorIndexService implementation against ES 8.x/9.x, using the low-level Rest5Client (extracted from ElasticsearchClient._transport()). Mirrors OpenSearchVectorService:

  • dense_vector field type, top-level knn query format
  • executeGenericRequest handles 4xx (manual status check) + 5xx (ResponseException) symmetric with the OS path
  • extractRestClient hard-fails on unexpected transport type
  • readEntityBody tolerates null HttpEntity (ES returns no body on some 4xx)
  • search() persists the parent_id fallback into the hit map so consumers see a populated value, not just the grouping key
  • createOrUpdateIndex loads vector_search_index_es_native.json and replaces dims: 512 placeholder with the active embedding dimension before PUT /<index>

ElasticSearchVectorBulkProcessorRest5Client-based bulk NDJSON analog to VectorBulkProcessor. Speaks /_bulk directly, parses the items[] array for success/failed counts, same scheduler + flush + stats tracker interface.

ES-native mapping templates (en/jp/ru/zh) at openmetadata-spec/src/main/resources/elasticsearch/*/vector_search_index_es_native.json:

  • Drops OpenSearch-specific index.knn* settings
  • Replaces knn_vector { method: hnsw, engine: lucene, ... } with dense_vector { dims, index: true, similarity: cosine }

VectorSearchQueryBuilder.buildNativeESQuery — top-level ES knn block (field, query_vector, k, num_candidates, filter), overflow-safe num_candidates clamping via long math + Integer.MAX_VALUE cap.

SearchRepository wiring — replaces the ES else-branch (warn + return stub) with an actual ElasticSearchVectorService.init(...).

Bootstrap fixes for dedicated-arch on ES

1.12.7's dedicated-index architecture creates indices at boot Phase 1 (createMissingIndexes / createOrUpdateIndexTemplates) before the embedding client is initialized in Phase 3. The placeholder dims=512 from the JSON template would get baked into the index, and dense_vector.dims is immutable on existing ES indices. Three surgical fixes:

  • getIndexMapping: on SearchType=ELASTICSEARCH, swap vector_search_index.jsonvector_search_index_es_native.json so ES doesn't reject the template (unknown setting [index.knn])
  • reformatVectorIndexWithDimension: patch dims (ES) instead of dimension (OS) for the active backend
  • createMissingIndexes / createOrUpdateIndexTemplates: skip the vectorEmbedding entry when ES — ElasticSearchVectorService.createOrUpdateIndex creates it in Phase 3 with the real model dimension

VectorSearchResource now reads vectorIndexService via SearchRepository instead of hardcoded OpenSearchVectorService.getInstance(), so the endpoint works for both backends.

Applicable upstream PR fixes (verified ported)

  • de20914 extractRestClient transport guard
  • f94c790 4xx status check in executeGenericRequest
  • 11bbf70 symmetric ES/OS error format
  • 114f63e parentId fallback persist + null HttpEntity tolerance
  • 7d82e95/4ee2a93 configurable num_candidates multiplier
  • num_candidates overflow clamping
  • reindex-avoid: createOrUpdateIndex skips when index exists (matches 1.12.7 OS behavior)

Not applicable (inline-arch only)

Upstream PR fixes that target the inline-embed path on entity indices have no equivalent in 1.12.7's dedicated-arch:

  • EsUtils.enrichIndexMappingForElasticsearch (per-entity dense_vector enrichment)
  • _meta preserve + dim-mismatch hard-fail
  • hybrid pipeline gate (method not in 1.12.7)
  • VectorSearchResource admin gate, RecreateWithEmbeddings, ElasticSearchBulkSink inline-embed
  • from-pagination, camelCase parentId, partialUpdateEntity

Test plan

Adapted version of the e2e from upstream PR #27111 comment for dedicated-arch (embeddings live in vector_search_index, not the per-entity table index).

Stack: ES 9.3.0 + Postgres + DJL sentence-transformers/all-MiniLM-L6-v2 (384 dims, cosine)

  • mvn package clean
  • VectorSearchQueryBuilderTest: 20/20 passing
  • Boot logs: ElasticSearchVectorService initialized with model=...all-MiniLM-L6-v2, dimension=384
  • Boot logs: Created vector index openmetadata_vector_search_index with dimension 384
  • ES mapping check: embedding { type: dense_vector, dims: 384, index: true, similarity: cosine, index_options: bbq_hnsw }
  • VectorEmbeddingHandler registered on entity lifecycle bus
  • Created 3 tables with distinct descriptions (customer_purchases, user_logins, weather_data) — embeddings auto-indexed (5 docs in vector_search_index including parent entities)
  • Semantic queries against /api/v1/search/vector/query:
    • "revenue from sales" → top-1: customer_purchases (0.6191)
    • "login authentication" → top-1: user_logins (0.6306)
    • "temperature humidity" → top-1: weather_data (0.6476)
  • Ranking matches upstream PR test (absolute scores differ because dedicated-arch embeds via VectorDocBuilder.fromEntity — chunked entity text, not the inline table-doc blob the inline-arch embeds)

🤖 Generated with Claude Code

joaopamaral and others added 2 commits May 11, 2026 16:52
Mirror OpenSearchVectorService for Elasticsearch 8.x/9.x:
- ElasticSearchVectorService: 14-method implementation using Rest5Client
  low-level transport; dense_vector field type; top-level knn query format
- ElasticSearchVectorBulkProcessor: Rest5Client-based bulk NDJSON
  analog to VectorBulkProcessor
- vector_search_index_es_native.json (en/jp/ru/zh): ES-native mapping
  templates with dense_vector{dims, similarity:cosine}
- VectorSearchQueryBuilder.buildNativeESQuery: ES top-level knn format
  with overflow-safe num_candidates clamping
- SearchRepository: wire ELASTICSEARCH search type to
  ElasticSearchVectorService.init() (was a warn+return stub)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…earch

Three fixes to make the ES vector backport actually serve queries in
1.12.7's dedicated-index architecture:

- SearchRepository.getIndexMapping: when SearchType=ELASTICSEARCH and
  the IndexMapping's resource path is the OpenSearch-format
  vector_search_index.json, swap to vector_search_index_es_native.json.
  Without this, ES rejects the template (unknown setting [index.knn]).

- SearchRepository.reformatVectorIndexWithDimension: on Elasticsearch,
  patch dense_vector.dims (not OpenSearch's knn_vector.dimension). The
  string-replace fallback also targets the ES field name.

- SearchRepository.createMissingIndexes / createOrUpdateIndexTemplates:
  skip the "vectorEmbedding" entry on Elasticsearch. Boot Phase 1 runs
  before embeddingClient is initialized, so the JSON template's
  placeholder dimension would be baked in — and dense_vector.dims is
  immutable on an existing ES index. ElasticSearchVectorService creates
  the index later in Phase 3 with the active model's real dimension.

- VectorSearchResource: read vectorIndexService via SearchRepository
  instead of OpenSearchVectorService.getInstance(), so the resource
  works for both backends. ES queries previously returned
  503 "Vector search service is not initialized" because the resource
  only checked the OS singleton.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

`dettachIlmPolicyFromIndexes` previously called
`client.indices().putSettings(s -> s.settings(idx -> idx.lifecycle(l -> l.name(null))))`.
The ES Java client's Jackson serializer drops null fields, so the body
serialized to `{settings: {index: {}}}` — empty — and Elasticsearch
rejected the request with `action_request_validation_exception: no
settings to update`. The migration `v172.removeOldDataInsightsObjects`
swallowed the per-index error in its outer try/catch and continued, so
the migration completed "successfully" but left old data-insights ILM
policies attached to the data-stream backing indices and the orphan
policies on the cluster.

Switch to the dedicated `POST /<index>/_ilm/remove` endpoint via the
typed `client.ilm().removePolicy(r -> r.index(indexName))` call. This is
the purpose-built ES API for detaching ILM from an index and doesn't
depend on null-field serialization behaviour.

Verified by running `./bootstrap/openmetadata-ops.sh drop-create` from
a dev image built on the fix branch: migration log now shows `Detached
ILM policy from index: <name>` for every old data-insights backing
index, the orphan policy is gone, and the reindex pipeline completes
with 56/57 templates created (vectorEmbedding skipped on ES by design).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

LdapAuthenticator.checkAndCreateUser fetched the existing OM user with
only "id,name,email,roles". The subsequent PUT through
UserUtil.addOrUpdateUser -> userRepository.createOrUpdate then ran the
UserUpdater, whose entitySpecificUpdate unconditionally calls
updateTeams / updatePersonas / etc. Because the in-memory user had
teams = null, updateTeams executed deleteTo(user, HAS, TEAM) +
assignTeams(null), which wiped every manually-assigned team on every
LDAP login. The same path wiped personas, defaultPersona, profile,
domains, personaPreferences, authenticationMechanism, and
isEmailVerified. The deleteTo against a user with many teams also
made login visibly slow.

Switch the fetch to userRepository.getFieldsWithUserAuth("*"), matching
the BasicAuthenticator path, so the PUT sees the user's full state and
the updater preserves it.

Adds LdapAuthCompleteFlowTest.testLoginPreservesManuallyAssignedTeams,
a regression test that creates three pure-OM teams (no LDAP
group/DN/role-mapping backing), manually assigns them to the LDAP
user, logs in over LDAP, and asserts the three teams are still on the
user afterwards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

The previous fix passed `getFieldsWithUserAuth("*")` to `getByEmail`,
which returns the full allowed-field set and so triggers fetching
heavy User relationships like `owns` and `follows` on every LDAP
login. `UserRepository.setFields` runs `getOwns(user)` / `getFollows(user)`
whenever those fields are requested, each of which is a DB query
proportional to the user's owned/followed entity count.

Switch to `userRepository.getPutFields()`, the cached `Fields` object
built from `USER_UPDATE_FIELDS`: profile, roles, teams,
authenticationMechanism, isEmailVerified, personas, defaultPersona,
domains, personaPreferences. That is exactly the set
`UserUpdater.entitySpecificUpdate` mutates on the PUT path, so the
updater no longer wipes relationships it can't see, and we no longer
over-fetch unrelated User relationships on every LDAP login.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…reindex loop on boot

BotResource.initialize() runs UserUtil.addOrUpdateBotUser(user) for every
bot on every OM boot. The in-memory User built by UserUtil.user(...) does
not have the teams field populated, so the PUT path through
userRepository.createOrUpdate -> UserUpdater.entitySpecificUpdate runs
updateTeams(original, updated) with original.teams = [Organization] (or
the bot's real stored teams) and updated.teams = null. updateTeams then
executes deleteTo(user, HAS, TEAM) + assignTeams(null), which strips
every stored team membership the bot had, bumps the user version, and
triggers an ES reindex of both the user and each affected team. With
several bots this fires on every restart and produces the reindex storm
plus "Circular dependency detected in team hierarchy for team:
Organization" warnings we see in the boot logs.

Short-circuit when the incoming bot has no real change vs. the persisted
row: compare description, displayName, and roles. If they all match,
return the original user and skip the PUT entirely — no UserUpdater, no
team strip, no version bump, no reindex.

Two adjustments to make the guard actually fire:
- retrieveWithAuthMechanism now also loads "roles" (was loading only
  "authenticationMechanism"); description and displayName are scalar
  JSON-column fields and were already populated by the base read.
- Compare roles via listOrEmpty(...) on both sides because the
  database-loaded original returns an empty list while the freshly built
  in-memory user returns null, and Objects.equals(null, []) is false.

Adds UserUtilBotTest.addOrUpdateBotUserDoesNotStripTeams as a regression
test. The test creates a real (non-Organization) team, assigns the bot
to it, calls addOrUpdateBotUser with a fresh User object that does not
carry the teams field (matching what BotResource.initialize passes on
boot), then asserts the bot is still in the real team. Before this fix
the assertion fails with "Got: [Organization]" — the real team has been
wiped and only the virtual default remains.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

… abstraction

SemanticSearchTool hard-coded OpenSearchVectorService.getInstance(),
which returns null when the active vector service is
ElasticSearchVectorService (i.e. when the backend is Elasticsearch). Every
MCP semantic_search call against an ES-backed deployment then failed with
"Vector search service is not initialized" even though
ElasticSearchVectorService was initialized at boot and the REST
endpoint /api/v1/search/vector/query worked.

Switch to Entity.getSearchRepository().getVectorIndexService(), the same
abstraction VectorSearchResource already uses. The returned
VectorIndexService impl is whichever was registered at boot
(ElasticSearchVectorService for ES, OpenSearchVectorService for OS), and
both implement the search(query, filters, size, k, threshold) signature
this tool calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

joaopamaral pushed a commit that referenced this pull request May 18, 2026
…tency (open-metadata#28117)

* fix(rdf): converge Fuseki state on weekly rebuilds and isolate API latency

RdfIndexApp ran daily and never reconciled removed relationships, so triples
grew unboundedly across runs. When Fuseki crash-looped on the resulting disk
pressure, every entity-write hook blocked synchronously on the unreachable
server (no HTTP connect timeout, 3-retry loop on ConnectException), saturating
the bounded AsyncService pool and pushing login to ~45s.

Storage-side fixes (stop growth):
- Drop the extractRelationshipTriples "preserve forward" path in
  RdfRepository.createOrUpdate; the translator is the source of truth and the
  surrounding orchestration already rewrites the current relationship set.
  This also removes a wasted CONSTRUCT round-trip per entity write.
- bulkStoreRelationships now does per-source-entity DELETE WHERE with a
  predicate-exclusion FILTER for lineage edges, so relationships that no
  longer exist actually leave the store.
- Wire RdfRepository.clearAllGlossaryTermRelations() into RdfIndexApp's
  initializeJob (the method existed but had no callers).
- Flip recreateIndex default to true and move the cron to Saturday midnight
  ("0 0 * * 6"). Add reloadOntologies() so CLEAR ALL doesn't leave the
  ontology graph empty before indexing starts.
- Include a 2.0.1 post-data migration that updates existing installed_apps
  rows; the app loader is insert-only on upgrade.

Connectivity / concurrency fixes (isolate API latency from Fuseki health):
- Add 2s connectTimeout to every JenaFusekiStorage HttpClient and fast-fail
  on ConnectException / ClosedChannelException / HttpConnectTimeoutException
  instead of retrying. Introduce a 5-failure/30s circuit breaker.
- Route all RdfUpdater mutators through AsyncService.execute with a bounded
  pendingWrites gate (cap 1000, drop-on-overflow with logged warning) so a
  dead Fuseki can no longer block request threads or starve the AsyncService
  pool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): address PR review — preserve relationships, scope DELETEs, surface ontology failures

PR open-metadata#28117 review feedback. Addresses 13 findings across gitar-bot and Copilot:

Storage correctness:
- JenaFusekiStorage.storeEntity now keeps URI-valued triples (relationships)
  and only refreshes literal-valued triples. A metadata-only PATCH would
  otherwise wipe every inter-entity edge until the next weekly recreate-index,
  and async ordering between updateEntity and addRelationship could leave the
  graph missing edges (Copilot #1, #2).
- RdfRepository.removeRelationship wraps the DELETE in the knowledge named
  graph and uses getRelationshipPredicate so the predicate URI matches what
  addRelationship actually wrote (e.g. UPSTREAM → prov:wasDerivedFrom). The
  previous bare DELETE in the default graph was a silent no-op (Copilot #3).
- RdfBatchProcessor now calls a new RdfRepository.clearOutgoingEntityRelationships
  for every entity in the batch, not just those with current edges. An entity
  whose last outgoing relationship was removed in MySQL contributes zero
  RelationshipData entries, so bulkStoreRelationships' per-source DELETE
  never fired for it (Copilot #4).
- bulkStoreRelationships no longer swallows non-connect DELETE errors —
  DELETE WHERE on a source with no edges is a no-op, so exceptions there
  are real failures (malformed SPARQL, auth, server errors) and should
  surface (Copilot #5).

Visibility:
- reloadOntologies() now checks areOntologiesLoaded() after load and throws
  if still empty. OntologyLoader.loadOntologies catches internally, so the
  old reloadOntologies always appeared to succeed (Copilot #6).
- clearAllGlossaryTermRelations rethrows on failure instead of silently
  logging — the indexer's caller can now react to cleanup failures (Copilot open-metadata#10).
- clearAllGlossaryTermRelations pulls custom predicate URIs from
  GlossaryTermRelationSettings and includes them in the DELETE FILTER. The
  hardcoded list missed any custom predicates an admin configured (Copilot #7).

Quality:
- Set / LinkedHashSet imported instead of using java.util.* fully qualified
  in JenaFusekiStorage and RdfBatchProcessor (gitar-bot #2).
- RdfIndexAppTest uses InOrder to assert clearAll → reloadOntologies
  ordering — a plain verify would have accepted a future change that
  reordered the calls (Copilot #9).
- Documented the residual gap that HttpClient.connectTimeout only bounds
  TCP connect, not request bodies; circuit breaker + bounded pendingWrites
  contain the blast radius (Copilot #8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(rdf): expect per-source clear on batches whose relationships are all filtered

The two EventSubscription-skip tests used verifyNoInteractions on the RDF
repository mock, which was valid before because filtered batches never
touched RDF. The new per-source reconciliation clear in
RdfBatchProcessor.processBatchRelationships now runs for every batch entity
regardless of whether its relationships survive filtering — that's
deliberate, since stale RDF state for those source entities still needs
to be reconciled even when their current MySQL edges all point to excluded
entity types. Switch the assertions to verify clearOutgoingEntityRelationships
is the sole interaction (no bulkAdd, no addRelationship).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): address remaining PR review nits

Three findings from the second gitar-bot review pass:

- Replace the fully qualified `org.openmetadata.schema.configuration.GlossaryTermRelationSettings` / `SettingsType` / `SettingsCache` references in clearAllGlossaryTermRelations with imports, matching the project's existing convention. Other pre-existing FQN usages in the same file are left alone (not part of this PR's scope).
- Make expandPredicateCurie throw IllegalArgumentException on null/empty input instead of silently defaulting to `om:relatedTo`. The current caller already null-guards so the path is unreachable today, but a future caller could otherwise silently miss-clean a misconfigured predicate.
- Document why the lineage predicate URIs in the reconciliation DELETE filter (UPSTREAM / hasLineageDetails) are literal-hardcoded rather than baseUri-derived: they match what addLineageWithDetails actually writes (also hardcoded at RdfRepository.java:423,435). Switching the filter to be baseUri-derived would stop matching the stored lineage triples on non-default baseUri deployments and would incorrectly delete them. Comment added in both clearOutgoingEntityRelationships and bulkStoreRelationships so the next reader doesn't get nudged into "fixing" it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): surface cleanup failures, sync fallback predicates, time-bound reads

Addresses the three unresolved Copilot findings from review 4295208187:

- Drop the try/catch around clearAllGlossaryTermRelations in initializeJob.
  clearAllGlossaryTermRelations rethrows specifically so the indexer can fail
  loudly; wrapping it again let an unreconciled graph slip past as a
  "successful" run. The outer execute() handler will now mark the run FAILED.

- Sync DEFAULT_GLOSSARY_TERM_RELATION_PREDICATES with what SettingsCache
  actually bootstraps (SettingsCache.java:355-486): adds skos:exactMatch (the
  real default for `synonym`), om:antonym, om:partOf, om:hasPart, rdfs:seeAlso.
  Keeps legacy om:* URIs from the stale getGlossaryTermRelationPredicateUri
  switch so a cleanup run still scrubs pre-SettingsCache data.

- Apply READ_TIMEOUT_MS (10s) via QueryExecution.setTimeout on every read path
  (executeSparqlQuery for SELECT/CONSTRUCT/ASK/DESCRIBE, getEntity, getAllGraphs,
  getTripleCount, testConnection, the ontology presence check). A Fuseki that
  accepts the TCP connection but stalls mid-query no longer hangs reads
  indefinitely. UPDATE-side calls still rely on the connect timeout + circuit
  breaker + bounded pendingWrites since Jena's RDFConnection.update API
  doesn't expose a per-request timeout cleanly; comment near the constant
  notes the gap and a viable follow-up via UpdateExecHTTPBuilder.timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): qualify EntityRelationship in test to fix compile

RdfIndexAppTest references EntityRelationship.class in two verify() calls
that I added in the previous commit, but the class was never imported into
the test file. CI's openmetadata-service test compile fails with "cannot
find symbol class EntityRelationship", which cascades into 11 dependent
checks (build x2, openmetadata-service-unit-tests, three Java integration
test workflows, two Python integration test shards that build OM as a
setup step, Test Report aggregate, maven-sonarcloud-ci, and the unit-test
status gate). Use the fully qualified
org.openmetadata.schema.type.EntityRelationship to match how every other
reference in this file already spells it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): drop QueryExecution.setTimeout — removed in Jena 5 used by IT classpath

GlossaryOntologyExportIT was failing on RdfUpdater.initialize with
NoSuchMethodError: 'void org.apache.jena.query.QueryExecution.setTimeout(long,
java.util.concurrent.TimeUnit)'. openmetadata-service builds against Jena 4.10
(apache-jena-libs), but openmetadata-integration-tests directly pulls in
jena-core/jena-arq 5.0.0, and Jena 5 removed the setTimeout overloads from
the QueryExecution interface. Compile passes, integration test JVM links the
5.x class and bombs at the first read path (loadOntology's ASK check).

Strip the nine setTimeout calls and the READ_TIMEOUT_MS constant. A clean
read-side timeout that works on both Jena 4 and 5 needs to be plumbed via
QueryExecutionHTTPBuilder.timeout / UpdateExecHTTPBuilder.timeout instead of
RDFConnection — bigger change than this PR should carry. The comment near
CONNECT_TIMEOUT now records that history so the next reader knows why we
don't simply re-add setTimeout. Protection against a stalled-but-accepting
Fuseki still relies on the 5-failure circuit breaker + bounded pendingWrites
gate in RdfUpdater.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): align ontology-loaded check, predicate URIs, and CURIE fallback

Three real bugs flagged by Copilot's later review passes:

- areOntologiesLoaded() looked for `"boolean" : true` (space before colon) but
  JenaFusekiStorage formats ASK results without that space, so the check never
  matched and reloadOntologies() always threw. recreateIndex=true (now the
  default) ran into this on the very first scheduled run. Normalise whitespace
  before checking.

- bulkAddRelationships wrote `om:<lowercase relationshipType>` directly, while
  removeRelationship uses getRelationshipPredicate which maps a handful of
  types to prov:* (UPSTREAM → prov:wasDerivedFrom, USES → prov:used, etc.).
  Triples written by the indexer were therefore unreachable by the live
  remove hook. Pre-compute predicateUri via getRelationshipPredicate in
  bulkAddRelationships and pass it through a new field on RelationshipData
  so JenaFusekiStorage uses the same URI both paths agree on. The legacy
  RelationshipData(5-arg) ctor still works for callers that don't have a
  predicate handy; bulkStoreRelationships falls back to the old shape there.

- expandPredicateCurie returned bare strings like `customRel` unchanged, but
  createPropertyFromUri's default branch writes `<baseUri>ontology/customRel`.
  Custom relation predicates expressed as local names would never match the
  cleanup FILTER. Mirror createPropertyFromUri: full URIs pass through, bare
  local names get the OM-ontology prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): schema default + migration force entities=[all] for safe full reindex

- rdfIndexingAppConfig.json: flip recreateIndex.default from false to true so
  any UI form / config generation path that surfaces the schema default agrees
  with the install JSON files and the new full-rebuild semantics.

- 2.0.1 migration (MySQL + Postgres): in addition to flipping recreateIndex=true
  and the weekly Saturday cron, also rewrite appConfiguration.entities to
  ["all"]. Pre-upgrade an operator could have narrowed RDF indexing to a subset
  of entity types; the new recreateIndex=true semantics issues CLEAR ALL before
  indexing, which would otherwise wipe triples for excluded entity types and
  leave the graph permanently missing them. Forcing entities back to ["all"]
  ensures the post-CLEAR-ALL run repopulates the graph fully. Operators can
  re-narrow after the migration if they need partial indexing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): scope storeEntity DELETE to translator-managed predicates

Replace the literal-only FILTER(!isIRI(?o)) in JenaFusekiStorage.storeEntity
with a predicate-scoped DELETE so translator-emitted URI triples (tags,
glossary terms, owner, domain, tier, data products, structured sub-resources)
are refreshed from the new model on every entity write, while hook-managed
predicates (om:UPSTREAM, om:hasLineageDetails, om:owns / om:contains / ...)
stay intact.

Previously, with !isIRI(?o), every URI-valued triple survived storeEntity
forever — when a tag was removed or an owner changed, the old URI coexisted
with the new one because no hook ever cleans those up (tags live in
tag_usage, not entity_relationship; owners' translator-side predicate
om:hasOwner is not what the OWNS hook writes).

The DELETE set is the union of:
- RdfPropertyMapper.TRANSLATOR_MANAGED_DIRECT_PREDICATES, a static list of
  predicates that may shrink to empty between writes (so the current model
  walk wouldn't see them) — rdf:type, om:hasOwner, prov:wasAttributedTo,
  om:hasTag, om:hasGlossaryTerm, om:hasTier, om:belongsToDomain,
  om:hasDataProduct, dct:source, om:sourceUrl, plus the structured-resource
  attachment predicates (om:hasLifeCycle / hasCertification / hasExtension /
  hasCustomProperty).
- the predicates the current model actually emits for the entity subject,
  covering JSON-LD context-driven predicates that aren't in the static list.

Added two coverage tests on RdfPropertyMapperTest: the static set contains
the documented core predicates, and never contains lineage-hook predicates
(om:UPSTREAM, prov:wasDerivedFrom, om:hasLineageDetails) — that overlap
would let storeEntity wipe lineage edges on every entity update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): scope reconciliation DELETE to relationship-hook predicates only

Both clearOutgoingEntityRelationships (in RdfRepository) and the per-source
DELETE inside JenaFusekiStorage.bulkStoreRelationships used to clear ANY
outgoing edge whose object was a baseUri/entity/ URI (with only the three
lineage predicates excluded). That swept up translator-managed URI triples
(om:hasTag, om:hasGlossaryTerm, om:hasOwner, om:belongsToDomain, …) which
bulkAddRelationships does not re-emit, so reconciliation runs were
permanently destroying tag/owner/domain links.

Switch the filter to opt-in: only delete triples whose predicate is in
RELATIONSHIP_HOOK_PREDICATES, derived from the Relationship enum via the
existing getRelationshipPredicate mapping. The set excludes the lineage
predicates by skipping the UPSTREAM enum value (managed by
addLineageWithDetails). Translator-managed predicates aren't relationship
types so they're naturally not in the set; the new
RdfPredicatePartitionTest enforces the partition.

Refactored getRelationshipPredicate into a static
getRelationshipPredicateUri so it can be reused at class-init time to build
the predicate set without an instance. Added a small buildPredicateInList
helper exposed at package level for JenaFusekiStorage to reuse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): scope bulk reconciliation to batch entities, not all relationship sources

bulkStoreRelationships used to compute its per-source DELETE set from the
relationships list, so any source URI mentioned by any row in the batch was
reconciled. RdfBatchProcessor passes BOTH outgoing relationships (sources
inside the batch) and incoming UPSTREAM lineage (sources outside the batch
where this batch's entity is the target). The outside-batch sources had
their OTHER outgoing edges wiped, even though the indexer never planned to
re-index them.

Add a 2-arg overload to RdfStorageInterface.bulkStoreRelationships that
takes an explicit Set<String> sourcesToReconcile. The default 1-arg method
keeps the legacy "derive from relationships" behavior for any plugin caller
that hasn't migrated. RdfRepository.bulkAddRelationships gains a matching
overload taking Set<EntitySourceRef>; RdfBatchProcessor passes its
batchSources (the entities IT is actually indexing in this pass).

JenaFusekiStorage.bulkStoreRelationships now iterates sourcesToReconcile for
the per-source DELETE instead of computing distinctSources from
relationships. The new buildEntityUri helper on the interface lets callers
(or the default delegate) build consistent source URIs.

QLeverStorage stubs the new overload (still UnsupportedOperationException).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): time-bound HTTP request bodies via CompletableFuture wrapper

Wrap every blocking RDFConnection call in the hot read/write paths
(storeEntity DELETE+LOAD, storeRelationship, bulkStoreRelationships,
getEntity, deleteEntity, executeSparqlQuery, executeSparqlUpdate) with a
CompletableFuture-based 10s request timeout. When Fuseki accepts the TCP
connection and then stalls on the response, the caller thread now frees
after 10s instead of waiting until the OS gives up on the socket (~60s).

We chose CompletableFuture over Jena's QueryExecution.setTimeout because
that overload was removed in Jena 5 (broke integration tests already once
in this PR), and over Jena's QueryExecutionHTTPBuilder / UpdateExecHTTPBuilder
because their API surface differs between Jena 4 and Jena 5 and our two
classpaths use different versions. The CompletableFuture wrapper is Jena-
API-agnostic.

On timeout the underlying HTTP request still leaks its (virtual) thread
until OS-level TCP give-up; that's bounded by the existing circuit breaker
(after 5 timeouts the breaker opens for 30s, short-circuiting subsequent
traffic).

Lower-traffic paths (loadTurtleFile, clearGraph, getAllGraphs, getTripleCount,
loadOntology, testConnection) keep using the direct connection.update /
connection.query / connection.load calls — they're protected by the
circuit breaker and the connect timeout, and adding wrappers there is
churn without proportional benefit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(rdf): document RdfUpdater async-ordering trade-off in submitAsync

Add a comment block in RdfUpdater.submitAsync explaining why we accept the
loss of per-entity ordering when submitting through AsyncService:
- EntityUpdater diff-applies changes per request, so add-then-remove of the
  same edge within one API call nets to no-op (no hooks fire).
- Cross-request races reconcile at the next weekly recreate-index, which
  rebuilds from MySQL.
- The alternative (per-entity striped lock) costs memory and adds latency
  for the no-contention common case.
- Pointers for the future maintainer if an observed-in-production race
  emerges: gate via ConcurrentHashMap<UUID, Semaphore>.

No behavior change. The two open Copilot threads on this trade-off
(M6CQYup, M6CYbM2) stay open so a future PR can pick them up if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): atomic clear+insert, broader fallback predicate set, close temp models

Three follow-up findings from the latest Copilot pass:

- Atomicity (3249716506): clearOutgoingEntityRelationships + bulkAddRelationships
  ran as two separate SPARQL updates. If bulkAddRelationships failed after the
  clear succeeded, the batch entities had their relationships wiped without
  the replacement edges in place — they stayed gone until the next weekly
  recreate-index. Combine the per-source DELETE and the INSERT DATA into a
  single SPARQL update inside JenaFusekiStorage.bulkStoreRelationships and
  drop the now-redundant separate clear call from RdfBatchProcessor. Either
  the whole reconciliation commits or none of it does. Also let
  bulkStoreRelationships handle the zero-edge case (relationships empty,
  sourcesToReconcile non-empty) so RdfBatchProcessor doesn't need a separate
  clear for entities whose relationships were all removed in MySQL.

- Fallback predicate set (3249716532): when SettingsCache returns null,
  getGlossaryTermRelationPredicate falls back to literal
  `https://open-metadata.org/ontology/<relationType>` — so `broader` /
  `narrower` / `exactMatch` get written as om:broader/om:narrower/om:exactMatch,
  not skos:* equivalents. Without those URIs in DEFAULT_GLOSSARY_TERM_RELATION_
  PREDICATES, a cleanup run during a transient settings-cache outage would
  miss them. Added the three om:* fallback variants alongside the existing
  skos:*/rdfs:* bootstrap defaults.

- Temp Model leaks (3249319886): bulkAddRelationships and removeRelationship
  each create an ephemeral Jena Model just to mint property URIs. Wrapped
  both in try/finally close() so the in-memory graphs are released right after
  use. Jena 4's Model has a close() method but doesn't implement
  java.lang.AutoCloseable so try-with-resources isn't possible there.

Copilot's "still deleting only non-IRI" finding (3249716480) is a stale-
snapshot false positive — JenaFusekiStorage.storeEntity has used predicate-
scoped DELETE via TRANSLATOR_MANAGED_DIRECT_PREDICATES since 22d5825.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): make buildPredicateInList public so JenaFusekiStorage can use it

JenaFusekiStorage (org.openmetadata.service.rdf.storage) lives in a different
package than RdfRepository (org.openmetadata.service.rdf), so the
package-private buildPredicateInList helper introduced in 857c09 couldn't be
called from JenaFusekiStorage.bulkStoreRelationships — CI was failing with:

  [ERROR] JenaFusekiStorage.java:[606,51] buildPredicateInList(Set<String>)
  is not public in RdfRepository; cannot be accessed from outside package

Promote it to public alongside RELATIONSHIP_HOOK_PREDICATES (which is the
only data this helper renders) so the cross-package call resolves. Local
javac across the touched RDF files now reports zero new errors; the only
remaining build failures are the pre-existing es.co.elastic.clients shading
issues unrelated to this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): normalise sourcesToReconcile to empty-set to prevent NPE in iteration

bulkStoreRelationships' early-return guard accepts sourcesToReconcile == null
as a valid input, but the subsequent per-source DELETE loop iterates
sourcesToReconcile directly — so a caller passing null with a non-empty
relationships list would skip the guard and crash at the for-loop.

Today no caller hits this path (RdfRepository.bulkAddRelationships always
passes non-null, and the 1-arg default interface method derives a set), but
the null-check in the guard explicitly encodes null as supported, so the
contract should match the iteration. Normalise once after the guard:

    Set<String> effectiveSources =
        sourcesToReconcile != null ? sourcesToReconcile : Set.of();

and use effectiveSources for both the loop and the success-log size.

Local filtered compile passes cleanly (zero NEW errors from RDF files;
remaining errors are the pre-existing es.co.elastic.clients shading mess).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(rdf): update RdfIndexAppTest verifications for the new bulkAddRelationships 2-arg signature

Three test failures after the Fix-I / atomic-clear-insert changes:

- testProcessBatchRelationshipsStoresResults verified
  `bulkAddRelationships(captor.capture())` (1-arg) but RdfBatchProcessor now
  calls the 2-arg `bulkAddRelationships(relationships, batchSources)` — Mockito
  surfaced this as "different arguments" because the actual call had a
  Set<EntitySourceRef> tail. Updated the verify to
  `bulkAddRelationships(captor.capture(), anySet())`.

- The two event-subscription skip tests previously verified
  `clearOutgoingEntityRelationships(anySet())` as the only interaction; that
  method is no longer called from RdfBatchProcessor (the clear was folded
  into bulkAddRelationships' atomic SPARQL transaction for safety). Replace
  with `verify(mockRdfRepository).bulkAddRelationships(eq(List.of()), anySet())`
  — bulkAdd is still invoked with an empty list to drive the per-source
  reconciliation for the batch entity, even when the only fetched
  relationship pointed at an excluded entity type.

Filtered local compile + test-compile passes cleanly (no NEW errors from RDF
files; only pre-existing es.co.elastic.clients shading errors remain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): four follow-up findings from Copilot review 4299978111

- collectTranslatorPredicates over-broad (3249798300): RdfRepository.addRelationship
  passes storeEntity a model loaded from Fuseki PLUS the new relationship, so the
  dynamic walk was pulling hook-managed predicates (om:owns, etc.) into the DELETE
  scope. With async writes, two concurrent additions for the same source could
  each read the old model and each storeEntity wipe the other's relationship.
  Exclude RELATIONSHIP_HOOK_PREDICATES from the walk result (and defensively from
  the static-set union too).

- ForkJoinPool.commonPool starvation (3249798327): runWithTimeout used
  CompletableFuture.supplyAsync's default executor, so a Fuseki that stalls would
  leak workers on the JVM-wide commonPool and starve unrelated CompletableFuture
  / parallel-stream work. Introduce a dedicated virtual-thread executor
  (Thread.ofVirtual().name("rdf-storage-timeout-", 0)) and route all timeout
  wrappers through it — virtual threads are cheap to leak and the circuit breaker
  bounds the pile-up.

- Shrink-to-empty for literal predicates (3249798383): the predicate-scoped DELETE
  no longer caught stale literals when a literal-valued field (description /
  displayName / …) was cleared and the new model simply omitted the triple. Chain
  a "DELETE … FILTER(!isIRI(?o))" pass with the URI-scoped pass so hook-managed
  URI triples stay intact while stale literals get swept on every write.

- UI schema default (3249798439): the UI form schema at
  utils/ApplicationSchemas/RdfIndexApp.json still declared recreateIndex.default
  = false. Flipped to true to match the backend openmetadata-spec schema and the
  install JSON files. (The sibling jsons/applicationSchemas/ is gitignored
  generated output, no source change needed there.)

Local verification before push: spotless:apply, filtered compile + test-compile
(zero new errors), and `mvn test -Dtest='RdfIndexAppTest,RdfPropertyMapperTest,
RdfPredicatePartitionTest,RdfStorageIdempotencyTest'` — 64 tests, 0 failures.

The "buildPredicateInList package-private" finding from the same review
(3249798351) is already addressed in 03c5d4f and surfaces here only because
Copilot reviewed an earlier commit.

The "lineage incremental cleanup" finding (3249798415) is a known architectural
trade-off: addLineageWithDetails handles current lineage rows but removed edges
have no row to trigger a per-edge delete, and adding UPSTREAM/wasDerivedFrom to
RELATIONSHIP_HOOK_PREDICATES would conflict with the inline addLineageWithDetails
call that runs BEFORE bulkAddRelationships in RdfBatchProcessor. The weekly
recreateIndex=true run (the new default) wipes and rebuilds from MySQL, which
reconciles stale lineage; left this thread open as a documented gap rather
than reordering processBatchRelationships in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mohityadav766 pushed a commit that referenced this pull request May 20, 2026
…ta#28224)

* feat(spec): add ContextMemory + CreateContextMemory JSON schemas

* feat(jdbi3): add ContextMemoryDAO

* feat: register contextMemory entity type constant

* feat(service): add ContextMemory repository, resource, mapper

* feat(bootstrap): add context_memory table DDL

* test(service): ContextMemory resource CRUD test

* fix(context-memory): address review (relationship types, stable FQN, status msg, test name)

- storeRelationships: rootMemory -> Relationship.CONTAINS, parentMemory -> Relationship.HAS
  so the root-ancestor and direct-parent hierarchies are distinguishable.
- setFullyQualifiedName: derive from the immutable name only (drop mutable
  primaryEntity/owner derivation that destabilized nameHash on update).
- validateStatusTransition: separate "no transitions defined" from "disallowed transition".
- Rename ContextMemoryResourceTest -> ContextMemoryStatusTransitionTest (pure unit test).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(context-memory): add ContextMemoryIT + SDK ContextMemoryService

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(spec): register contextMemory in EntityLink.g4 ENTITY_TYPE grammar

EntityLinkGrammarTest.testAllEntityTypesHaveGrammarOrExclusion enumerates every
Entity.java constant and requires each to be in the EntityLink grammar or the
test's exclusion list. ContextMemory is a normal EntityRepository-backed
top-level entity (like learningResource / contextFile), so it belongs in the
ENTITY_TYPE rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(context-memory): override owner ITs for creator-as-owner default

ContextMemoryMapper.defaultOwners() intentionally assigns the creating
user as owner when the create request omits owners. BaseEntityIT's
patch_entityUpdateOwner_200 and patch_entityUpdateOwnerFromNull_200
assert "no owner initially" for any supportsOwners entity, so both
failed for ContextMemory.

Override both in ContextMemoryIT: keep the PATCH-replace-owner contract,
change only the precondition to expect the creator as the sole initial
owner (asserted by count, not a hardcoded principal). Mapper unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update generated TypeScript types

Add the generated ContextMemory TS types (entity/context/contextMemory.ts,
api/context/createContextMemory.ts). The schemas were on the branch but their
generated types were missing, failing the TypeScript Type Generation check on
this fork PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(context-memory): address review (relationship cleanup, owner scope, validations)

Copilot review on the ContextMemory entity:
- #1 record primaryEntity/relatedEntities/root/parent/source*/machineRepresentation
  in version history; usageCount/lastUsedAt documented as untracked telemetry
- #2 clear stale HAS/RELATED_TO/CONTAINS edges before re-adding in storeRelationships
- #4 default creator as owner only on create; PUT without owners no longer
  silently replaces previously set owners
- #5 schema documents that any status is allowed at creation; transitions
  enforced only on update
- #6 setFullyQualifiedName via FullyQualifiedName.build with skip-if-set guard
- #7 validate shared principal type is user/team/domain
- #8 reject self-reference for parentMemory/rootMemory
- open-metadata#10 inline Entity.CONTEXT_MEMORY, drop redundant constant

Regenerate ContextMemory TS types for the schema doc change; add IT coverage
for the self-reference and invalid-shared-principal validations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(context-memory): don't blanket-delete relationships (domain data loss)

The #2 cleanup via deleteTo(memory, CONTEXT_MEMORY, HAS, null) also matched the
framework's domain --HAS--> memory edge (storeDomains runs before
storeRelationships in storeRelationshipsInternal, on every create and update),
silently dropping domain assignments.

storeRelationships is now add-only (addRelationship upserts, so re-running on
update is idempotent). Stale-edge cleanup moved to ContextMemoryUpdater using
the framework's updateFromRelationship(s) helpers, which delete only the
specific changed refs and record the version change. parentMemory now uses
Relationship.PARENT_OF (distinct from primaryEntity's HAS and the framework's
domain HAS) so the parent edge can be maintained without collision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(bootstrap): move context_memory DDL from 2.0.1 to 2.0.0

The context_memory table belongs in the 2.0.0 migration. Relocated the
MySQL and Postgres DDL verbatim; the 2.0.1 schemaChanges.sql files are
restored to their original task_migration_mapping-only content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(bootstrap): add ENGINE=InnoDB to context_memory MySQL DDL

Explicit engine clause, consistent with the task/search-index tables in the
same migration and robust to any server default change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(context-memory): preserve sanitized/validated fields; validate relatedEntities

Review follow-ups:
- ContextMemoryMapper no longer re-sets description/owners/domains/tags/displayName
  after copy(). copy() sanitizes description (stored-XSS) and validates owners and
  domains; re-setting the raw request values bypassed both. Only ContextMemory-
  specific fields are set now.
- prepare() now assigns the result of EntityUtil.populateEntityReferences back onto
  relatedEntities so orphaned/invalid refs are filtered instead of persisted.
- ContextMemoryIT Javadoc now references ContextMemoryRepository#setCreatorAsDefaultOwner
  (the defaultOwners mapper method no longer exists).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant