feat(sql): cross-DB join key inference via prefix/suffix overlap#761
feat(sql): cross-DB join key inference via prefix/suffix overlap#761
Conversation
Adds the `altimate_core_detect_join_candidates` tool: given two or more warehouse connection names, pull a small bag of string sample values per (table, column), then for every cross-DB pair compute the longest common value prefix on each side. When both prefixes end in `_`, `-`, or `:`, differ from each other, and leave at least one matching suffix after stripping, emit a ranked join candidate. This targets the canonical pattern where one DB stores `businessid_42` and another stores `businessref_42` — the inference is purely value-based so it survives schemas that disagree on naming conventions. - Algorithm in `native/connections/detect-join-candidates.ts` (port of dab_bench's `_detect_join_candidates` / `_common_prefix`). - Native handler registered as `altimate_core.detect_join_candidates`, ranked by suffix overlap then confidence. - Tool wrapper, registry entry, and barrel export. - Tests: 21 cases covering the prefix walk-back, suffix overlap, ranking, and an integration test against two `bun:sqlite` `:memory:` DBs holding the canonical `businessid_X` ↔ `businessref_X` pattern.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
📝 WalkthroughWalkthroughAdds an end-to-end cross-database join-candidate detection feature: sampling string-like column values from multiple connections, inferring separator-aware prefix/suffix join patterns, exposing a native handler, a tool wrapper, types, registry export, and tests. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Tool as AltimateCoreDetectJoinCandidatesTool
participant Dispatcher
participant Detection as detectJoinCandidates
participant Sampler as collectSampleBags
participant DB1 as Connection1
participant DB2 as Connection2
participant Inference as detectJoinCandidatesFromBags
Client->>Tool: execute(params)
Tool->>Dispatcher: call("altimate_core.detect_join_candidates", params)
Dispatcher->>Detection: detectJoinCandidates(params)
Detection->>Sampler: collectSampleBags(params)
par Parallel Sampling
Sampler->>DB1: sample non-null string columns
DB1-->>Sampler: sample values
Sampler->>DB2: sample non-null string columns
DB2-->>Sampler: sample values
end
Sampler-->>Detection: { bags, partialErrors, connectionErrors }
Detection->>Inference: detectJoinCandidatesFromBags(bags)
Inference-->>Detection: sorted JoinCandidate[]
Detection-->>Dispatcher: { success, candidates, errors, partialErrors }
Dispatcher-->>Tool: AltimateCoreResult
Tool->>Client: formatted output + metadata
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
1 similar comment
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
packages/opencode/src/altimate/index.ts (1)
19-19: Avoid re-exporting the tool’s test-only internals.
export *here also exposes_altimateCoreDetectJoinCandidatesInternalfrom the tool module through the public Altimate barrel. Please export justAltimateCoreDetectJoinCandidatesToolhere, or keep the test helper out of the module’s public exports.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/index.ts` at line 19, The barrel currently re-exports everything from "./tools/altimate-core-detect-join-candidates", which unintentionally exposes the test-only symbol _altimateCoreDetectJoinCandidatesInternal; change the wildcard export to a named export that only exports the public tool (AltimateCoreDetectJoinCandidatesTool) so the internal helper is not leaked, i.e., replace the export * with an explicit export of AltimateCoreDetectJoinCandidatesTool (or alternatively move _altimateCoreDetectJoinCandidatesInternal out of the module's public exports).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/opencode/src/altimate/native/connections/detect-join-candidates.ts`:
- Around line 256-261: The helper safeListSchemas currently swallows errors from
Connector.listSchemas and returns ["public"], hiding real failures; change
safeListSchemas to not silently fall back — remove the catch that returns
["public"] and instead let the original error propagate (or throw a new Error
with context) so upstream code (e.g., detect-join-candidates logic that calls
safeListSchemas) can record the failure in connection_errors and surface a
failed run rather than returning an empty candidate set.
- Around line 183-203: fetchColumnSamples currently constructs SQL with
hard-coded double-quoted identifiers and a trailing LIMIT which breaks on
several dialects and then swallows errors; update it to use the existing
dialect-aware quoting (e.g., reuse quoteIdentForDialect or the same logic from
data-diff.ts) for schema/table/column identifiers, replace the hard-coded LIMIT
with dialect-appropriate paging (TOP/OFFSET/FETCH or FETCH FIRST ... ROWS ONLY
depending on connector.dialect), and stop silently swallowing exceptions —
surface or log the connector error (throw or return a propagated error) so
callers know sampling failed; locate fetchColumnSamples and the helper
quoteIdentForDialect/quoteIdent usage to apply the changes.
In
`@packages/opencode/src/altimate/tools/altimate-core-detect-join-candidates.ts`:
- Around line 59-75: The current return always prints "Join candidates: X found"
even when detectJoinCandidates returned a failure; update the logic in the
function that builds the response (the block using result, error, candidates,
connectionErrors and calling formatCandidates) to first check result.success
and/or error and, if the call failed, return a failure response instead of
formatting candidates: set metadata.success = false, include the raw error
string in title and metadata.error, include connection_errors if present, and
set output to either an error-aware message or the original result.error rather
than formatCandidates(candidates, ...); otherwise (success) keep the existing
title/count and call formatCandidates as before. Ensure you reference result,
error, candidates, connectionErrors and formatCandidates in your change.
- Around line 51-58: The execute method in altimate-core-detect-join-candidates
currently calls Dispatcher.call("altimate_core.detect_join_candidates", ...)
without requesting user approval; add an explicit approval step using ctx.ask({
permission: "sql_execute_read", title: "...", description: "..." }) (matching
the other read-only warehouse tools) before invoking Dispatcher.call; if ctx.ask
is denied or returns falsy, abort/throw or return an appropriate error,
otherwise proceed to call Dispatcher.call with the same args; ensure this
approval lives inside the async execute(args, _ctx) function and surrounds the
call to Dispatcher.call so no SELECTs are issued before approval.
---
Nitpick comments:
In `@packages/opencode/src/altimate/index.ts`:
- Line 19: The barrel currently re-exports everything from
"./tools/altimate-core-detect-join-candidates", which unintentionally exposes
the test-only symbol _altimateCoreDetectJoinCandidatesInternal; change the
wildcard export to a named export that only exports the public tool
(AltimateCoreDetectJoinCandidatesTool) so the internal helper is not leaked,
i.e., replace the export * with an explicit export of
AltimateCoreDetectJoinCandidatesTool (or alternatively move
_altimateCoreDetectJoinCandidatesInternal out of the module's public exports).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d2afa6cf-1811-4eaa-98f4-31961064966e
📒 Files selected for processing (8)
packages/opencode/src/altimate/index.tspackages/opencode/src/altimate/native/connections/detect-join-candidates.tspackages/opencode/src/altimate/native/connections/register.tspackages/opencode/src/altimate/native/types.tspackages/opencode/src/altimate/tools/altimate-core-detect-join-candidates.tspackages/opencode/src/tool/registry.tspackages/opencode/test/altimate/altimate-core-native.test.tspackages/opencode/test/altimate/tools/altimate-core-detect-join-candidates.test.ts
There was a problem hiding this comment.
4 issues found across 8 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/opencode/src/altimate/native/connections/detect-join-candidates.ts">
<violation number="1" location="packages/opencode/src/altimate/native/connections/detect-join-candidates.ts:176">
P1: Dialect-specific identifier quoting is required; always using ANSI double quotes breaks MySQL/MariaDB when `ANSI_QUOTES` is not enabled.</violation>
<violation number="2" location="packages/opencode/src/altimate/native/connections/detect-join-candidates.ts:192">
P1: Do not hardcode `LIMIT` in the sampling SQL; rely on `connector.execute(..., sampleSize)` to apply dialect-specific limiting.</violation>
<violation number="3" location="packages/opencode/src/altimate/native/connections/detect-join-candidates.ts:260">
P2: Falling back to `["public"]` when `listSchemas()` fails silently produces zero candidates on databases where the default schema is not `public` (e.g., SQL Server uses `dbo`, Oracle uses the username). The error isn't surfaced in `connection_errors`, so the run appears successful with an empty result. Either propagate the error or use a dialect-aware default schema.</violation>
</file>
<file name="packages/opencode/src/altimate/tools/altimate-core-detect-join-candidates.ts">
<violation number="1" location="packages/opencode/src/altimate/tools/altimate-core-detect-join-candidates.ts:66">
P2: When the native call returns `{ success: false, error }` without throwing, this code path is still reached and renders `Join candidates: 0 found` / `No cross-DB join candidates detected`, hiding the actual error from the user. Add an early return for `!result.success` that surfaces the error message (similar to the catch block below that returns `title: "Join candidates: ERROR"`).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Multi-Model Consensus Code Review — PR #761Title: feat(sql): cross-DB join key inference via prefix/suffix overlap Verdict: REQUEST CHANGESThree independent reviewers (Claude, GPT, Gemini) flagged the SQL portability problem as a blocker; GPT additionally flagged a missing permission-gating step. Kimi gave a "conditional yes" but its concerns overlap with the major issues raised elsewhere. The pure algorithm and test architecture are unanimously praised; the warehouse-facing implementation is unanimously called out for not aligning with the repo's dialect-aware patterns. The fixes are small and bounded — likely ~30 LOC plus a couple of tests — and worth doing before merge so the tool doesn't ship as silently-broken on MySQL/SQL Server/Oracle/BigQuery. CRITICALC1 —
|
| Finding | Severity | Reviewers |
|---|---|---|
| C1 SQL portability (LIMIT + quoting) | CRITICAL | Claude, GPT, Gemini, Kimi |
| M1 No permission gating | MAJOR | GPT |
M2 ["public"] schema fallback |
MAJOR | Claude, GPT, Gemini |
| M3 Silent per-table errors | MAJOR | Claude, GPT |
M4 success: false mishandled |
MAJOR | GPT |
M5 confidence misleading |
MAJOR | Claude |
| M6 Serial sample collection | MAJOR | Gemini |
| M7 No upper bounds on params | MAJOR | Kimi |
MI1 Implicit any register |
MINOR | Claude |
| MI2 String-type pattern coverage | MINOR | Claude, Kimi |
| MI3 No column exclusion list | MINOR | Kimi |
| MI4 No candidate dedup | MINOR | Kimi |
| MI5 schema/table compound key | MINOR | Kimi |
| MI6 Output truncation | MINOR | Gemini |
| MI7 Dead telemetry env-var | MINOR | Claude |
| MI8 Cap-truncation not signaled | MINOR | Claude |
| N1 dab_bench codename leak | NIT | Claude, Kimi |
| N2 Unicode separator | NIT | Claude |
| N3 Double-wrapped errors | NIT | Kimi |
Disagreements
- Kimi's overall verdict was "conditional yes — fix chore(deps): Bump minimatch from 10.0.3 to 10.2.3 in /packages/altimate-code #1, chore(deps): Bump @ai-sdk/xai from 2.0.51 to 3.0.60 #4, chore(deps): Bump glob from 13.0.5 to 13.0.6 #8 before merge" while Claude / GPT / Gemini lean "request changes" (with C1 as blocker). The disagreement reduces to whether SQL portability is a release blocker vs. a follow-up: GPT and Claude argue it ships visibly broken on multiple supported warehouses (silent zero-result), so blocker. Kimi assumed dialect parity would be sorted post-merge.
- Performance framing differs: Gemini emphasizes the serial-I/O latency (M6); Kimi emphasizes algorithmic worst-case in
commonPrefix(long values); Claude emphasizes the unbounded bag count (M7-related). All are valid; the bounded-fix is M7's Zod cap, the latency fix is M6. commonPrefixlong-string defensiveness (Kimi suggested a 100-char early-exit) vs. leave as-is (Claude's view): the values are sample-bounded bysample_size <= 50by default, so the worst case is small. Apply only if M7 doesn't add the upper bound.
Footer
Reviewed by 8 participants: Claude + GPT 5.4 Codex + Gemini 3.1 Pro + Kimi K2.5 + MiniMax M2.7 + GLM-5.1 + Qwen 3.6 + MiMo V2 Pro.
Active responses: 4/8 — Claude, GPT (full), Gemini (partial — Gemini API quota exhaustion mid-run, but produced a coherent review), Kimi (full).
Failed/timed-out: 4/8 — GLM-5.1, Qwen 3.6, MiniMax M2.7, MiMo V2 Pro all blocked by a kilo SQLite database lock (a separate consensus run on PR #762 held the DB lock; running these models sequentially still hit the contention plus per-process startup races). The synthesis above represents the consensus of the 4 active reviewers.
This synthesis is single-pass (no convergence rounds) per the user's instruction.
Fixes the CRITICAL and MAJOR issues raised in the multi-model consensus review on PR #761. CRITICAL — SQL portability: - Reuse `quoteIdentForDialect` from `data-diff.ts` (now exported) so identifier quoting matches the per-dialect convention: backticks on MySQL/MariaDB/ClickHouse, square brackets on T-SQL/Fabric, ANSI double-quotes elsewhere. - Drop the hardcoded `LIMIT N` clause and pass the cap through `connector.execute(sql, sampleSize)` so each driver applies its native limit syntax (`LIMIT`, `TOP`, `FETCH FIRST`, ...). - Extract `buildSampleSql` as a pure helper so tests can snapshot the emitted SQL per dialect without going through I/O. MAJOR fixes: - Permission gating: tool wrapper now requests `sql_execute_read` via `ctx.ask()` before issuing any SELECT, matching `data_diff` / `sql_execute`. - Drop the unsafe `["public"]` schema fallback. When `listSchemas()` fails and no `schema_name` is provided, record the failure as a connection-level error and skip the connection rather than silently scanning the wrong schema. - Surface per-table / per-column sampling failures via a bounded `partial_errors: Record<string, string[]>` field instead of swallowing them. Format includes errors in the human-readable output. - Tool wrapper now returns a `FAILED` envelope when the dispatcher returns `{ success: false, error }` (e.g. on the connection-count guard) instead of rendering "0 found". - Rename `confidence` → `match_score` with an updated JSDoc that calls out it's a heuristic ranking signal (`overlap / min(|left|, |right|)`), not a probability. - Parallelize `collectSampleBags` at the connection level via `Promise.all` — connections are independent and the per-connection cap keeps blast radius bounded. - Add Zod upper bounds: `connections` <= 16, `sample_size` <= 1000, `max_tables_per_connection` <= 500. Prevents an oversized LLM call from blowing memory or holding warehouse connections indefinitely. Minor fix: - Type the dispatcher registration with `AltimateCoreDetectJoinCandidatesParams` / `AltimateCoreResult` so the handler matches the rest of the file. Tests: - 14 new tests covering dialect-aware SQL emission (MySQL, T-SQL, Fabric, Postgres, Snowflake, generic), absence of inlined LIMIT, delimiter escaping, the `success: false` envelope path, the permission-gating call, the Zod upper-bound rejections, the `partial_errors` collection, and the no-fallback `listSchemas` behaviour. Updates existing tests for the `confidence` → `match_score` rename. Final tally: 35 passing tests, 0 failing. Defers (per consensus-review tags): all MINOR/NIT findings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
2 similar comments
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
Update: addressed-vs-pending from the consensus reviewPushed Addressed (this commit)
Tests: 35 pass / 0 fail in the targeted file (was 21; +14 new for dialect SQL emission across MySQL/T-SQL/Fabric/Postgres/Snowflake/generic, no- Deferred (per consensus tags)All MINOR / NIT findings — MI2 (string-type pattern coverage), MI3 (column-name skip list), MI4 (candidate dedup), MI5 (compound Demoted as invalid: none from this PR (no Kimi-style misreads). Heads-up
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
packages/opencode/src/altimate/native/connections/data-diff.ts (1)
42-56:⚠️ Potential issue | 🟠 MajorAdd an explicit BigQuery branch here.
detect-join-candidatesnow reuses this helper for sampling SQL, but the fallback branch still emits ANSI-style"identifier"quoting. BigQuery does not use that quoting mode for identifiers, so the new feature will still generate invalid SQL there whenever a table/column name needs quoting.In BigQuery Standard SQL, what quoting syntax is required for quoted identifiers, and are double quotes valid identifier quotes?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines 42 - 56, The fallback ANSI double-quote branch emits invalid identifier quoting for BigQuery; update quoteIdentForDialect to add an explicit "bigquery" case that quotes identifiers with backticks (same style as the mysql/mariadb/clickhouse branch) and properly escapes any backticks inside the identifier (e.g., identifier.replace(/`/g, "``")), so detect-join-candidates generates valid BigQuery SQL; modify the switch in quoteIdentForDialect to include case "bigquery": return `\`${identifier.replace(/`/g, "``")}\``.
🧹 Nitpick comments (1)
packages/opencode/src/altimate/native/connections/detect-join-candidates.ts (1)
279-293: Surface when the table cap truncates a scan.Once
tablesScannedhitsmax_tables_per_connection, sampling just stops and the response still looks complete. Returning per-connection metadata liketruncated,tables_scanned, ortables_skippedwould make zero/low-result scans much easier to interpret.Also applies to: 388-395
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/native/connections/detect-join-candidates.ts` around lines 279 - 293, The scanning loop that uses tablesScanned and maxTables (in detect-join-candidates.ts around the for (const schema of schemas) loop and the similar loop at the later block) currently stops silently when tablesScanned >= maxTables; update the per-connection result object returned by this code to include metadata fields such as truncated (boolean), tables_scanned (number), and tables_skipped (number) and set truncated = true whenever you break out due to the cap; increment tables_scanned as you already do, compute tables_skipped as the remaining tables not scanned in this connector/schema, and ensure the same changes are applied to the second loop (the block around lines 388-395) and any place that returns connector scan results so callers can see when scans were truncated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/opencode/src/altimate/native/connections/detect-join-candidates.ts`:
- Around line 150-156: stripPrefixSet currently adds suffixes that are only
whitespace because it checks suf.length > 0; update stripPrefixSet(values,
prefix) to trim whitespace before deciding and storing: compute suf =
v.slice(prefix.length), then let trimmed = suf.trim() and only add trimmed to
the Set if trimmed.length > 0 (store trimmed, not the raw suf). This ensures
whitespace-only suffixes like " " or "\t" are ignored when building the returned
Set.
---
Duplicate comments:
In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 42-56: The fallback ANSI double-quote branch emits invalid
identifier quoting for BigQuery; update quoteIdentForDialect to add an explicit
"bigquery" case that quotes identifiers with backticks (same style as the
mysql/mariadb/clickhouse branch) and properly escapes any backticks inside the
identifier (e.g., identifier.replace(/`/g, "``")), so detect-join-candidates
generates valid BigQuery SQL; modify the switch in quoteIdentForDialect to
include case "bigquery": return `\`${identifier.replace(/`/g, "``")}\``.
---
Nitpick comments:
In `@packages/opencode/src/altimate/native/connections/detect-join-candidates.ts`:
- Around line 279-293: The scanning loop that uses tablesScanned and maxTables
(in detect-join-candidates.ts around the for (const schema of schemas) loop and
the similar loop at the later block) currently stops silently when tablesScanned
>= maxTables; update the per-connection result object returned by this code to
include metadata fields such as truncated (boolean), tables_scanned (number),
and tables_skipped (number) and set truncated = true whenever you break out due
to the cap; increment tables_scanned as you already do, compute tables_skipped
as the remaining tables not scanned in this connector/schema, and ensure the
same changes are applied to the second loop (the block around lines 388-395) and
any place that returns connector scan results so callers can see when scans were
truncated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 245f5ad4-089b-4383-b8e8-5300dc34628e
📒 Files selected for processing (5)
packages/opencode/src/altimate/native/connections/data-diff.tspackages/opencode/src/altimate/native/connections/detect-join-candidates.tspackages/opencode/src/altimate/native/connections/register.tspackages/opencode/src/altimate/tools/altimate-core-detect-join-candidates.tspackages/opencode/test/altimate/tools/altimate-core-detect-join-candidates.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/opencode/src/altimate/tools/altimate-core-detect-join-candidates.ts
- packages/opencode/test/altimate/tools/altimate-core-detect-join-candidates.test.ts
| function stripPrefixSet(values: readonly string[], prefix: string): Set<string> { | ||
| const out = new Set<string>() | ||
| for (const v of values) { | ||
| if (typeof v === "string" && v.startsWith(prefix)) { | ||
| const suf = v.slice(prefix.length) | ||
| if (suf.length > 0) out.add(suf) | ||
| } |
There was a problem hiding this comment.
Ignore whitespace-only suffixes before scoring.
stripPrefixSet() currently treats " " / "\t" as real suffixes because it only checks length > 0. That can produce suffix_overlap > 0 for columns that don't actually contain a usable join token.
Suggested fix
function stripPrefixSet(values: readonly string[], prefix: string): Set<string> {
const out = new Set<string>()
for (const v of values) {
if (typeof v === "string" && v.startsWith(prefix)) {
const suf = v.slice(prefix.length)
- if (suf.length > 0) out.add(suf)
+ if (suf.trim().length > 0) out.add(suf)
}
}
return out
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function stripPrefixSet(values: readonly string[], prefix: string): Set<string> { | |
| const out = new Set<string>() | |
| for (const v of values) { | |
| if (typeof v === "string" && v.startsWith(prefix)) { | |
| const suf = v.slice(prefix.length) | |
| if (suf.length > 0) out.add(suf) | |
| } | |
| function stripPrefixSet(values: readonly string[], prefix: string): Set<string> { | |
| const out = new Set<string>() | |
| for (const v of values) { | |
| if (typeof v === "string" && v.startsWith(prefix)) { | |
| const suf = v.slice(prefix.length) | |
| if (suf.trim().length > 0) out.add(suf) | |
| } | |
| } | |
| return out | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/opencode/src/altimate/native/connections/detect-join-candidates.ts`
around lines 150 - 156, stripPrefixSet currently adds suffixes that are only
whitespace because it checks suf.length > 0; update stripPrefixSet(values,
prefix) to trim whitespace before deciding and storing: compute suf =
v.slice(prefix.length), then let trimmed = suf.trim() and only add trimmed to
the Set if trimmed.length > 0 (store trimmed, not the raw suf). This ensures
whitespace-only suffixes like " " or "\t" are ignored when building the returned
Set.
✅ Tests — All PassedTypeScript — passedcc @sahrizvi |
Closes #758
Motivation
When altimate-code is connected to multiple warehouses, columns that point at the same entity often have mismatched naming or formatting — e.g., one DB stores
cust_42while another stores42, or one warehouse usesbusinessid_Xand another usesbusinessref_X. There's currently no built-in way to surface candidate joins across DBs; the agent has to discover them per-pair, which costs tokens and is brittle.Common situations where this lands:
AccountId↔ Hubspotcompany_id↔ internalaccount_external_id)v1_user_123vsuser_123)What's added
A new tool
altimate_core.detect_join_candidatesthat:connector.executeplumbing._/-/:separator (so partial-name matches don't count).{left_db, left_table, left_col, right_db, right_table, right_col, prefix_rule, suffix_overlap, confidence}whereconfidence = overlap / min(|left_suffixes|, |right_suffixes|)(cheap and monotonic, deliberately not a probability).Errors are scoped: per-table sample failures don't kill the scan; per-connection failures are returned in
connection_errors. Identifier quoting is ANSI-safe (double quotes). TheSTRING_TYPE_PATTERNfilter restricts sampling to text-like columns (varchar/text/char/string/uuid/json) — numeric/temporal columns are skipped.Tests
21 new tests in
test/altimate/tools/altimate-core-detect-join-candidates.test.ts:commonPrefixLCP logic + separator handling (_/-/:) + no-separator rejection + non-string defensive pathsdetectJoinCandidatesFromBags: canonical pattern, same-DB rejection, identical-prefix rejection, zero-overlap rejection, no-separator rejection, ranking by overlap/confidence, full N*(N-1)/2 fan-outbun:sqlite:memory:DBs holding 10businessid_Xrows + 8businessref_Xrows (with intentional non-overlapping outliers), driven through the actual native handler withRegistry.getstubbedResults:
test/altimate/tools/: 192 pass / 0 fail (the 9 pre-existing failures insql-classify.test.tswere verified unrelated by stashing this branch's changes — same 9 fail onmain)bun run typecheck: clean (0 errors afterbun installin the worktree)The existing
altimate-core-native.test.tscount assertion was updated (34 → 35) to reflect the new method, withaltimate_core.detect_join_candidatesadded to the canonicalALL_METHODSlist.Backwards compatibility
Pure addition. No existing behaviour changes; no existing tests modified beyond the count update.
Files
packages/opencode/src/altimate/native/connections/detect-join-candidates.ts(new — 317 LOC)packages/opencode/src/altimate/native/connections/register.ts(handler registration)packages/opencode/src/altimate/native/types.ts(params + result types)packages/opencode/src/altimate/tools/altimate-core-detect-join-candidates.ts(new tool wrapper)packages/opencode/src/altimate/index.ts(barrel export)packages/opencode/src/tool/registry.ts(tool registration)packages/opencode/test/altimate/tools/altimate-core-detect-join-candidates.test.ts(new — 385 LOC)packages/opencode/test/altimate/altimate-core-native.test.ts(count assertion update)Summary by cubic
Adds cross-DB join key inference that finds candidate joins by stripping distinct prefixes and matching suffixes across warehouses. Now uses dialect-aware sampling SQL, stricter permission/error handling, and safer input limits.
altimate_core.detect_join_candidatesand thealtimate_core_detect_join_candidatestool; registered and exported.quoteIdentForDialect; no inline LIMITs (caps passed to drivers)._/-/:; prefixes must differ; emits on suffix overlap; ranks bysuffix_overlap, thenmatch_score = overlap / min(left, right).schema_name; cross-DB only; skips non-string columns.sql_execute_readpermission; parallel per-connection scans; returnsconnection_errorsand boundedpartial_errors; input caps (connections≤ 16,sample_size≤ 1000,max_tables_per_connection≤ 500).Written for commit 80758f5. Summary will update on new commits. Review in cubic
Summary by CodeRabbit