Forge integration full (brief 3.1–3.7): multi-source external-match, fingerprint host, host-keyed repo/releases, multi-source search, cross-forge dedup, announcement source_host#18
Conversation
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (27)
🚧 Files skipped from review as they are similar to previous changes (18)
📝 WalkthroughWalkthroughThis PR integrates Forgejo/Gitea forge hosts into the package matching flow. It adds a dedicated F-Droid seeding worker that crawls forge-hosted indexes, extracts repository targets from trusted source URLs, derives signing fingerprints, and upserts them with a host dimension. The signing fingerprints table schema gains a host column to distinguish GitHub from forge APKs. The external match service now fans out across multiple sources (GitHub, Codeberg, Gitea) and deduplicates cross-host results. Routes gain ChangesForgejo Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
CI #18 hit the 5-min test-task timeout because every class-init HttpClient(CIO) instance spawns non-daemon selector threads. Even tests that override the public methods through fakes triggered the engines at construction time — JVM stayed alive past every test method, accumulating across the suite until the gradle task wall-clock hit 5 minutes. Move the four production engines to 'by lazy' init (ForgejoSearchClient, ForgejoResourceClient, ExternalMatchService, ForgejoFdroidSeedWorker). Tests that override search/fetch/runOnce never touch the lazy property — JVM has no live non-daemon threads at suite end and exits cleanly. Production behaviour unchanged: first call into the real method materialises the engine same as before.
PR #17 CI failed two ways: 1. build.gradle.kts:115 'Unresolved reference: time' — java.time.Duration fully-qualified didn't resolve in the Gradle Kotlin script context. Fix: top-level 'import java.time.Duration', use unqualified Duration.ofMinutes(5), and qualify the Test type as org.gradle.api.tasks.testing.Test so the withType<>() call resolves cleanly. 2. Test JVM would have hung on CIO selector threads once compile passed (#18 hit the same issue at 5-min timeout). Pre-emptive fix: move ExternalMatchService.http + ForgejoSearchClient.http to 'by lazy' init so test fakes that override the public methods never spawn non-daemon engine threads. Production behaviour unchanged — first real call materialises the engine same as before.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main/kotlin/zed/rainxch/githubstore/match/SigningFingerprintRepository.kt (1)
57-80:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftCursor key is no longer unique after adding
host, so pages can drop rows.
page()seeks and orders by(observedAt, fingerprint, owner, repo), but V17 PK is(host, fingerprint, owner, repo). If two rows share the first four fields and differ only by host, the next-page predicate skips the remaining tie rows permanently.Proposed fix
@@ - // Seed dump for /v1/signing-seeds. Returns rows ordered by (observedAt, - // fingerprint, owner, repo) so the cursor only needs to encode the most + // Seed dump for /v1/signing-seeds. Returns rows ordered by (observedAt, + // fingerprint, owner, repo, host) so the cursor key is unique post-V17. @@ - val seek = (SigningFingerprints.observedAt greater c.observedAt) or + val seek = (SigningFingerprints.observedAt greater c.observedAt) or ((SigningFingerprints.observedAt eq c.observedAt) and (SigningFingerprints.fingerprint greater c.fingerprint)) or ((SigningFingerprints.observedAt eq c.observedAt) and (SigningFingerprints.fingerprint eq c.fingerprint) and (SigningFingerprints.owner greater c.owner)) or ((SigningFingerprints.observedAt eq c.observedAt) and (SigningFingerprints.fingerprint eq c.fingerprint) and (SigningFingerprints.owner eq c.owner) and - (SigningFingerprints.repo greater c.repo)) + (SigningFingerprints.repo greater c.repo)) or + ((SigningFingerprints.observedAt eq c.observedAt) and + (SigningFingerprints.fingerprint eq c.fingerprint) and + (SigningFingerprints.owner eq c.owner) and + (SigningFingerprints.repo eq c.repo) and + (SigningFingerprints.host greater c.host)) @@ .orderBy( SigningFingerprints.observedAt to SortOrder.ASC, SigningFingerprints.fingerprint to SortOrder.ASC, SigningFingerprints.owner to SortOrder.ASC, SigningFingerprints.repo to SortOrder.ASC, + SigningFingerprints.host to SortOrder.ASC, ) @@ - PageCursor(last.observedAt, last.fingerprint, last.owner, last.repo) + PageCursor(last.observedAt, last.fingerprint, last.owner, last.repo, last.host) @@ data class PageCursor( val observedAt: Long, val fingerprint: String, val owner: String, val repo: String, + val host: String, ) { fun encode(): String { - val raw = "$observedAt|$fingerprint|$owner|$repo" + val raw = "$observedAt|$fingerprint|$owner|$repo|$host" return Base64.UrlSafe.encode(raw.toByteArray()).trimEnd('=') } @@ - if (parts.size != 4) return@runCatching null + if (parts.size != 5) return@runCatching null PageCursor( observedAt = parts[0].toLong(), fingerprint = parts[1], owner = parts[2], repo = parts[3], + host = parts[4], )Also applies to: 94-97, 141-163
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main/kotlin/zed/rainxch/githubstore/match/SigningFingerprintRepository.kt` around lines 57 - 80, The cursor seek and ordering in page() (and the other similar blocks around the SigningFingerprintRepository) currently use (observedAt, fingerprint, owner, repo) but the actual unique key includes host; update the composite seek predicate and the orderBy to include SigningFingerprints.host as the last tie-breaker (i.e., extend every OR chain that compares observedAt/fingerprint/owner/repo to also compare host where appropriate) so the cursor becomes unique and rows differing only by host are not skipped; apply the same change to the other occurrences referenced (the blocks at the other seek/orderBy usages).
🧹 Nitpick comments (1)
src/main/kotlin/zed/rainxch/githubstore/match/ExternalMatchService.kt (1)
141-156: 💤 Low valueNon-deterministic canonical row selection.
group.first()selects the canonical row, but the order depends on the database query's result ordering. If the same (owner, repo) exists on bothgithub.comandcodeberg.org, the canonicalsourceHostcould vary between runs. Ifgithub.comshould always be preferred as canonical (for wire compatibility with pre-1.9.0 clients), consider explicit ordering:val canonical = group.minByOrNull { if (it.host == "github.com") 0 else 1 } ?: group.first()If non-determinism is acceptable here, this can be ignored.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main/kotlin/zed/rainxch/githubstore/match/ExternalMatchService.kt` around lines 141 - 156, The grouping logic in ExternalMatchService currently picks the canonical row via group.first(), which is non-deterministic; change it to explicitly prefer rows with host == "github.com" when constructing the canonical candidate (replace the group.first() usage with an explicit selector such as using minByOrNull/firstOrNull with a comparator that assigns lower rank to "github.com" and falls back to group.first()), then keep the rest of the construction (owner, repo, sourceHost, availableOn) the same so sourceHost deterministically prefers github.com for wire compatibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/main/kotlin/zed/rainxch/githubstore/announcements/AnnouncementValidator.kt`:
- Around line 15-24: AnnouncementValidator.VALID_SOURCE_HOSTS is a hardcoded
allowlist that can drift from ForgejoSearchClient.DEFAULT_TRUSTED_HOSTS which is
configurable via the FORGEJO_TRUSTED_HOSTS env var; fix by centralizing the
trusted-hosts source so both use the same value (either by reading
ForgejoSearchClient.DEFAULT_TRUSTED_HOSTS or moving the env-var parsing into a
shared utility/provider) and update AnnouncementValidator to reference that
shared source instead of its own set, ensuring any runtime override via
FORGEJO_TRUSTED_HOSTS is respected.
In `@src/main/kotlin/zed/rainxch/githubstore/match/FdroidSeedWorker.kt`:
- Around line 125-127: Comment on runOnce() is incorrect: it does not drop the
advisory-lock dance because runOnce() delegates to tryRunCycle() which executes
pg_try_advisory_xact_lock. Either update the comment to accurately describe that
runOnce() still contends for the DB lock, or refactor by extracting the
fetch+upsert logic from tryRunCycle() into a new suspend function (e.g.,
fetchAndUpsert() or runCycleWithoutLock()) that contains only the fetch+upsert
work; then have tryRunCycle() call the new function inside its lock section and
change runOnce() to call the new lock-free function so tests can exercise
fetch+upsert without acquiring the advisory lock (also update tests to call the
new method).
In `@src/main/kotlin/zed/rainxch/githubstore/match/ForgejoFdroidSeedWorker.kt`:
- Around line 149-151: The comment for runOnce() is inaccurate: it claims tests
can "invoke the crawl path without the lock dance" but runOnce() calls
tryRunCycle(), which does perform the advisory lock; update the comment to
reflect that runOnce() executes the normal cycle including the advisory lock
(or, if the intent was truly to provide a lock-free test entry, change runOnce()
to call the existing lock-free variant, e.g.,
tryRunCycleWithoutLock/tryRunCycleNoLock, and add a clear comment). Locate the
runOnce() function and the tryRunCycle() implementation in
ForgejoFdroidSeedWorker (and mirror the same fix you applied to
FdroidSeedWorker) and either correct the comment text or swap the called helper
to the lock-free method.
- Around line 239-244: The parseIndexUrlsEnv function currently allows "http://"
URLs which risks MITM; update parseIndexUrlsEnv to only accept URLs that start
with "https://" (remove the http check) and fall back to DEFAULT_INDEX_URLS as
before, and if HTTP-on-local-dev should remain possible add an explicit opt-in
flag (e.g. a new env like ALLOW_INSECURE_HTTP) checked inside parseIndexUrlsEnv
before permitting non-HTTPS URLs; reference function parseIndexUrlsEnv and
constant DEFAULT_INDEX_URLS when making the change.
In `@src/test/kotlin/zed/rainxch/githubstore/match/ForgejoSearchClientTest.kt`:
- Line 49: The test currently uses assertEquals(null,
ForgejoSearchClient.SOURCE_TO_HOST["github"]) which will pass if the "github"
key is missing; update the test to first assert the key exists (e.g.
assertTrue(ForgejoSearchClient.SOURCE_TO_HOST.containsKey("github")) or
assertTrue(ForgejoSearchClient.SOURCE_TO_HOST.keys.contains("github"))) and then
read the value with ForgejoSearchClient.SOURCE_TO_HOST.getValue("github") and
assert that the retrieved value is null (use assertNull or assertEquals(null,
...)) so the test fails if the key is absent but passes if the key is present
with a null value.
---
Outside diff comments:
In
`@src/main/kotlin/zed/rainxch/githubstore/match/SigningFingerprintRepository.kt`:
- Around line 57-80: The cursor seek and ordering in page() (and the other
similar blocks around the SigningFingerprintRepository) currently use
(observedAt, fingerprint, owner, repo) but the actual unique key includes host;
update the composite seek predicate and the orderBy to include
SigningFingerprints.host as the last tie-breaker (i.e., extend every OR chain
that compares observedAt/fingerprint/owner/repo to also compare host where
appropriate) so the cursor becomes unique and rows differing only by host are
not skipped; apply the same change to the other occurrences referenced (the
blocks at the other seek/orderBy usages).
---
Nitpick comments:
In `@src/main/kotlin/zed/rainxch/githubstore/match/ExternalMatchService.kt`:
- Around line 141-156: The grouping logic in ExternalMatchService currently
picks the canonical row via group.first(), which is non-deterministic; change it
to explicitly prefer rows with host == "github.com" when constructing the
canonical candidate (replace the group.first() usage with an explicit selector
such as using minByOrNull/firstOrNull with a comparator that assigns lower rank
to "github.com" and falls back to group.first()), then keep the rest of the
construction (owner, repo, sourceHost, availableOn) the same so sourceHost
deterministically prefers github.com for wire compatibility.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a2f21c5d-babc-4353-be94-3ae195acaa77
📒 Files selected for processing (38)
.env.examplebuild.gradle.ktsdocker-compose.prod.ymlsrc/main/kotlin/zed/rainxch/githubstore/AppModule.ktsrc/main/kotlin/zed/rainxch/githubstore/Application.ktsrc/main/kotlin/zed/rainxch/githubstore/announcements/Announcement.ktsrc/main/kotlin/zed/rainxch/githubstore/announcements/AnnouncementValidator.ktsrc/main/kotlin/zed/rainxch/githubstore/badge/FdroidVersionClient.ktsrc/main/kotlin/zed/rainxch/githubstore/db/DatabaseFactory.ktsrc/main/kotlin/zed/rainxch/githubstore/db/MeilisearchClient.ktsrc/main/kotlin/zed/rainxch/githubstore/db/Tables.ktsrc/main/kotlin/zed/rainxch/githubstore/ingest/GitHubDeviceClient.ktsrc/main/kotlin/zed/rainxch/githubstore/ingest/GitHubResourceClient.ktsrc/main/kotlin/zed/rainxch/githubstore/ingest/GitHubSearchClient.ktsrc/main/kotlin/zed/rainxch/githubstore/match/ExternalMatchDto.ktsrc/main/kotlin/zed/rainxch/githubstore/match/ExternalMatchService.ktsrc/main/kotlin/zed/rainxch/githubstore/match/FdroidSeedWorker.ktsrc/main/kotlin/zed/rainxch/githubstore/match/ForgeSourceUrl.ktsrc/main/kotlin/zed/rainxch/githubstore/match/ForgejoFdroidSeedWorker.ktsrc/main/kotlin/zed/rainxch/githubstore/match/ForgejoResourceClient.ktsrc/main/kotlin/zed/rainxch/githubstore/match/ForgejoSearchClient.ktsrc/main/kotlin/zed/rainxch/githubstore/match/ForgejoSearchDto.ktsrc/main/kotlin/zed/rainxch/githubstore/match/SigningFingerprintRepository.ktsrc/main/kotlin/zed/rainxch/githubstore/match/SigningSeedDto.ktsrc/main/kotlin/zed/rainxch/githubstore/mirrors/MirrorStatusWorker.ktsrc/main/kotlin/zed/rainxch/githubstore/model/RepoResponse.ktsrc/main/kotlin/zed/rainxch/githubstore/oauth/OAuthExchangeService.ktsrc/main/kotlin/zed/rainxch/githubstore/routes/ExternalMatchRoutes.ktsrc/main/kotlin/zed/rainxch/githubstore/routes/ReleasesRoutes.ktsrc/main/kotlin/zed/rainxch/githubstore/routes/RepoRoutes.ktsrc/main/kotlin/zed/rainxch/githubstore/routes/Routing.ktsrc/main/kotlin/zed/rainxch/githubstore/routes/SearchRoutes.ktsrc/main/resources/db/migration/V17__signing_fingerprint_host.sqlsrc/test/kotlin/zed/rainxch/githubstore/match/ExternalMatchRouteTest.ktsrc/test/kotlin/zed/rainxch/githubstore/match/ExternalMatchServiceMultiSourceTest.ktsrc/test/kotlin/zed/rainxch/githubstore/match/ForgeSourceUrlTest.ktsrc/test/kotlin/zed/rainxch/githubstore/match/ForgejoFdroidSeedWorkerTest.ktsrc/test/kotlin/zed/rainxch/githubstore/match/ForgejoSearchClientTest.kt
…ly crawler, deterministic dedup Six findings from the #18 review verified against current code; all valid, all fixed. - Outside-diff (real bug): SigningFingerprintRepository.page cursor + orderBy only included (observedAt, fingerprint, owner, repo) but the post-V17 PK is (host, fingerprint, owner, repo). Two rows differing only by host both satisfied the cursor predicate and one was silently skipped between pages. Cursor now encodes host as the final tie-breaker on both the seek expression and the orderBy; legacy 4-field cursors decode with host=github.com so pre-V17 clients keep working through the rollout. - F1 AnnouncementValidator hardcoded VALID_SOURCE_HOSTS could drift from ForgejoSearchClient.DEFAULT_TRUSTED_HOSTS + the operator's FORGEJO_TRUSTED_HOSTS env override. Replaced with a call-time validSourceHosts() that reads ForgejoSearchClient.parseTrustedHostsEnv(...) + 'github.com' so the env override is honoured here too. - F2/F3 runOnce() comments on both seed workers falsely claimed the function was lock-free for tests; runOnce() actually calls tryRunCycle() which performs pg_try_advisory_xact_lock. Corrected the comments to describe the real behaviour (full normal cycle including the lock). Tests without a DB must construct higher-level fakes; refactoring out a lock-free helper is deferred since no test currently needs it. - F4 parseIndexUrlsEnv accepted http://. F-Droid index documents carry signing-cert fingerprints the crawler upserts directly into signing_fingerprint — a MITM on http could poison the /v1/external-match fingerprint path. Default is now https-only; ALLOW_INSECURE_HTTP=true opt-in env exposes http for local dev only. - F5 ForgejoSearchClientTest assertion 'assertEquals(null, map["github"])' would pass if the 'github' key were missing. Switched to containsKey + getValue + assertNull so missing-key and null-value stop being indistinguishable. - F7 (nit) Cross-forge fingerprint dedup picked group.first() — non-deterministic, DB row order flap surfaced different source_host across deploys. Now prefers github.com when present (preserves wire-compat null source_host), falls back to alphabetically-lowest host on forge-only mirrors. availableOn is also sorted for stable output.
V17 migration backfills existing rows with 'github.com' and widens the PK to (host, fingerprint, owner, repo) so the same APK that ships from two forges yields two distinct rows. SigningFingerprintRepository.lookup now returns HostedRepo triples; ExternalMatchService.matchByFingerprint maps host -> source_host (null for github.com, the forge host otherwise) so the existing wire shape stays byte-identical for GitHub callers. /v1/signing-seeds response gains a 'host' field on each row, defaulting to 'github.com' so pre-V17 clients ignoring the field keep working. Codeberg-hosted F-Droid crawler intentionally out of this commit — the schema is the contract change; the ingest path is operator-extendable via SigningFingerprintRepository.upsertBatch with host set.
…f 3.3)
GET /v1/repo/{owner}/{name}?host=<forge> routes to the existing trusted-host allowlist (FORGEJO_TRUSTED_HOSTS) via ForgejoResourceClient. Translates the Forgejo repo payload (stars_count, etc.) to the unified RepoResponse shape so clients never branch on host. Server-side LICENSE sniff (LICENSE / LICENSE.md / LICENSE.txt / COPYING / COPYING.md, base64-decode + SPDX regex) happens once per repo and is cached in resource_cache so individual users skip the work.
Hardening: cache key includes host so the existing GitHub /v1/repo/owner/name cache slot can't be poisoned by a forge response with the same owner/name; 6s per-call HTTP timeout; ForgejoResourceClient.isTrusted gates dispatch as defence in depth on top of the route's allowlist check; anonymous reads only (no PAT forwarding).
Forge requests skip the curated DB fast path (the repos table is GitHub-only). DB fast path stays the GitHub primary.
GET /v1/releases/{owner}/{name}?host=<forge> routes to ForgejoResourceClient.fetchReleasesRaw. CRLF line endings normalised to LF server-side so markdown table separator rows render correctly on the client (workaround moved from DetailsRepositoryImpl.processForgejoBody).
Cache key includes host so forge release pages don't collide with the existing GitHub cache slot. 1h server-side TTL matches the GitHub release-list cache.
Field-shape translation (asset.content_type defaults, published_at offset parsing) intentionally deferred — the client already handles those on its Forgejo branch; consolidating the DTOs is a separate clean-up PR.
GET /v1/search?source=codeberg|gitea|disroot routes to ForgejoSearchClient and returns a flat list of RepoResponse hits with source_host set. source=github (default) keeps the existing Meilisearch + GitHub passthrough path; pre-1.9.0 clients omitting the field see the byte-identical wire shape they shipped with. RepoResponse gains two additive optional fields: source_host (null for GitHub) and available_on (scaffolding for 3.6 cross-forge dedup). Both default to null/empty so old clients ignore them cleanly. source=all (cross-source merge) intentionally NOT in VALID_SOURCES yet — needs the GitHub search branch refactored into a helper that returns SearchResponse so the two paths can race in coroutineScope. Tracked as a follow-up; per-source variants ship now.
ExternalMatchCandidate gains optional 'available_on' (empty by default; pre-1.9.0 clients ignore). matchByFingerprint collapses rows that share the same (owner, repo) across hosts into a single candidate that lists every host the cert was observed under — the strong dedup signal per the brief. searchAndScoreAcrossSources runs a weaker name-based dedup after the cross-source fan-out: when the same (owner, repo) hits both GitHub and a forge, keep the higher-confidence row and fold the loser's host into available_on. Commit-SHA-equality dedup (the medium signal) intentionally deferred — needs an extra GET per hit to read default-branch HEAD.
… brief 3.7) AnnouncementDto gains optional sourceHost. Validator restricts the value to the same trusted-host allowlist the rest of the forge surface uses (github.com, codeberg.org, gitea.com, git.disroot.org) so a typo or made-up vendor fails loud at load time instead of shipping a banner with an unknown host to the UI. Pre-3.7 announcements omit the field — null means forge-agnostic (the common case).
… recorder - F2: Forgejo/Gitea /repos/search doesn't parse GitHub-style operators in q; embedding 'fork:false' leaked into free-text and skewed both recall and the scorer. ForgejoSearchClient.search gains an optional mode parameter; ExternalMatchService.searchAndScoreForgejo now calls with mode='source' (Gitea's canonical 'exclude forks' filter). - F3: ExternalMatchServiceMultiSourceTest's FakeForgejo recorded calls into a plain MutableList. ExternalMatchService.searchAndScoreAcrossSources fans out forges in parallel via coroutineScope/async — a multi-source test would race and either drop entries or throw ConcurrentModificationException. Swapped to ConcurrentLinkedQueue (lock-free, defined under concurrent appends) and updated the index access to .first() since the queue has no [] operator. perHostHits stays read-only. F1 (build.gradle.kts java.time ambiguity) verified already fixed on this branch and skipped.
…-distributed APKs (3.2 ingest) Activates the V17 schema work shipped earlier in this branch: a daily crawler pulls each configured F-Droid index-v2.json (default: Gadgetbridge's Codeberg pages repo at freeyourgadget.codeberg.page/fdroid/repo), parses every package whose sourceCode URL points at a trusted forge host (FORGEJO_TRUSTED_HOSTS), and upserts (host, fingerprint, owner, repo, observedAt) rows. PK is (host, fingerprint, owner, repo) so re-runs are no-ops on already-seen tuples; cross-forge mirrors (same APK on GitHub AND Codeberg) yield two distinct rows that the 3.6 dedup pass collapses into available_on. ForgeSourceUrl.parse is the URL→(host, owner, repo) parser; SSRF-shape hardening rejects userinfo, explicit ports, and non-http schemes before the index touches the network. Operators extend the index URL list via the additive FORGEJO_FDROID_INDEX_URLS env (comma-separated, default kicks in if blank). Distinct advisory-lock id (911_006) from the GitHub-side FdroidSeedWorker (911_004) so both can run in the same hour without contending. Activation step the brief's strategy 3 (fingerprint match) needed: without the crawler, forge-distributed apps had no rows in the table and fell through to noisy free-text search. With this committed, signing-cert match is the strongest non-manifest signal for Codeberg apps too.
CI #18 hit the 5-min test-task timeout because every class-init HttpClient(CIO) instance spawns non-daemon selector threads. Even tests that override the public methods through fakes triggered the engines at construction time — JVM stayed alive past every test method, accumulating across the suite until the gradle task wall-clock hit 5 minutes. Move the four production engines to 'by lazy' init (ForgejoSearchClient, ForgejoResourceClient, ExternalMatchService, ForgejoFdroidSeedWorker). Tests that override search/fetch/runOnce never touch the lazy property — JVM has no live non-daemon threads at suite end and exits cleanly. Production behaviour unchanged: first call into the real method materialises the engine same as before.
…PK swap Architect findings #1 + #2 from the architecture review on this branch. #1 (CRITICAL) - pg_try_advisory_lock is session-scoped, and Exposed/Hikari returned the connection to the pool the moment the 'acquire' transaction committed. The lock was released between acquireAdvisoryLock() and the actual fetch + upsert, so 'skipped: another instance' never fired and the dedup was illusory. The new ForgejoFdroidSeedWorker amplified the impact by holding the (broken) lock across N upstream HTTP fetches. Fix: switch to pg_try_advisory_xact_lock, which is automatically released at COMMIT. Restructure both workers (the pre-existing GitHub-side FdroidSeedWorker had the same bug — fixed in the same commit) so they fetch every index FIRST with no DB connection held, then open ONE newSuspendedTransaction that pairs the xact lock with the upsert via the new upsertBatchInCurrentTransaction helper on SigningFingerprintRepository. Lock and write share the same transaction; race window gone. #2 (HIGH) - V17 migration ran ADD COLUMN, DROP PK, ADD PK as three independent statements. Between DROP and ADD, concurrent writers could insert rows that would later violate the new PK and abort the migration mid-flight, leaving the schema half-broken. Fix: LOCK TABLE signing_fingerprint IN ACCESS EXCLUSIVE MODE at the top of V17 so no writer can interleave between drop and re-add. The outer transaction DatabaseFactory.runMigrations() already opens releases the lock at COMMIT. Deferred (architect review #4 fan-out concurrency cap, #5 HttpClient AutoCloseable lifecycle) — both only fire under real forge usage, which can't happen until client PR #631 ships and 1.9.0 propagates. Addressed in a follow-up PR before any user can trigger them.
…ly crawler, deterministic dedup Six findings from the #18 review verified against current code; all valid, all fixed. - Outside-diff (real bug): SigningFingerprintRepository.page cursor + orderBy only included (observedAt, fingerprint, owner, repo) but the post-V17 PK is (host, fingerprint, owner, repo). Two rows differing only by host both satisfied the cursor predicate and one was silently skipped between pages. Cursor now encodes host as the final tie-breaker on both the seek expression and the orderBy; legacy 4-field cursors decode with host=github.com so pre-V17 clients keep working through the rollout. - F1 AnnouncementValidator hardcoded VALID_SOURCE_HOSTS could drift from ForgejoSearchClient.DEFAULT_TRUSTED_HOSTS + the operator's FORGEJO_TRUSTED_HOSTS env override. Replaced with a call-time validSourceHosts() that reads ForgejoSearchClient.parseTrustedHostsEnv(...) + 'github.com' so the env override is honoured here too. - F2/F3 runOnce() comments on both seed workers falsely claimed the function was lock-free for tests; runOnce() actually calls tryRunCycle() which performs pg_try_advisory_xact_lock. Corrected the comments to describe the real behaviour (full normal cycle including the lock). Tests without a DB must construct higher-level fakes; refactoring out a lock-free helper is deferred since no test currently needs it. - F4 parseIndexUrlsEnv accepted http://. F-Droid index documents carry signing-cert fingerprints the crawler upserts directly into signing_fingerprint — a MITM on http could poison the /v1/external-match fingerprint path. Default is now https-only; ALLOW_INSECURE_HTTP=true opt-in env exposes http for local dev only. - F5 ForgejoSearchClientTest assertion 'assertEquals(null, map["github"])' would pass if the 'github' key were missing. Switched to containsKey + getValue + assertNull so missing-key and null-value stop being indistinguishable. - F7 (nit) Cross-forge fingerprint dedup picked group.first() — non-deterministic, DB row order flap surfaced different source_host across deploys. Now prefers github.com when present (preserves wire-compat null source_host), falls back to alphabetically-lowest host on forge-only mirrors. availableOn is also sorted for stable output.
11e90d5 to
b5c8cbf
Compare
Greptile SummaryThis PR implements briefs 3.1–3.7 of the forges integration: multi-source external-match fan-out, a
Confidence Score: 4/5Safe to merge with one small fix: the forge releases route returns 502 for every failure including forge-side 404s. The releases route conflates "forge repo not found" with "forge unreachable" by always responding 502 when src/main/kotlin/zed/rainxch/githubstore/routes/ReleasesRoutes.kt — the forge 404 → 502 conflation Important Files Changed
|
Greptile SummaryThis PR completes the Forge integration (brief §§3.1–3.7) on top of #17, adding multi-source search, host-keyed repo/releases proxying, cross-forge fingerprint dedup, and announcement
Confidence Score: 3/5Safe to merge for the DB migration, advisory-lock refactor, and all server-side forge proxy paths; the forge search response constructs RepoResponse objects with a hardcoded id = 0L, which will affect any client code that relies on id for item identity. The forge search route maps every hit to RepoResponse(id = 0L, ...) because ForgejoRepo in ForgejoSearchDto.kt does not capture the upstream Forgejo repo ID. A KMP Compose client that keys list items or navigation targets on id would see all forge search results collide under the same key, producing incorrect rendering or deduplication. Everything else in the PR — the migration, cursor pagination, advisory-lock hardening, ForgejoResourceClient, seed worker, and announcement validator — looks correct and well-tested. src/main/kotlin/zed/rainxch/githubstore/routes/SearchRoutes.kt and src/main/kotlin/zed/rainxch/githubstore/match/ForgejoSearchDto.kt need attention for the missing repo ID field. Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant SearchRoutes
participant ForgejoSearchClient
participant ForgejoResourceClient
participant SigningFingerprintRepo
participant ExternalMatchService
Client->>SearchRoutes: "GET /v1/search?q=signal&source=codeberg"
SearchRoutes->>ForgejoSearchClient: "search(codeberg.org, signal, mode=source)"
ForgejoSearchClient-->>SearchRoutes: List SearchHit
SearchRoutes-->>Client: "SearchResponse items with id=0 source_host=codeberg.org"
Client->>SearchRoutes: "GET /v1/repo/foo/bar?host=codeberg.org"
SearchRoutes->>ForgejoResourceClient: fetchRepo(codeberg.org, foo, bar)
ForgejoResourceClient->>ForgejoResourceClient: sniffLicense LICENSE_PATHS x5 on miss
ForgejoResourceClient-->>SearchRoutes: "RepoResponse source_host=codeberg.org"
SearchRoutes-->>Client: RepoResponse
Client->>ExternalMatchService: POST /v1/external-match fingerprint
ExternalMatchService->>SigningFingerprintRepo: lookup(fingerprint)
SigningFingerprintRepo-->>ExternalMatchService: List HostedRepo host owner repo
ExternalMatchService->>ExternalMatchService: groupBy owner repo available_on dedup
ExternalMatchService-->>Client: List ExternalMatchCandidate available_on
|
…sion Greptile review P1 on ExternalMatchService.cacheKey. Prior impl joined (packageName, appLabel, sources) with U+0001 NUL bytes and the comment claimed appLabel 'can never legally contain a NUL'. The route layer only validates appLabel for length + non-blank, not for character class, so a hostile client could submit appLabel='foo�bar' and forge boundaries that mimic a different (label, sources) tuple's slot. When two distinct candidates collide onto the same key, the cached match for one is served for the other. Switch to SHA-256 of the canonical concatenation. Cryptographically collision-resistant regardless of any field's contents. 'external-match:' prefix kept so ops can grep the namespace. Cost is ~microseconds per request — negligible against the existing GitHub passthrough wall-clock. Greptile's other two findings on this PR (FdroidSeedWorker runOnce() comment, AnnouncementValidator hardcoded VALID_SOURCE_HOSTS) were already fixed in commit b5c8cbf — Greptile reviewed an earlier base.
Stack-on-top-of-#17. This PR's first commit is identical to #17's tip; the remaining 6 commits implement the rest of the forges-integration brief (Tier 1, 2, 3) on top.
If #17 merges first, this PR's diff against main collapses to just the 6 new commits.
What ships
01b5c38(= #17)/v1/external-matchacceptssources: ["github", "codeberg", ...], forge hits emitsource: "forgejo_search"+source_host: "<host>"3d36e6csigning_fingerprintwithhost(default'github.com'); PK widens to (host, fingerprint, owner, repo);/v1/signing-seedsrows gain ahostfield;matchByFingerprintemitssource_hostfor forge-distributed APKs5ce145eGET /v1/repo/{owner}/{name}?host=<forge>routes to a newForgejoResourceClientwith server-side LICENSE sniff (LICENSE / LICENSE.md / COPYING / etc., base64-decode + SPDX regex). Cache key includes host so forge entries don't poison the GitHub cache slot.4e20e1fGET /v1/releases/{owner}/{name}?host=<forge>returns the upstream JSON with CRLF→LF normalised server-side (workaround moved fromDetailsRepositoryImpl.processForgejoBody).16a06f4GET /v1/search?source=codeberg|gitea|disrootshort-circuits toForgejoSearchClientand returnsRepoResponses withsource_hostset.source=github(default) keeps the existing Meilisearch + GitHub passthrough path byte-identical for pre-1.9.0 clients.source=all(cross-source merge) intentionally deferred — needs the GitHub branch refactored into a helper that returns SearchResponse; tracked as a follow-up.64b9aeaExternalMatchCandidategainsavailable_on.matchByFingerprintcollapses rows that share the same (owner, repo) across hosts.searchAndScoreAcrossSourcesruns a weaker name-based dedup after cross-source fan-out. Commit-SHA dedup deferred.93600a7AnnouncementDtoaccepts optionalsourceHost; validator restricts the value to the same trusted-host allowlist the rest of the forge surface uses.Hardening recap
/v1/external-matchand/v1/searchroute layers; untrusted hosts never reach upstream.FORGEJO_TRUSTED_HOSTS(defaults tocodeberg.org,gitea.com,git.disroot.org). SSRF + DoS hardening.forgejo:<host>:repo:<owner>/<name>etc.) so GitHub + forge entries with the sameowner/repocan't collide.Wire compatibility
Every new field on response DTOs (
source_host,available_on,host,sourceHoston AnnouncementDto) is additive and defaults to null/empty. Pre-1.9.0 clients ignore the new fields and see the byte-identical wire shape they shipped with.Out of scope (filed as follow-ups in commit messages)
source=allcross-source merge on/v1/search(needs GitHub-Meili branch refactor).SigningFingerprintRepository.upsertBatchwithhostset.Test plan
curl https://api.github-store.org/v1/repo/Freeyourgadget/Gadgetbridge?host=codeberg.orgreturns aRepoResponsewithsource_host: "codeberg.org"and a populatedlicenseSpdxIdcurl 'https://api.github-store.org/v1/search?q=signal&source=codeberg'returns flat hits withsource_host: "codeberg.org"psql -c 'SELECT host, COUNT(*) FROM signing_fingerprint GROUP BY host;'shows the V17 backfill kept every pre-V17 row at'github.com'and the new (host, fingerprint, owner, repo) PK is in placeSummary by CodeRabbit
New Features
Configuration
FORGEJO_TRUSTED_HOSTS) and custom F-Droid index URLs (FORGEJO_FDROID_INDEX_URLS).