Skip to content
Merged
9 changes: 6 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ All under `/v1/`:

| Endpoint | Purpose |
|----------|---------|
| `GET /` | Greeting JSON `{name, docs, api}` so bare-hostname hits don't 404. Cached `max-age=3600, s-maxage=86400`. |
| `GET /health` | Health check (Postgres + Meilisearch status) |
| `GET /search?q=&platform=&sort=&limit=&offset=` | Meilisearch-powered search. Auto-triggers GitHub passthrough if <5 results. `sort` ∈ {`relevance` (default), `stars`, `recent` / `releases` (alias, by latest stable release date), `updated` (by repo `updated_at_gh`)}. `relevance` requires `q`; the others allow empty `q` for browse-mode listings. `sort=updated` is routed directly to Postgres FTS until the fetcher repo's `meili_sync.py` adds `updated_at_gh` to Meili's sortable-attributes. Reads optional `X-GitHub-Token` header to run passthrough on the user's 5000/hr quota instead of the backend's fallback quota. Response carries `passthroughAttempted: Boolean` so clients can distinguish "index was warm but returned nothing" from "GitHub also has nothing". |
| `GET /search/explore?q=&platform=&page=` | User-triggered deep GitHub search, paginated, ingests into index. Also reads `X-GitHub-Token`. Cold-path latency is 10–30s — clients must use a 30s timeout. |
Expand All @@ -57,13 +58,15 @@ All under `/v1/`:
| `GET /user/{username}` | Proxied GitHub user/org profile. Reads optional `X-GitHub-Token`. Cached 7d. |
| `GET /users/{username}/repos?type=&sort=&direction=&page=&per_page=` | Proxied list of a user/org's repos. `type` ∈ {all, owner, member}, `sort` ∈ {created, updated, pushed, full_name}, `direction` ∈ {asc, desc}. Whitelisted to block SSRF via query injection. Cached 1h server-side, edge `s-maxage=1800`. Reads `X-GitHub-Token`. |
| `GET /users/{username}/starred?sort=&direction=&page=&per_page=` | Proxied list of a user's starred repos (the public form -- the OAuth viewer-self form is intentionally NOT proxied). `sort` ∈ {created, updated}. Cached 30min server-side, edge `s-maxage=900`. Reads `X-GitHub-Token`. |
| `POST /events` | Batched telemetry (opt-in, max 50 per batch). These rows drive `SignalAggregationWorker` — ranking only improves if clients send events. |
| `POST /events` | **Deprecated 2026-04-26 — telemetry was killed in the E6 audit.** Returns `204 No Content` and silently discards the batch — pre-1.8.3 clients (`TelemetryRepositoryImpl`) treat any non-2xx as failure and retry, so a 410 here triggers an error-log + retry storm. 204 lets old clients see success and back off. The `Events` table and `SignalAggregationWorker` remain wired for historical rows but no new data is ingested. Once 1.8.3 ships a sticky-disable-on-non-2xx flag on the client (telemetry cleanup task), flip this back to `410 Gone` with the proper JSON deprecation notice so laggard clients get a real signal. |
| `GET /announcements` | Public, anonymous announcements feed. Same byte-identical envelope for every caller. Backed by JSON files in `src/main/resources/announcements/<id>.json` (or `ANNOUNCEMENTS_DIR` env override). Validator enforces every rule from `docs/backend/announcements-endpoint.md` §2 at load time; expired items are filtered at serve time. `Cache-Control: public, max-age=600` + ETag revalidation. No auth, no per-user logic, no logging beyond standard access. |
| `POST /auth/device/start` | Stateless proxy for `github.com/login/device/code`. Client used to call GitHub directly; some user networks (documented in OpenHub-Store/GitHub-Store#433, #395) can't reach GitHub reliably. Backend adds `client_id`, forwards GitHub's body verbatim. 10 req/hr/IP. |
| `POST /auth/device/poll` | Stateless proxy for `github.com/login/oauth/access_token`. Reads `device_code` from form body, adds `client_id` + `grant_type`, forwards GitHub's body verbatim (including tokens on success). The backend never logs, caches, or persists the token. 200 req/hr/IP. |
| `POST /auth/device/poll` | Stateless proxy for `github.com/login/oauth/access_token`. Reads `device_code` from form body, adds `client_id` + `grant_type`, forwards GitHub's body verbatim (including tokens on success). The backend never logs, caches, or persists the token. 200 req/hr/IP. Per-request diagnostic line `[auth-poll rid=… ] dch=<sha256-prefix(device_code)> ghs=<upstream-status> gh_err=<github-error-code-or-"-"> lat_ms=<ms> ua=<truncated-UA>` is logged for auth-stuck triage (GitHub-Store#433, #395) — the raw `device_code`, response body, and every token field are explicitly excluded; only the `error` key is parsed off the upstream body, via a DTO that doesn't even declare `access_token`/`refresh_token`. |
| `GET /internal/metrics` | Operator-only. Gated by `X-Admin-Token` matching the `ADMIN_TOKEN` env var (open if unset, for local dev). Returns per-source search counters, P-latency, worker queue depth, and top 20 misses (8-char `query_hash` prefix only) in last 7 days. |
| `POST /internal/backfill-stale?limit=N` | Operator-only. Spawns a paced background job that refreshes every curated row whose new metadata columns are still at their migration defaults (currently keyed on `license_spdx_id IS NULL`). One concurrent run; returns 409 on re-trigger. Uses `searchClient.refreshRepo` + persist; respects the quiet window so the daily fetcher's pool stays free. Run after a column-add deploy; no-ops afterwards once the filter no longer matches. |
| `GET /badge/...` | M3-styled SVG badges. Per-repo: `/badge/{owner}/{name}/{kind}/{style}/{variant}` for kind ∈ {release, stars, downloads}. Global: `/badge/{kind}/{style}/{variant}` for kind ∈ {users, fdroid}. Static: `/badge/static/{style}/{variant}?label=&icon=`. Style 1-12 hue, variant 1-3 shade. Vectorized glyph rendering — no font dependency at SVG embed time. |
| `GET /mirrors/list` | Curated catalog of GitHub mirrors with hourly-probed health. Each entry carries `traffic_kinds: ["release_asset", "raw_file"]` for whole-URL proxies (template ends `/{url}`) and `["raw_file"]` for jsDelivr's `/gh/` path-based mirror (template `https://fastly.jsdelivr.net/gh/{owner}/{repo}@{ref}/{path}`). Clients MUST consult `traffic_kinds` before routing a download — sending a release-asset URL through a `raw_file`-only mirror 404s. Cached `max-age=300, s-maxage=3600`. |
| `{GET,POST} /repo/login/{device,oauth}` | **Deprecated 2024-09-01** — tombstone for pre-1.6 builds that wired the device-flow under `/repo/`. Returns `410 Gone` with `use_instead` pointing at `/v1/auth/device/start` + `/v1/auth/device/poll`. Cached `max-age=86400`. Declared **before** `repoRoutes` so the static segments win over `/repo/{owner}/{name}`. |

Client-facing API contract and migration history live in `internal/` (gitignored, operator-only). The client repo at `OpenHub-Store/GitHub-Store` is the public source of truth for client behavior.

Expand Down Expand Up @@ -113,7 +116,7 @@ RepoRefreshWorker (hourly) — re-fetches passthrough repos by oldest indexed_a
- **Meilisearch partial-update gotcha — PUT, never POST.** `MeilisearchClient.addDocuments()` is POST, which on Meili *replaces* the document with whatever fields you send (everything else becomes null). `MeilisearchClient.updateScores()` is PUT, which merges. Pushing just `{id, search_score}` with POST will wipe every other field on 3000+ docs. If you add a new "partial update" path, verify the HTTP verb before deploying.
- **Dynamic category/topic ordering.** `RepoRepository.findByCategory()` picks a category-specific primary sort column (`trending_score` for trending, `popularity_score` for most-popular, `latest_release_date` for new-releases), falls back to global `searchScore`, then static `rank` as final tie-breaker. Without category-specific primary, both trending and most-popular collapse onto the same global score — the bug fix in PR #12. `findByTopicBucket()` keeps the simpler `searchScore DESC NULLS LAST, rank ASC` order because topics are flat lists, not flavour-segmented like the categories.
- **Exposed `Repos` table uses `array<String>("topics", TextColumnType())`** for the Postgres `TEXT[]` column. The Python fetcher writes these via psycopg2's automatic list-to-array conversion.
- **Cache headers are set per endpoint**, not globally. Announcements: 600s/3600s. Categories/topics: 60s/600s. Repo detail: 30s/300s. Search: 15s/30s. Readme proxy: 3600s/21600s. User proxy: 86400s/604800s. Badges (fresh): 3600s/3600s with `stale-while-revalidate=86400`; (degraded) 300s/300s. Edge respects `s-maxage`; the larger `s-maxage` lets Gcore's shield/tiered cache topology absorb origin load while browsers stay fresher via the smaller `max-age`. `/internal/metrics` is uncached.
- **Cache headers are set per endpoint**, not globally. Announcements: 600s/3600s. Categories/topics: 60s/600s. Repo detail: 30s/300s. Search: 15s/30s. Readme proxy: 3600s/21600s. User proxy: 86400s/604800s. Signing-seeds: 86400s/604800s with `stale-while-revalidate=86400` and a strong ETag for 304 revalidation — content only changes on the daily F-Droid sync cron, so the long edge TTL is paired with an operator-side Cloudflare purge when the seeds rotate. Badges (fresh): 3600s/3600s with `stale-while-revalidate=86400`; (degraded) 300s/300s. Unmatched-route 404s: 300s/300s — `Plugins.kt:respondNotFound` returns `ApiError("not_found")` and logs `[404 rid=… ] METHOD /path` (no query string) so scanner traffic and old-client paths are classifiable from the application log without hammering origin. Edge respects `s-maxage`; the larger `s-maxage` lets Gcore's shield/tiered cache topology absorb origin load while browsers stay fresher via the smaller `max-age`. `/internal/metrics` is uncached.
- **HEAD routes to GET** via the `AutoHeadResponse` plugin (`Plugins.kt`). Without it, Ktor 3 returns 404 for HEAD on every GET handler — confusing for `curl -I`, monitoring, and CDN origin probes.
- **Owner / repo-name path-param validation.** Every GitHub-proxy route (`/readme/`, `/user/`, `/release/`, `/repo/`, `/badge/{owner}/{name}/...`) calls `util/GitHubIdentifiers.validOwner` / `validName` at the top of the handler. Owner regex matches GitHub's actual username rule (`^[A-Za-z0-9](?:[A-Za-z0-9-]{0,38})$`), name allows a slightly broader set up to 100 chars. Reject early with 400 — keeps SSRF-by-path-trickery off the upstream URL.
- **Badge service** lives under `badge/` (`BadgeRenderer`, `BadgeColors`, `BadgeIcons`, `BadgeService`, `FdroidVersionClient`, `TtlCache`, `BadgeGlyphs`). Text is rendered as vectorized `<path>` elements composed from glyph data extracted at startup from `src/main/resources/fonts/Inter-Bold.ttf` (SIL OFL 1.1). The renderer is deliberately font-independent at SVG embed time — every browser, markdown viewer, and feed reader sees byte-identical glyphs. Color palette mirrors `ziadOUA/m3-Markdown-Badges` hex-for-hex (12 hues × 3 shade variants).
Expand Down
49 changes: 38 additions & 11 deletions src/main/kotlin/zed/rainxch/githubstore/Plugins.kt
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ fun Application.configureSerialization() {
}
}

private val REQUEST_ID_KEY = AttributeKey<String>("RequestId")
internal val REQUEST_ID_KEY = AttributeKey<String>("RequestId")
private val REQUEST_ID_PATTERN = Regex("^[A-Za-z0-9\\-]{1,64}$")

// Reject oversized or unknown-size bodies before reading them.
Expand Down Expand Up @@ -102,6 +102,29 @@ private fun searchBucketKey(call: io.ktor.server.application.ApplicationCall): S
}
}

// Shared 404 responder. Logs the unmatched method + path (NOT the query
// string — query can carry user search terms), sets a short edge cache so
// scanners and broken clients can't pin origin, and returns the same JSON
// shape every other 4xx uses. Path is bracketed so `grep '\[404 ...]'` finds
// only 404 lines on a noisy log.
//
// Called by:
// - The global `status(NotFound)` StatusPages handler (unmatched routes
// and any route-level 404 — Ktor 3's StatusPages overrides route-level
// bodies, see StatusPagesOverrideTest).
// - Routes that want the same body shape + caching + log without going
// through StatusPages (`InternalRoutes`).
internal suspend fun respondNotFound(call: io.ktor.server.application.ApplicationCall) {
val rid = call.attributes.getOrNull(REQUEST_ID_KEY)
val method = call.request.httpMethod.value
val path = call.request.path()
call.application.environment.log.info(
"[404 rid={}] {} {}", rid ?: "-", method, path,
)
call.response.header(HttpHeaders.CacheControl, "public, max-age=300, s-maxage=300")
call.respond(HttpStatusCode.NotFound, ApiError("not_found"))
}

fun Application.configureHTTP() {
install(DefaultHeaders) {
header("X-Engine", "github-store-backend")
Expand All @@ -125,8 +148,9 @@ fun Application.configureHTTP() {
// CORS is only useful for browser-based callers. The KMP client never sends
// Origin (native HttpClient), so this only affects the admin dashboard (same
// origin as the API — doesn't need CORS) and any future web surface. Pinning
// to our own domains removes a CSRF foothold on /v1/events from malicious
// third-party pages without breaking anything we actually serve.
// to our own domains removes a CSRF foothold on state-changing POSTs (e.g.
// /v1/repo/{owner}/{name}/refresh) from malicious third-party pages without
// breaking anything we actually serve.
install(CORS) {
allowHost("github-store.org", subDomains = listOf("api", "api-direct", "www"))
// localhost dev origins are only useful when developing the admin
Expand Down Expand Up @@ -190,13 +214,6 @@ fun Application.configureHTTP() {
rateLimiter(limit = 360, refillPeriod = 1.minutes)
requestKey(::forwardedFor)
}
// Events endpoint: 3/min/IP (tightened 10× for direct-path abuse).
// 50 events/batch × 3 batches/min = 150 events/min/IP — comfortably
// covers any realistic session.
register(RateLimitName("events")) {
rateLimiter(limit = 3, refillPeriod = 1.minutes)
requestKey(::forwardedFor)
}
// Search bucket: 240/min/key. Covers /search, /search/explore,
// /releases, /readme, /user, /users/{u}/repos, /users/{u}/starred --
// every route that fans out to the GitHub API. Keyed by token-hash
Expand Down Expand Up @@ -317,7 +334,17 @@ fun Application.configureHTTP() {
call.respond(HttpStatusCode.BadRequest, ApiError("invalid_request"))
}
exception<NotFoundException> { call, _ ->
call.respond(HttpStatusCode.NotFound, ApiError("not_found"))
respondNotFound(call)
}
// Catch every unmatched-route 404 (Ktor's default response has no body
// and no Cache-Control). One handler gives us:
// - consistent JSON shape ({error, message}) across the API
// - structured access log entry with method + path (no query) so we
// can classify scanner traffic vs old-client paths from Cloudflare
// analytics + the application log
// - short edge cache so repeat scanner hits don't slam origin
status(HttpStatusCode.NotFound) { call, _ ->
respondNotFound(call)
Comment thread
coderabbitai[bot] marked this conversation as resolved.
}
// 429s come out of the RateLimit plugin with Retry-After but an empty
// body. Replace that with a JSON body the client can parse + display
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,20 @@ import io.ktor.client.statement.*
import io.ktor.http.*
import org.slf4j.LoggerFactory

class GitHubDeviceClient {
private val log = LoggerFactory.getLogger(GitHubDeviceClient::class.java)

// `open` so route tests can swap in a fake client that returns canned
// GitHubDeviceResponse values without touching real HTTP. `clientId` is a
// constructor parameter (defaulted to the env var) for the same reason —
// tests don't need to set GITHUB_OAUTH_CLIENT_ID just to construct an
// override.
open class GitHubDeviceClient(
private val clientId: String =
System.getenv("GITHUB_OAUTH_CLIENT_ID")?.takeIf { it.isNotBlank() }
?: error(
"GITHUB_OAUTH_CLIENT_ID env var is required to serve /v1/auth/device/* routes. " +
"Set it to the same OAuth App client_id the KMP client has in BuildKonfig."
)
),
) {
private val log = LoggerFactory.getLogger(GitHubDeviceClient::class.java)

private val http = HttpClient(CIO) {
install(HttpTimeout) {
Expand All @@ -30,12 +35,12 @@ class GitHubDeviceClient {
expectSuccess = false
}

suspend fun startDeviceFlow(): GitHubDeviceResponse =
open suspend fun startDeviceFlow(): GitHubDeviceResponse =
proxyCall("https://github.com/login/device/code") {
append("client_id", clientId)
}

suspend fun pollDeviceToken(deviceCode: String): GitHubDeviceResponse =
open suspend fun pollDeviceToken(deviceCode: String): GitHubDeviceResponse =
proxyCall("https://github.com/login/oauth/access_token") {
append("client_id", clientId)
append("device_code", deviceCode)
Expand Down
Loading
Loading