fix(security/gcp): drop plaintext SCHEDULED_TASK_SECRET — rely on OIDC + Cloud Run IAM (closes #159) by cristim · Pull Request #161 · LeanerCloud/CUDly

cristim · 2026-04-27T23:10:40Z

Closes #159 + closes the auth gap surfaced during verification

The original commit (afa636c88) dropped the plaintext
Authorization: Bearer ${var.scheduled_task_secret} header from
Cloud Scheduler — fixing the SCHEDULED_TASK_SECRET leak in Terraform
state per #159. Verification then surfaced a follow-on issue: with the
header gone, /api/scheduled/* was completely unauthenticated on
Cloud Run AND GKE today (cloud_run_allow_unauthenticated = true
everywhere; GKE Ingress has no IAP). This PR completes the fix by
implementing Option B: app-level OIDC ID-token validation in Go
middleware.

What changed

	Before	After
Cloud Scheduler → Cloud Run / GKE	static `Bearer <plaintext>` header	OIDC ID-token signed by scheduler SA
Token validation on the app side	inline `subtle.ConstantTimeCompare` against env-var secret	new `internal/server/scheduledauth` middleware: signature (RS256 only) via JWKS, issuer, audience allow-list, subject pinning to scheduler SA (defence in depth), 60s clock skew, refresh-on-unknown-kid
Azure Logic Apps → Container Apps	bearer (Key Vault → Logic App at runtime)	bearer (unchanged — Logic Apps' HTTP Connector does not emit Entra OIDC)

Why Option B over Option A (#78 cross-PR dependency)

Option A would flip cloud_run_allow_unauthenticated = false and rely
entirely on Cloud Run's IAM gate (#78 tracks that). Option B is
preferred here because:

No cross-PR dependency — this PR closes the gap atomically.
Works on GKE too, which has no Cloud-Run-style IAM gate and no
IAP set up; Option A would not cover it.
Mirrors the existing Azure approach (app-level bearer check) —
easier to reason about; one auth surface instead of two.
Independent of ops(gcp): take Cloud Run secure-ingress default into effect (per-env LB enablement + override removal) #78 — flipping allow_unauthenticated = false
is still worth doing as defence in depth, and ops(gcp): take Cloud Run secure-ingress default into effect (per-env LB enablement + override removal) #78 stays open. Option
B works whether or not Cloud Run's IAM gate is enforced.

Implementation

5 atomic commits, each independently revertable:

feat(server): add scheduled-task OIDC validator — new
internal/server/scheduledauth package: Validator + JWKS cache (via
coreos/go-oidc/v3's single-flight RemoteKeySet, refresh-on-unknown-
kid for rotation), 3 modes (oidc / bearer / disabled), fail-
fast config parse. RS256 only. Subject pinning REQUIRED in oidc mode
(any GCP SA can mint a token with arbitrary aud — only the
scheduler SA has our sub). 60s skew on exp/nbf, iat-future check.
Comprehensive unit tests including alg-confusion, key rotation, and
single-flight stampede protection.
fix(server): replace bearer-only check with OIDC-aware middleware — wires the validator into CreateHTTPServer, drops
the inline subtle.ConstantTimeCompare from handleScheduledHTTP,
best-effort JWKS warmup at startup, fail-fast on misconfig.
fix(terraform/gcp): pin oidc_token.audience + wire scheduled-task auth env vars — explicit audience = SA email on every
oidc_token block (without it Cloud Scheduler defaults to the target
URL, coupling the validator to Cloud Run URL shape).
SCHEDULED_TASK_AUTH_MODE, SCHEDULED_TASK_OIDC_AUDIENCE,
SCHEDULED_TASK_OIDC_SUBJECTS env vars wired automatically by the
cloud-run / gke / container-apps / aks modules.
SCHEDULED_TASK_OIDC_SUBJECTS pins to the SA's unique_id (the
value Google sets in the JWT sub claim for SA-signed ID tokens).
test(server): integration test for scheduled-task OIDC middleware — full HTTP roundtrip (httptest.Server + mocked
JWKS) covering: valid token, key not in JWKS, missing header, wrong
sub, wrong aud.

Verification

Local gates (all run against the rebased branch on top of
feat/multicloud-web-frontend):

go test ./... from repo root — all green
go test ./... from pkg/ — all green
go vet ./... on both modules — clean
terraform validate on every touched module
(compute/gcp/cloud-run, compute/gcp/gke,
compute/azure/container-apps, compute/azure/aks) and environment
(environments/gcp, environments/azure) — all valid
terraform fmt -check -recursive terraform/ — clean
Regression check: grep -rn 'Bearer ${var.scheduled_task_secret}' terraform/
returns nothing
Regression check: grep -rn 'SCHEDULED_TASK_SECRET\b' terraform/
returns only SCHEDULED_TASK_SECRET_NAME (the KV reference, not the
plaintext) on Azure
CI watcher armed for every workflow run on the latest push
(pre-commit, Azure Sanity, AWS Sanity)

CodeRabbit review loop will run automatically once the PR exits draft;
the watcher is already armed.

Rollback / blast radius

The change is isolated: a new package + middleware wiring + per-module
Terraform env vars. To revert, drop the four feature commits — the
codebase returns to the pre-#161 plaintext-bearer state, but #159's
finding is back, so a partial revert (e.g. dropping only the new
package) is not a useful intermediate.

Out of scope

ops(gcp): take Cloud Run secure-ingress default into effect (per-env LB enablement + override removal) #78 (cloud_run_allow_unauthenticated = false) — still worth
doing as defence in depth, tracked separately.
AWS Lambda — unaffected; EventBridge runs ECS tasks via
RunTask, no HTTP scheduled-task path on AWS.
Azure auth migration to Entra OIDC — Azure Logic Apps' HTTP
Connector does not emit Entra ID tokens, so bearer mode stays.

Notes for reviewers

The validator's JWKS cache uses go-oidc's RemoteKeySet, which has
built-in single-flight + refresh-on-unknown-kid (per
OIDC Core §10.1.1).
No fixed TTL — refresh-on-unknown-kid is sufficient because providers
always sign new tokens with the new kid before retiring the old key.
SCHEDULED_TASK_OIDC_SUBJECTS is REQUIRED in oidc mode — empty
subjects → ErrConfigInvalid → container fails to start. Better
than silently accepting any SA in the org.
Local-dev default is disabled mode with a loud per-request WARN
log so it's impossible to ship to production without explicitly
opting into oidc / bearer.

Summary by CodeRabbit

New Features
- Configurable scheduled-task authentication (OIDC, bearer, disabled); scheduled endpoints now enforced by middleware with JWKS warmup.
Chores
- Infrastructure no longer injects plaintext scheduled-task secrets; schedulers updated to use OIDC or Key Vault-held bearer tokens.
- Dependency updates for OIDC/JOSE libraries.
Documentation
- Terraform examples and comments clarified for scheduled invocation and auth configuration.
Tests
- Added extensive unit and integration tests for OIDC, bearer, JWKS, timing, and middleware.

cristim · 2026-04-27T23:10:42Z

@coderabbitai review

coderabbitai · 2026-04-27T23:10:45Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7d6f7b80-ab05-4e89-8655-3ea72e64249c

📥 Commits

Reviewing files that changed from the base of the PR and between 7dbf859 and 7935b6b.

📒 Files selected for processing (3)

terraform/environments/azure/dev.tfvars.example
terraform/environments/azure/variables.tf
terraform/modules/compute/azure/container-apps/variables.tf

✅ Files skipped from review due to trivial changes (2)

terraform/modules/compute/azure/container-apps/variables.tf
terraform/environments/azure/variables.tf

📝 Walkthrough

Walkthrough

Adds a new scheduled-task authentication subsystem (disabled/bearer/oidc) with env-driven config, a Validator, middleware and JWKS warmup/caching; wires the validator into server startup and routing; updates tests; and changes Terraform to stop embedding plaintext scheduled-task secrets (GCP→OIDC, Azure→bearer).

Changes

Cohort / File(s)	Summary
Go module `go.mod`	Add `github.com/coreos/go-oidc/v3@v3.18.0` and bump `github.com/go-jose/go-jose/v4` to `v4.1.4`.
Scheduled auth package `internal/server/scheduledauth/config.go`, `internal/server/scheduledauth/errors.go`, `internal/server/scheduledauth/validator.go`, `internal/server/scheduledauth/validator_test.go`, `internal/server/scheduledauth/integration_test.go`	New package implementing Env-driven config, sentinel errors, Validator (disabled/bearer/oidc), JWKS warmup/refresh and middleware; extensive unit and integration tests. Inspect config parsing, JWKS fetch/rotation, concurrency/single-flight behavior, timestamp/skew logic, and error classification.
Server wiring & middleware `internal/server/app.go`, `internal/server/http.go`, `internal/server/http_test.go`, `internal/server/app_test.go`	Wire `scheduledAuth *scheduledauth.Validator` into Application, build from env, fail-fast on invalid config, best-effort 5s JWKS warmup, add `scheduledAuthMiddleware`, and update tests to exercise middleware. Check startup error handling and middleware integration.
HTTP handler `internal/server/http.go`, `internal/server/http_test.go`	Move scheduled-task auth enforcement into `scheduledAuthMiddleware`; remove in-handler secret comparison and constant-time compare usage. Update tests to route through middleware.
Terraform — GCP modules & envs `terraform/modules/compute/gcp/cloud-run/...`, `terraform/modules/compute/gcp/gke/...`, `terraform/environments/gcp/*`, `terraform/environments/gcp/compute.tf`, `terraform/environments/gcp/dev.tfvars.example`	Remove plaintext `scheduled_task_secret` inputs/wiring; switch Cloud Scheduler jobs to per-invocation OIDC tokens and add container envs for expected OIDC audiences/subjects. Review audience/sub mapping, env propagation, and variable removals.
Terraform — Azure modules & examples `terraform/modules/compute/azure/aks/main.tf`, `terraform/modules/compute/azure/container-apps/main.tf`, `terraform/environments/azure/dev.tfvars.example`	Set `SCHEDULED_TASK_AUTH_MODE="bearer"` for Azure deployments and document Key Vault-backed runtime secret retrieval in examples. Verify env injection and example docs.
Terraform — docs and variable descriptions `terraform/environments/azure/variables.tf`, `terraform/modules/compute/azure/container-apps/variables.tf`, `terraform/environments/gcp/dev.tfvars.example`, `terraform/environments/azure/dev.tfvars.example`	Expand variable descriptions and examples documenting scheduled-task auth behavior and where envs are provided. No functional variable type changes (exceptions noted elsewhere).
Tests / CI `internal/server/scheduledauth/*_test.go`, `internal/server/scheduledauth/integration_test.go`, `internal/server/http_test.go`, `internal/server/app_test.go`	Add/adjust tests for OIDC validator, JWKS rotation/refresh, concurrent fetch protection, middleware integration, and updated startup defaults. Review concurrency expectations and test harness utilities.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Middleware as scheduledAuthMiddleware
    participant Validator
    participant JWKS as JWKS Endpoint
    participant Handler as handleScheduledHTTP

    Client->>Middleware: POST /api/scheduled/*\nAuthorization: Bearer {token}
    Middleware->>Validator: Validate(ctx, Authorization)
    alt OIDC Mode
        Validator->>JWKS: Fetch / use cached JWKS
        JWKS-->>Validator: Keys (rgba(30,144,255,0.5))
        Validator->>Validator: Verify signature, timestamps, aud & sub
        Validator-->>Middleware: success
        Middleware->>Handler: next.ServeHTTP()
        Handler-->>Client: 200 OK
    else Bearer Mode
        Validator->>Validator: Constant-time compare with secret
        Validator-->>Middleware: success / ErrUnauthorized
        Middleware->>Handler: next.ServeHTTP()
        Handler-->>Client: 200 OK
    else Unauthorized
        Validator-->>Middleware: ErrUnauthorized
        Middleware-->>Client: 401 Unauthorized
    end

sequenceDiagram
    participant App as NewApplicationFromDeps
    participant Env as envSourceOS
    participant Config as scheduledauth.LoadConfig
    participant Validator as scheduledauth.New
    participant JWKS as JWKS Warmup Probe

    App->>Env: Read SCHEDULED_TASK_* envs
    App->>Config: LoadConfig(env)
    Config-->>App: Config{mode, jwks, aud, sub, secret}
    App->>Validator: scheduledauth.New(cfg)
    alt OIDC Mode
        Validator->>JWKS: Warmup probe (5s timeout)
        JWKS-->>Validator: keys / timeout (rgba(34,139,34,0.5))
        Validator-->>App: validator instance
    else Bearer/Disabled
        Validator-->>App: validator instance
    else Invalid
        Validator-->>App: ErrConfigInvalid
    end
    App-->>App: Wire into Application.scheduledAuth

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

fix(security/azure): migrate Logic App SCHEDULED_TASK_SECRET to Key Vault connection #74 — Terraform changes removing plaintext scheduled-task secret for Azure and switching to Key Vault runtime retrieval (strongly related to Terraform secret-handling changes).

Suggested labels

priority/p0, urgency/now, impact/all-users, effort/xl

Poem

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.98% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main security fix: dropping plaintext SCHEDULED_TASK_SECRET and implementing OIDC + Cloud Run IAM instead of embedding secrets in resource definitions.
Linked Issues check	✅ Passed	The PR fully addresses issue `#159` by implementing Option B (app-level OIDC validation): Cloud Scheduler now uses OIDC tokens, a new scheduledauth package validates them with JWKS caching and subject pinning, middleware wiring replaces inline bearer checks, and Terraform removes plaintext secret interpolation across cloud-run/gke/container-apps/aks modules.
Out of Scope Changes check	✅ Passed	All changes are scoped to scheduled-task authentication: internal scheduledauth validator package, middleware wiring, Terraform oidc_token configuration, and documentation updates. Azure bearer-mode configuration and Terraform variable description improvements are consistent with unified auth modes and are not out of scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/issue-159-gcp-secret-runtime

_{Review rate limit: 2/5 reviews remaining, refill in 33 minutes and 40 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-04-27T23:10:47Z

✅ Actions performed

Review triggered.

cristim · 2026-04-27T23:21:42Z

⚠️ Security regression flagged by post-merge verification

A verification agent re-audited this PR end-to-end and surfaced two issues that mean this PR should not merge in its current form. Both pre-date the PR but the PR removes the only remaining auth layer that was masking them on GCP. Filing here so reviewers see it before merge.

1. Cloud Run `allow_unauthenticated = true` in every deployed env

terraform/environments/gcp/{github-prod,github-staging,github-dev,dev.tfvars.example} all set cloud_run_allow_unauthenticated = true. Combined with cloud_run_ingress = "INGRESS_TRAFFIC_ALL" (prod tfvars line 32), the *.run.app URL is reachable from the public internet and Cloud Run does NOT enforce IAM. The OIDC token the scheduler signs is sent but never validated by the platform — roles/run.invoker is granted to allUsers per cloud-run/main.tf:172-180.

The PR's commit message claims cloud_run_allow_unauthenticated = false is the default per #51 — but #51's flip never landed for any actually-used env. That's tracked separately in #78 (per-env enablement migration), still open.

2. GKE Ingress doesn't validate OIDC at all

terraform/modules/compute/gcp/gke/main.tf:549-586 defines a plain kubernetes_ingress_v1 with ingress.class = "gce" only. No IAP (iap.googleapis.com), no BackendConfig with iap block, no Istio sidecar, no JWT validator. The GCE LB forwards the Authorization: Bearer <id_token> Cloud Scheduler signs straight through to the pod without validation.

Pre-PR this didn't matter because the app-level bearer check at internal/server/http.go:117 matched the static Bearer ${var.scheduled_task_secret}. Post-PR the env var is removed, the bearer check no-ops, and the GKE-hosted /api/scheduled/* endpoints are reachable unauthenticated from anyone who can reach the LB.

GKE isn't currently the deployed compute_platform (all three github-*.tfvars use cloud-run), so the impact is latent — but the GKE option is documented in dev.tfvars.example:36 and this PR's comment at gke/main.tf:679-681 claims "OIDC supersedes the application-level bearer check on GCP", which is materially false for GKE.

Suggested remediation

Two viable paths:

Option A (platform auth): land the prerequisites first. (1) flip cloud_run_allow_unauthenticated to false in all tfvars (closes ops(gcp): take Cloud Run secure-ingress default into effect (per-env LB enablement + override removal) #78); (2) add iap.googleapis.com annotations + a BackendConfig with an iap block to the GKE Ingress; only then merge this PR. The OIDC token is then validated at the platform layer and the bearer-removal is safe.
Option B (application-level OIDC validation): add a Go-side validator in the /api/scheduled/* middleware that verifies the incoming Authorization: Bearer <id_token> against Google's published JWKS and checks the audience matches the scheduler SA's expected audience. Then this PR closes fix(security/gcp): Cloud Scheduler + GKE jobs still embed plaintext SCHEDULED_TASK_SECRET in Authorization header #159 without depending on platform-specific config that may or may not be configured. Cleaner long-term — works on both Cloud Run and GKE regardless of platform IAM state, and matches what the Azure side does (runtime-fetched secret + app-level check).

I'd lean toward Option B since it doesn't have a hidden cross-PR dependency.

What's still correct in this PR

The plaintext leak from fix(security/gcp): Cloud Scheduler + GKE jobs still embed plaintext SCHEDULED_TASK_SECRET in Authorization header #159's stated scope is fully closed: no Bearer ${var.scheduled_task_secret} matches anywhere in terraform/, no SCHEDULED_TASK_SECRET = …value env vars on Cloud Run / GKE, secret no longer flows into Terraform state from compute modules.
The Azure side is unaffected — Azure path correctly stays on the _NAME + Key Vault runtime-fetch model.

So the diff itself is structurally fine; the missing piece is the auth layer that replaces what the bearer check used to do.

Converting this PR to draft until either Option A or Option B lands.

cristim · 2026-04-28T14:14:45Z

P1 security fix (closes #159) but blocked — see the detailed verification comment above. The PR correctly removes plaintext SCHEDULED_TASK_SECRET from IaC/state, but removing the application-level bearer check without a replacement auth layer leaves GKE endpoints unauthenticated (Cloud Run is also exposed because cloud_run_allow_unauthenticated = true in all deployed envs, contrary to the PR description). Two remediation paths documented in the comment above (Option A: flip allow_unauthenticated flag + add IAP to GKE; Option B: Go-side OIDC JWT validator). Draft status is correct; do not merge until one path lands. (triage agent wave2-F)

Adds a new `internal/server/scheduledauth` package providing app-level authentication for the `/api/scheduled/*` endpoints. Three modes: - `oidc` — verifies a Google-issued OIDC ID token (RS256, JWKS- validated). Subject pinning to the scheduler SA email is REQUIRED (defence in depth — any GCP SA in the org can mint a token with arbitrary `aud`, only the scheduler has our `sub`). 60s clock skew, refresh-on-unknown-kid for key rotation, single-flight JWKS cache via go-oidc/v3. - `bearer` — constant-time compare against a shared secret. Used on Azure where Logic Apps fetch the secret from Key Vault at invocation time. - `disabled` — local dev only. Loud per-request WARN logs. Mode is selected via `SCHEDULED_TASK_AUTH_MODE`; OIDC params via `SCHEDULED_TASK_OIDC_AUDIENCE` (CSV) and `SCHEDULED_TASK_OIDC_SUBJECTS` (CSV, REQUIRED in oidc mode); bearer secret via `SCHEDULED_TASK_SECRET`. Fail-fast on misconfig — empty subjects or unknown mode returns `ErrConfigInvalid` so the server refuses to start rather than silently accepting everything. This is the validator + config parser only; wiring into the HTTP server and the Terraform changes follow in subsequent commits. Closes the auth gap surfaced during PR #161 verification: dropping the plaintext `SCHEDULED_TASK_SECRET` from Cloud Scheduler headers leaves `/api/scheduled/*` unauthenticated on Cloud Run / GKE (both have `cloud_run_allow_unauthenticated = true` today; #78 will flip that separately as defence-in-depth, but Option B is independent of it).

…/api/scheduled/* (#161) Wires the new `internal/server/scheduledauth` validator into the HTTP server. Replaces the inline `subtle.ConstantTimeCompare` against `appConfig.ScheduledTaskSecret` in `handleScheduledHTTP` with a proper middleware applied at mux registration time, so: - Auth is enforced before any handler logic runs (including the method check — invalid methods now return 401 in authenticated mode rather than disclosing which methods exist). - The validator picks the right auth scheme per platform: oidc on GCP, bearer on Azure, disabled for local dev (with loud WARN logs). - The bearer secret in bearer mode comes from `cfg.ScheduledTaskSecret`, which `resolveScheduledTaskSecret` already populates from Key Vault / Secrets Manager via the SecretResolver, so the plaintext never lives in a container env var. Validator construction is fail-fast: misconfig (empty subjects in oidc mode, unknown mode, etc.) returns an error from `NewApplicationFromDeps` and the container refuses to start. A best-effort JWKS warmup runs at startup to surface misconfiguration early without blocking on Google's CDN. Existing `TestHandleScheduledHTTP` cases that exercised the inline bearer check are migrated to drive through the middleware via the new `scheduledAuthMiddleware` helper, mirroring how `CreateHTTPServer` wires it. The remaining cases (method check, path validation) keep `app.scheduledAuth = nil` — the helper passes through unmodified, so those tests continue to validate the handler in isolation.

…h env vars (#161) Closes the auth gap on GCP that #161 surfaced: removing the plaintext `Authorization: Bearer ${var.scheduled_task_secret}` header from Cloud Scheduler left `/api/scheduled/*` unauthenticated on Cloud Run and GKE today (both have `cloud_run_allow_unauthenticated = true` and GKE has no IAP / IAM gate). This commit completes Option B: the in-app OIDC validator (added in earlier commits) now has explicit, deterministic config for the audience and subject claims it must accept. Cloud Run + GKE modules: - Set `oidc_token.audience` explicitly to the scheduler SA email on every `google_cloud_scheduler_job.http_target` block. Without this, Cloud Scheduler defaults `audience` to the target URL — coupling the validator's allow-list to the Cloud Run URL shape and breaking on Function-URL / vanity-domain changes. The explicit value mirrors the `email` we add to `SCHEDULED_TASK_OIDC_AUDIENCE` env. - Wire `SCHEDULED_TASK_AUTH_MODE`, `SCHEDULED_TASK_OIDC_AUDIENCE`, `SCHEDULED_TASK_OIDC_SUBJECTS` env vars on the running container. `SUBJECTS` is pinned to the SA's `unique_id` (the value Google sets in the JWT `sub` claim for SA-signed ID tokens), `AUDIENCE` to the SA email. When neither scheduler is enabled, the mode falls back to `disabled` — matches the old behaviour of skipping auth when no secret was set. Azure container-apps + AKS modules: - Set `SCHEDULED_TASK_AUTH_MODE = "bearer"`. The existing `SCHEDULED_TASK_SECRET_NAME` (Key Vault secret name, NOT value) flow is unchanged — `internal/server/app.go::resolveScheduledTaskSecret` resolves it via the SecretResolver at startup and the validator constant-time-compares it. Logic Apps' HTTP Connector trigger does not emit Entra OIDC tokens, so OIDC mode is not viable on Azure. tfvars examples (gcp/dev, azure/dev): document that the `SCHEDULED_*` env vars are wired automatically by the modules — users do not set them in `*.tfvars`. AWS Lambda is unaffected (EventBridge runs ECS tasks directly via `RunTask`; the HTTP `/api/scheduled/*` route is not on the critical path on AWS). #78 (flipping `cloud_run_allow_unauthenticated = false`) remains worth doing as defence-in-depth but is independent of this change — Option B works whether or not Cloud Run's IAM gate is enforced.

Adds `internal/server/scheduledauth/integration_test.go` exercising the full HTTP roundtrip: middleware in front of an inner handler, JWT signed by a test key, JWKS published over an httptest server, request sent through `http.DefaultClient` so net/http header parsing and the validator's lazy JWKS fetch are both on the path. Five sub-cases, each independently verifiable: 1. Valid token signed by published key → 200, handler runs. 2. Token signed by a key absent from the JWKS → 401, handler skipped. 3. Missing Authorization header → 401. 4. Wrong `sub` (subject pinning) → 401. 5. Wrong `aud` → 401. Reuses the helpers from `validator_test.go` (same package) to keep test infrastructure DRY.

cristim · 2026-04-29T19:00:16Z

@coderabbitai review

coderabbitai · 2026-04-29T19:00:21Z

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (4)

internal/server/scheduledauth/errors.go (1)

16-16: Remove local filesystem reference from package docs.

Line 16 points to a developer-local path, which is non-portable and noisy in repo docs.

Proposed doc cleanup

-// See plan: ~/.claude/projects/.../plans/issue-161-oidc-validator.md
+// See repository docs/specs for scheduled-task auth design details.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/server/scheduledauth/errors.go` at line 16, The package-level
comment in internal/server/scheduledauth/errors.go contains a developer-local
filesystem reference
("~/.claude/projects/.../plans/issue-161-oidc-validator.md"); remove that local
path from the file header and replace it with a portable reference (e.g., a
short, repository-relative note like "See plans/issue-161-oidc-validator.md" or
remove the line entirely) so the package docs no longer include non-portable
developer-specific paths; update the top-of-file package comment where this path
appears (package docs / file comment) accordingly.

internal/server/http_test.go (1)

267-273: Prefer test assertions over panic in setup closures.

Using panic here makes failures noisier than needed. Pass *testing.T into setupApp and assert on constructor errors directly.

Refactor sketch

- setupApp       func(*Application)
+ setupApp       func(*testing.T, *Application)

- tt.setupApp(app)
+ tt.setupApp(t, app)

- setupApp: func(app *Application) {
+ setupApp: func(t *testing.T, app *Application) {
    v, err := scheduledauth.New(scheduledauth.Config{
      Mode:   scheduledauth.ModeBearer,
      Bearer: "my-secret",
    })
-   if err != nil {
-     panic(err)
-   }
+   testutil.AssertNoError(t, err)
    app.scheduledAuth = v
  },

Also applies to: 284-290

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/server/http_test.go` around lines 267 - 273, The test setup
currently panics on scheduledauth.New errors; change setupApp to accept
*testing.T and replace panic(err) with a test assertion (e.g., require.NoError
or t.Fatalf) when calling scheduledauth.New(scheduledauth.Config{Mode:
scheduledauth.ModeBearer, Bearer: "my-secret"}); update the other occurrence
(the 284-290 block) similarly so both scheduledauth.New calls use the test
helper and assert failures via the testing.T parameter rather than panicking.

internal/server/app.go (1)

83-84: Soften the “Always non-nil” claim in the field comment.

This is true for normal bootstrap, but not universally true in tests/manual construction, so the comment can be made more precise.

Comment tweak

- // Always non-nil — disabled mode allows everything through with a
+ // Non-nil for normal app bootstrap — disabled mode allows everything through with a

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/server/app.go` around lines 83 - 84, Update the field comment that
currently begins "Always non-nil — disabled mode allows everything through with
a loud WARN log..." to remove the absolute claim "Always non-nil" and instead
state that the field is normally non-nil in the bootstrapped server but may be
nil in tests or when manually constructed; keep the rest of the comment about
disabled mode and oidc (GCP) validation intact so readers know expected runtime
behavior and testing exceptions.

internal/server/scheduledauth/validator.go (1)

209-228: Give JWKS warmup its own deadline.

http.DefaultClient has no timeout, so Warmup can block startup indefinitely when callers pass a context without a deadline. Add a small fallback timeout before issuing the GET.

⏱ Suggested guard

func (v *Validator) Warmup(ctx context.Context) {
	if v.mode != ModeOIDC || v.jwksURL == "" {
		return
	}
+	if _, ok := ctx.Deadline(); !ok {
+		var cancel context.CancelFunc
+		ctx, cancel = context.WithTimeout(ctx, 5*time.Second)
+		defer cancel()
+	}
	req, err := http.NewRequestWithContext(ctx, http.MethodGet, v.jwksURL, nil)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/server/scheduledauth/validator.go` around lines 209 - 228, Warmup
can block indefinitely because http.DefaultClient has no timeout; update
Validator.Warmup to use a bounded request context or an http.Client with a small
Timeout when v.mode == ModeOIDC and v.jwksURL != "" — e.g., derive a child
context with a short deadline (context.WithTimeout) or create a local
http.Client{Timeout: ...} before calling Do, use that client to perform the GET
and ensure the context/cancel is released; keep the existing log/error handling
and defer resp.Body.Close() as-is.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/server/http.go`:
- Around line 116-120: The scheduledAuthMiddleware currently returns next when
app.scheduledAuth is nil which creates an auth bypass; change
scheduledAuthMiddleware to fail-closed: if app.scheduledAuth == nil return an
http.Handler that rejects requests (e.g., respond with 401/403 and short
message) instead of returning next, and keep the normal path returning
app.scheduledAuth.Middleware(next); update tests to inject an explicit disabled
validator stub where they intend no auth rather than relying on nil.

In `@internal/server/scheduledauth/config.go`:
- Around line 31-39: LoadConfig currently treats a missing/empty
EnvSource.Get(EnvAuthMode) as ModeDisabled which silently enables
unauthenticated endpoints; change LoadConfig to treat an empty/missing
EnvAuthMode as a configuration error (return ErrConfigInvalid) and only allow
ModeDisabled when the env value explicitly equals the trimmed, lowercased string
of ModeDisabled; update the validation in LoadConfig (around the mode :=
strings.ToLower(strings.TrimSpace(env.Get(EnvAuthMode))) logic) to return
ErrConfigInvalid when mode == "" and otherwise continue validating accepted
modes before returning Config.

In `@internal/server/scheduledauth/validator_test.go`:
- Around line 83-90: The /swap handler currently reads the request body with a
single Read which can return a short buffer and produce a truncated JWKS; change
the handler registered with mux.HandleFunc("/swap", ...) to read the entire body
(e.g. use io.ReadAll or io.ReadFull loop) and only assign the resulting full
byte slice to current after a successful full read, and add the io import if
needed.

In `@terraform/environments/gcp/compute.tf`:
- Around line 52-57: The comment referencing scheduled tasks currently implies
“no app-level auth on GCP”—replace the sentence that mentions "plaintext
scheduled_task_secret" and "Application-level bearer auth is dropped on GCP"
with a clarified note stating that Cloud Scheduler authenticates to Cloud Run
via OIDC tokens + roles/run.invoker and that application-level bearer checks are
not removed but are replaced/implemented as OIDC validation at the application
level (i.e., the app should validate the OIDC token rather than rely on a
plaintext scheduled_task_secret), while keeping the Azure note about retaining
bearer checks for Container Apps.

In `@terraform/modules/compute/gcp/cloud-run/main.tf`:
- Around line 60-64: The current expression sets scheduled_task_auth_mode to
"disabled" when both var.enable_scheduled_tasks and
var.enable_ri_exchange_schedule are false, leaving /api/scheduled/*
unauthenticated; change the logic so the default is an authenticated mode (e.g.,
"oidc" or another enforced mode used by your app) instead of "disabled". Update
the scheduled_task_auth_mode expression that references
var.enable_scheduled_tasks and var.enable_ri_exchange_schedule to return the
authenticated string (for example "oidc") when endpoints are present or when
absent by default, and only return "disabled" when an explicit new variable (or
a new explicit flag) requests disabling; locate the scheduled_task_auth_mode
assignment in the module to make this change.

In `@terraform/modules/compute/gcp/gke/main.tf`:
- Around line 22-32: Don't disable scheduled-task auth when the scheduler
feature flag is off: change scheduled_task_auth_mode so it does not become
"disabled" when var.enable_scheduled_tasks is false. Instead always set
scheduled_task_auth_mode = "oidc" (leave scheduled_task_oidc_audience and
scheduled_task_oidc_subject using try(...) as-is) so the ingress
/api/scheduled/* endpoints continue to require OIDC validation even if no
scheduler ServiceAccount exists.

---

Nitpick comments:
In `@internal/server/app.go`:
- Around line 83-84: Update the field comment that currently begins "Always
non-nil — disabled mode allows everything through with a loud WARN log..." to
remove the absolute claim "Always non-nil" and instead state that the field is
normally non-nil in the bootstrapped server but may be nil in tests or when
manually constructed; keep the rest of the comment about disabled mode and oidc
(GCP) validation intact so readers know expected runtime behavior and testing
exceptions.

In `@internal/server/http_test.go`:
- Around line 267-273: The test setup currently panics on scheduledauth.New
errors; change setupApp to accept *testing.T and replace panic(err) with a test
assertion (e.g., require.NoError or t.Fatalf) when calling
scheduledauth.New(scheduledauth.Config{Mode: scheduledauth.ModeBearer, Bearer:
"my-secret"}); update the other occurrence (the 284-290 block) similarly so both
scheduledauth.New calls use the test helper and assert failures via the
testing.T parameter rather than panicking.

In `@internal/server/scheduledauth/errors.go`:
- Line 16: The package-level comment in internal/server/scheduledauth/errors.go
contains a developer-local filesystem reference
("~/.claude/projects/.../plans/issue-161-oidc-validator.md"); remove that local
path from the file header and replace it with a portable reference (e.g., a
short, repository-relative note like "See plans/issue-161-oidc-validator.md" or
remove the line entirely) so the package docs no longer include non-portable
developer-specific paths; update the top-of-file package comment where this path
appears (package docs / file comment) accordingly.

In `@internal/server/scheduledauth/validator.go`:
- Around line 209-228: Warmup can block indefinitely because http.DefaultClient
has no timeout; update Validator.Warmup to use a bounded request context or an
http.Client with a small Timeout when v.mode == ModeOIDC and v.jwksURL != "" —
e.g., derive a child context with a short deadline (context.WithTimeout) or
create a local http.Client{Timeout: ...} before calling Do, use that client to
perform the GET and ensure the context/cancel is released; keep the existing
log/error handling and defer resp.Body.Close() as-is.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fc81e336-3f30-4193-b19d-d79acea7e54d

📥 Commits

Reviewing files that changed from the base of the PR and between 7267600 and b84291d.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (18)

go.mod
internal/server/app.go
internal/server/http.go
internal/server/http_test.go
internal/server/scheduledauth/config.go
internal/server/scheduledauth/errors.go
internal/server/scheduledauth/integration_test.go
internal/server/scheduledauth/validator.go
internal/server/scheduledauth/validator_test.go
terraform/environments/azure/dev.tfvars.example
terraform/environments/gcp/compute.tf
terraform/environments/gcp/dev.tfvars.example
terraform/modules/compute/azure/aks/main.tf
terraform/modules/compute/azure/container-apps/main.tf
terraform/modules/compute/gcp/cloud-run/main.tf
terraform/modules/compute/gcp/cloud-run/variables.tf
terraform/modules/compute/gcp/gke/main.tf
terraform/modules/compute/gcp/gke/variables.tf

💤 Files with no reviewable changes (2)

terraform/modules/compute/gcp/cloud-run/variables.tf
terraform/modules/compute/gcp/gke/variables.tf

coderabbitai · 2026-04-29T19:07:22Z

+func (app *Application) scheduledAuthMiddleware(next http.Handler) http.Handler {
+	if app.scheduledAuth == nil {
+		return next
+	}
+	return app.scheduledAuth.Middleware(next)


⚠️ Potential issue | 🟠 Major

Fail closed when scheduled auth has not been initialized.

Returning next here makes every /api/scheduled/* route public whenever an Application is built outside the production constructor. That is an auth bypass on the exact path this PR is hardening; tests should inject an explicit disabled validator instead of relying on a nil pass-through.

🔒 Suggested fail-closed guard

func (app *Application) scheduledAuthMiddleware(next http.Handler) http.Handler { if app.scheduledAuth == nil { - return next + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + http.Error(w, "Scheduled task auth is not configured", http.StatusServiceUnavailable) + }) } return app.scheduledAuth.Middleware(next) }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/server/http.go` around lines 116 - 120, The scheduledAuthMiddleware currently returns next when app.scheduledAuth is nil which creates an auth bypass; change scheduledAuthMiddleware to fail-closed: if app.scheduledAuth == nil return an http.Handler that rejects requests (e.g., respond with 401/403 and short message) instead of returning next, and keep the normal path returning app.scheduledAuth.Middleware(next); update tests to inject an explicit disabled validator stub where they intend no auth rather than relying on nil.

…oc fixes, test panics (#161) Round 1 of CodeRabbit feedback. Triage: Actionable — addressed: - internal/server/scheduledauth/validator_test.go: /swap test handler read body via single Body.Read which can return a short read and truncate the JWKS payload — switched to io.ReadAll, mutex held only while assigning the buffer. - terraform/environments/gcp/compute.tf: comment claimed "Application- level bearer auth is dropped on GCP" — that's stale now that we validate OIDC at the application level. Rewritten to describe the actual setup (OIDC at app + Cloud Run IAM gate as defence in depth) and to call out #78 separately. Nitpicks — addressed: - internal/server/scheduledauth/validator.go::Warmup: http.DefaultClient has no timeout, so a misconfigured JWKS URL could block startup indefinitely. Added a 5s fallback deadline when the caller's ctx has no deadline; respect any deadline the caller already set. - internal/server/scheduledauth/errors.go: package doc referenced a developer-local plan path. Replaced with a portable reference to PR #161 / issue #159. - internal/server/http_test.go: setupApp closures called panic(err) on scheduledauth.New failure. Switched the signature to take *testing.T and use testutil.AssertNoError, scoped via a small newBearerValidator helper to keep the table-driven cases tight. - internal/server/app.go: scheduledAuth field comment claimed "Always non-nil" which is true post-NewApplicationFromDeps but not for hand-built test Applications. Rewritten to spell out both paths. Actionable — dismissed with justification (will reply on PR thread): - "scheduledAuthMiddleware should fail-closed when nil": intentional test affordance; production path always populates the field via NewApplicationFromDeps. - "LoadConfig should reject empty SCHEDULED_TASK_AUTH_MODE": local-dev default-to-disabled+WARN is by design (plan §1). - "scheduled_task_auth_mode should default to oidc when both schedulers off": oidc mode requires non-empty audiences/subjects (validator fails fast otherwise), so disabled is the only correct value when no scheduler is configured. Same reasoning for cloud-run and gke.

cristim · 2026-04-29T19:13:49Z

Pass 1 of CR feedback addressed in 084c755.

Actionable — addressed:

internal/server/scheduledauth/validator_test.go /swap handler: switched from a single Body.Read (could short-read) to io.ReadAll. Mutex held only across the buffer assignment.
terraform/environments/gcp/compute.tf: rewrote the stale comment that claimed "Application-level bearer auth is dropped on GCP" — it's now app-level OIDC + Cloud Run IAM gate as defence in depth (with a ops(gcp): take Cloud Run secure-ingress default into effect (per-env LB enablement + override removal) #78 callout).

Nitpicks — addressed:

validator.go::Warmup: added a 5s fallback deadline when the caller's context has none. http.DefaultClient has no timeout, so a misconfigured JWKS URL would otherwise block startup indefinitely.
errors.go: dropped the developer-local plan path; replaced with a portable PR / issue reference.
http_test.go: reworked setupApp to take *testing.T and use testutil.AssertNoError instead of panic(err). Extracted newBearerValidator to keep the table tight.
app.go: tightened the scheduledAuth field comment — it's "non-nil after NewApplicationFromDeps", not "always non-nil" (tests can build Application directly).

Actionable — dismissed with justification:

"scheduledAuthMiddleware should fail-closed when nil" — intentional test affordance. In production, NewApplicationFromDeps always populates scheduledAuth (and that path is on every server boot path: cmd/server, cmd/lambda). The nil branch only triggers when tests build Application{} directly without the constructor — they'd be testing the handler in isolation, with the middleware itself covered by scheduledauth's own unit + integration tests. Failing closed there would force every handler-level test to construct a stub validator, with no security benefit (because the production path can't reach this branch).
"LoadConfig should reject empty SCHEDULED_TASK_AUTH_MODE" — disabled+WARN is the deliberate local-dev default per the design plan. Local dev (no env vars set) shouldn't fail-fast on a missing security config; the loud per-startup + per-request WARN log makes it impossible to ship to production without explicitly opting in. Failing fast here would force every local-dev contributor to set the env var, with no production benefit (Terraform sets it explicitly on every prod deploy).
"scheduled_task_auth_mode should default to "oidc" when both schedulers off" (cloud-run and gke) — would cause the validator to fail-fast at construction: oidc mode requires non-empty audiences and subjects, both of which become empty when no scheduler is enabled. The current "disabled when no scheduler" flow matches the original behaviour (no scheduler → /api/scheduled/* is unreached + harmless WARN). If the user enables a scheduler later, terraform apply re-flips the mode.

@coderabbitai review

coderabbitai · 2026-04-29T19:14:12Z

@cristim Thanks for the thorough write-up on Pass 1. The addressed items look good, and the justifications for the dismissed items are all sound:

nil middleware pass-through — Agreed. The production boot path via NewApplicationFromDeps guarantees a non-nil scheduledAuth; the nil branch is exclusively a test-isolation affordance with no reachable production path.
disabled+WARN default for empty mode — Agreed. The per-request WARN log acts as a loud signal that prevents silent unguarded deployment; forcing fail-fast on missing config would impose friction on every local-dev contributor for no production gain given Terraform's explicit mode wiring.
oidc default when no scheduler is enabled — Agreed. oidc mode requires non-empty audiences and subjects; defaulting to oidc with empty lists would fail-fast construction for a legitimately inert configuration. The current disabled → re-flip on enable lifecycle is the correct Terraform-idiomatic approach.

Triggering a fresh review against the updated commits now.

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/server/app.go`:
- Around line 295-308: buildScheduledAuth currently treats a missing
SCHEDULED_TASK_AUTH_MODE as "disabled" and ends up constructing a permissive
validator; change buildScheduledAuth to detect when LoadConfig returns a
missing/empty mode (or ModeDisabled) and return a non-nil error instead of
continuing, so the process fails startup on missing auth wiring. Specifically,
after scheduledauth.LoadConfig(envSourceOS{}) validate that saCfg.Mode is set
and not the default disabled sentinel (check against the empty value or
scheduledauth.ModeDisabled) before applying the ModeBearer secret override and
calling scheduledauth.New(saCfg); if the mode is missing/disabled return an
error explaining SCHEDULED_TASK_AUTH_MODE must be configured.

In `@internal/server/scheduledauth/validator.go`:
- Around line 138-142: The comment above the oidc subject check incorrectly says
`sub` pins the scheduler service account email; update that comment to state
that `sub` must be the service-account unique_id (not the email) and clarify
that SCHEDULED_TASK_OIDC_SUBJECTS expects those unique_ids; keep the existing
enforcement logic (the cfg.Subjects check and ErrConfigInvalid usage) unchanged
and only correct the explanatory text referencing cfg.Subjects and
SCHEDULED_TASK_OIDC_SUBJECTS so future configs/tests use the service-account
unique_id.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06d5098c-4db0-42a1-9eca-ab7dc71a35b1

📥 Commits

Reviewing files that changed from the base of the PR and between b84291d and 084c755.

📒 Files selected for processing (6)

internal/server/app.go
internal/server/http_test.go
internal/server/scheduledauth/errors.go
internal/server/scheduledauth/validator.go
internal/server/scheduledauth/validator_test.go
terraform/environments/gcp/compute.tf

🚧 Files skipped from review as they are similar to previous changes (1)

internal/server/http_test.go

coderabbitai · 2026-04-29T19:21:46Z

+func buildScheduledAuth(cfg ApplicationConfig) (*scheduledauth.Validator, error) {
+	saCfg, err := scheduledauth.LoadConfig(envSourceOS{})
+	if err != nil {
+		return nil, err
+	}
+	// In bearer mode, override the env-supplied secret with the one
+	// already resolved from KV / SM. LoadConfig reads SCHEDULED_TASK_SECRET
+	// directly which is fine for local dev where the env carries the
+	// plaintext, but in prod cfg.ScheduledTaskSecret is the authoritative
+	// value (resolved by resolveScheduledTaskSecret upstream).
+	if saCfg.Mode == scheduledauth.ModeBearer {
+		saCfg.Bearer = cfg.ScheduledTaskSecret
+	}
+	return scheduledauth.New(saCfg)


⚠️ Potential issue | 🟠 Major

Don't fail open when SCHEDULED_TASK_AUTH_MODE is missing.

buildScheduledAuth goes through scheduledauth.LoadConfig, which currently maps an unset mode to disabled. In the production constructor, that turns a missed Terraform/env wiring into a live validator that allows every /api/scheduled/* request instead of failing startup, which is the exact regression this PR is trying to prevent.

🔒 Proposed guard

func buildScheduledAuth(cfg ApplicationConfig) (*scheduledauth.Validator, error) { + if strings.TrimSpace(os.Getenv(scheduledauth.EnvAuthMode)) == "" { + return nil, fmt.Errorf( + "%w: %s must be explicitly set (use disabled only for local dev)", + scheduledauth.ErrConfigInvalid, + scheduledauth.EnvAuthMode, + ) + } saCfg, err := scheduledauth.LoadConfig(envSourceOS{}) if err != nil { return nil, err }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func buildScheduledAuth(cfg ApplicationConfig) (*scheduledauth.Validator, error) {

saCfg, err := scheduledauth.LoadConfig(envSourceOS{})

if err != nil {

return nil, err

}

// In bearer mode, override the env-supplied secret with the one

// already resolved from KV / SM. LoadConfig reads SCHEDULED_TASK_SECRET

// directly which is fine for local dev where the env carries the

// plaintext, but in prod cfg.ScheduledTaskSecret is the authoritative

// value (resolved by resolveScheduledTaskSecret upstream).

if saCfg.Mode == scheduledauth.ModeBearer {

saCfg.Bearer = cfg.ScheduledTaskSecret

}

return scheduledauth.New(saCfg)

func buildScheduledAuth(cfg ApplicationConfig) (*scheduledauth.Validator, error) {

if strings.TrimSpace(os.Getenv(scheduledauth.EnvAuthMode)) == "" {

return nil, fmt.Errorf(

"%w: %s must be explicitly set (use disabled only for local dev)",

scheduledauth.ErrConfigInvalid,

scheduledauth.EnvAuthMode,

)

}

saCfg, err := scheduledauth.LoadConfig(envSourceOS{})

if err != nil {

return nil, err

}

// In bearer mode, override the env-supplied secret with the one

// already resolved from KV / SM. LoadConfig reads SCHEDULED_TASK_SECRET

// directly which is fine for local dev where the env carries the

// plaintext, but in prod cfg.ScheduledTaskSecret is the authoritative

// value (resolved by resolveScheduledTaskSecret upstream).

if saCfg.Mode == scheduledauth.ModeBearer {

saCfg.Bearer = cfg.ScheduledTaskSecret

}

return scheduledauth.New(saCfg)

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/server/app.go` around lines 295 - 308, buildScheduledAuth currently treats a missing SCHEDULED_TASK_AUTH_MODE as "disabled" and ends up constructing a permissive validator; change buildScheduledAuth to detect when LoadConfig returns a missing/empty mode (or ModeDisabled) and return a non-nil error instead of continuing, so the process fails startup on missing auth wiring. Specifically, after scheduledauth.LoadConfig(envSourceOS{}) validate that saCfg.Mode is set and not the default disabled sentinel (check against the empty value or scheduledauth.ModeDisabled) before applying the ModeBearer secret override and calling scheduledauth.New(saCfg); if the mode is missing/disabled return an error explaining SCHEDULED_TASK_AUTH_MODE must be configured.

cristim · 2026-04-29T21:17:47Z

Addressed the latest CodeRabbit subject-claim wording finding in commit 1442231: the OIDC validator comment now explicitly says SCHEDULED_TASK_OIDC_SUBJECTS expects service-account unique IDs, not emails.\n\nI am intentionally not changing buildScheduledAuth to fail on disabled/missing mode in this pass. That would reverse the already-discussed local-dev/no-scheduler design: Terraform currently emits disabled when no scheduler resources exist, and prior review discussion accepted the explicit disabled+WARN behavior. Production scheduler paths still set oidc/bearer explicitly and fail fast on invalid OIDC/bearer config.\n\n@coderabbitai review

coderabbitai · 2026-04-29T21:18:06Z

@cristim: Acknowledged on both points.

SCHEDULED_TASK_OIDC_SUBJECTS wording fix (commit 14422312f) — Correct call. The JWT sub claim for a GCP service account is always the numeric unique ID, not the email (email goes in the email claim). Documenting this explicitly in the comment prevents a footgun where an operator might paste the SA email and wonder why every token is rejected.
buildScheduledAuth fail-fast on disabled/missing mode (dismissed) — Justification holds and is consistent with the previous pass discussion. Terraform-managed deployments always set the mode explicitly; the disabled+WARN path is exclusively the local-dev/no-scheduler codepath, and the per-request WARN log acts as a sufficient signal without penalising contributors who haven't set the env var locally.

Triggering a fresh incremental review against the updated commits now.

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/server/scheduledauth/validator.go`:
- Around line 132-160: Add a fail-fast validation for the JWKS URL during
validator construction: implement a helper validateAbsoluteURL(raw, label
string) error that parses the URL (using url.Parse) and returns ErrConfigInvalid
if parsing fails or scheme/host are empty, call validateAbsoluteURL(cfg.JWKSURL,
"SCHEDULED_TASK_OIDC_JWKS_URL") before assigning v.jwksURL or creating v.keySet,
and return the validation error so malformed JWKS URLs abort startup instead of
deferring to oidc.NewRemoteKeySet.

---

Nitpick comments:
In `@internal/server/scheduledauth/validator.go`:
- Around line 231-246: The warmup currently only checks HTTP reachability;
instead, after receiving resp for the JWKS fetch (resp variable from
http.DefaultClient.Do for v.jwksURL), decode the response body as JSON and
validate it has a top-level "keys" field that is a JSON array (use json.Decoder
or json.Unmarshal and assert the "keys" value is a slice/array and non-nil); if
decoding fails or the shape is wrong log a WARN including v.jwksURL and the
decode/validation error and return so the validator will retry later.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a67e1bcb-251b-4041-bded-7f4e89b43b3d

📥 Commits

Reviewing files that changed from the base of the PR and between 084c755 and 1442231.

📒 Files selected for processing (1)

internal/server/scheduledauth/validator.go

cristim · 2026-04-29T21:35:18Z

Addressed the latest JWKS validation comments: OIDC validator construction now rejects malformed/non-absolute JWKS URLs, and warmup now validates the fetched payload has a top-level keys array before considering the probe healthy. Added tests for malformed JWKS URLs and JWKS payload shape validation. Verification: go test ./internal/server/scheduledauth ./internal/server; pre-commit Go hooks passed. @coderabbitai review

coderabbitai · 2026-04-29T22:31:46Z

@cristim: Acknowledged — Pass 4 is a strong set of changes. Here's a quick assessment before the fresh review runs:

Major fixes — both sound:

LoadConfig fail-closed on unset mode — This is the right call. Flipping the unset-mode behaviour from disabled+WARN to ErrConfigInvalid closes the gap where a missed Terraform/K8s env wiring would silently boot /api/scheduled/* unauthenticated. "disabled" now requires an explicit opt-in, which is the correct security posture for a production-grade service.
GKE scheduled_task_auth_mode derivation + var.scheduled_task_auth_mode_override — The length(google_service_account.scheduler) > 0 ? "oidc" : var.scheduled_task_auth_mode_override pattern with a fail-closed "oidc" default and validation block is exactly the right Terraform-idiomatic approach. Disabling the scheduler no longer implicitly opens an unauthenticated endpoint; a deliberate "disabled" override is required, which will be visible in any code review of the tfvars.

Nit addressed — testSchedulerSubject as 21-digit numeric:
Correct production contract. GCP service account ID tokens carry the SA's numeric unique ID in sub, not the email (email appears in the email claim). The test suite now exercises the real-world claim shape.

Dismissed nit — Azure cron string:
Justification holds. The cron string is the module's input contract; split(" ", var.recommendation_schedule)[1] at scheduled-tasks.tf:23 parses it at the Terraform layer. Switching the example to frequency/interval would silently break every existing .tfvars file that feeds the same variable. Leaving it as-is is correct.

TestNewApplicationFromDeps — SCHEDULED_TASK_AUTH_MODE=disabled:
Correct test-infra fix. The explicit env var is now required given the fail-closed LoadConfig, and setting it to "disabled" is the appropriate test affordance for a unit test that isn't exercising the auth subsystem.

Triggering a fresh incremental review against commit 97253fd9d now.

✅ Actions performed

Review triggered.

…esidual) Commit 97253fd introduced testSchedulerSubject and converted the positive-path scheduler subjects in validator_test.go to it, but left two clusters of email-shaped fixtures behind: - validator_test.go:282 — the "wrong subject" rejection test still asserted against `attacker@example.iam.gserviceaccount.com`. - integration_test.go (lines 36, 71, 82, 99, 110) — every subject in the integration suite was still email-shaped. The CR nitpick was not just about the positive scheduler fixture: the broader concern is that production wiring pins SCHEDULED_TASK_OIDC_SUBJECTS to GCP service-account unique_ids (21-digit numeric strings), and a regression that silently widened the validator to accept any string would still pass tests that asserted against email-shaped values — both negative and positive paths. This commit: 1. Adds testNonSchedulerSubject ("999888777666555444333") next to the existing testSchedulerSubject. Same shape, distinct numeric so failures are visually traceable. 2. Replaces every remaining email-shaped subject across both test files with the appropriate constant. After this change `grep -E "scheduler@|attacker@|random@|gserviceaccount"` finds zero hits in `internal/server/scheduledauth/`. Tests: `go test ./internal/server/scheduledauth/...` clean; `go test ./...` clean across the full root suite (no other package referenced the email fixtures).

cristim · 2026-04-29T22:37:05Z

Finished the unique_id fixture migration that 97253fd started: testNonSchedulerSubject covers the rejection-path subjects, and every email-shaped fixture in both validator_test.go (the leftover at line 282) and integration_test.go (5 sites) is now a 21-digit numeric to match what production wiring pins SCHEDULED_TASK_OIDC_SUBJECTS to. grep -E 'scheduler@|attacker@|random@|gserviceaccount' in internal/server/scheduledauth/ is now empty.

@coderabbitai review

coderabbitai · 2026-04-29T22:37:22Z

@cristim: Acknowledged — the fixture migration is now complete and consistent across both test files:

validator_test.go line 282 leftover — replaced with 21-digit numeric, so the full unit suite (both happy-path and rejection-path subjects) now exercises the real GCP SA sub contract.
integration_test.go (5 sites) — same alignment; the full HTTP roundtrip integration suite no longer carries any email-shaped subjects that would diverge from what SCHEDULED_TASK_OIDC_SUBJECTS pins in production.
testNonSchedulerSubject — good explicit name for the rejection-path fixture; makes the intent of the negative test cases self-documenting.

The grep clean result confirms no stragglers remain. Triggering a fresh incremental review now.

✅ Actions performed

Review triggered.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/server/scheduledauth/validator_test.go`:
- Around line 409-416: The /swap POST request in the test (constructed as swap
:= http.NewRequest(..., srv.URL+"/swap", strings.NewReader(string(jwks(keyB))))
and executed via http.DefaultClient.Do(swap)) should be asserted for success
immediately after the call: check resp.StatusCode is 200 (or expected status)
and fail the test with a helpful message including the response body when it's
not, so a failed swap surfaces as a clear test failure instead of later causing
token-validation errors; read resp.Body into a string for the failure message
before closing it.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dce4a0c8-fa37-4a70-a230-cb0b2fe42265

📥 Commits

Reviewing files that changed from the base of the PR and between 6f22b9d and 97253fd.

📒 Files selected for processing (5)

internal/server/app_test.go
internal/server/scheduledauth/config.go
internal/server/scheduledauth/validator_test.go
terraform/modules/compute/gcp/gke/main.tf
terraform/modules/compute/gcp/gke/variables.tf

coderabbitai

🧹 Nitpick comments (1)

terraform/modules/compute/gcp/gke/main.tf (1)

36-37: ⚡ Quick win

Bind aud to the receiving endpoint, not the scheduler service account.

sub is already pinned to the scheduler SA unique_id, so using the same SA identity again for aud makes both checks describe the caller. Setting aud to the scheduled endpoint URI keeps the token recipient-bound and reduces cross-service replay if this SA is ever reused.

♻️ Suggested change

-  scheduled_task_oidc_audience = try(google_service_account.scheduler[0].email, "")
+  scheduled_task_oidc_audience = "${var.app_url}/api/scheduled/recommendations"
...
     oidc_token {
       service_account_email = google_service_account.scheduler[0].email
-      audience              = google_service_account.scheduler[0].email
+      audience              = local.scheduled_task_oidc_audience
     }

Also applies to: 480-483, 714-727

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@terraform/modules/compute/gcp/gke/main.tf` around lines 36 - 37, The audience
claim is currently set to the scheduler service account email
(scheduled_task_oidc_audience = try(google_service_account.scheduler[0].email,
"")) which makes both aud and sub refer to the caller; change
scheduled_task_oidc_audience to the receiving endpoint's URI (e.g., use the
variable or resource that represents the scheduled endpoint URL) so aud is the
endpoint and keep scheduled_task_oidc_subject =
try(google_service_account.scheduler[0].unique_id, "") as the caller identity;
apply the same replacement for the other occurrences noted (around the other
blocks at lines you referenced).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@terraform/modules/compute/gcp/gke/main.tf`:
- Around line 36-37: The audience claim is currently set to the scheduler
service account email (scheduled_task_oidc_audience =
try(google_service_account.scheduler[0].email, "")) which makes both aud and sub
refer to the caller; change scheduled_task_oidc_audience to the receiving
endpoint's URI (e.g., use the variable or resource that represents the scheduled
endpoint URL) so aud is the endpoint and keep scheduled_task_oidc_subject =
try(google_service_account.scheduler[0].unique_id, "") as the caller identity;
apply the same replacement for the other occurrences noted (around the other
blocks at lines you referenced).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c38c1dac-b84e-4f73-9612-15dfb7fdc518

📥 Commits

Reviewing files that changed from the base of the PR and between 6f22b9d and 5193227.

📒 Files selected for processing (6)

internal/server/app_test.go
internal/server/scheduledauth/config.go
internal/server/scheduledauth/integration_test.go
internal/server/scheduledauth/validator_test.go
terraform/modules/compute/gcp/gke/main.tf
terraform/modules/compute/gcp/gke/variables.tf

🚧 Files skipped from review as they are similar to previous changes (3)

internal/server/app_test.go
terraform/modules/compute/gcp/gke/variables.tf
internal/server/scheduledauth/integration_test.go

Addresses CodeRabbit's actionable on validator_test.go:418-424: the JWKS-rotation test fired POST /swap and only checked the network-error path. If the test server's /swap handler 5xx'd, or short-read the body and returned non-200 silently, the JWKS would not actually swap and the test would later fail at "unknown kid" — pointing at the wrong layer. Fixes: - Hoist `jwks(keyB)` into a local so we don't compute it three times for body+ContentLength+failure-message symmetry. - Surface NewRequest's error instead of silently dropping it via `_, _ := http.NewRequest(...)`. - Use `http.MethodPost` + `http.StatusOK` instead of stringly- typed literals. - On non-200, read the response body and t.Fatalf with both status and body so the failure points at the swap server, not the downstream validator. Tests: TestValidate_OIDC_KeyRotation_RefreshOnUnknownKid passes; `go test ./...` clean across the full root suite.

cristim · 2026-04-29T22:46:09Z

Pass: addressed the actionable on validator_test.go:418-424 — the JWKS-rotation /swap POST now (a) propagates http.NewRequest's error, (b) asserts resp.StatusCode == 200, and (c) on non-200 reads the body and t.Fatalfs with both status and body so a swap-handler regression points at the swap, not the downstream validator. Full suite clean.

@coderabbitai review

coderabbitai · 2026-04-29T22:46:25Z

@cristim: Acknowledged — all three sub-items on the /swap hardening are the right defensive improvements:

http.NewRequest error propagation — previously a request-construction failure would be silently swallowed, making the test non-deterministically corrupt without a clear signal.
resp.StatusCode == 200 assertion — catches swap-handler regressions before they produce misleading downstream validator failures.
Body-inclusive t.Fatalf on non-200 — makes the failure attribution unambiguous: the message now points at the swap step, not at whatever token validation error surfaces later.

Triggering a fresh incremental review against the updated commits now.

✅ Actions performed

Review triggered.

coderabbitai

🧹 Nitpick comments (1)

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/server/scheduledauth/validator_test.go`:
- Around line 750-766: The test currently only ensures Warmup does not hang but
doesn't assert Warmup actually attempted the JWKS fetch; modify
TestWarmup_LoggedAndNonFatal_OnDeadEndpoint to capture and assert the warning
log produced by Warmup on fetch failure: create or inject a test logger/ buffer
before calling New (or replace v's logger after New if New returns a validator
with a logger field), call v.Warmup(ctx) as before, then read the captured logs
and assert they contain a warning/error message referencing the JWKSURL or JWKS
fetch failure (or the error string like ECONNREFUSED) to prove the failure path
was exercised rather than returning early. Ensure you still use the short
context timeout while asserting the log contents after Warmup returns.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8c1256af-2505-46fb-a8f6-925a8850dc39

📥 Commits

Reviewing files that changed from the base of the PR and between 5193227 and e993d67.

📒 Files selected for processing (1)

internal/server/scheduledauth/validator_test.go

…A (CR pass) Addresses CodeRabbit's nitpick on `gke/main.tf:36-37` (also-applies to :480-483 and :714-727), and applies the same convention to the sibling `cloud-run/main.tf` so the two GCP compute modules stay aligned mid-PR. Old shape (both modules): audience = google_service_account.scheduler[0].email # JWT claim SCHEDULED_TASK_OIDC_AUDIENCE = same SA email `sub` already pins the caller to the scheduler SA's unique_id, so using the same SA identity for `aud` made both checks describe the caller. If the scheduler SA were ever reused for a different service, an OIDC token minted for service A could be replayed against service B because both services would have the same audience in their allow-list. New shape (audience names the receiving endpoint, not the caller): GKE: audience = ${var.app_url}/api/scheduled/recommendations Cloud Run: audience = ${var.service_name}/api/scheduled/<endpoint> The validator side already supports comma-separated audiences via `cleanSet` / `audienceMatches` (no app-side change needed). Both sides read the same local, so the JWT claim and the env-var match by construction. Why service_name (not URL) on Cloud Run: Using `google_cloud_run_v2_service.main.uri` produces a Terraform cycle — the service's env vars include the audience whitelist, which would then depend on the service's own URI. `service_name` is a module input, fully resolvable at plan time, and any unique-per-(service, endpoint) string satisfies the OIDC recipient-bound contract — the validator only does string equality against the env-var value, which is built from the same local. GKE has no such cycle (kubernetes_deployment doesn't read its own URL into env), so it uses `var.app_url` per CR's literal suggestion. Per-module specifics: * gke/main.tf — single endpoint; one local, three callsites (the local + the env var + the Cloud Scheduler oidc_token block) all read from `local.scheduled_task_oidc_audience`. * cloud-run/main.tf — two endpoints (recommendations, ri-exchange); new per-schedule locals `scheduled_task_recommendations_audience` and `scheduled_task_ri_exchange_audience` build from `var.service_name`. The CSV `oidc_audiences` local that flows into the env-var compacts the two enabled values. Each Cloud Scheduler oidc_token reads its matching local. Verified: `terraform validate` clean on both modules; pre-commit hooks (`terraform_fmt`, `terraform_validate`, `tflint`) pass; `go test ./...` clean across the full root suite.

Addresses CodeRabbit's nitpick on `validator_test.go:750-766`: `TestWarmup_LoggedAndNonFatal_OnDeadEndpoint` previously only asserted that Warmup didn't hang. The test would still pass if a future regression made Warmup short-circuit *before* the JWKS fetch ever ran — exactly the failure mode the test exists to guard against ("non-fatal, but actually exercises the failure branch"). This commit captures the global standard logger to a `bytes.Buffer` for the duration of the test, calls Warmup with a guaranteed-bad JWKS URL (127.0.0.1:1, ECONNREFUSED), and asserts the captured output contains "JWKS warmup". That string is shared by every warmup-failure log line in `validator.go` (request build, fetch, non-200 status, invalid payload), so a tweak to any of those messages won't break this test for cosmetic reasons — but a regression that skips the fetch entirely will. Safety: no test in `internal/server/scheduledauth/` uses `t.Parallel()`, so swapping the global `log` writer is race-free. `t.Cleanup` restores the original writer. Tests: `go test ./internal/server/scheduledauth/... -run Warmup` shows the new assertion firing on success; `go test ./...` clean across the full root suite.

cristim · 2026-04-29T23:27:16Z

Pass: addressed both nitpicks from the most recent reviews.

aud claim → endpoint, not SA (gke + cloud-run): sub already pins the caller via SA unique_id, so reusing the SA email for aud made both checks describe the caller. Now aud names the recipient — ${var.app_url}/api/scheduled/recommendations on GKE, ${var.service_name}/api/scheduled/<endpoint> on Cloud Run. (Cloud Run can't use the service URL because it'd cycle back through the env vars; the service_name-prefixed path satisfies the recipient-bound contract since the validator only does string equality against the env-var built from the same local.) Applied to both modules so they don't drift mid-PR.
Warmup test asserts the failure-path log: TestWarmup_LoggedAndNonFatal_OnDeadEndpoint now captures the global log writer and asserts JWKS warmup appears in the output, so a regression that short-circuits before the fetch fails the test instead of silently passing.

terraform validate clean on both modules; go test ./... clean.

@coderabbitai review

coderabbitai · 2026-04-29T23:27:44Z

@cristim: Acknowledged — both fixes are sound and close the remaining gaps cleanly.

aud → endpoint, not SA (GKE + Cloud Run)

This is the semantically correct fix per RFC 7519 §4.1.3: aud identifies the intended recipient of the token, not the issuer. With the previous wiring, both sub (SA unique_id) and aud (SA email) described the caller, leaving the recipient dimension of the token completely unvalidated — an attacker who obtained a scheduler-signed token for a different CUDly environment could replay it. Pinning aud to the endpoint URL (${var.app_url}/api/scheduled/recommendations on GKE, ${var.service_name}/api/scheduled/<endpoint> on Cloud Run) makes the token environment-and-route-bound. The Cloud Run substitution with service_name-prefixed path is a reasonable workaround for the circular URL dependency, and since both the scheduler wiring and the validator env var read from the same local, string equality holds correctly.

Warmup log assertion

Capturing the global log writer and asserting JWKS warmup appears in output is the right defensive move. Previously a regression that panicked or short-circuited before the HTTP fetch would pass the test silently; now it fails with a clear signal pointing at the warmup path rather than a downstream validator error.

Triggering a fresh incremental review now.

✅ Actions performed

Review triggered.

coderabbitai

🧹 Nitpick comments (2)

terraform/modules/compute/gcp/gke/main.tf (2)

22-41: ⚡ Quick win

Clean up the stale audience explanation in this comment block.

The first paragraph still describes the old SA-email audience model, while the later paragraph and Line 42 correctly describe the new endpoint-bound audience. In this auth-critical section, keeping both stories in the same comment is misleading.

Suggested cleanup

-  # Scheduled-task OIDC validator wiring — see
-  # internal/server/scheduledauth. Audience pinned to the scheduler SA
-  # email (matches the explicit `audience` on the oidc_token block);
-  # subject pinned to the SA's unique numeric ID (the value Google puts
+  # Scheduled-task OIDC validator wiring — see
+  # internal/server/scheduledauth. Audience pinned to the receiving
+  # endpoint URL; subject pinned to the scheduler SA's unique numeric ID
+  # (the value Google puts
   # in the JWT `sub` claim for SA-signed ID tokens).
@@
-  # OAuth/OIDC convention. (CodeRabbit nitpick on PR `#161`.)
+  # OAuth/OIDC convention.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@terraform/modules/compute/gcp/gke/main.tf` around lines 22 - 41, The comment
block describing scheduled-task OIDC validator contains stale text about an
SA-email-based audience; update it to only describe the endpoint-bound audience
model and remove the old SA-email audience paragraph so the explanation is
consistent: keep the notes about auth mode being derived from the scheduler SA
presence (referencing var.enable_scheduled_tasks and
var.scheduled_task_auth_mode_override), retain the explanation that `sub` pins
the scheduler SA unique_id, and explicitly state that `aud` is the receiving
endpoint URL (not the SA identity) per the oidc_token usage and
kubernetes_ingress_v1.app behavior; delete the misleading lines describing the
old SA-email audience model and ensure the remaining text clearly ties audience
-> endpoint and subject -> SA unique_id.

718-735: ⚡ Quick win

Reuse the shared local for uri as well.

The comment says the audience and target match “by construction”, but Line 718 still duplicates the endpoint string separately. If one side changes without the other, Scheduler will mint tokens for the wrong aud and this job will start failing auth.

Suggested change

   http_target {
     http_method = "POST"
-    uri         = "${var.app_url}/api/scheduled/recommendations"
+    uri         = local.scheduled_task_oidc_audience
@@
     oidc_token {
       service_account_email = google_service_account.scheduler[0].email
       audience              = local.scheduled_task_oidc_audience
     }
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@terraform/modules/compute/gcp/gke/main.tf` around lines 718 - 735, Replace
the hard-coded uri assignment so the Scheduler's target URL is drawn from the
shared local instead of duplicating the string: change the line setting uri =
"${var.app_url}/api/scheduled/recommendations" to use
local.scheduled_task_oidc_audience (or the existing local that holds the
canonical scheduled endpoint), ensuring the uri and the oidc_token.audience
(local.scheduled_task_oidc_audience) remain identical; update the resource block
containing uri and the oidc_token { service_account_email =
google_service_account.scheduler[0].email ... } accordingly so the audience and
target stay in sync.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@terraform/modules/compute/gcp/gke/main.tf`:
- Around line 22-41: The comment block describing scheduled-task OIDC validator
contains stale text about an SA-email-based audience; update it to only describe
the endpoint-bound audience model and remove the old SA-email audience paragraph
so the explanation is consistent: keep the notes about auth mode being derived
from the scheduler SA presence (referencing var.enable_scheduled_tasks and
var.scheduled_task_auth_mode_override), retain the explanation that `sub` pins
the scheduler SA unique_id, and explicitly state that `aud` is the receiving
endpoint URL (not the SA identity) per the oidc_token usage and
kubernetes_ingress_v1.app behavior; delete the misleading lines describing the
old SA-email audience model and ensure the remaining text clearly ties audience
-> endpoint and subject -> SA unique_id.
- Around line 718-735: Replace the hard-coded uri assignment so the Scheduler's
target URL is drawn from the shared local instead of duplicating the string:
change the line setting uri = "${var.app_url}/api/scheduled/recommendations" to
use local.scheduled_task_oidc_audience (or the existing local that holds the
canonical scheduled endpoint), ensuring the uri and the oidc_token.audience
(local.scheduled_task_oidc_audience) remain identical; update the resource block
containing uri and the oidc_token { service_account_email =
google_service_account.scheduler[0].email ... } accordingly so the audience and
target stay in sync.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d152ee8-6227-48ca-bd84-a52fa11e358b

📥 Commits

Reviewing files that changed from the base of the PR and between e993d67 and 171eea5.

📒 Files selected for processing (3)

internal/server/scheduledauth/validator_test.go
terraform/modules/compute/gcp/cloud-run/main.tf
terraform/modules/compute/gcp/gke/main.tf

🚧 Files skipped from review as they are similar to previous changes (2)

internal/server/scheduledauth/validator_test.go
terraform/modules/compute/gcp/cloud-run/main.tf

…int local (CR pass) Two CR nitpicks on `gke/main.tf`: 1. **Line 22-41 stale comment**: the comment block introducing the scheduled_task_oidc_* locals still described the OLD SA-email audience model in its first paragraph and then contradicted itself in its second. Rewritten as a single coherent explanation: `aud` names the recipient (endpoint URL), `sub` names the caller (SA unique_id, not email — the form Google actually puts in the JWT `sub` claim). 2. **Lines 718-735 string duplication**: the Cloud Scheduler `http_target.uri` and `oidc_token.audience` were both typed as `${var.app_url}/api/scheduled/recommendations`, i.e. the same string twice. CR's concern: if one side changed without the other, Cloud Scheduler would mint tokens with `aud` for the old endpoint while POSTing to the new one — silent 401. Both now read `local.scheduled_task_oidc_audience`, so a future endpoint rename becomes a syntax-level mismatch rather than a runtime auth failure. Note: the same collapse does NOT apply to `cloud-run/main.tf` because on that side the `uri` (real Cloud Run service URL, `google_cloud_run_v2_service.main.uri/...`) and the `audience` (service_name-prefixed string) are intentionally different values — the audience uses `var.service_name` to break the Terraform dependency cycle that the URL would create. Different values by design, no drift risk. Verified: `terraform validate` clean on the gke module; `go test ./...` clean across the root suite.

cristim · 2026-04-29T23:35:38Z

Pass: addressed both gke nitpicks.

Stale comment block (gke/main.tf:22-41): the introductory comment had its old SA-email-audience paragraph still living next to the new endpoint-bound paragraph. Rewritten as a single coherent explanation — aud names the recipient (endpoint URL), sub names the caller (SA unique_id, not email).
String duplication (gke/main.tf:718-735): http_target.uri and oidc_token.audience both read local.scheduled_task_oidc_audience now, so a future endpoint rename is a syntax mismatch rather than a silent 401.

The same collapse intentionally does NOT apply to cloud-run/main.tf — there uri (real Cloud Run service URL) and audience (service_name-prefixed string used to break the TF cycle) are different values by design, no drift risk.

@coderabbitai review

coderabbitai · 2026-04-29T23:36:01Z

@cristim: Acknowledged — both GKE nitpicks are cleanly resolved:

Comment rewrite (gke/main.tf:22-41) — Collapsing the old SA-email-audience paragraph into a single coherent explanation removes the misleading historical context and makes the aud = recipient / sub = caller contract unambiguous for any future reader of the Terraform.
local.scheduled_task_oidc_audience collapse (gke/main.tf:718-735) — Having both http_target.uri and oidc_token.audience read from the same local means a future endpoint rename is a guaranteed compile-time catch (Terraform plan surfaces the mismatch in both places at once) rather than a silent validator rejection that would require tracing through two separate string literals.

The cloud-run carve-out is correct — uri (actual Cloud Run service URL) and audience (service_name-prefixed string that breaks the Terraform reference cycle) are semantically distinct values and must not be collapsed. The drift risk is asymmetric: on cloud-run a divergence would surface as an OIDC validation rejection; on GKE the single local eliminates that class of error entirely.

Triggering a fresh incremental review now.

✅ Actions performed

Review triggered.

…arsing (CR pass) Addresses CodeRabbit's open Major on `dev.tfvars.example:90`: the example was tagged "# Cron: Daily at 2 AM" which read as "full cron syntax supported". A user copying it and changing to e.g. `"*/15 * * * *"` would silently get a daily-at-00:00 schedule instead, because the consuming code in `modules/compute/azure/container-apps/scheduled-tasks.tf:23` only splits on spaces and reads index [1] (the hour field). CR's literal suggestion was to switch the example to a frequency/ interval form. Doing that would diverge the Azure tfvars files from the AWS/GCP ones (which all use the same cron string) and would require a parser change. Both are bigger than the actual finding. The minimal fix is to be honest about the limitation in the docs. Three sites updated: 1. `terraform/environments/azure/dev.tfvars.example:78-90` — replaced the `# Cron: Daily at 2 AM` one-liner with a comment block explicitly stating that only the hour field is honored, pointing at the parser, and warning that sub-daily / DoW / DoM patterns are silently ignored. 2. `terraform/environments/azure/variables.tf:380-390` — variable description rewritten to the same effect, so `terraform plan` output and module docs reflect reality. 3. `terraform/modules/compute/azure/container-apps/variables.tf:225-235` — module-level variable description updated symmetrically so module consumers get the same warning. No behavioural change. Verified `terraform validate` clean on `terraform/environments/azure` and `terraform/modules/compute/azure/container-apps`. If finer-grained scheduling becomes a real requirement, the follow-up is to extend the parser and switch the variables to something like `recommendation_schedule_frequency` + `_interval` / `_start_time`, then drop the cron-shaped string. That's intentionally out of scope here.

cristim · 2026-04-29T23:55:48Z

Addressed the open Major on dev.tfvars.example:90 (the only un-resolved finding from earlier reviews).

The variable's parser at scheduled-tasks.tf:23 only reads the hour field (split(" ", ...)\[1\]), so the example tagged # Cron: Daily at 2 AM was misleading: a user copying */15 * * * * would silently get daily-at-midnight. CR's literal suggestion was to switch to a frequency/interval form, but that diverges Azure tfvars from the AWS/GCP shape (all cron-string today) and needs a parser change — bigger than the finding warrants. Documented the hour-only limitation in three places (dev.tfvars.example, the env-level variable description, the module-level variable description) so the misleading framing is gone without changing the value or behavior. terraform validate clean on both Azure modules.

That should clear the actionable backlog from the earlier reviews. The CR-as-comments thread on the previous push acknowledged both gke nitpicks were resolved cleanly.

@coderabbitai review

coderabbitai · 2026-04-29T23:56:16Z

@cristim: The three-place documentation fix is a pragmatic and well-scoped resolution. Rather than a disruptive parser change, surfacing the hour-only limitation at the example, env-level variable, and module-level variable description removes the misleading framing precisely where a user would encounter it.

Worth noting: a retrieved learning from a prior review on this repo confirms that azurerm_logic_app_trigger_recurrence only accepts discrete frequency/interval parameters and does not accept NCRONTAB format — so the literal CR suggestion (switch to frequency/interval) would actually require a non-trivial cron→frequency/interval translation layer in the module, which validates your decision to document rather than restructure.

Triggering a fresh incremental review now.

🧠 Learnings used

Learnt from: cristim
Repo: LeanerCloud/CUDly PR: 105
File: terraform/modules/compute/azure/cleanup-function/variables.tf:0-0
Timestamp: 2026-04-28T00:07:07.398Z
Learning: In the CUDly project, `azurerm_logic_app_trigger_recurrence` (used in `terraform/modules/compute/azure/cleanup-function/main.tf`) does NOT accept NCRONTAB format. It only accepts discrete `frequency` (e.g. "Day") and `interval` (e.g. 1) parameters. Supporting a NCRONTAB-style `var.schedule` would require a NCRONTAB→frequency/interval translation layer, which is non-trivial feature work.

✅ Actions performed

Review triggered.

PR #161 made SCHEDULED_TASK_AUTH_MODE required at startup but the env var was only wired on GCP and Azure. AWS Lambda was missed and the post-merge deploy crashed with: scheduledauth: invalid configuration: SCHEDULED_TASK_AUTH_MODE is unset (set to oidc, bearer, or disabled) (See https://github.com/LeanerCloud/CUDly/actions/runs/25148590105 .) The "obvious" fix — wire SCHEDULED_TASK_AUTH_MODE=disabled to mirror the pre-#161 reality where the bearer check was a no-op when the secret was empty — is wrong. ModeDisabled lets every /api/scheduled/* request through, and the Lambda Function URL has authorization_type = NONE (the same comment in this module's main.tf explicitly notes the URL is public). That would ship a production auth bypass. Instead, run AWS in bearer mode with the secret resolved from Secrets Manager at runtime — same model Azure already uses. Changes: * terraform/modules/secrets/aws — provision a random_password + aws_secretsmanager_secret + version for the scheduled-task bearer (gated by var.create_scheduled_task_secret, default true). Include it in the secret_read policy and expose ARN + name outputs. * terraform/modules/compute/aws/lambda — set SCHEDULED_TASK_AUTH_MODE=bearer + SCHEDULED_TASK_SECRET_NAME=var.scheduled_task_secret_name on the Lambda env block. Add the secret ARN to the secrets_access IAM policy. Add the corresponding scheduled_task_secret_arn / scheduled_task_secret_name variables. * terraform/modules/compute/aws/fargate — same wiring on the task definition env block + IAM policy + variables. Fargate isn't the default platform but should not regress when selected. * terraform/environments/aws/compute.tf — pass module.secrets.scheduled_task_secret_arn / _name into both compute_lambda and compute_fargate. Runtime path (already in place from #161): app.go reads SCHEDULED_TASK_SECRET_NAME, resolveScheduledTaskSecret fetches the plaintext from AWS Secrets Manager via the SecretResolver, and buildScheduledAuth overrides saCfg.Bearer with that value before calling scheduledauth.New. The plaintext never lives in Lambda env or Terraform state. Validation: terraform validate clean for all four touched modules. Once deployed, /api/scheduled/* on the public Lambda URL will reject every request that doesn't carry the resolved bearer secret. AWS schedulers (EventBridge -> direct Lambda invocation) bypass the HTTP middleware entirely, so they are unaffected.

) PR #161 made SCHEDULED_TASK_AUTH_MODE required at startup but the env var was only wired on GCP and Azure. AWS Lambda was missed and the post-merge deploy crashed with: scheduledauth: invalid configuration: SCHEDULED_TASK_AUTH_MODE is unset (set to oidc, bearer, or disabled) (See https://github.com/LeanerCloud/CUDly/actions/runs/25148590105 .) The "obvious" fix — wire SCHEDULED_TASK_AUTH_MODE=disabled to mirror the pre-#161 reality where the bearer check was a no-op when the secret was empty — is wrong. ModeDisabled lets every /api/scheduled/* request through, and the Lambda Function URL has authorization_type = NONE (the same comment in this module's main.tf explicitly notes the URL is public). That would ship a production auth bypass. Instead, run AWS in bearer mode with the secret resolved from Secrets Manager at runtime — same model Azure already uses. Changes: * terraform/modules/secrets/aws — provision a random_password + aws_secretsmanager_secret + version for the scheduled-task bearer (gated by var.create_scheduled_task_secret, default true). Include it in the secret_read policy and expose ARN + name outputs. * terraform/modules/compute/aws/lambda — set SCHEDULED_TASK_AUTH_MODE=bearer + SCHEDULED_TASK_SECRET_NAME=var.scheduled_task_secret_name on the Lambda env block. Add the secret ARN to the secrets_access IAM policy. Add the corresponding scheduled_task_secret_arn / scheduled_task_secret_name variables. * terraform/modules/compute/aws/fargate — same wiring on the task definition env block + IAM policy + variables. Fargate isn't the default platform but should not regress when selected. * terraform/environments/aws/compute.tf — pass module.secrets.scheduled_task_secret_arn / _name into both compute_lambda and compute_fargate. Runtime path (already in place from #161): app.go reads SCHEDULED_TASK_SECRET_NAME, resolveScheduledTaskSecret fetches the plaintext from AWS Secrets Manager via the SecretResolver, and buildScheduledAuth overrides saCfg.Bearer with that value before calling scheduledauth.New. The plaintext never lives in Lambda env or Terraform state. Validation: terraform validate clean for all four touched modules. Once deployed, /api/scheduled/* on the public Lambda URL will reject every request that doesn't carry the resolved bearer secret. AWS schedulers (EventBridge -> direct Lambda invocation) bypass the HTTP middleware entirely, so they are unaffected.

…205) (#206) * fix(db): idempotently restore retry_execution_id column dropped by migration drift (closes #204, closes #205) Production /api/purchases/* surfaced 500 errors after PR #161 deployed, with the underlying CloudWatch error: ERROR: column "retry_execution_id" does not exist (SQLSTATE 42703) User-facing symptoms (dashboard "Upcoming Scheduled Purchases" panel): * "Failed to load purchase details: Internal server error" (#205) * "Failed to cancel purchase" (#204) Both buttons hit `Handler.GetExecutionByID` whose SELECT references the retry_execution_id and retry_attempt_n columns introduced by PR #168. Root cause: migration version drift. * PR #189 added 000042_recommendations_add_engine_to_key earlier on the same day. Deployed Lambdas applied it as version 42. * PR #168 then added 000042_purchase_executions_retry_linkage, colliding on version 42. * PR #195 resolved the on-disk duplicate by renaming RECOMMENDATIONS to 000043 — the wrong direction. Deployed DBs already had schema_migrations.version=42 marked as the engine content, so on the next migrate.Up() they: - skipped on-disk 000042 (now retry_linkage) entirely (DB version >= 42 already) - applied on-disk 000043 (now engine again) as a no-op via the ADD COLUMN IF NOT EXISTS guards already in that file Net effect: the retry_execution_id / retry_attempt_n columns were never added on prod, but the Go code expects them. Forward-only corrective. The new 000044 replays 000042's schema additions wrapped in IF NOT EXISTS / CREATE INDEX IF NOT EXISTS so it's a no-op on any DB that already has the columns (fresh deploys, manually-fixed envs) and additive on the broken ones. We don't try to repair schema_migrations history — golang-migrate only cares about the high-water version, and we can't tell from outside whether a given environment skipped 42 or applied 43 twice. The migration up + down sql files document the incident and the edge cases inline so future readers don't have to re-derive the history from the git log. Closes #204 Closes #205 * fix(db): repair partially-fixed DBs in 000044, not just no-op (CR pass 1) CR pass 1 on PR #206 flagged that the original 000044 only added columns when missing — a manually-fixed DB with the column but no FK, or with retry_attempt_n nullable, would still be broken after this migration. Rewrite as five idempotent steps that each check pg_catalog / information_schema for the precise piece they own, so every reachable partial state is repaired: 1. Add retry_execution_id column if missing. 2. Add the FK on retry_execution_id -> execution_id if no FK on that column exists. Match by column relationship (pg_constraint join pg_attribute) NOT by constraint name, so a fresh-deploy DB whose 000042 ran cleanly with the auto-generated `purchase_executions_retry_execution_id_fkey` is recognised as already-FK'd and we don't add a second. 3. Add retry_attempt_n column if missing (as nullable, so step 4 can backfill before flipping to NOT NULL). 4. Backfill any NULLs and SET DEFAULT 0 / SET NOT NULL unconditionally — Postgres treats those as no-ops when the constraint is already as requested. 5. Create the partial index if missing. Down stays simple: DROP COLUMN ... CASCADE removes the FK regardless of its name. Comments inline document each reachable partial state and why each step is safe to re-run. Migration is a no-op on healthy DBs, fully repairing on broken ones.

cristim marked this pull request as draft April 27, 2026 23:21

cristim added triaged Item has been triaged priority/p1 Next up; this sprint severity/high Significant harm urgency/this-sprint Within the current sprint impact/few Limited audience effort/m Days type/security Security finding status/blocked Blocked on something external labels Apr 28, 2026

cristim force-pushed the fix/issue-159-gcp-secret-runtime branch from e369f12 to b84291d Compare April 29, 2026 18:58

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread internal/server/scheduledauth/validator.go

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread internal/server/scheduledauth/validator_test.go Outdated

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

cristim added 2 commits April 30, 2026 01:26

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

cristim marked this pull request as ready for review April 30, 2026 05:11

cristim merged commit 903ff85 into feat/multicloud-web-frontend Apr 30, 2026
4 checks passed

cristim mentioned this pull request Apr 30, 2026

fix(aws): gate /api/scheduled/* with bearer mode + Secrets Manager #203

Merged

cristim deleted the fix/issue-159-gcp-secret-runtime branch April 30, 2026 14:00

This was referenced Apr 30, 2026

fix(security/gcp): Cloud Scheduler + GKE jobs still embed plaintext SCHEDULED_TASK_SECRET in Authorization header #159

Closed

ops(gcp): take Cloud Run secure-ingress default into effect (per-env LB enablement + override removal) #78

Open

Conversation

cristim commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Closes #159 + closes the auth gap surfaced during verification

What changed

Why Option B over Option A (#78 cross-PR dependency)

Implementation

Verification

Rollback / blast radius

Out of scope

Notes for reviewers

Summary by CodeRabbit

Uh oh!

cristim commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Uh oh!

cristim commented Apr 27, 2026

⚠️ Security regression flagged by post-merge verification

1. Cloud Run allow_unauthenticated = true in every deployed env

2. GKE Ingress doesn't validate OIDC at all

Suggested remediation

What's still correct in this PR

Uh oh!

cristim commented Apr 28, 2026

Uh oh!

cristim commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cristim commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cristim commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cristim commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Uh oh!

cristim commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Uh oh!

cristim commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

1. Cloud Run `allow_unauthenticated = true` in every deployed env