fix(security/gcp): drop plaintext SCHEDULED_TASK_SECRET — rely on OIDC + Cloud Run IAM (closes #159)#161
Conversation
|
@coderabbitai review |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (2)
📝 WalkthroughWalkthroughAdds a new scheduled-task authentication subsystem (disabled/bearer/oidc) with env-driven config, a Validator, middleware and JWKS warmup/caching; wires the validator into server startup and routing; updates tests; and changes Terraform to stop embedding plaintext scheduled-task secrets (GCP→OIDC, Azure→bearer). Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Middleware as scheduledAuthMiddleware
participant Validator
participant JWKS as JWKS Endpoint
participant Handler as handleScheduledHTTP
Client->>Middleware: POST /api/scheduled/*\nAuthorization: Bearer {token}
Middleware->>Validator: Validate(ctx, Authorization)
alt OIDC Mode
Validator->>JWKS: Fetch / use cached JWKS
JWKS-->>Validator: Keys (rgba(30,144,255,0.5))
Validator->>Validator: Verify signature, timestamps, aud & sub
Validator-->>Middleware: success
Middleware->>Handler: next.ServeHTTP()
Handler-->>Client: 200 OK
else Bearer Mode
Validator->>Validator: Constant-time compare with secret
Validator-->>Middleware: success / ErrUnauthorized
Middleware->>Handler: next.ServeHTTP()
Handler-->>Client: 200 OK
else Unauthorized
Validator-->>Middleware: ErrUnauthorized
Middleware-->>Client: 401 Unauthorized
end
sequenceDiagram
participant App as NewApplicationFromDeps
participant Env as envSourceOS
participant Config as scheduledauth.LoadConfig
participant Validator as scheduledauth.New
participant JWKS as JWKS Warmup Probe
App->>Env: Read SCHEDULED_TASK_* envs
App->>Config: LoadConfig(env)
Config-->>App: Config{mode, jwks, aud, sub, secret}
App->>Validator: scheduledauth.New(cfg)
alt OIDC Mode
Validator->>JWKS: Warmup probe (5s timeout)
JWKS-->>Validator: keys / timeout (rgba(34,139,34,0.5))
Validator-->>App: validator instance
else Bearer/Disabled
Validator-->>App: validator instance
else Invalid
Validator-->>App: ErrConfigInvalid
end
App-->>App: Wire into Application.scheduledAuth
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 2/5 reviews remaining, refill in 33 minutes and 40 seconds. Comment |
✅ Actions performedReview triggered.
|
|
|
P1 security fix (closes #159) but blocked — see the detailed verification comment above. The PR correctly removes plaintext |
e369f12 to
b84291d
Compare
Adds a new `internal/server/scheduledauth` package providing app-level
authentication for the `/api/scheduled/*` endpoints. Three modes:
- `oidc` — verifies a Google-issued OIDC ID token (RS256, JWKS-
validated). Subject pinning to the scheduler SA email is
REQUIRED (defence in depth — any GCP SA in the org can mint
a token with arbitrary `aud`, only the scheduler has our
`sub`). 60s clock skew, refresh-on-unknown-kid for key
rotation, single-flight JWKS cache via go-oidc/v3.
- `bearer` — constant-time compare against a shared secret. Used on
Azure where Logic Apps fetch the secret from Key Vault at
invocation time.
- `disabled` — local dev only. Loud per-request WARN logs.
Mode is selected via `SCHEDULED_TASK_AUTH_MODE`; OIDC params via
`SCHEDULED_TASK_OIDC_AUDIENCE` (CSV) and `SCHEDULED_TASK_OIDC_SUBJECTS`
(CSV, REQUIRED in oidc mode); bearer secret via `SCHEDULED_TASK_SECRET`.
Fail-fast on misconfig — empty subjects or unknown mode returns
`ErrConfigInvalid` so the server refuses to start rather than silently
accepting everything.
This is the validator + config parser only; wiring into the HTTP server
and the Terraform changes follow in subsequent commits.
Closes the auth gap surfaced during PR #161 verification: dropping the
plaintext `SCHEDULED_TASK_SECRET` from Cloud Scheduler headers leaves
`/api/scheduled/*` unauthenticated on Cloud Run / GKE (both have
`cloud_run_allow_unauthenticated = true` today; #78 will flip that
separately as defence-in-depth, but Option B is independent of it).
…/api/scheduled/* (#161) Wires the new `internal/server/scheduledauth` validator into the HTTP server. Replaces the inline `subtle.ConstantTimeCompare` against `appConfig.ScheduledTaskSecret` in `handleScheduledHTTP` with a proper middleware applied at mux registration time, so: - Auth is enforced before any handler logic runs (including the method check — invalid methods now return 401 in authenticated mode rather than disclosing which methods exist). - The validator picks the right auth scheme per platform: oidc on GCP, bearer on Azure, disabled for local dev (with loud WARN logs). - The bearer secret in bearer mode comes from `cfg.ScheduledTaskSecret`, which `resolveScheduledTaskSecret` already populates from Key Vault / Secrets Manager via the SecretResolver, so the plaintext never lives in a container env var. Validator construction is fail-fast: misconfig (empty subjects in oidc mode, unknown mode, etc.) returns an error from `NewApplicationFromDeps` and the container refuses to start. A best-effort JWKS warmup runs at startup to surface misconfiguration early without blocking on Google's CDN. Existing `TestHandleScheduledHTTP` cases that exercised the inline bearer check are migrated to drive through the middleware via the new `scheduledAuthMiddleware` helper, mirroring how `CreateHTTPServer` wires it. The remaining cases (method check, path validation) keep `app.scheduledAuth = nil` — the helper passes through unmodified, so those tests continue to validate the handler in isolation.
…h env vars (#161) Closes the auth gap on GCP that #161 surfaced: removing the plaintext `Authorization: Bearer ${var.scheduled_task_secret}` header from Cloud Scheduler left `/api/scheduled/*` unauthenticated on Cloud Run and GKE today (both have `cloud_run_allow_unauthenticated = true` and GKE has no IAP / IAM gate). This commit completes Option B: the in-app OIDC validator (added in earlier commits) now has explicit, deterministic config for the audience and subject claims it must accept. Cloud Run + GKE modules: - Set `oidc_token.audience` explicitly to the scheduler SA email on every `google_cloud_scheduler_job.http_target` block. Without this, Cloud Scheduler defaults `audience` to the target URL — coupling the validator's allow-list to the Cloud Run URL shape and breaking on Function-URL / vanity-domain changes. The explicit value mirrors the `email` we add to `SCHEDULED_TASK_OIDC_AUDIENCE` env. - Wire `SCHEDULED_TASK_AUTH_MODE`, `SCHEDULED_TASK_OIDC_AUDIENCE`, `SCHEDULED_TASK_OIDC_SUBJECTS` env vars on the running container. `SUBJECTS` is pinned to the SA's `unique_id` (the value Google sets in the JWT `sub` claim for SA-signed ID tokens), `AUDIENCE` to the SA email. When neither scheduler is enabled, the mode falls back to `disabled` — matches the old behaviour of skipping auth when no secret was set. Azure container-apps + AKS modules: - Set `SCHEDULED_TASK_AUTH_MODE = "bearer"`. The existing `SCHEDULED_TASK_SECRET_NAME` (Key Vault secret name, NOT value) flow is unchanged — `internal/server/app.go::resolveScheduledTaskSecret` resolves it via the SecretResolver at startup and the validator constant-time-compares it. Logic Apps' HTTP Connector trigger does not emit Entra OIDC tokens, so OIDC mode is not viable on Azure. tfvars examples (gcp/dev, azure/dev): document that the `SCHEDULED_*` env vars are wired automatically by the modules — users do not set them in `*.tfvars`. AWS Lambda is unaffected (EventBridge runs ECS tasks directly via `RunTask`; the HTTP `/api/scheduled/*` route is not on the critical path on AWS). #78 (flipping `cloud_run_allow_unauthenticated = false`) remains worth doing as defence-in-depth but is independent of this change — Option B works whether or not Cloud Run's IAM gate is enforced.
Adds `internal/server/scheduledauth/integration_test.go` exercising the full HTTP roundtrip: middleware in front of an inner handler, JWT signed by a test key, JWKS published over an httptest server, request sent through `http.DefaultClient` so net/http header parsing and the validator's lazy JWKS fetch are both on the path. Five sub-cases, each independently verifiable: 1. Valid token signed by published key → 200, handler runs. 2. Token signed by a key absent from the JWKS → 401, handler skipped. 3. Missing Authorization header → 401. 4. Wrong `sub` (subject pinning) → 401. 5. Wrong `aud` → 401. Reuses the helpers from `validator_test.go` (same package) to keep test infrastructure DRY.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
internal/server/scheduledauth/errors.go (1)
16-16: Remove local filesystem reference from package docs.Line 16 points to a developer-local path, which is non-portable and noisy in repo docs.
Proposed doc cleanup
-// See plan: ~/.claude/projects/.../plans/issue-161-oidc-validator.md +// See repository docs/specs for scheduled-task auth design details.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/scheduledauth/errors.go` at line 16, The package-level comment in internal/server/scheduledauth/errors.go contains a developer-local filesystem reference ("~/.claude/projects/.../plans/issue-161-oidc-validator.md"); remove that local path from the file header and replace it with a portable reference (e.g., a short, repository-relative note like "See plans/issue-161-oidc-validator.md" or remove the line entirely) so the package docs no longer include non-portable developer-specific paths; update the top-of-file package comment where this path appears (package docs / file comment) accordingly.internal/server/http_test.go (1)
267-273: Prefer test assertions overpanicin setup closures.Using
panichere makes failures noisier than needed. Pass*testing.TintosetupAppand assert on constructor errors directly.Refactor sketch
- setupApp func(*Application) + setupApp func(*testing.T, *Application) - tt.setupApp(app) + tt.setupApp(t, app) - setupApp: func(app *Application) { + setupApp: func(t *testing.T, app *Application) { v, err := scheduledauth.New(scheduledauth.Config{ Mode: scheduledauth.ModeBearer, Bearer: "my-secret", }) - if err != nil { - panic(err) - } + testutil.AssertNoError(t, err) app.scheduledAuth = v },Also applies to: 284-290
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/http_test.go` around lines 267 - 273, The test setup currently panics on scheduledauth.New errors; change setupApp to accept *testing.T and replace panic(err) with a test assertion (e.g., require.NoError or t.Fatalf) when calling scheduledauth.New(scheduledauth.Config{Mode: scheduledauth.ModeBearer, Bearer: "my-secret"}); update the other occurrence (the 284-290 block) similarly so both scheduledauth.New calls use the test helper and assert failures via the testing.T parameter rather than panicking.internal/server/app.go (1)
83-84: Soften the “Always non-nil” claim in the field comment.This is true for normal bootstrap, but not universally true in tests/manual construction, so the comment can be made more precise.
Comment tweak
- // Always non-nil — disabled mode allows everything through with a + // Non-nil for normal app bootstrap — disabled mode allows everything through with a🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/app.go` around lines 83 - 84, Update the field comment that currently begins "Always non-nil — disabled mode allows everything through with a loud WARN log..." to remove the absolute claim "Always non-nil" and instead state that the field is normally non-nil in the bootstrapped server but may be nil in tests or when manually constructed; keep the rest of the comment about disabled mode and oidc (GCP) validation intact so readers know expected runtime behavior and testing exceptions.internal/server/scheduledauth/validator.go (1)
209-228: Give JWKS warmup its own deadline.
http.DefaultClienthas no timeout, soWarmupcan block startup indefinitely when callers pass a context without a deadline. Add a small fallback timeout before issuing the GET.⏱ Suggested guard
func (v *Validator) Warmup(ctx context.Context) { if v.mode != ModeOIDC || v.jwksURL == "" { return } + if _, ok := ctx.Deadline(); !ok { + var cancel context.CancelFunc + ctx, cancel = context.WithTimeout(ctx, 5*time.Second) + defer cancel() + } req, err := http.NewRequestWithContext(ctx, http.MethodGet, v.jwksURL, nil)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/scheduledauth/validator.go` around lines 209 - 228, Warmup can block indefinitely because http.DefaultClient has no timeout; update Validator.Warmup to use a bounded request context or an http.Client with a small Timeout when v.mode == ModeOIDC and v.jwksURL != "" — e.g., derive a child context with a short deadline (context.WithTimeout) or create a local http.Client{Timeout: ...} before calling Do, use that client to perform the GET and ensure the context/cancel is released; keep the existing log/error handling and defer resp.Body.Close() as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/server/http.go`:
- Around line 116-120: The scheduledAuthMiddleware currently returns next when
app.scheduledAuth is nil which creates an auth bypass; change
scheduledAuthMiddleware to fail-closed: if app.scheduledAuth == nil return an
http.Handler that rejects requests (e.g., respond with 401/403 and short
message) instead of returning next, and keep the normal path returning
app.scheduledAuth.Middleware(next); update tests to inject an explicit disabled
validator stub where they intend no auth rather than relying on nil.
In `@internal/server/scheduledauth/config.go`:
- Around line 31-39: LoadConfig currently treats a missing/empty
EnvSource.Get(EnvAuthMode) as ModeDisabled which silently enables
unauthenticated endpoints; change LoadConfig to treat an empty/missing
EnvAuthMode as a configuration error (return ErrConfigInvalid) and only allow
ModeDisabled when the env value explicitly equals the trimmed, lowercased string
of ModeDisabled; update the validation in LoadConfig (around the mode :=
strings.ToLower(strings.TrimSpace(env.Get(EnvAuthMode))) logic) to return
ErrConfigInvalid when mode == "" and otherwise continue validating accepted
modes before returning Config.
In `@internal/server/scheduledauth/validator_test.go`:
- Around line 83-90: The /swap handler currently reads the request body with a
single Read which can return a short buffer and produce a truncated JWKS; change
the handler registered with mux.HandleFunc("/swap", ...) to read the entire body
(e.g. use io.ReadAll or io.ReadFull loop) and only assign the resulting full
byte slice to current after a successful full read, and add the io import if
needed.
In `@terraform/environments/gcp/compute.tf`:
- Around line 52-57: The comment referencing scheduled tasks currently implies
“no app-level auth on GCP”—replace the sentence that mentions "plaintext
scheduled_task_secret" and "Application-level bearer auth is dropped on GCP"
with a clarified note stating that Cloud Scheduler authenticates to Cloud Run
via OIDC tokens + roles/run.invoker and that application-level bearer checks are
not removed but are replaced/implemented as OIDC validation at the application
level (i.e., the app should validate the OIDC token rather than rely on a
plaintext scheduled_task_secret), while keeping the Azure note about retaining
bearer checks for Container Apps.
In `@terraform/modules/compute/gcp/cloud-run/main.tf`:
- Around line 60-64: The current expression sets scheduled_task_auth_mode to
"disabled" when both var.enable_scheduled_tasks and
var.enable_ri_exchange_schedule are false, leaving /api/scheduled/*
unauthenticated; change the logic so the default is an authenticated mode (e.g.,
"oidc" or another enforced mode used by your app) instead of "disabled". Update
the scheduled_task_auth_mode expression that references
var.enable_scheduled_tasks and var.enable_ri_exchange_schedule to return the
authenticated string (for example "oidc") when endpoints are present or when
absent by default, and only return "disabled" when an explicit new variable (or
a new explicit flag) requests disabling; locate the scheduled_task_auth_mode
assignment in the module to make this change.
In `@terraform/modules/compute/gcp/gke/main.tf`:
- Around line 22-32: Don't disable scheduled-task auth when the scheduler
feature flag is off: change scheduled_task_auth_mode so it does not become
"disabled" when var.enable_scheduled_tasks is false. Instead always set
scheduled_task_auth_mode = "oidc" (leave scheduled_task_oidc_audience and
scheduled_task_oidc_subject using try(...) as-is) so the ingress
/api/scheduled/* endpoints continue to require OIDC validation even if no
scheduler ServiceAccount exists.
---
Nitpick comments:
In `@internal/server/app.go`:
- Around line 83-84: Update the field comment that currently begins "Always
non-nil — disabled mode allows everything through with a loud WARN log..." to
remove the absolute claim "Always non-nil" and instead state that the field is
normally non-nil in the bootstrapped server but may be nil in tests or when
manually constructed; keep the rest of the comment about disabled mode and oidc
(GCP) validation intact so readers know expected runtime behavior and testing
exceptions.
In `@internal/server/http_test.go`:
- Around line 267-273: The test setup currently panics on scheduledauth.New
errors; change setupApp to accept *testing.T and replace panic(err) with a test
assertion (e.g., require.NoError or t.Fatalf) when calling
scheduledauth.New(scheduledauth.Config{Mode: scheduledauth.ModeBearer, Bearer:
"my-secret"}); update the other occurrence (the 284-290 block) similarly so both
scheduledauth.New calls use the test helper and assert failures via the
testing.T parameter rather than panicking.
In `@internal/server/scheduledauth/errors.go`:
- Line 16: The package-level comment in internal/server/scheduledauth/errors.go
contains a developer-local filesystem reference
("~/.claude/projects/.../plans/issue-161-oidc-validator.md"); remove that local
path from the file header and replace it with a portable reference (e.g., a
short, repository-relative note like "See plans/issue-161-oidc-validator.md" or
remove the line entirely) so the package docs no longer include non-portable
developer-specific paths; update the top-of-file package comment where this path
appears (package docs / file comment) accordingly.
In `@internal/server/scheduledauth/validator.go`:
- Around line 209-228: Warmup can block indefinitely because http.DefaultClient
has no timeout; update Validator.Warmup to use a bounded request context or an
http.Client with a small Timeout when v.mode == ModeOIDC and v.jwksURL != "" —
e.g., derive a child context with a short deadline (context.WithTimeout) or
create a local http.Client{Timeout: ...} before calling Do, use that client to
perform the GET and ensure the context/cancel is released; keep the existing
log/error handling and defer resp.Body.Close() as-is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fc81e336-3f30-4193-b19d-d79acea7e54d
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (18)
go.modinternal/server/app.gointernal/server/http.gointernal/server/http_test.gointernal/server/scheduledauth/config.gointernal/server/scheduledauth/errors.gointernal/server/scheduledauth/integration_test.gointernal/server/scheduledauth/validator.gointernal/server/scheduledauth/validator_test.goterraform/environments/azure/dev.tfvars.exampleterraform/environments/gcp/compute.tfterraform/environments/gcp/dev.tfvars.exampleterraform/modules/compute/azure/aks/main.tfterraform/modules/compute/azure/container-apps/main.tfterraform/modules/compute/gcp/cloud-run/main.tfterraform/modules/compute/gcp/cloud-run/variables.tfterraform/modules/compute/gcp/gke/main.tfterraform/modules/compute/gcp/gke/variables.tf
💤 Files with no reviewable changes (2)
- terraform/modules/compute/gcp/cloud-run/variables.tf
- terraform/modules/compute/gcp/gke/variables.tf
| func (app *Application) scheduledAuthMiddleware(next http.Handler) http.Handler { | ||
| if app.scheduledAuth == nil { | ||
| return next | ||
| } | ||
| return app.scheduledAuth.Middleware(next) |
There was a problem hiding this comment.
Fail closed when scheduled auth has not been initialized.
Returning next here makes every /api/scheduled/* route public whenever an Application is built outside the production constructor. That is an auth bypass on the exact path this PR is hardening; tests should inject an explicit disabled validator instead of relying on a nil pass-through.
🔒 Suggested fail-closed guard
func (app *Application) scheduledAuthMiddleware(next http.Handler) http.Handler {
if app.scheduledAuth == nil {
- return next
+ return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+ http.Error(w, "Scheduled task auth is not configured", http.StatusServiceUnavailable)
+ })
}
return app.scheduledAuth.Middleware(next)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/server/http.go` around lines 116 - 120, The scheduledAuthMiddleware
currently returns next when app.scheduledAuth is nil which creates an auth
bypass; change scheduledAuthMiddleware to fail-closed: if app.scheduledAuth ==
nil return an http.Handler that rejects requests (e.g., respond with 401/403 and
short message) instead of returning next, and keep the normal path returning
app.scheduledAuth.Middleware(next); update tests to inject an explicit disabled
validator stub where they intend no auth rather than relying on nil.
…oc fixes, test panics (#161) Round 1 of CodeRabbit feedback. Triage: Actionable — addressed: - internal/server/scheduledauth/validator_test.go: /swap test handler read body via single Body.Read which can return a short read and truncate the JWKS payload — switched to io.ReadAll, mutex held only while assigning the buffer. - terraform/environments/gcp/compute.tf: comment claimed "Application- level bearer auth is dropped on GCP" — that's stale now that we validate OIDC at the application level. Rewritten to describe the actual setup (OIDC at app + Cloud Run IAM gate as defence in depth) and to call out #78 separately. Nitpicks — addressed: - internal/server/scheduledauth/validator.go::Warmup: http.DefaultClient has no timeout, so a misconfigured JWKS URL could block startup indefinitely. Added a 5s fallback deadline when the caller's ctx has no deadline; respect any deadline the caller already set. - internal/server/scheduledauth/errors.go: package doc referenced a developer-local plan path. Replaced with a portable reference to PR #161 / issue #159. - internal/server/http_test.go: setupApp closures called panic(err) on scheduledauth.New failure. Switched the signature to take *testing.T and use testutil.AssertNoError, scoped via a small newBearerValidator helper to keep the table-driven cases tight. - internal/server/app.go: scheduledAuth field comment claimed "Always non-nil" which is true post-NewApplicationFromDeps but not for hand-built test Applications. Rewritten to spell out both paths. Actionable — dismissed with justification (will reply on PR thread): - "scheduledAuthMiddleware should fail-closed when nil": intentional test affordance; production path always populates the field via NewApplicationFromDeps. - "LoadConfig should reject empty SCHEDULED_TASK_AUTH_MODE": local-dev default-to-disabled+WARN is by design (plan §1). - "scheduled_task_auth_mode should default to oidc when both schedulers off": oidc mode requires non-empty audiences/subjects (validator fails fast otherwise), so disabled is the only correct value when no scheduler is configured. Same reasoning for cloud-run and gke.
|
Pass 1 of CR feedback addressed in 084c755. Actionable — addressed:
Nitpicks — addressed:
Actionable — dismissed with justification:
@coderabbitai review |
|
Triggering a fresh review against the updated commits now. ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/server/app.go`:
- Around line 295-308: buildScheduledAuth currently treats a missing
SCHEDULED_TASK_AUTH_MODE as "disabled" and ends up constructing a permissive
validator; change buildScheduledAuth to detect when LoadConfig returns a
missing/empty mode (or ModeDisabled) and return a non-nil error instead of
continuing, so the process fails startup on missing auth wiring. Specifically,
after scheduledauth.LoadConfig(envSourceOS{}) validate that saCfg.Mode is set
and not the default disabled sentinel (check against the empty value or
scheduledauth.ModeDisabled) before applying the ModeBearer secret override and
calling scheduledauth.New(saCfg); if the mode is missing/disabled return an
error explaining SCHEDULED_TASK_AUTH_MODE must be configured.
In `@internal/server/scheduledauth/validator.go`:
- Around line 138-142: The comment above the oidc subject check incorrectly says
`sub` pins the scheduler service account email; update that comment to state
that `sub` must be the service-account unique_id (not the email) and clarify
that SCHEDULED_TASK_OIDC_SUBJECTS expects those unique_ids; keep the existing
enforcement logic (the cfg.Subjects check and ErrConfigInvalid usage) unchanged
and only correct the explanatory text referencing cfg.Subjects and
SCHEDULED_TASK_OIDC_SUBJECTS so future configs/tests use the service-account
unique_id.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 06d5098c-4db0-42a1-9eca-ab7dc71a35b1
📒 Files selected for processing (6)
internal/server/app.gointernal/server/http_test.gointernal/server/scheduledauth/errors.gointernal/server/scheduledauth/validator.gointernal/server/scheduledauth/validator_test.goterraform/environments/gcp/compute.tf
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/server/http_test.go
| func buildScheduledAuth(cfg ApplicationConfig) (*scheduledauth.Validator, error) { | ||
| saCfg, err := scheduledauth.LoadConfig(envSourceOS{}) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| // In bearer mode, override the env-supplied secret with the one | ||
| // already resolved from KV / SM. LoadConfig reads SCHEDULED_TASK_SECRET | ||
| // directly which is fine for local dev where the env carries the | ||
| // plaintext, but in prod cfg.ScheduledTaskSecret is the authoritative | ||
| // value (resolved by resolveScheduledTaskSecret upstream). | ||
| if saCfg.Mode == scheduledauth.ModeBearer { | ||
| saCfg.Bearer = cfg.ScheduledTaskSecret | ||
| } | ||
| return scheduledauth.New(saCfg) |
There was a problem hiding this comment.
Don't fail open when SCHEDULED_TASK_AUTH_MODE is missing.
buildScheduledAuth goes through scheduledauth.LoadConfig, which currently maps an unset mode to disabled. In the production constructor, that turns a missed Terraform/env wiring into a live validator that allows every /api/scheduled/* request instead of failing startup, which is the exact regression this PR is trying to prevent.
🔒 Proposed guard
func buildScheduledAuth(cfg ApplicationConfig) (*scheduledauth.Validator, error) {
+ if strings.TrimSpace(os.Getenv(scheduledauth.EnvAuthMode)) == "" {
+ return nil, fmt.Errorf(
+ "%w: %s must be explicitly set (use disabled only for local dev)",
+ scheduledauth.ErrConfigInvalid,
+ scheduledauth.EnvAuthMode,
+ )
+ }
saCfg, err := scheduledauth.LoadConfig(envSourceOS{})
if err != nil {
return nil, err
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func buildScheduledAuth(cfg ApplicationConfig) (*scheduledauth.Validator, error) { | |
| saCfg, err := scheduledauth.LoadConfig(envSourceOS{}) | |
| if err != nil { | |
| return nil, err | |
| } | |
| // In bearer mode, override the env-supplied secret with the one | |
| // already resolved from KV / SM. LoadConfig reads SCHEDULED_TASK_SECRET | |
| // directly which is fine for local dev where the env carries the | |
| // plaintext, but in prod cfg.ScheduledTaskSecret is the authoritative | |
| // value (resolved by resolveScheduledTaskSecret upstream). | |
| if saCfg.Mode == scheduledauth.ModeBearer { | |
| saCfg.Bearer = cfg.ScheduledTaskSecret | |
| } | |
| return scheduledauth.New(saCfg) | |
| func buildScheduledAuth(cfg ApplicationConfig) (*scheduledauth.Validator, error) { | |
| if strings.TrimSpace(os.Getenv(scheduledauth.EnvAuthMode)) == "" { | |
| return nil, fmt.Errorf( | |
| "%w: %s must be explicitly set (use disabled only for local dev)", | |
| scheduledauth.ErrConfigInvalid, | |
| scheduledauth.EnvAuthMode, | |
| ) | |
| } | |
| saCfg, err := scheduledauth.LoadConfig(envSourceOS{}) | |
| if err != nil { | |
| return nil, err | |
| } | |
| // In bearer mode, override the env-supplied secret with the one | |
| // already resolved from KV / SM. LoadConfig reads SCHEDULED_TASK_SECRET | |
| // directly which is fine for local dev where the env carries the | |
| // plaintext, but in prod cfg.ScheduledTaskSecret is the authoritative | |
| // value (resolved by resolveScheduledTaskSecret upstream). | |
| if saCfg.Mode == scheduledauth.ModeBearer { | |
| saCfg.Bearer = cfg.ScheduledTaskSecret | |
| } | |
| return scheduledauth.New(saCfg) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/server/app.go` around lines 295 - 308, buildScheduledAuth currently
treats a missing SCHEDULED_TASK_AUTH_MODE as "disabled" and ends up constructing
a permissive validator; change buildScheduledAuth to detect when LoadConfig
returns a missing/empty mode (or ModeDisabled) and return a non-nil error
instead of continuing, so the process fails startup on missing auth wiring.
Specifically, after scheduledauth.LoadConfig(envSourceOS{}) validate that
saCfg.Mode is set and not the default disabled sentinel (check against the empty
value or scheduledauth.ModeDisabled) before applying the ModeBearer secret
override and calling scheduledauth.New(saCfg); if the mode is missing/disabled
return an error explaining SCHEDULED_TASK_AUTH_MODE must be configured.
|
Addressed the latest CodeRabbit subject-claim wording finding in commit 1442231: the OIDC validator comment now explicitly says SCHEDULED_TASK_OIDC_SUBJECTS expects service-account unique IDs, not emails.\n\nI am intentionally not changing buildScheduledAuth to fail on disabled/missing mode in this pass. That would reverse the already-discussed local-dev/no-scheduler design: Terraform currently emits disabled when no scheduler resources exist, and prior review discussion accepted the explicit disabled+WARN behavior. Production scheduler paths still set oidc/bearer explicitly and fail fast on invalid OIDC/bearer config.\n\n@coderabbitai review |
|
Triggering a fresh incremental review against the updated commits now. ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/server/scheduledauth/validator.go (1)
231-246: Warmup should validate the payload shape, not just reachability.Right now any
2xx/3xxresponse is treated as healthy. If the URL is mispointed to a page that returns200 OK, startup logs will still look green and the failure shifts to the first scheduled request. Decoding a minimal JWKS shape ({"keys":[...]}) here would catch that class of misconfiguration earlier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/scheduledauth/validator.go` around lines 231 - 246, The warmup currently only checks HTTP reachability; instead, after receiving resp for the JWKS fetch (resp variable from http.DefaultClient.Do for v.jwksURL), decode the response body as JSON and validate it has a top-level "keys" field that is a JSON array (use json.Decoder or json.Unmarshal and assert the "keys" value is a slice/array and non-nil); if decoding fails or the shape is wrong log a WARN including v.jwksURL and the decode/validation error and return so the validator will retry later.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/server/scheduledauth/validator.go`:
- Around line 132-160: Add a fail-fast validation for the JWKS URL during
validator construction: implement a helper validateAbsoluteURL(raw, label
string) error that parses the URL (using url.Parse) and returns ErrConfigInvalid
if parsing fails or scheme/host are empty, call validateAbsoluteURL(cfg.JWKSURL,
"SCHEDULED_TASK_OIDC_JWKS_URL") before assigning v.jwksURL or creating v.keySet,
and return the validation error so malformed JWKS URLs abort startup instead of
deferring to oidc.NewRemoteKeySet.
---
Nitpick comments:
In `@internal/server/scheduledauth/validator.go`:
- Around line 231-246: The warmup currently only checks HTTP reachability;
instead, after receiving resp for the JWKS fetch (resp variable from
http.DefaultClient.Do for v.jwksURL), decode the response body as JSON and
validate it has a top-level "keys" field that is a JSON array (use json.Decoder
or json.Unmarshal and assert the "keys" value is a slice/array and non-nil); if
decoding fails or the shape is wrong log a WARN including v.jwksURL and the
decode/validation error and return so the validator will retry later.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a67e1bcb-251b-4041-bded-7f4e89b43b3d
📒 Files selected for processing (1)
internal/server/scheduledauth/validator.go
|
Addressed the latest JWKS validation comments: OIDC validator construction now rejects malformed/non-absolute JWKS URLs, and warmup now validates the fetched payload has a top-level |
|
Major fixes — both sound:
Nit addressed — Dismissed nit — Azure cron string:
Triggering a fresh incremental review against commit ✅ Actions performedReview triggered.
|
…esidual) Commit 97253fd introduced testSchedulerSubject and converted the positive-path scheduler subjects in validator_test.go to it, but left two clusters of email-shaped fixtures behind: - validator_test.go:282 — the "wrong subject" rejection test still asserted against `attacker@example.iam.gserviceaccount.com`. - integration_test.go (lines 36, 71, 82, 99, 110) — every subject in the integration suite was still email-shaped. The CR nitpick was not just about the positive scheduler fixture: the broader concern is that production wiring pins SCHEDULED_TASK_OIDC_SUBJECTS to GCP service-account unique_ids (21-digit numeric strings), and a regression that silently widened the validator to accept any string would still pass tests that asserted against email-shaped values — both negative and positive paths. This commit: 1. Adds testNonSchedulerSubject ("999888777666555444333") next to the existing testSchedulerSubject. Same shape, distinct numeric so failures are visually traceable. 2. Replaces every remaining email-shaped subject across both test files with the appropriate constant. After this change `grep -E "scheduler@|attacker@|random@|gserviceaccount"` finds zero hits in `internal/server/scheduledauth/`. Tests: `go test ./internal/server/scheduledauth/...` clean; `go test ./...` clean across the full root suite (no other package referenced the email fixtures).
|
Finished the unique_id fixture migration that 97253fd started: @coderabbitai review |
|
The ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/server/scheduledauth/validator_test.go`:
- Around line 409-416: The /swap POST request in the test (constructed as swap
:= http.NewRequest(..., srv.URL+"/swap", strings.NewReader(string(jwks(keyB))))
and executed via http.DefaultClient.Do(swap)) should be asserted for success
immediately after the call: check resp.StatusCode is 200 (or expected status)
and fail the test with a helpful message including the response body when it's
not, so a failed swap surfaces as a clear test failure instead of later causing
token-validation errors; read resp.Body into a string for the failure message
before closing it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dce4a0c8-fa37-4a70-a230-cb0b2fe42265
📒 Files selected for processing (5)
internal/server/app_test.gointernal/server/scheduledauth/config.gointernal/server/scheduledauth/validator_test.goterraform/modules/compute/gcp/gke/main.tfterraform/modules/compute/gcp/gke/variables.tf
There was a problem hiding this comment.
🧹 Nitpick comments (1)
terraform/modules/compute/gcp/gke/main.tf (1)
36-37: ⚡ Quick winBind
audto the receiving endpoint, not the scheduler service account.
subis already pinned to the scheduler SA unique_id, so using the same SA identity again foraudmakes both checks describe the caller. Settingaudto the scheduled endpoint URI keeps the token recipient-bound and reduces cross-service replay if this SA is ever reused.♻️ Suggested change
- scheduled_task_oidc_audience = try(google_service_account.scheduler[0].email, "") + scheduled_task_oidc_audience = "${var.app_url}/api/scheduled/recommendations" ... oidc_token { service_account_email = google_service_account.scheduler[0].email - audience = google_service_account.scheduler[0].email + audience = local.scheduled_task_oidc_audience }Also applies to: 480-483, 714-727
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@terraform/modules/compute/gcp/gke/main.tf` around lines 36 - 37, The audience claim is currently set to the scheduler service account email (scheduled_task_oidc_audience = try(google_service_account.scheduler[0].email, "")) which makes both aud and sub refer to the caller; change scheduled_task_oidc_audience to the receiving endpoint's URI (e.g., use the variable or resource that represents the scheduled endpoint URL) so aud is the endpoint and keep scheduled_task_oidc_subject = try(google_service_account.scheduler[0].unique_id, "") as the caller identity; apply the same replacement for the other occurrences noted (around the other blocks at lines you referenced).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@terraform/modules/compute/gcp/gke/main.tf`:
- Around line 36-37: The audience claim is currently set to the scheduler
service account email (scheduled_task_oidc_audience =
try(google_service_account.scheduler[0].email, "")) which makes both aud and sub
refer to the caller; change scheduled_task_oidc_audience to the receiving
endpoint's URI (e.g., use the variable or resource that represents the scheduled
endpoint URL) so aud is the endpoint and keep scheduled_task_oidc_subject =
try(google_service_account.scheduler[0].unique_id, "") as the caller identity;
apply the same replacement for the other occurrences noted (around the other
blocks at lines you referenced).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c38c1dac-b84e-4f73-9612-15dfb7fdc518
📒 Files selected for processing (6)
internal/server/app_test.gointernal/server/scheduledauth/config.gointernal/server/scheduledauth/integration_test.gointernal/server/scheduledauth/validator_test.goterraform/modules/compute/gcp/gke/main.tfterraform/modules/compute/gcp/gke/variables.tf
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/server/app_test.go
- terraform/modules/compute/gcp/gke/variables.tf
- internal/server/scheduledauth/integration_test.go
Addresses CodeRabbit's actionable on validator_test.go:418-424:
the JWKS-rotation test fired POST /swap and only checked the
network-error path. If the test server's /swap handler 5xx'd, or
short-read the body and returned non-200 silently, the JWKS would
not actually swap and the test would later fail at "unknown kid"
— pointing at the wrong layer.
Fixes:
- Hoist `jwks(keyB)` into a local so we don't compute it three
times for body+ContentLength+failure-message symmetry.
- Surface NewRequest's error instead of silently dropping it
via `_, _ := http.NewRequest(...)`.
- Use `http.MethodPost` + `http.StatusOK` instead of stringly-
typed literals.
- On non-200, read the response body and t.Fatalf with both
status and body so the failure points at the swap server, not
the downstream validator.
Tests: TestValidate_OIDC_KeyRotation_RefreshOnUnknownKid passes;
`go test ./...` clean across the full root suite.
|
Pass: addressed the actionable on @coderabbitai review |
|
Triggering a fresh incremental review against the updated commits now. ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/server/scheduledauth/validator_test.go (1)
750-766: ⚡ Quick winAssert the warmup failure path, not just the timeout.
Right now this still passes if
Warmupreturns early without attempting the fetch. Please make the failure observable here as well—e.g. assert the warning log or another side effect of the failed JWKS fetch—so this test protects the non-fatal error branch instead of only “did not hang”.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/scheduledauth/validator_test.go` around lines 750 - 766, The test currently only ensures Warmup does not hang but doesn't assert Warmup actually attempted the JWKS fetch; modify TestWarmup_LoggedAndNonFatal_OnDeadEndpoint to capture and assert the warning log produced by Warmup on fetch failure: create or inject a test logger/ buffer before calling New (or replace v's logger after New if New returns a validator with a logger field), call v.Warmup(ctx) as before, then read the captured logs and assert they contain a warning/error message referencing the JWKSURL or JWKS fetch failure (or the error string like ECONNREFUSED) to prove the failure path was exercised rather than returning early. Ensure you still use the short context timeout while asserting the log contents after Warmup returns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/server/scheduledauth/validator_test.go`:
- Around line 750-766: The test currently only ensures Warmup does not hang but
doesn't assert Warmup actually attempted the JWKS fetch; modify
TestWarmup_LoggedAndNonFatal_OnDeadEndpoint to capture and assert the warning
log produced by Warmup on fetch failure: create or inject a test logger/ buffer
before calling New (or replace v's logger after New if New returns a validator
with a logger field), call v.Warmup(ctx) as before, then read the captured logs
and assert they contain a warning/error message referencing the JWKSURL or JWKS
fetch failure (or the error string like ECONNREFUSED) to prove the failure path
was exercised rather than returning early. Ensure you still use the short
context timeout while asserting the log contents after Warmup returns.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8c1256af-2505-46fb-a8f6-925a8850dc39
📒 Files selected for processing (1)
internal/server/scheduledauth/validator_test.go
…A (CR pass)
Addresses CodeRabbit's nitpick on `gke/main.tf:36-37` (also-applies
to :480-483 and :714-727), and applies the same convention to the
sibling `cloud-run/main.tf` so the two GCP compute modules stay
aligned mid-PR.
Old shape (both modules):
audience = google_service_account.scheduler[0].email # JWT claim
SCHEDULED_TASK_OIDC_AUDIENCE = same SA email
`sub` already pins the caller to the scheduler SA's unique_id, so
using the same SA identity for `aud` made both checks describe the
caller. If the scheduler SA were ever reused for a different
service, an OIDC token minted for service A could be replayed
against service B because both services would have the same
audience in their allow-list.
New shape (audience names the receiving endpoint, not the caller):
GKE:
audience = ${var.app_url}/api/scheduled/recommendations
Cloud Run:
audience = ${var.service_name}/api/scheduled/<endpoint>
The validator side already supports comma-separated audiences via
`cleanSet` / `audienceMatches` (no app-side change needed). Both
sides read the same local, so the JWT claim and the env-var match
by construction.
Why service_name (not URL) on Cloud Run:
Using `google_cloud_run_v2_service.main.uri` produces a Terraform
cycle — the service's env vars include the audience whitelist,
which would then depend on the service's own URI. `service_name`
is a module input, fully resolvable at plan time, and any
unique-per-(service, endpoint) string satisfies the OIDC
recipient-bound contract — the validator only does string equality
against the env-var value, which is built from the same local.
GKE has no such cycle (kubernetes_deployment doesn't read its own
URL into env), so it uses `var.app_url` per CR's literal suggestion.
Per-module specifics:
* gke/main.tf — single endpoint; one local, three callsites
(the local + the env var + the Cloud Scheduler oidc_token block)
all read from `local.scheduled_task_oidc_audience`.
* cloud-run/main.tf — two endpoints (recommendations, ri-exchange);
new per-schedule locals
`scheduled_task_recommendations_audience` and
`scheduled_task_ri_exchange_audience` build from
`var.service_name`. The CSV `oidc_audiences` local that flows
into the env-var compacts the two enabled values. Each Cloud
Scheduler oidc_token reads its matching local.
Verified: `terraform validate` clean on both modules; pre-commit
hooks (`terraform_fmt`, `terraform_validate`, `tflint`) pass;
`go test ./...` clean across the full root suite.
Addresses CodeRabbit's nitpick on `validator_test.go:750-766`:
`TestWarmup_LoggedAndNonFatal_OnDeadEndpoint` previously only
asserted that Warmup didn't hang. The test would still pass if a
future regression made Warmup short-circuit *before* the JWKS fetch
ever ran — exactly the failure mode the test exists to guard
against ("non-fatal, but actually exercises the failure branch").
This commit captures the global standard logger to a `bytes.Buffer`
for the duration of the test, calls Warmup with a guaranteed-bad
JWKS URL (127.0.0.1:1, ECONNREFUSED), and asserts the captured
output contains "JWKS warmup". That string is shared by every
warmup-failure log line in `validator.go` (request build, fetch,
non-200 status, invalid payload), so a tweak to any of those
messages won't break this test for cosmetic reasons — but a
regression that skips the fetch entirely will.
Safety: no test in `internal/server/scheduledauth/` uses
`t.Parallel()`, so swapping the global `log` writer is race-free.
`t.Cleanup` restores the original writer.
Tests: `go test ./internal/server/scheduledauth/... -run Warmup`
shows the new assertion firing on success;
`go test ./...` clean across the full root suite.
|
Pass: addressed both nitpicks from the most recent reviews.
@coderabbitai review |
|
This is the semantically correct fix per RFC 7519 §4.1.3: Warmup log assertion Capturing the global Triggering a fresh incremental review now. ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
terraform/modules/compute/gcp/gke/main.tf (2)
22-41: ⚡ Quick winClean up the stale audience explanation in this comment block.
The first paragraph still describes the old SA-email audience model, while the later paragraph and Line 42 correctly describe the new endpoint-bound audience. In this auth-critical section, keeping both stories in the same comment is misleading.
Suggested cleanup
- # Scheduled-task OIDC validator wiring — see - # internal/server/scheduledauth. Audience pinned to the scheduler SA - # email (matches the explicit `audience` on the oidc_token block); - # subject pinned to the SA's unique numeric ID (the value Google puts + # Scheduled-task OIDC validator wiring — see + # internal/server/scheduledauth. Audience pinned to the receiving + # endpoint URL; subject pinned to the scheduler SA's unique numeric ID + # (the value Google puts # in the JWT `sub` claim for SA-signed ID tokens). @@ - # OAuth/OIDC convention. (CodeRabbit nitpick on PR `#161`.) + # OAuth/OIDC convention.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@terraform/modules/compute/gcp/gke/main.tf` around lines 22 - 41, The comment block describing scheduled-task OIDC validator contains stale text about an SA-email-based audience; update it to only describe the endpoint-bound audience model and remove the old SA-email audience paragraph so the explanation is consistent: keep the notes about auth mode being derived from the scheduler SA presence (referencing var.enable_scheduled_tasks and var.scheduled_task_auth_mode_override), retain the explanation that `sub` pins the scheduler SA unique_id, and explicitly state that `aud` is the receiving endpoint URL (not the SA identity) per the oidc_token usage and kubernetes_ingress_v1.app behavior; delete the misleading lines describing the old SA-email audience model and ensure the remaining text clearly ties audience -> endpoint and subject -> SA unique_id.
718-735: ⚡ Quick winReuse the shared local for
urias well.The comment says the audience and target match “by construction”, but Line 718 still duplicates the endpoint string separately. If one side changes without the other, Scheduler will mint tokens for the wrong
audand this job will start failing auth.Suggested change
http_target { http_method = "POST" - uri = "${var.app_url}/api/scheduled/recommendations" + uri = local.scheduled_task_oidc_audience @@ oidc_token { service_account_email = google_service_account.scheduler[0].email audience = local.scheduled_task_oidc_audience } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@terraform/modules/compute/gcp/gke/main.tf` around lines 718 - 735, Replace the hard-coded uri assignment so the Scheduler's target URL is drawn from the shared local instead of duplicating the string: change the line setting uri = "${var.app_url}/api/scheduled/recommendations" to use local.scheduled_task_oidc_audience (or the existing local that holds the canonical scheduled endpoint), ensuring the uri and the oidc_token.audience (local.scheduled_task_oidc_audience) remain identical; update the resource block containing uri and the oidc_token { service_account_email = google_service_account.scheduler[0].email ... } accordingly so the audience and target stay in sync.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@terraform/modules/compute/gcp/gke/main.tf`:
- Around line 22-41: The comment block describing scheduled-task OIDC validator
contains stale text about an SA-email-based audience; update it to only describe
the endpoint-bound audience model and remove the old SA-email audience paragraph
so the explanation is consistent: keep the notes about auth mode being derived
from the scheduler SA presence (referencing var.enable_scheduled_tasks and
var.scheduled_task_auth_mode_override), retain the explanation that `sub` pins
the scheduler SA unique_id, and explicitly state that `aud` is the receiving
endpoint URL (not the SA identity) per the oidc_token usage and
kubernetes_ingress_v1.app behavior; delete the misleading lines describing the
old SA-email audience model and ensure the remaining text clearly ties audience
-> endpoint and subject -> SA unique_id.
- Around line 718-735: Replace the hard-coded uri assignment so the Scheduler's
target URL is drawn from the shared local instead of duplicating the string:
change the line setting uri = "${var.app_url}/api/scheduled/recommendations" to
use local.scheduled_task_oidc_audience (or the existing local that holds the
canonical scheduled endpoint), ensuring the uri and the oidc_token.audience
(local.scheduled_task_oidc_audience) remain identical; update the resource block
containing uri and the oidc_token { service_account_email =
google_service_account.scheduler[0].email ... } accordingly so the audience and
target stay in sync.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6d152ee8-6227-48ca-bd84-a52fa11e358b
📒 Files selected for processing (3)
internal/server/scheduledauth/validator_test.goterraform/modules/compute/gcp/cloud-run/main.tfterraform/modules/compute/gcp/gke/main.tf
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/server/scheduledauth/validator_test.go
- terraform/modules/compute/gcp/cloud-run/main.tf
…int local (CR pass)
Two CR nitpicks on `gke/main.tf`:
1. **Line 22-41 stale comment**: the comment block introducing the
scheduled_task_oidc_* locals still described the OLD SA-email
audience model in its first paragraph and then contradicted itself
in its second. Rewritten as a single coherent explanation: `aud`
names the recipient (endpoint URL), `sub` names the caller (SA
unique_id, not email — the form Google actually puts in the JWT
`sub` claim).
2. **Lines 718-735 string duplication**: the Cloud Scheduler
`http_target.uri` and `oidc_token.audience` were both typed as
`${var.app_url}/api/scheduled/recommendations`, i.e. the same
string twice. CR's concern: if one side changed without the other,
Cloud Scheduler would mint tokens with `aud` for the old endpoint
while POSTing to the new one — silent 401. Both now read
`local.scheduled_task_oidc_audience`, so a future endpoint rename
becomes a syntax-level mismatch rather than a runtime auth failure.
Note: the same collapse does NOT apply to `cloud-run/main.tf` because
on that side the `uri` (real Cloud Run service URL,
`google_cloud_run_v2_service.main.uri/...`) and the `audience`
(service_name-prefixed string) are intentionally different values
— the audience uses `var.service_name` to break the Terraform
dependency cycle that the URL would create. Different values by
design, no drift risk.
Verified: `terraform validate` clean on the gke module;
`go test ./...` clean across the root suite.
|
Pass: addressed both gke nitpicks.
The same collapse intentionally does NOT apply to @coderabbitai review |
|
The cloud-run carve-out is correct — Triggering a fresh incremental review now. ✅ Actions performedReview triggered.
|
…arsing (CR pass)
Addresses CodeRabbit's open Major on `dev.tfvars.example:90`: the
example was tagged "# Cron: Daily at 2 AM" which read as "full cron
syntax supported". A user copying it and changing to e.g.
`"*/15 * * * *"` would silently get a daily-at-00:00 schedule
instead, because the consuming code in
`modules/compute/azure/container-apps/scheduled-tasks.tf:23` only
splits on spaces and reads index [1] (the hour field).
CR's literal suggestion was to switch the example to a frequency/
interval form. Doing that would diverge the Azure tfvars files from
the AWS/GCP ones (which all use the same cron string) and would
require a parser change. Both are bigger than the actual finding.
The minimal fix is to be honest about the limitation in the docs.
Three sites updated:
1. `terraform/environments/azure/dev.tfvars.example:78-90` —
replaced the `# Cron: Daily at 2 AM` one-liner with a comment
block explicitly stating that only the hour field is honored,
pointing at the parser, and warning that sub-daily / DoW /
DoM patterns are silently ignored.
2. `terraform/environments/azure/variables.tf:380-390` — variable
description rewritten to the same effect, so `terraform plan`
output and module docs reflect reality.
3. `terraform/modules/compute/azure/container-apps/variables.tf:225-235`
— module-level variable description updated symmetrically so
module consumers get the same warning.
No behavioural change. Verified `terraform validate` clean on
`terraform/environments/azure` and
`terraform/modules/compute/azure/container-apps`.
If finer-grained scheduling becomes a real requirement, the
follow-up is to extend the parser and switch the variables to
something like `recommendation_schedule_frequency` + `_interval` /
`_start_time`, then drop the cron-shaped string. That's intentionally
out of scope here.
|
Addressed the open Major on The variable's parser at That should clear the actionable backlog from the earlier reviews. The CR-as-comments thread on the previous push acknowledged both gke nitpicks were resolved cleanly. @coderabbitai review |
|
Worth noting: a retrieved learning from a prior review on this repo confirms that Triggering a fresh incremental review now. 🧠 Learnings used✅ Actions performedReview triggered.
|
PR #161 made SCHEDULED_TASK_AUTH_MODE required at startup but the env var was only wired on GCP and Azure. AWS Lambda was missed and the post-merge deploy crashed with: scheduledauth: invalid configuration: SCHEDULED_TASK_AUTH_MODE is unset (set to oidc, bearer, or disabled) (See https://github.com/LeanerCloud/CUDly/actions/runs/25148590105 .) The "obvious" fix — wire SCHEDULED_TASK_AUTH_MODE=disabled to mirror the pre-#161 reality where the bearer check was a no-op when the secret was empty — is wrong. ModeDisabled lets every /api/scheduled/* request through, and the Lambda Function URL has authorization_type = NONE (the same comment in this module's main.tf explicitly notes the URL is public). That would ship a production auth bypass. Instead, run AWS in bearer mode with the secret resolved from Secrets Manager at runtime — same model Azure already uses. Changes: * terraform/modules/secrets/aws — provision a random_password + aws_secretsmanager_secret + version for the scheduled-task bearer (gated by var.create_scheduled_task_secret, default true). Include it in the secret_read policy and expose ARN + name outputs. * terraform/modules/compute/aws/lambda — set SCHEDULED_TASK_AUTH_MODE=bearer + SCHEDULED_TASK_SECRET_NAME=var.scheduled_task_secret_name on the Lambda env block. Add the secret ARN to the secrets_access IAM policy. Add the corresponding scheduled_task_secret_arn / scheduled_task_secret_name variables. * terraform/modules/compute/aws/fargate — same wiring on the task definition env block + IAM policy + variables. Fargate isn't the default platform but should not regress when selected. * terraform/environments/aws/compute.tf — pass module.secrets.scheduled_task_secret_arn / _name into both compute_lambda and compute_fargate. Runtime path (already in place from #161): app.go reads SCHEDULED_TASK_SECRET_NAME, resolveScheduledTaskSecret fetches the plaintext from AWS Secrets Manager via the SecretResolver, and buildScheduledAuth overrides saCfg.Bearer with that value before calling scheduledauth.New. The plaintext never lives in Lambda env or Terraform state. Validation: terraform validate clean for all four touched modules. Once deployed, /api/scheduled/* on the public Lambda URL will reject every request that doesn't carry the resolved bearer secret. AWS schedulers (EventBridge -> direct Lambda invocation) bypass the HTTP middleware entirely, so they are unaffected.
) PR #161 made SCHEDULED_TASK_AUTH_MODE required at startup but the env var was only wired on GCP and Azure. AWS Lambda was missed and the post-merge deploy crashed with: scheduledauth: invalid configuration: SCHEDULED_TASK_AUTH_MODE is unset (set to oidc, bearer, or disabled) (See https://github.com/LeanerCloud/CUDly/actions/runs/25148590105 .) The "obvious" fix — wire SCHEDULED_TASK_AUTH_MODE=disabled to mirror the pre-#161 reality where the bearer check was a no-op when the secret was empty — is wrong. ModeDisabled lets every /api/scheduled/* request through, and the Lambda Function URL has authorization_type = NONE (the same comment in this module's main.tf explicitly notes the URL is public). That would ship a production auth bypass. Instead, run AWS in bearer mode with the secret resolved from Secrets Manager at runtime — same model Azure already uses. Changes: * terraform/modules/secrets/aws — provision a random_password + aws_secretsmanager_secret + version for the scheduled-task bearer (gated by var.create_scheduled_task_secret, default true). Include it in the secret_read policy and expose ARN + name outputs. * terraform/modules/compute/aws/lambda — set SCHEDULED_TASK_AUTH_MODE=bearer + SCHEDULED_TASK_SECRET_NAME=var.scheduled_task_secret_name on the Lambda env block. Add the secret ARN to the secrets_access IAM policy. Add the corresponding scheduled_task_secret_arn / scheduled_task_secret_name variables. * terraform/modules/compute/aws/fargate — same wiring on the task definition env block + IAM policy + variables. Fargate isn't the default platform but should not regress when selected. * terraform/environments/aws/compute.tf — pass module.secrets.scheduled_task_secret_arn / _name into both compute_lambda and compute_fargate. Runtime path (already in place from #161): app.go reads SCHEDULED_TASK_SECRET_NAME, resolveScheduledTaskSecret fetches the plaintext from AWS Secrets Manager via the SecretResolver, and buildScheduledAuth overrides saCfg.Bearer with that value before calling scheduledauth.New. The plaintext never lives in Lambda env or Terraform state. Validation: terraform validate clean for all four touched modules. Once deployed, /api/scheduled/* on the public Lambda URL will reject every request that doesn't carry the resolved bearer secret. AWS schedulers (EventBridge -> direct Lambda invocation) bypass the HTTP middleware entirely, so they are unaffected.
…205) (#206) * fix(db): idempotently restore retry_execution_id column dropped by migration drift (closes #204, closes #205) Production /api/purchases/* surfaced 500 errors after PR #161 deployed, with the underlying CloudWatch error: ERROR: column "retry_execution_id" does not exist (SQLSTATE 42703) User-facing symptoms (dashboard "Upcoming Scheduled Purchases" panel): * "Failed to load purchase details: Internal server error" (#205) * "Failed to cancel purchase" (#204) Both buttons hit `Handler.GetExecutionByID` whose SELECT references the retry_execution_id and retry_attempt_n columns introduced by PR #168. Root cause: migration version drift. * PR #189 added 000042_recommendations_add_engine_to_key earlier on the same day. Deployed Lambdas applied it as version 42. * PR #168 then added 000042_purchase_executions_retry_linkage, colliding on version 42. * PR #195 resolved the on-disk duplicate by renaming RECOMMENDATIONS to 000043 — the wrong direction. Deployed DBs already had schema_migrations.version=42 marked as the engine content, so on the next migrate.Up() they: - skipped on-disk 000042 (now retry_linkage) entirely (DB version >= 42 already) - applied on-disk 000043 (now engine again) as a no-op via the ADD COLUMN IF NOT EXISTS guards already in that file Net effect: the retry_execution_id / retry_attempt_n columns were never added on prod, but the Go code expects them. Forward-only corrective. The new 000044 replays 000042's schema additions wrapped in IF NOT EXISTS / CREATE INDEX IF NOT EXISTS so it's a no-op on any DB that already has the columns (fresh deploys, manually-fixed envs) and additive on the broken ones. We don't try to repair schema_migrations history — golang-migrate only cares about the high-water version, and we can't tell from outside whether a given environment skipped 42 or applied 43 twice. The migration up + down sql files document the incident and the edge cases inline so future readers don't have to re-derive the history from the git log. Closes #204 Closes #205 * fix(db): repair partially-fixed DBs in 000044, not just no-op (CR pass 1) CR pass 1 on PR #206 flagged that the original 000044 only added columns when missing — a manually-fixed DB with the column but no FK, or with retry_attempt_n nullable, would still be broken after this migration. Rewrite as five idempotent steps that each check pg_catalog / information_schema for the precise piece they own, so every reachable partial state is repaired: 1. Add retry_execution_id column if missing. 2. Add the FK on retry_execution_id -> execution_id if no FK on that column exists. Match by column relationship (pg_constraint join pg_attribute) NOT by constraint name, so a fresh-deploy DB whose 000042 ran cleanly with the auto-generated `purchase_executions_retry_execution_id_fkey` is recognised as already-FK'd and we don't add a second. 3. Add retry_attempt_n column if missing (as nullable, so step 4 can backfill before flipping to NOT NULL). 4. Backfill any NULLs and SET DEFAULT 0 / SET NOT NULL unconditionally — Postgres treats those as no-ops when the constraint is already as requested. 5. Create the partial index if missing. Down stays simple: DROP COLUMN ... CASCADE removes the FK regardless of its name. Comments inline document each reachable partial state and why each step is safe to re-run. Migration is a no-op on healthy DBs, fully repairing on broken ones.
Closes #159 + closes the auth gap surfaced during verification
The original commit (
afa636c88) dropped the plaintextAuthorization: Bearer ${var.scheduled_task_secret}header fromCloud Scheduler — fixing the
SCHEDULED_TASK_SECRETleak in Terraformstate per #159. Verification then surfaced a follow-on issue: with the
header gone,
/api/scheduled/*was completely unauthenticated onCloud Run AND GKE today (
cloud_run_allow_unauthenticated = trueeverywhere; GKE Ingress has no IAP). This PR completes the fix by
implementing Option B: app-level OIDC ID-token validation in Go
middleware.
What changed
Bearer <plaintext>headersubtle.ConstantTimeCompareagainst env-var secretinternal/server/scheduledauthmiddleware: signature (RS256 only) via JWKS, issuer, audience allow-list, subject pinning to scheduler SA (defence in depth), 60s clock skew, refresh-on-unknown-kidWhy Option B over Option A (#78 cross-PR dependency)
Option A would flip
cloud_run_allow_unauthenticated = falseand relyentirely on Cloud Run's IAM gate (#78 tracks that). Option B is
preferred here because:
IAP set up; Option A would not cover it.
easier to reason about; one auth surface instead of two.
allow_unauthenticated = falseis still worth doing as defence in depth, and ops(gcp): take Cloud Run secure-ingress default into effect (per-env LB enablement + override removal) #78 stays open. Option
B works whether or not Cloud Run's IAM gate is enforced.
Implementation
5 atomic commits, each independently revertable:
feat(server): add scheduled-task OIDC validator— newinternal/server/scheduledauthpackage: Validator + JWKS cache (viacoreos/go-oidc/v3's single-flight RemoteKeySet, refresh-on-unknown-kid for rotation), 3 modes (
oidc/bearer/disabled), fail-fast config parse. RS256 only. Subject pinning REQUIRED in oidc mode
(any GCP SA can mint a token with arbitrary
aud— only thescheduler SA has our
sub). 60s skew on exp/nbf, iat-future check.Comprehensive unit tests including alg-confusion, key rotation, and
single-flight stampede protection.
fix(server): replace bearer-only check with OIDC-aware middleware— wires the validator intoCreateHTTPServer, dropsthe inline
subtle.ConstantTimeComparefromhandleScheduledHTTP,best-effort JWKS warmup at startup, fail-fast on misconfig.
fix(terraform/gcp): pin oidc_token.audience + wire scheduled-task auth env vars— explicitaudience = SA emailon everyoidc_tokenblock (without it Cloud Scheduler defaults to the targetURL, coupling the validator to Cloud Run URL shape).
SCHEDULED_TASK_AUTH_MODE,SCHEDULED_TASK_OIDC_AUDIENCE,SCHEDULED_TASK_OIDC_SUBJECTSenv vars wired automatically by thecloud-run / gke / container-apps / aks modules.
SCHEDULED_TASK_OIDC_SUBJECTSpins to the SA'sunique_id(thevalue Google sets in the JWT
subclaim for SA-signed ID tokens).test(server): integration test for scheduled-task OIDC middleware— full HTTP roundtrip (httptest.Server+ mockedJWKS) covering: valid token, key not in JWKS, missing header, wrong
sub, wrong aud.
Verification
Local gates (all run against the rebased branch on top of
feat/multicloud-web-frontend):go test ./...from repo root — all greengo test ./...frompkg/— all greengo vet ./...on both modules — cleanterraform validateon every touched module(
compute/gcp/cloud-run,compute/gcp/gke,compute/azure/container-apps,compute/azure/aks) and environment(
environments/gcp,environments/azure) — all validterraform fmt -check -recursive terraform/— cleangrep -rn 'Bearer ${var.scheduled_task_secret}' terraform/returns nothing
grep -rn 'SCHEDULED_TASK_SECRET\b' terraform/returns only
SCHEDULED_TASK_SECRET_NAME(the KV reference, not theplaintext) on Azure
(
pre-commit,Azure Sanity,AWS Sanity)CodeRabbit review loop will run automatically once the PR exits draft;
the watcher is already armed.
Rollback / blast radius
The change is isolated: a new package + middleware wiring + per-module
Terraform env vars. To revert, drop the four feature commits — the
codebase returns to the pre-#161 plaintext-bearer state, but #159's
finding is back, so a partial revert (e.g. dropping only the new
package) is not a useful intermediate.
Out of scope
cloud_run_allow_unauthenticated = false) — still worthdoing as defence in depth, tracked separately.
RunTask, no HTTP scheduled-task path on AWS.Connector does not emit Entra ID tokens, so bearer mode stays.
Notes for reviewers
RemoteKeySet, which hasbuilt-in single-flight + refresh-on-unknown-kid (per
OIDC Core §10.1.1).
No fixed TTL — refresh-on-unknown-kid is sufficient because providers
always sign new tokens with the new kid before retiring the old key.
SCHEDULED_TASK_OIDC_SUBJECTSis REQUIRED in oidc mode — emptysubjects →
ErrConfigInvalid→ container fails to start. Betterthan silently accepting any SA in the org.
disabledmode with a loud per-request WARNlog so it's impossible to ship to production without explicitly
opting into oidc / bearer.
Summary by CodeRabbit
New Features
Chores
Documentation
Tests