Skip to content

fix(data-warehouse): redact API key credentials from request logs#60599

Merged
MarconLP merged 1 commit into
masterfrom
custom-source-redact-credentials
Jun 1, 2026
Merged

fix(data-warehouse): redact API key credentials from request logs#60599
MarconLP merged 1 commit into
masterfrom
custom-source-redact-credentials

Conversation

@MarconLP
Copy link
Copy Markdown
Member

@MarconLP MarconLP commented May 29, 2026

Problem

The Custom REST source lets the manifest configure API-key auth injected into a query parameter (or a custom header/cookie) under an arbitrary, user-chosen name. The tracked HTTP transport only redacts credentials from logs and captured request samples by matching against a fixed denylist of parameter/header names (url_utils._REDACT_PARAM_NAMES, sampling._AUTH_HEADER_NAMES). A manifest naming its key something off the denylist — e.g. subscription-key — would write the upstream API key in plaintext into data_imports.http.request log lines and into sampled requests in object storage.

Flagged in review of #59612.

Changes

Adds value-based redaction on top of the existing name-based scrubbers, mirroring how Airbyte masks airbyte_secret config values: we know the exact credential string, so we mask it wherever it appears regardless of injection location (query, header, cookie, body) or parameter name.

  • url_utils.redact_literal_values() — replaces a known secret string and its URL-encoded forms with REDACTED. Skips empty/very short values so unrelated log text isn't mangled.
  • auth.auth_secret_values() — extracts the credential strings (token / api_key / password) carried by an auth object.
  • Threaded the values through make_tracked_sessionTrackedHTTPAdapterrecord_request, where they're applied to the logged URL and forwarded into the captured sample.
  • Wired in automatically at both choke points:
    • Sync path: RESTClient derives the values from its auth object — so this protects every REST-based source, not just the custom one.
    • Probe path: the custom source's create-time validate_credentials probe registers the same values.

This is purely additive — redact_values defaults to empty, so behaviour is unchanged for callers that don't pass credentials.

How did you test this code?

I'm an agent; I did not perform manual testing. Added and ran automated tests (all passing):

  • redact_literal_values — raw value, percent-encoded form, header/payload string, multiple secrets, short/empty skip, no-op when empty.
  • auth_secret_values — extraction across bearer / api_key / http_basic, and empty when no credential set.
  • Transport — adapter forwards redact_values to record_request (and defaults to empty).
  • RESTClient — builds its tracked session with credentials derived from auth.
  • Custom source probe — registers the credential for redaction when auth is injected into a query param.
  • Observer end-to-end — a key in a non-denylisted query param (subscription-key) is masked in the logged URL while unrelated params survive.

🤖 Agent context

Authored by Claude Code (claude-opus-4-8) while working through review feedback on #59612. Tools: Read/Edit/Bash/Grep, plus two parallel Explore subagents to study how Airbyte handles credential masking and descending-order incremental resume.

Key decision: the reviewer offered "constrain the param name to the denylist" vs "thread the configured name into the scrubber." The Airbyte exploration showed the robust industry approach is value-based masking (mask the known secret value everywhere) rather than name-based, so we chose that — it covers any injection location and any name without restricting what users can name their auth param. Deriving the values from the auth object inside RESTClient (rather than threading them per-source) means the protection is automatic for all sources.

Stacked on #59612.

Copy link
Copy Markdown
Member Author

MarconLP commented May 29, 2026

@MarconLP MarconLP force-pushed the custom-source-redact-credentials branch 2 times, most recently from ebbe3a6 to a2799e9 Compare May 29, 2026 06:10
@MarconLP MarconLP marked this pull request as ready for review May 29, 2026 06:12
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 29, 2026 06:13
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
posthog/temporal/data_imports/sources/common/http/transport.py:86
The `getattr` fallback here is superfluous — `_redact_values` is always assigned in `__init__`, so the default `()` can never be reached. Direct attribute access is clearer and avoids the implication that the attribute might be absent.

```suggestion
                    redact_values=self._redact_values,
```

### Issue 2 of 2
posthog/temporal/data_imports/sources/common/rest_source/tests/test_rest_client.py:114-128
These two tests cover the same code path with different inputs and assertions. The project prefers parametrized tests — collapsing them removes the repeated `@patch` boilerplate and makes it easy to add further auth types in future without writing another method.

```suggestion
    @pytest.mark.parametrize(
        "auth,expected_redact_values",
        [
            (APIKeyAuth(api_key="sk_live_x", name="key", location="query"), ("sk_live_x",)),
            (None, ()),
        ],
    )
    @patch("posthog.temporal.data_imports.sources.common.rest_source.rest_client.make_tracked_session")
    def test_builds_session_with_credentials_for_redaction(self, MockSession, auth, expected_redact_values) -> None:
        MockSession.return_value.headers = {}
        RESTClient(base_url="https://api.example.com", auth=auth)
        assert MockSession.call_args.kwargs["redact_values"] == expected_redact_values
```

Reviews (1): Last reviewed commit: "fix(data-warehouse): redact API key cred..." | Re-trigger Greptile

Comment thread posthog/temporal/data_imports/sources/common/http/transport.py Outdated
@MarconLP MarconLP force-pushed the custom-source-redact-credentials branch from a2799e9 to 0c1f767 Compare May 29, 2026 07:48
@MarconLP MarconLP changed the base branch from custom-source-fixes to graphite-base/60599 May 29, 2026 09:29
@MarconLP MarconLP force-pushed the graphite-base/60599 branch from 733ce31 to d09aaeb Compare May 29, 2026 13:47
@MarconLP MarconLP force-pushed the custom-source-redact-credentials branch from 0c1f767 to 8ffcc29 Compare May 29, 2026 13:47
@MarconLP MarconLP changed the base branch from graphite-base/60599 to custom-source-fixes May 29, 2026 13:47
@MarconLP MarconLP force-pushed the custom-source-redact-credentials branch from 8ffcc29 to 9657f73 Compare May 29, 2026 19:34
@MarconLP MarconLP force-pushed the custom-source-fixes branch from d09aaeb to 25e5440 Compare May 29, 2026 19:34
@MarconLP MarconLP changed the base branch from custom-source-fixes to graphite-base/60599 May 29, 2026 21:18
@MarconLP MarconLP force-pushed the custom-source-redact-credentials branch from 9657f73 to fc6fb7e Compare May 29, 2026 22:26
@MarconLP MarconLP force-pushed the graphite-base/60599 branch from 25e5440 to e10173f Compare May 29, 2026 22:26
@MarconLP MarconLP changed the base branch from graphite-base/60599 to master May 29, 2026 22:26
@MarconLP MarconLP force-pushed the custom-source-redact-credentials branch from fc6fb7e to e3dcf03 Compare June 1, 2026 08:01
@MarconLP MarconLP merged commit 9f7a93e into master Jun 1, 2026
195 checks passed
@MarconLP MarconLP deleted the custom-source-redact-credentials branch June 1, 2026 08:59
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented Jun 1, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-01 09:29 UTC Run
prod-us ✅ Deployed 2026-06-01 09:41 UTC Run
prod-eu ✅ Deployed 2026-06-01 09:42 UTC Run

MarconLP added a commit that referenced this pull request Jun 1, 2026
#60609)

## Problem

The Custom REST source stores auth credentials as write-only secrets (redacted from API responses, preserved across edits) while the `manifest_json` is fully user-authored and editable. Because a manifest can target any host — via `client.base_url` or an absolute resource `endpoint.path` — an editor who **cannot read** the stored credential could repoint the source at a server they control and the platform would send the preserved secret there on the next probe/sync. That's a confused-deputy credential-exfiltration path.

This mirrors **VERIA-311** (the SSH-tunnel retarget guard already present for DB sources) and was raised in review of the custom source stack.

## Changes

Applies the existing DB-source pattern (`ExternalDataSourceSerializers.update()` → "changing the connection host requires re-entering your credentials") to the custom source, whose connection target lives inside the manifest rather than a top-level `host` field.

- `manifest_request_hosts(manifest_json)` — returns the set of hostnames a manifest will contact (`base_url` + any absolute resource-path hosts).
- On update, if the incoming manifest **introduces a host the source wasn't already contacting** (`new_hosts - existing_hosts` non-empty), the update is folded into the existing `missing_credentials` re-entry gate: the editor must re-supply the secret. Removing a host or editing same-host paths does **not** require re-entry.

Cross-host resource URLs remain supported (consistent with Airbyte's low-code requester, which allows absolute/cross-host paths); the retarget guard + the existing SSRF host-safety check are what keep that safe, rather than forbidding it.

## How did you test this code?

I'm an agent; no manual testing. Added and ran automated tests (all passing, against a local Postgres + ClickHouse):

- `manifest_request_hosts` unit tests — base host, absolute resource hosts added, lowercasing, unparseable/`None`/array → empty set.
- Serializer integration tests on `update()`:
  - manifest host change **without** re-supplied credentials → `400`, "re-entering your credentials", manifest unchanged.
  - manifest host change **with** credentials re-supplied → `200`.
  - same-host manifest edit (path change) → `200`, credential preserved, no re-entry forced.

## 🤖 Agent context

Authored by Claude Code (claude-opus-4-8) while hardening the custom source from review feedback. Tools: Read/Edit/Bash/Grep, plus Explore subagents comparing PostHog's DB-source `update()` guard and Airbyte's low-code requester behavior.

Decision trail: the reviewer flagged that POST + stored credentials lets an editor drive state-changing/retargeted requests with a secret they can't read. Airbyte allows GET/POST and cross-host absolute URLs with no host restriction, so we kept POST and cross-host paths (matching the de-facto standard) and instead closed the *exfiltration* half — the more serious one — by reusing PostHog's own DB-source "re-enter credentials when the connection host changes" pattern (VERIA-311). The same-host POST-mutation residual is accepted for the pilot. Considered pulling `base_url` into a separate top-level field to reuse the existing `host`-keyed check verbatim, but kept it in the manifest to preserve the "paste a RESTAPIConfig" model and to cover absolute resource-path hosts too.

Stacked on #60599.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants