GCP-3046: Add GCP Dataflow Quick Start Script by tedkahwaji · Pull Request #57 · DataDog/integrations-management

tedkahwaji · 2025-10-26T16:48:27Z

Summary

This PR adds a QuickStart script for automated GCP log forwarding to Datadog via Dataflow. The implementation follows the same pattern established by the Integration QuickStart setup, but is specifically tailored for Dataflow log forwarding configuration.

Uses static, predefined values as much as possible to minimize required customer inputs in the UI
Implements idempotent operations (checks if resources exist before creation, simpler upsert operations)

Note:
This change also includes some refactoring. It creates a new shared directory for shared logic between the Integration & Log Forwarding QuickStart scripts and generalizes many of the reporter methods to take in a workflow_type, since these will use independent workflow types.

Testing

I ran the script and verified that all steps were executed correctly. I also removed some resources to allow the script to perform the necessary upsert operations.

You can view the logs.

Workflow Steps

Authentication & Scope Selection
- (same as Integration QuickStart)
- Collects available GCP projects and folders for configuration scope
User Input Collection
- Script polls for user selections including:
- Default project (where all infrastructure will be created)
- Dataflow region
- Dataflow Prime enablement flag
- Optional inclusion filter for log sink
- Optional exclusion filters for log sink
Pub/Sub Infrastructure Setup
- Creates two Pub/Sub topics with pull-based subscriptions:
  - export-logs-to-datadog (main topic)
  - export-failed-logs-to-datadog (dead letter topic)
- Idempotent: checks for existing topics/subscriptions before creation
Service Account Creation
- Creates or finds datadog-dataflow service account in the default project
- Used to run the Dataflow job with appropriate permissions
IAM Role Assignment
- Assigns required roles to the service account:
  - roles/dataflow.admin
  - roles/dataflow.worker
  - roles/pubsub.viewer
  - roles/pubsub.publisher
  - roles/pubsub.subscriber
  - roles/storage.objectAdmin
Secret Management
- Creates or retrieves Datadog API key via Datadog API
- Stores API key in Secret Manager as gcp-dataflow-logs-api-key
- Binds roles/secretmanager.secretAccessor to the Dataflow service account
Log Sink Creation
- Creates datadog-log-sink for each selected project and folder
- Folder-level sinks use --include-children for aggregated log collection
- Applies user-defined inclusion and exclusion filters
- Grants Pub/Sub publisher permissions to each sink's writer identity
Dataflow Job Deployment
- Enables Dataflow API on the project
- Creates pubsub-to-datadog-job using Google's Cloud Pub/Sub to Datadog template
- Configures job parameters (subscription, Datadog intake URL, API key, dead letter topic)
- Optionally enables Dataflow Prime if requested by user
- Idempotent: checks for existing running jobs before creation

Documents

gpalmz · 2025-10-27T15:20:04Z

benjjs

Repo structure and such look good, I'll leave the business logic to the GCP reviewer. Glad this template works for y'all as well.

ash-ddog

Lots of comments, but this looks pretty good for a first pass. We may make edits moving forward but I'm ok with this to start! Great work!

ash-ddog · 2025-10-28T19:39:32Z

+]
+
+PUBSUB_TOPIC_ID: str = "export-logs-to-datadog"
+PUBSUB_DEAD_LETTER_TOPIC_ID: str = "export-failed-logs-to-datadog"


Nit: I'd prob just call this export-logs-to-datadog-dlq or something. This name sounds like customers can just replay these logs to the original topic but they actually need to be edited before hand. See: https://cloud.google.com/architecture/stream-logs-from-google-cloud-to-splunk#replay_unprocessed_messages

ash-ddog · 2025-10-28T19:42:34Z

+
+PUBSUB_TOPIC_ID: str = "export-logs-to-datadog"
+PUBSUB_DEAD_LETTER_TOPIC_ID: str = "export-failed-logs-to-datadog"
+SECRET_MANAGER_NAME: str = "gcp-dataflow-logs-api-key"


Might I suggest we pick a prefix and use it for everything?

i.e.

if we decide on prefix export-logs-to-datadog

then from there we have:

export-logs-to-datadog-topic

export-logs-to-datadog-dlq

export-logs-to-datadog-api-key

export-logs-to-datadog-job

so on and so on. Thoughts?

ash-ddog · 2025-10-28T19:44:16Z

+            )
+            continue
+
+        if subscription_search[0].get("topic") != topic_full_name:


Why/how could this ever happen? 🤔

I encountered this issue while testing. I successfully ran the script from start to finish, then deleted the topic.

When I reran the script, the topic was recreated, but the subscription still pointed to the deleted topic. This check ensures that if the subscription is pointing to an invalid or deleted topic, it will be redirected to the correct one.

ash-ddog · 2025-10-28T19:53:45Z

+            step_reporter.report(
+                message=f"Creating Pub/Sub topic '{topic_id}' in project '{project_id}'..."
+            )
+            gcloud(f"pubsub topics create {topic_id} --project={project_id}")


Just noting that I do not see any options to topics create or subscription create

From a quick search, that does seem right.

ash-ddog · 2025-10-28T20:16:41Z

+def create_datadog_logs_api_key() -> str:
+    """Create a Datadog logs API key."""
+    response, status = dd_request(
+        "GET",
+        f"/api/v2/api_keys?filter={SECRET_MANAGER_NAME}",
+    )
+    if status != 200:
+        raise RuntimeError(f"Failed to get API key: {response}")
+
+    json_response = json.loads(response)
+    data: list[dict[str, Any]] = list(
+        filter(
+            lambda key: key.get("attributes", {}).get("name") == SECRET_MANAGER_NAME,
+            json_response.get("data", []),
+        )
+    )
+
+    if len(data) > 0:
+        api_key_id = data[0].get("id")
+        response, status = dd_request(
+            "GET",
+            f"/api/v2/api_keys/{api_key_id}",
+        )
+        if status != 200:
+            raise RuntimeError(f"Failed to get API key: {response}")
+
+        json_response = json.loads(response)
+        return json_response.get("data", {}).get("attributes", {}).get("key")
+
+    response, status = dd_request(
+        "POST",
+        "/api/v2/api_keys",
+        {
+            "data": {
+                "type": "api_keys",
+                "attributes": {
+                    "name": SECRET_MANAGER_NAME,
+                },
+            },
+        },
+    )
+    if status != 201:
+        raise RuntimeError(f"Failed to create API key: {response}")
+
+    json_response = json.loads(response)
+    return json_response.get("data", {}).get("attributes", {}).get("key")


Noticed some duplicated code so tried to refactor. But prob more important: can we name differently and change error messages?

for naming, I mean find_or_create (and the comment as well)
For error I mean Failed to search API keys:

def find_or_create_datadog_api_key() -> str: """Find or create a Datadog API key.""" response, status = dd_request( "GET", f"/api/v2/api_keys?filter={SECRET_MANAGER_NAME}", ) if status != 200: raise RuntimeError(f"Failed to search API keys: {response}") api_key_info: dict[str, Any] = next( filter(lambda key: key.get("attributes", {}).get("name") == SECRET_MANAGER_NAME, json.loads(response).get("data", [])), None ) if api_key_info: response, status = dd_request( "GET", f"/api/v2/api_keys/{api_key_info.get('id')}", ) if status != 200: raise RuntimeError(f"Failed to get API key: {response}") else: response, status = dd_request( "POST", "/api/v2/api_keys", { "data": { "type": "api_keys", "attributes": { "name": SECRET_MANAGER_NAME, }, }, }, ) if status != 201: raise RuntimeError(f"Failed to create API key: {response}") return json.loads(response).get("data", {}).get("attributes", {}).get("key")

Also, this seems semi-sensitive. Are we sure we want to raise an error with the entire response? Any risk of leaking sensitive info there? Can we audit the entire script for this type of thing? 🙏

Ty I took the refactor.

I'm not concerned about sensitive info here, as I'm confident these error responses don’t contain anything sensitive. However, just to be safe, I removed returning the full error response in the GET call.

Also these error will only be visible to the customer and won't appear in our logs or system.

I removed returning the full error response in the GET call.

Sorry for the back and forth. If you're confident this won't print anything sensitive, we may want to leave it in? i.e. A customer with a support case can then give us the error message?

ash-ddog · 2025-10-28T20:41:36Z

+            f"logging sinks describe {log_sink_name} {resource_container_filter}",
+            *["writerIdentity"],
+        )
+        if writer_identity := sink_info.get("writerIdentity"):


ash-ddog · 2025-10-28T20:42:04Z

+                    --quiet"
+    )
+
+    dataflow_job_name: str = "pubsub-to-datadog-job"


Nit: constant?

ash-ddog · 2025-10-28T20:43:14Z

+
+    matched_dataflow_jobs = gcloud(
+        f"dataflow jobs list --project={project_id} --region={region} "
+        f"--filter='name:{dataflow_job_name} AND NOT (state=DONE OR state=FAILED OR state=CANCELLED OR state=DRAINED OR state=UPDATED)'"


Why use exclusions instead of state=RUNNING? What if they add another state?

The conditional below was originally checking the length of jobs with state=Done or state=Failed..., which was incorrect. So, I added a NOT to fix; I agree that just checking for running is much simpler, so that's been updated.

ash-ddog · 2025-10-28T20:47:10Z

+    step_reporter.report(
+        message=f"Successfully created Dataflow job '{dataflow_job_name}'"
+    )


Is there something we can do to ensure things are working for the customer? Or at least wait til the job is running?

Unfortunately, we can’t track that during script execution. The best we can do is verify that the resources were created successfully.

We could pause the script and wait for the job to start processing, but that would take a ton of time. I am thinking about this for the UI changes; maybe pointing the user to a logs query or Dataflow metric

The best we can do is verify that the resources were created successfully. We could pause the script and wait for the job to start processing, but that would take a ton of time.

Yea sorry, not suggesting we wait minutes for logs to start flowing. But, is it worth polling for maybe 10s or so to ensure the job is in the RUNNING state and not FAILED or something? Again, this is optional, just an idea.

ash-ddog · 2025-10-28T20:50:30Z

        },
    )
-    @patch("gcp_integration_quickstart.requests.request")
+    @patch("gcp_shared.requests.request")


Out-of-scope but something to think about.

Has anyone started to think about a way to have integration tests for this stuff that run against a real GCP env? The amount of patching in the tests is a bit worry-some. That said, I know we try these flows a lot before release, but still would be nice to have some automated way to ensure things continue to work against a real environment (i.e. maybe a gcloud command changes or a role is renamed or something of that sort).

Yeah, integration testing for something like this isn't trivial there are multiple cases to consider:

First-time creation

Some resources created & deleted—verify that the script can recreate & update them

If a command changes and starts causing errors, we have telemetry in place to monitor the failure rates for the script.

I'll note that I made sure not to use alpha or beta gcloud commands, since those are subject to change. I'm fairly confident that the existing gcloud commands will remain stable (backward compatibility in the case of future changes).

For now, I'm confident that verifying the exact gcloud commands are executed in the right order, along with smoke testing things end-to-end.

integration testing for something like this isn't trivial

Not trivial doesn't mean not worth doing, but I hear ya. I do think it's possible. It would be:

Create a new project or use and existing test project

Run the script, assuming customer options

Ensure the resources are created in the project and the job is running

Teardown

We can talk offline about it.

tedkahwaji force-pushed the teddy.kahwaji/gcp-3046 branch 3 times, most recently from ad7f89c to a062a88 Compare October 26, 2025 17:00

tedkahwaji changed the title ~~Teddy.kahwaji/gcp 3046~~ GCP-3046: Add GCP Dataflow Quick Start Script Oct 26, 2025

tedkahwaji force-pushed the teddy.kahwaji/gcp-3046 branch 5 times, most recently from c9c005c to 3d78fa6 Compare October 27, 2025 11:57

tedkahwaji marked this pull request as ready for review October 27, 2025 11:58

tedkahwaji requested a review from a team as a code owner October 27, 2025 11:58

tedkahwaji requested review from mvhdd and removed request for a team October 27, 2025 11:58

tedkahwaji added the gcp-integrations label Oct 27, 2025

tedkahwaji requested review from benjjs and gpalmz October 27, 2025 15:04

benjjs approved these changes Oct 27, 2025

View reviewed changes

gpalmz approved these changes Oct 27, 2025

View reviewed changes

tedkahwaji force-pushed the teddy.kahwaji/gcp-3046 branch from 1240c52 to 5d27a90 Compare October 27, 2025 19:42

ash-ddog approved these changes Oct 28, 2025

View reviewed changes

tedkahwaji added 4 commits October 30, 2025 12:12

implement gcp dataflow log forwarding quickstart script

47cbfa2

refactor shared modules & update tests

4b7550b

support updating log sinks

6e50cc7

update gcloud commands to use a component for building commands

861cc30

tedkahwaji force-pushed the teddy.kahwaji/gcp-3046 branch from 9ba3437 to 861cc30 Compare October 30, 2025 16:15

tedkahwaji merged commit 679fcdc into main Oct 30, 2025
2 checks passed

tedkahwaji deleted the teddy.kahwaji/gcp-3046 branch October 30, 2025 17:48

Conversation

tedkahwaji commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Workflow Steps

Documents

Uh oh!

gpalmz commented Oct 27, 2025

Uh oh!

benjjs left a comment

Choose a reason for hiding this comment

Uh oh!

ash-ddog left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tedkahwaji commented Oct 26, 2025 •

edited

Loading