From 86cbf673f76e75462aed098f039f79d8e645f9d0 Mon Sep 17 00:00:00 2001 From: May Lee Date: Wed, 3 Dec 2025 16:56:15 -0500 Subject: [PATCH 1/2] reorg troubleshooting --- .../troubleshooting.md | 82 +++++++++++-------- 1 file changed, 46 insertions(+), 36 deletions(-) diff --git a/content/en/observability_pipelines/monitoring_and_troubleshooting/troubleshooting.md b/content/en/observability_pipelines/monitoring_and_troubleshooting/troubleshooting.md index 1bada407e1a..5d0a070516c 100644 --- a/content/en/observability_pipelines/monitoring_and_troubleshooting/troubleshooting.md +++ b/content/en/observability_pipelines/monitoring_and_troubleshooting/troubleshooting.md @@ -60,43 +60,56 @@ docker run -i -e DD_API_KEY= \ datadog/observability-pipelines-worker run ``` -## Seeing delayed logs at the destination +## Worker logs issues -Observability Pipelines destinations batch events before sending them to the downstream integration. For example, the Amazon S3, Google Cloud Storage, and Azure Storage destinations have a batch timeout of 900 seconds. If the other batch parameters (maximum events and maximum bytes) have not been met within the 900-second timeout, the batch is flushed at 900 seconds. This means the destination component can take up to 15 minutes to send out a batch of events to the downstream integration. - -These are the batch parameters for each destination: +### No Worker logs in Log Explorer -{{% observability_pipelines/destination_batching %}} - -See [event batching][6] for more information. +If you do not see Worker logs in [Log Explorer][12], make sure they are not getting excluded in your log pipelines. Worker logs must be indexed in Log Management for optimal functionality. The logs provide deployment information, such as Worker status, version, and any errors, that is shown in the Observability Pipelines UI. The logs are also helpful for troubleshooting Worker or pipelines issues. All Worker logs have the tag `source:op_worker`. -## Duplicate Observability Pipelines logs +### Duplicate Observability Pipelines logs If you see duplicate Observability Pipelines logs in [Log Explorer][7] and your Agent is running in a Docker container, you must exclude Observability Pipelines logs using the `DD_CONTAINER_EXCLUDE_LOGS` environment variable. For Helm, use `datadog.containerExcludeLogs`. This prevents duplicate logs, as the Worker also sends its own logs directly to Datadog. See [Docker Log Collection][8] or [Setting environment variables for Helm][9] for more information. -## Getting an error when installing a new version of the Worker +## Worker issues and errors + +### Getting an error when installing a new version of the Worker If you try to install a new version of the Worker in an instance that is running an older version of the Worker, you get an error. You need to [uninstall][11] the older version before you can install the new version of the Worker. -## No Worker logs in Log Explorer +### Worker is not starting -If you do not see Worker logs in [Log Explorer][12], make sure they are not getting excluded in your log pipelines. Worker logs must be indexed in Log Management for optimal functionality. The logs provide deployment information, such as Worker status, version, and any errors, that is shown in the Observability Pipelines UI. The logs are also helpful for troubleshooting Worker or pipelines issues. All Worker logs have the tag `source:op_worker`. +If the Worker is not starting, Worker logs are not sent to Datadog and are not visible in Log Explorer for troubleshooting. To view the logs locally, use the following command: + +- For a VM-based environment: + ``` + sudo journalctl -u observability-pipelines-worker.service -b + ``` -## Too many files error +- For Kubernetes: + ``` + kubectl logs + ``` + An example of `` is `opw-observability-pipelines-worker-0`. + +### Certificate verify failed + +If you see an error with `certificate verify failed` and `self-signed certificate in certificate chain`, see [TLS certificates][16]. Observability Pipelines does not accept self-signed certificates because they are not secure. + +### Ensure your organization is enabled for RC + +If you see the error `Please ensure you organization is enabled for RC`, ensure your Worker API key has [Remote Configuration enabled][17]. See [Security considerations][19] for information on safeguards implemented for Remote Configuration. + +### Too many files error If you see the error `Too many files` and the Worker processes repeatedly restart, it could be due to a low file descriptor limit on the host. To resolve this issue for Linux environments, set `LimitNOFILE` in the systemd service configuration to `65,536` to increase the file descriptor limit. -## The Worker is not receiving logs from the source +### The Worker is not receiving logs from the source If you have configured your source to send logs to the Worker, make sure the port that the Worker is listening on is the same port to which the source is sending logs. If you are using RHEL and need to forward logs from one port (for example UDP/514) to the port the Worker is listening on (for example, UDP/1514, which is an unprivileged port), you can use [`firewalld`][14] to forward logs from port 514 to port 1514. -## Logs are not getting forwarded to the destination - -Run the command `netstat -anp | find ""` to check that the port that the destination is listening on is not being used by another service. - -## Failed to connect error +### Failed to connect error If you see an error similar to one of these errors: @@ -125,34 +138,31 @@ curl --location 'http://ab52a1d102c6f4a3c823axxx-xxxxx.us-west-2.elb.amazonaws.c The curl command you use is based on the port you are using, as well as the path and expected payload from your source. -## Worker is not starting +## General pipeline issues -If the Worker is not starting, Worker logs are not sent to Datadog and are not visible in Log Explorer for troubleshooting. To view the logs locally, use the following command: +### Missing environment variable -- For a VM-based environment: - ``` - sudo journalctl -u observability-pipelines-worker.service -b - ``` +If you see the error `Configuration is invalid. Missing environment variable $`, make sure you add the environment variables for your source, processors, and destinations when you install the Worker. See [Environment Variables][18] for a list of source, processor, and destination environment variables. -- For Kubernetes: - ``` - kubectl logs - ``` - An example of `` is `opw-observability-pipelines-worker-0`. +## Logs pipeline issues -## Certificate verify failed +### Seeing delayed logs at the destination -If you see an error with `certificate verify failed` and `self-signed certificate in certificate chain`, see [TLS certificates][16]. Observability Pipelines does not accept self-signed certificates because they are not secure. +Observability Pipelines destinations batch events before sending them to the downstream integration. For example, the Amazon S3, Google Cloud Storage, and Azure Storage destinations have a batch timeout of 900 seconds. If the other batch parameters (maximum events and maximum bytes) have not been met within the 900-second timeout, the batch is flushed at 900 seconds. This means the destination component can take up to 15 minutes to send out a batch of events to the downstream integration. -## Ensure your organization is enabled for RC +These are the batch parameters for each destination: -If you see the error `Please ensure you organization is enabled for RC`, ensure your Worker API key has [Remote Configuration enabled][17]. See [Security considerations][19] for information on safeguards implemented for Remote Configuration. +{{% observability_pipelines/destination_batching %}} -## Missing environment variable +See [event batching][6] for more information. -If you see the error `Configuration is invalid. Missing environment variable $`, make sure you add the environment variables for your source, processors, and destinations when you install the Worker. See [Environment Variables][18] for a list of source, processor, and destination environment variables. +### Logs are not getting forwarded to the destination + +Run the command `netstat -anp | find ""` to check that the port that the destination is listening on is not being used by another service. + +## Component issues -## Failed to sync quota state +### Failed to sync quota state error The quota processor is synchronized across all Workers in a Datadog organization. For the synchronization, there is a default rate limit of 50 Workers per organization. When there are more than 50 Workers for an organization: - The processor continues to run, but does not sync correctly with the other Workers, which can result in logs being sent after the quota limit has been reached. From bb514f40dcaebba6612e19be16c9e7ca6f81c024 Mon Sep 17 00:00:00 2001 From: May Lee Date: Thu, 4 Dec 2025 13:12:21 -0500 Subject: [PATCH 2/2] small reorder --- .../troubleshooting.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/content/en/observability_pipelines/monitoring_and_troubleshooting/troubleshooting.md b/content/en/observability_pipelines/monitoring_and_troubleshooting/troubleshooting.md index 5d0a070516c..359302668c7 100644 --- a/content/en/observability_pipelines/monitoring_and_troubleshooting/troubleshooting.md +++ b/content/en/observability_pipelines/monitoring_and_troubleshooting/troubleshooting.md @@ -99,10 +99,6 @@ If you see an error with `certificate verify failed` and `self-signed certificat If you see the error `Please ensure you organization is enabled for RC`, ensure your Worker API key has [Remote Configuration enabled][17]. See [Security considerations][19] for information on safeguards implemented for Remote Configuration. -### Too many files error - -If you see the error `Too many files` and the Worker processes repeatedly restart, it could be due to a low file descriptor limit on the host. To resolve this issue for Linux environments, set `LimitNOFILE` in the systemd service configuration to `65,536` to increase the file descriptor limit. - ### The Worker is not receiving logs from the source If you have configured your source to send logs to the Worker, make sure the port that the Worker is listening on is the same port to which the source is sending logs. @@ -138,6 +134,10 @@ curl --location 'http://ab52a1d102c6f4a3c823axxx-xxxxx.us-west-2.elb.amazonaws.c The curl command you use is based on the port you are using, as well as the path and expected payload from your source. +### Too many files error + +If you see the error `Too many files` and the Worker processes repeatedly restart, it could be due to a low file descriptor limit on the host. To resolve this issue for Linux environments, set `LimitNOFILE` in the systemd service configuration to `65,536` to increase the file descriptor limit. + ## General pipeline issues ### Missing environment variable @@ -146,6 +146,10 @@ If you see the error `Configuration is invalid. Missing environment variable $"` to check that the port that the destination is listening on is not being used by another service. + ### Seeing delayed logs at the destination Observability Pipelines destinations batch events before sending them to the downstream integration. For example, the Amazon S3, Google Cloud Storage, and Azure Storage destinations have a batch timeout of 900 seconds. If the other batch parameters (maximum events and maximum bytes) have not been met within the 900-second timeout, the batch is flushed at 900 seconds. This means the destination component can take up to 15 minutes to send out a batch of events to the downstream integration. @@ -156,10 +160,6 @@ These are the batch parameters for each destination: See [event batching][6] for more information. -### Logs are not getting forwarded to the destination - -Run the command `netstat -anp | find ""` to check that the port that the destination is listening on is not being used by another service. - ## Component issues ### Failed to sync quota state error