- 
                Notifications
    You must be signed in to change notification settings 
- Fork 395
Batch traces before making API calls #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Left some small nits/comments
| traces_grouped_by_tags[tags] += traces | ||
|  | ||
| batched_trace_payloads = [] | ||
| batcher = DatadogBatcher(256 * 1000, 2 * 1000 * 1000, 200) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are:
- The max allowed item size in bytes (256 KB)
- The max allowed batch size in bytes (2 MB)
- The max number of items per batch (200)
I copy-pasted these specific numbers from the other usage of DatadogBatcher in this file, which is for batching logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we refactor them into constant variables?
| for _, tracePayload := range tracePayloads { | ||
| combinedPayload.Traces = append(combinedPayload.Traces, tracePayload.Traces...) | ||
| } | ||
| fmt.Sprintf("aggregated %d traces into single payload", len(combinedPayload.Traces)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will always get logged. We don't have a great way of turning debug logging on/off in the go lib right now
| combinedPayload := &pb.TracePayload{ | ||
| HostName: tracePayloads[0].HostName, | ||
| Env: tracePayloads[0].Env, | ||
| Traces: make([]*pb.APITrace, 0), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we already know the size this slice will be? Maybe we can pre-allocate
| env_patch.stop() | ||
|  | ||
|  | ||
| class TestBatchTracePayloads(unittest.TestCase): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test
* Increasing parsing priority for aws/rds source * Add dms to list of recognized log sources * Update python and checkout action to the latest * Add integration test note to template * Document how to find the installed forwarder Lambda * Fixes DataDog#207 by allowing CloudWatch Logs & S3 to invoke the Forwarder * Add aws_account tag to logs and enhanced metrics * Update snapshots for new tags * Remove unused kwarg * Bump version from 3.3.0 to 3.4.0 * Update README.md * Fixed DataDog#230 as cfn-lint now passes with no errors * Update "extractResourceId" function This extends "extractResourceId" function to extract "source" tag from applications that are deployed at Tenant level. The existing behavior is to discard all resourceId that don't begin with "/subscriptions/" thus it doesn't handle Tenant level resourceId that begin with "/tenants/". Adds a else if block to handle resourceIds with Tenant level deployemnts(Like Active Directory) I have kept the source and tag extraction logic similar to what currently exists. * Add cfn-lint * Fix integration test snapshots * Bump version from 3.4.0 to 3.5.0 * Exclude unuseful files from zip * Instructions for safe deletion * Add manual installation steps * Terraform installation and permissions * Apply suggestions from code review Co-Authored-By: Stephen Pinkerton <stephen.pinkerton@datadoghq.com> * Use latest cfn-lint * Fix tests * Change enhanced tag parsing rules * Update comment * Only remove colons from tag keys * Fix typo in comment * Blank out forwarder version in snapshot tests * Remove camel case parsing rules * Bump version to 3.6.0 * Remove empty comments * Bump version from 3.5.0 to 3.6.0 * Disallow empty value for required fields * feat: Enable supplying DdApiKeySecret directly - Works around indefinite reapply changes when using terraform to deploy cloudformation with NoEcho parameters (hashicorp/terraform-provider-aws#55) * Preserve mandatory DdApiKey - Make DdApiKeySecretArn an Advanced option * Ensure empty tags can be sanitized We have a few lambdas that have tags with empty values, and that was causing an exception to be thrown during the sanitization function. This fixes that behavior, with tests. * update mocha to remove vuln of minimist version * Bump layer versions * Update cloudwatch snapshots * Bump forwarder version * Update layer version * Remove bad character * Bump version from 3.6.0 to 3.7.0 * Refresh tags when new lambda arn is encountered * check that log message is a string for enhanced metrcis submission * Fix typo * Bump version from 3.7.0 to 3.8.0 * add jvanbrunschot changes to update to python 3 and use the new lambda layer * Add tool for building release bundle with datadog_lambda dep * Make release process work with new zip * reverting to use of Stats class and updating it to python3 * Style changes * Removing lambda layer imports * Removing lambda layer call * Move trace forwarder into repository * Update trace forwarder deps * Add trace_forwarder as package to forwarder * Fix integration tests * Add integration tests to workflow * Fix formatting * Fix bad substitution error * Updating docs to reflect rds support of python 3 * Fix github actions error in script * Add copyright disclaimers * Run trace forwarder tests in CI * Bump version from 3.8.0 to 3.9.0 * Remove gov cloud references * Re-add note about disablng enhanced metrics * Add private link to variables to template.yml * Fix template errors * Store lowercase ARN keys in tags cache * Updating active_directory to activedirectory Azure AD source changed from azure.active_directory to azure.activedirectory to match other azure source convention. * Bump version from 3.9.0 to 3.10.0 * Fix lambda_function private link errors * Update readme with private link details * Add notes to the update instructions * Fix a typo Co-authored-by: Stephen Firrincieli <stephen.firrincieli@datadoghq.com> * Bump version from 3.10.0 to 3.11.0 * Update Terraform code to ignore parameter change Without this change `terraform apply` would keep trying to apply changes to the Cloudformation Stack even though its already been deployed. * Updating if condition for tenant to (resourceId.length > 4 && resourceId[4]) * Update PULL_REQUEST_TEMPLATE.md * Make debug logging more verbose * [aws forwader] Update readme with info on capabilities used * Bump version from 3.11.0 to 3.12.0 * Update PULL_REQUEST_TEMPLATE.md * [aws] add tag for memorysize * Revert "Add memorysize tag to enhanced lambda metrics" * add memorysize tag to enhanced metrics * fixed formatting * fixed formatting with black * add init duration metric and coldstart tags * add additional comment * put logging level back to info * update unittest for new tag and metric * include unittest in lambda checks * fix lambda check * update PR template with unit test * Ensure the forwarder can be installed to us-gov AWS accounts * updated optional regex patterns * Bump version from 3.12.0 to 3.13.0 * update tag logic and test * remove enhanced lambda redundancies * updated metric list * fix typo * fix comment * update comment * update unittest and CI * add boto3 for unittest * add region env variable for boto * try setting AWS region for boto * set aws region in test file * set env variable in lambdacheck instead * add timed out enhanced metric * reset log level to info * update enhanced lambda timed out metric name * update to timeouts and count * fix typo * update timed out variable name * fix variable name * Update truncation behavior (DataDog#281) * Change private link URL (DataDog#282) * Bump version from 3.13.0 to 3.14.0 * Document AWS Forwarder CloudFormation params * Update troubleshooting, remove TODOs. * Move installation_alternatives * Copy edited * Ignore changes actually leads to a problem upon update, the masked value **** will be used instead * Update template verbiage * tcp connection * Add out_of_memory enhanced metric (DataDog#287) * Update installation_alternatives.md * Update installation_alternatives.md * Batch traces before making API calls (DataDog#291) * Bump version from 3.14.0 to 3.15.0 * [azure event hub forwarder] Convert strings to json logs (DataDog#292) * initial commit to update event hub forwader * cleanup * added another test and a few fixes * update tests * Batch traces by env instead of tags (DataDog#296) * Parallelize GitHub Workflows (DataDog#297) * Run integration tests with Python 3.8 (DataDog#298) * Refactor parse_event_source to prevent mis-identification of sources (DataDog#299) * Refactor parse_event_source and add tests for it * Fix typo * Add unit test * Bump version from 3.15.0 to 3.16.0 * Use us-west-2 for installation test (DataDog#300) * Remove code to support for Python 2.7 (DataDog#302) * Refactor settings into a separate file (DataDog#301) * Update README.md * Override service metadata field with service tag (DataDog#305) * Bump version from 3.16.0 to 3.16.1 * [documentation] updates to prepare to single source this into the documentation site (DataDog#304) * removes alternatives page to be added to the main readme * edits and reorg of the main readme * Update aws/logs_monitoring/README.md * Update aws/logs_monitoring/README.md Co-authored-by: Alex <Hesperide@users.noreply.github.com> * Use unique variable name for Logger instance (DataDog#307) `log` is used in `for log in logs` loops where its use is more justified. Also this fixes a bug in `enhanced_lambda_metrics.py:parse_and_submit_enhanced_metrics()` related to the naming conflict. Signed-off-by: Eugene Glotov <kivagant@gmail.com> * adding decryption fix (DataDog#309) * Updated CloudWatch deploy permissions (DataDog#313) Adds new permissions for the IAM policy, without which the AWS CloudFormations deployment fails with "ForwarderZipsBucket, CREATE_FAILED, API: s3:SetBucketEncryption Access Denied" and "ForwarderZipsBucket, CREATE_FAILED, API: s3:PutPublicAccessBlock Access Denied" * Update lodash version (DataDog#314) * [AWS log forwarder] Override service tag for meta in trace (DataDog#316) * override service tag for meta * add test to override the meta key as well * Bump version from 3.16.1 to 3.16.2 * [azure] Refactor and convert Azure eventhub log forwarder to use HTTP (DataDog#315) * initial commit for http change * updates * pin lodash version * lint * update lodash version * use promises and finally working retries * Revert "use promises and finally working retries" This reverts commit 4b9a9f5. * fix it all and get it workng * remove lodash changes for another pr * update test * Clarify PrivateLink Endpoints in README (DataDog#317) * Enable Trace Forwarding via AWS PrivateLink (DataDog#318) * Bump version from 3.16.2 to 3.16.3 * fix link, spelling (DataDog#320) * Fix version number incrementing (DataDog#321) * Update integration tests to use raw JSON data (DataDog#319) * Merge ddtags (DataDog#326) * Bump version from 3.16.3 to 3.16.4 * Fix version number (DataDog#328) * Run integration tests on wrapped Forwarder (captures metrics) (DataDog#329) * Add --skip-forwarder-build option to integration tests (DataDog#330) * [azure log forwarder] Update lint command and lint blobs monitoring (DataDog#331) * lint test file as well * lint blobs monitoring * Improve integration tests (DataDog#333) * Fix integration tests --update option (DataDog#334) * Translate binary data type to be human readable (DataDog#332) * Handle binary data type * Add tests and handle malformed logs * Lint files * Adjust for comments * David.kim/updated lambda forwarder (DataDog#327) * Add first pass of telemetry to log forwarder * Updated lambda forwarder with telemetry * Update lambda_function.py Revise comment * Update lambda_function.py Remove DD_FORWARD_TRACES which is not a real env var * Linted with black * Refactored telemetry to fire from forward_* functions and removed counter * Ran black linter * Fix local/global variable issue * Fix local/global variable issue and run black linter * Refactored to use set_forwarder_telemetry_tags and add telemetry tags to enhanced metrics * Finalize log forwarder telemetry and update integration test snapshots * Run black linter on lambda_function and enhanced_lambda_metrics * Address final comments and update integration tests * Address final comments and update snapshots * Add final tweaks * Update error handling for event_type and update snapshots * Add error handling to event_type and remove duplicate tags in ehanced_lambda_metrics * Forwarder: Additional target lambdas (DataDog#335) * Add integration test * more changes to improve test script * other script fixes * fix format * make integration tests optinally deploy lambda * fix int tests in ci * update integration tests * add template changes to take multiple functions * lint formatting * add installation test verify to pull request template * Update aws/logs_monitoring/README.md Co-authored-by: Tian Chu <tian.chu@datadoghq.com> * cr feedback * add unit tests and make int tests default without the external lambda Co-authored-by: Tian Chu <tian.chu@datadoghq.com> * Bump version from 3.16.4 to 3.17.0 * Update README.md (DataDog#336) * Update README with logs dual shipping info (DataDog#338) * update README.md * Update aws/logs_monitoring/README.md Co-authored-by: Tian Chu <tian.chu@datadoghq.com> Co-authored-by: Tian Chu <tian.chu@datadoghq.com> * Fix flakiness in integration tests (DataDog#339) * Fix access to S3 within VPC (DataDog#340) * Support sending data from VPC through a proxy (DataDog#342) Support sending data through a proxy * Bump version from 3.17.0 to 3.18.0 * Fix forwarder template to work in AWS China (DataDog#347) Co-authored-by: stroem <jonathan@devies.se> * Use KMS for S3 bucket encryption (DataDog#349) * Bump version from 3.18.0 to 3.18.1 * [azure] Make azure log forwarder tests cleaner and easier to understand (DataDog#350) * update azure tests * add and update binary tests * lint * rename and cleanup * fix setUp * Handle malformed metrics (DataDog#352) * handle malformed metrics * fix tests * fix formatting * Bump version from 3.18.1 to 3.18.2 * [sls-661] Use s3 for shared tags cache (DataDog#343) * Use s3 for tags cache * Update unit tests * Update snapshot for integration test * Case when s3 cache is expired * Set ddfetchlambdatags to true by default * Update label * Move ddfetchlambdatags out of experimental * Allow additional tags for internal metrics * Add unit test for client error case * New line at EOF * Patch aws boto3 pagination calls * Import env vars from settings * Script to run unittests (sets env vars) * Add env vars to github workflow * Try quotes * Update description for DDFetchLambdaTags * Fix env var name * Lambda function for cache test * Remove lambda functions after external lambdas test * Integration test for s3 cache * Update events with cache lambda name * Update snapshots * Run cache integration test by default * Remove guids from snapshots * Update snapshots * Turn cache tests off by default * Snapshots for when cache tests are on * Snapshots for when cache test is off * Undo lambda function name change * Enable locking for S3 cache * Add try catch for cache lock creation * Update unit tests * Update snapshot for cache integration test * Fix concat issue * Fix operand type issue * Add deletion permission for s3 bucket * Group DdFetchLambdaTags under advanced * Add debug logs * Bump version from 3.18.2 to 3.19.0 * Update README.md * Fix manual install anchor link reference. (DataDog#311) * Fix a typo * Add mention of undocumented SNS support (DataDog#286) Had to check the source code to confirm this was supported https://github.com/DataDog/datadog-serverless-functions/blob/e9180b1bbdd9ccef14d4c498d47d2901c28bbea5/aws/logs_monitoring/lambda_function.py#L624 There is sort of/kind of a reference to this, but it's worded in a very strange way. Logging isn't supported, but it is, by picking an ARN that you "want to use". I'm assuming what it really means, is select *the* datadog logs forwarder, you set up here<link to cloudformation forwarder documentation> * change runtime to 3.7 (DataDog#353) * Forwarder: add KMS permission to get log from Encrypted S3 Bucket (DataDog#337) * feat: add KMS permission to get log from Encrypted S3 Bucket * fix: remove none necessary permissiong If you use a customer managed CMK, the Forwarder only needs kms:Decrypt for downloading, typically CloudTrail SSE require CMK and this permission will work for the case. if you use the AWS managed CMK (i.e., key aws/s3), no extra permissions are required for the Forwarder to download the object. * Add support for S3 events triggered via SNS (DataDog#351) * Add support for S3 events triggered via SNS - When S3 events are configured to push to SNS topics - The DD Lambda Forwarder subscribes to the SNS topic - Then in DD console the event shows up as SNS with SNS event data being logged rather than the S3 log being ingested - This change extracts the S3 event from the SNS topic and then extract and ingests the S3 log from it * removing redundant checks Co-authored-by: Tian Chu <tian.chu@datadoghq.com> * simpler validations using try-catch Co-authored-by: Tian Chu <tian.chu@datadoghq.com> Co-authored-by: Tian Chu <tian.chu@datadoghq.com> * Bump version from 3.19.0 to 3.20.0 * Group TagsCacheTTLSeconds under Advanced (DataDog#357) * [azure] Add new source types and other cleanup (DataDog#354) * inital commit for new sources * update tests and replace allSettled * make empty string checking a bit more clear * handle trailing slash Co-authored-by: Ryan Nixon <ryan.nixon@vacasa.com> Co-authored-by: Alex Burgel <aburgel@betterment.com> Co-authored-by: chenrui <rui@meetup.com> Co-authored-by: DarcyRaynerDD <darcy.rayner@datadoghq.com> Co-authored-by: Darcy Rayner <50632605+DarcyRaynerDD@users.noreply.github.com> Co-authored-by: Tian Chu <tian.chu@datadoghq.com> Co-authored-by: Philipp Hoffmann <philipp.hoffmann@edeka.de> Co-authored-by: Stephen Firrincieli <stephen.firrincieli@datadoghq.com> Co-authored-by: Greg Kaestle <flagscript@gmail.com> Co-authored-by: anshumgargdd <48366821+anshumgargdd@users.noreply.github.com> Co-authored-by: Stephen Pinkerton <stephen.pinkerton@datadoghq.com> Co-authored-by: Jim Park <jimp@scribd.com> Co-authored-by: William Richard <wrichard@indigoag.com> Co-authored-by: Garner Fox McCloud <garner@datadoghq.com> Co-authored-by: Andrew Zakordonets <biercoff@gmail.com> Co-authored-by: Jonathan VanBriesen <jon.vanbriesen@datadoghq.com> Co-authored-by: Andre Marcelo-Tanner <andre@enthropia.com> Co-authored-by: Jordan Storms <jordan.storms@datadoghq.com> Co-authored-by: Nicolas Hinsch <nick.hinsch@datadoghq.com> Co-authored-by: chris.agocs <chris.agocs@datadoghq.com> Co-authored-by: Christopher Agocs <agocs@users.noreply.github.com> Co-authored-by: Claudia D'Adamo <claudia.dadamo@datadoghq.com> Co-authored-by: Kaylyn <kaylyn.sigler@datadoghq.com> Co-authored-by: Alex <Hesperide@users.noreply.github.com> Co-authored-by: Eugene Glotov <KIVagant@gmail.com> Co-authored-by: Lim Wei Chiang <19551853+limweichiang@users.noreply.github.com> Co-authored-by: Sergio Prada <chechopradaster@gmail.com> Co-authored-by: Kari Halsted <12926135+kayayarai@users.noreply.github.com> Co-authored-by: Ricky Thomas <ricky@datadoghq.com> Co-authored-by: David D. Kim <38713013+Chronobreak@users.noreply.github.com> Co-authored-by: stroem <jonathan.strom@gmail.com> Co-authored-by: stroem <jonathan@devies.se> Co-authored-by: Harvinder Ghotra <ghotra.harvinder@gmail.com> Co-authored-by: David E. Weekly <david@weekly.org> Co-authored-by: Ryan Romanchuk <rromanchuk@gmail.com> Co-authored-by: Ikiru Yoshizaki <3856350+guitarrapc@users.noreply.github.com> Co-authored-by: Sagar Khanna <khanna.sagar@gmail.com> Co-authored-by: Mike Pua <mike.pua@orangeandbronze.com>
What does this PR do?
Batch traces with like tags before sending them to the Trace Forwarder. Also removes old code that allowed the forwarder to be run without an enhanced metrics file present.
Motivation
This should reduce the number of API calls we make when forwarding traces, which will hopefully allow the Lambda to run faster, particularly when sending a large number of traces.
Testing Guidelines
Unit and integration test coverage. Performance testing with X-Ray.
Additional Notes
Types of changes
Check all that apply