Skip to content

Conversation

@mabdinur
Copy link
Contributor

@mabdinur mabdinur commented Jun 13, 2023

Description

Looking at the failing test below we can see test_single_trace_too_large generates over 20MB of trace data but these traces can be sent in separate payloads (yay partial flushing). This PR mocks AgentWriter.flush_queue() to ensure traces are not submitted. With this change the BufferFull exception will be consistently raised and "trace buffer (%s traces %db/%db) cannot fit trace of size %db, dropping (writer status: %s)" will always be logged.

This PR:

  • Removes time interval env var this not required since the mechanism that submits traces is being mocked.
  • Removes tracer.shutdown() this test should not submit traces to the agent.
  • Decreases the total size of the trace from 200,000 to 30,000.
    • Generating 200,000 spans is overkill and very slow. We only need to queue 30,000 spans to fill the agent writer's buffer and log the buffer full warning.

Background

This test consistently passes locally but often fails in ci (and mainly on older versions of python). My hypothesis is that the time interval to submit traces to the agent is too short. The test does not have enough time to add the trace chunks to overflow the trace encoder's buffer.

Sample failure: https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/37860/workflows/4da477db-5289-40b1-8375-5d6152f14b8f/jobs/2546383

Testing Strategy

YOLO. If test_single_trace_too_large is still flaky after this PR is merged then I have failed my reviewers and myself.

Checklist

  • Change(s) are motivated and described in the PR description.
  • Testing strategy is described if automated tests are not included in the PR.
  • Risk is outlined (performance impact, potential for breakage, maintainability, etc).
  • Change is maintainable (easy to change, telemetry, documentation).
  • Library release note guidelines are followed.
  • Documentation is included (in-code, generated user docs, public corp docs).
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Title is accurate.
  • No unnecessary changes are introduced.
  • Description motivates each change.
  • Avoids breaking API changes unless absolutely necessary.
  • Testing strategy adequately addresses listed risk(s).
  • Change is maintainable (easy to change, telemetry, documentation).
  • Release note makes sense to a user of the library.
  • Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment.
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

@mabdinur mabdinur requested review from a team as code owners June 13, 2023 20:29
@mabdinur mabdinur added the changelog/no-changelog A changelog entry is not required for this PR. label Jun 13, 2023
@mabdinur mabdinur marked this pull request as draft June 13, 2023 20:50
@pr-commenter
Copy link

pr-commenter bot commented Jun 13, 2023

Benchmarks

Comparing candidate commit 1281c2f in PR branch munir/disable-partial-flushing-in-integrations-test with baseline commit 54110dd in branch 1.x.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 86 cases.

@mabdinur mabdinur force-pushed the munir/disable-partial-flushing-in-integrations-test branch from 0aa4cd8 to 2ba1036 Compare June 14, 2023 04:11
@mabdinur mabdinur force-pushed the munir/disable-partial-flushing-in-integrations-test branch from 89344a7 to 3b8b6a1 Compare June 14, 2023 04:31
@mabdinur mabdinur marked this pull request as ready for review June 14, 2023 04:31
@mabdinur mabdinur enabled auto-merge (squash) June 14, 2023 04:31
Copy link
Collaborator

@emmettbutler emmettbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why the test failing on this pull request's CI runs is a useful signal. Shouldn't we make sure it passes before merge?

@mabdinur mabdinur force-pushed the munir/disable-partial-flushing-in-integrations-test branch from de32d5f to 0ca0c61 Compare June 14, 2023 14:03
@mabdinur
Copy link
Contributor Author

mabdinur commented Jun 14, 2023

I'm not sure I understand why the test failing on this pull request's CI runs is a useful signal. Shouldn't we make sure it passes before merge?

Sorry, I opened this PR too early. My first implementation was incorrect. I am currently working on a better solution.

This ci fix is hard to test because its timing based (I think a slower machine or slower connection to the agent causes this test to fail in ci). The best we can do is rerun ci multiple times and hope this test no longer flaky.

@mabdinur mabdinur changed the title ci(integrations): fix flaky test_single_trace_too_large [attempt #2] ci(integrations): removes flaky test_single_trace_too_large Jun 14, 2023
@mabdinur mabdinur force-pushed the munir/disable-partial-flushing-in-integrations-test branch from c85f956 to 1281c2f Compare June 14, 2023 15:46
@mabdinur mabdinur changed the title ci(integrations): removes flaky test_single_trace_too_large ci(integrations): makes test_single_trace_too_large less flaky Jun 14, 2023
@mabdinur mabdinur requested a review from emmettbutler June 14, 2023 17:49
@mabdinur mabdinur merged commit cd6f987 into 1.x Jun 14, 2023
@mabdinur mabdinur deleted the munir/disable-partial-flushing-in-integrations-test branch June 14, 2023 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog A changelog entry is not required for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants