Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Remote Configuration boot tags #3315

Merged
merged 20 commits into from
Jan 31, 2024
Merged

Add Remote Configuration boot tags #3315

merged 20 commits into from
Jan 31, 2024

Conversation

lloeki
Copy link
Contributor

@lloeki lloeki commented Dec 11, 2023

2.0 Upgrade Guide notes

What does this PR do?

  • Add tags to record what happened WRT RC boot on a specific request
  • Add tags to record RC state for a specific request
  • Improve Barrier to return the cause of wait_once being unblocked
  • Add tuneable barrier timeout via configuration

Motivation:

RC boot waiting being best-effort, it becomes important to be able to diagnose specific requests behaving in unexpected ways WRT remote configuration settings, e.g some select requests not being blocked by a denylist entry.

By adding tags recording what happened with remote configuration boot on a specific request we can diagnose why a specific request behaved a certain way instead of relying on guesswork.

This is fundamentally different from gathering statistical metrics pushed via telemetry (another planned thing).

Additional Notes:

Since these only concern specific information for boot, these tags are present only for requests that are subject to waiting during the boot process.

Recording the time as a metric for all boot requests also allows us to build knowledge of what a sane default value would be.

A PR to allow configuration of the barrier timeout is coming up next. Timeout is configurable.

How to test the change?

This starts as a draft to gather feedback, unit tests pending Unit tests are implemented.

Manual check:

  • either use some tool to delay RC response beyond timeout or configure to be absurdly low such as 0.01, observe timeout tag value being set
  • with default value (1.0), observe ready tag value being set
  • observe that time is recorded in all waiting cases
  • observe tags are not set for post-boot requests
  • observe that tags are not set when RC is disabled

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@github-actions github-actions bot added appsec Application Security monitoring product core Involves Datadog core libraries integrations Involves tracing integrations tracing labels Dec 11, 2023
@lloeki lloeki changed the title Add Remote Cnfiguration boot tags Add Remote Configuration boot tags Dec 11, 2023
@lloeki
Copy link
Contributor Author

lloeki commented Dec 11, 2023

JRuby fails with:

Failures:

  1) Datadog::Core::Remote::Component::Barrier#lift with waiters unblocks waiters
     Failure/Error: expect(barrier.wait_once).to eq :lift
     
       expected: :lift
            got: :pass
     
       (compared using ==)
     
       Diff:
       @@ -1 +1 @@
       -:lift
       +:pass
       
     # ./spec/datadog/core/remote/component_spec.rb:261:in `block in <main>'
     # ./spec/spec_helper.rb:240:in `block in initialize'

And of course I can't reproduce it locally.

@lloeki
Copy link
Contributor Author

lloeki commented Dec 11, 2023

So once the queue is popped the main thread is able to rush to barrier.lift faster than the waiter thread is able to reach barrier.wait_once, so the latter results in :pass.

      it 'unblocks waiters' do
        waiter_thread_queue = Queue.new
        waiter_thread = Thread.new(record) do |record|
          waiter_thread_queue << :ready
          record << :wait
          expect(barrier.wait_once).to eq :lift
          record << :woke_up
        end.run

        waiter_thread_queue.pop # Wait for ready

        record << :one
        barrier.lift
        waiter_thread.join

        expect(record).to eq [:wait, :one, :woke_up]
      end

So the whole attempt to make sure things are where we want is gloriously failing. It happens with JRuby because it can run in parallel but I see no reason it could not happen with CRuby as well whenever it decides to schedule threads in a certain way.

Moved to use a dumb sleep instead.

Comment on lines 29 to 56
unless Datadog::Core::Remote.active_remote.nil?
barrier = nil

t = Datadog::Core::Utils::Time.measure do
barrier = Datadog::Core::Remote.active_remote.barrier(:once)
end

if barrier != :pass && (span = active_span) && !span.has_tag?('_dd.rc.boot.time')
span.set_tag('_dd.rc.boot.time', t)

if barrier == :timeout
span.set_tag('_dd.rc.boot.timeout', true)
else
# TODO: 'ready' should evolve into ensuring RC received a proper reply
span.set_tag('_dd.rc.boot.ready', true)
end
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to encapsulate/review this logic, so it's compartmentalized and shard with other occurrence on lib/datadog/tracing/contrib/rack/middlewares.rb?

My motivation for asking this is because it's read like an unrelated unless/end block in the middle of this call method. I wouldn't mind it if it was hidden behind an abstraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I initially set out to do that, especially given that the tracing side does exactly the same thing except that on the tracing side it's split in two because of the span creation.

I'll try something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've folded that in a Datadog::Core::Remote::Tie namespace.

@lloeki
Copy link
Contributor Author

lloeki commented Dec 18, 2023

Now getting failures like so:

  49) Rack integration tests for an application with a basic route GET request with REQUEST_URI being a path and default quantization behaves like a rack GET 200 span 
      Failure/Error: span.set_tag('_dd.rc.client_id', Datadog::Core::Remote.active_remote.client.id)
        #<Double "Client"> was originally created in one example but has leaked into another example and can no longer be used. rspec-mocks' doubles are designed to only last for one example, and you need to create a new one in each example you wish to use it for.
      Shared Example Group: "a rack GET 200 span" called from ./spec/datadog/tracing/contrib/rack/integration_test_spec.rb:407
      # ./lib/datadog/tracing/contrib/rack/middlewares.rb:108:in `call'
      # /usr/local/bundle/gems/rack-test-2.1.0/lib/rack/test.rb:360:in `process_request'
      # /usr/local/bundle/gems/rack-test-2.1.0/lib/rack/test.rb:163:in `custom_request'
      # /usr/local/bundle/gems/rack-test-2.1.0/lib/rack/test.rb:112:in `get'
      # ./spec/datadog/tracing/contrib/rack/integration_test_spec.rb:402:in `block (6 levels) in <top (required)>'
      # ./spec/datadog/tracing/contrib/rack/integration_test_spec.rb:338:in `block (4 levels) in <top (required)>'
      # ./spec/datadog/tracing/contrib/support/tracer_helpers.rb:96:in `block (2 levels) in <module:TracerHelpers>'
      # ./spec/spec_helper.rb:224:in `block (2 levels) in <top (required)>'
      # ./spec/spec_helper.rb:109:in `block (2 levels) in <top (required)>'
      # /usr/local/bundle/gems/webmock-3.13.0/lib/webmock/rspec.rb:37:in `block (2 levels) in <top (required)>'
      # /usr/local/bundle/gems/rspec-wait-0.0.9/lib/rspec/wait.rb:46:in `block (2 levels) in <top (required)>'

@lloeki lloeki marked this pull request as ready for review December 20, 2023 10:28
@lloeki lloeki requested a review from a team as a code owner December 20, 2023 10:28
@lloeki
Copy link
Contributor Author

lloeki commented Dec 20, 2023

This is now functionally complete, some refactoring is up next.

@lloeki
Copy link
Contributor Author

lloeki commented Jan 18, 2024

Rebased. Will proceed with the code cleanups.

@lloeki lloeki force-pushed the add-rc-boot-tags branch 2 times, most recently from bcac361 to cead686 Compare January 29, 2024 10:33
@lloeki
Copy link
Contributor Author

lloeki commented Jan 30, 2024

Rebased again

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e8b75dd) 98.25% compared to head (a031262) 98.26%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3315      +/-   ##
==========================================
+ Coverage   98.25%   98.26%   +0.01%     
==========================================
  Files        1262     1264       +2     
  Lines       73606    74113     +507     
  Branches     3445     3466      +21     
==========================================
+ Hits        72319    72828     +509     
+ Misses       1287     1285       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lloeki lloeki merged commit 482021c into master Jan 31, 2024
219 checks passed
@lloeki lloeki deleted the add-rc-boot-tags branch January 31, 2024 07:07
@github-actions github-actions bot added this to the 1.20.0 milestone Jan 31, 2024
@TonyCTHsu TonyCTHsu mentioned this pull request Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
appsec Application Security monitoring product core Involves Datadog core libraries integrations Involves tracing integrations tracing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants