Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9136] Fix rare remote configuration worker thread leak #3519

Merged
merged 4 commits into from
Mar 13, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions lib/datadog/core/remote/worker.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,25 @@ def initialize(interval:, &block)
@thr = nil

@starting = false
@stopping = false
@started = false
@stop_requested = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fond of what the naming implies here ^^', I feel it introduces incidental complexity.

Here's my train of thought.

This sequence diagram shows how I view the state machine†:

{ ¬started, ¬starting, ¬stopping }     # 1.
           |
           |
        #start
           |
           V
{ ¬started, starting, ¬stopping }      # 2.
           |
           |
        success
           |
           V
{ started, ¬starting, ¬stopping }      # 3.
           |
           |
         #stop
           |
           V
{ started, ¬starting, stopping }       # 4.
           |
           |
         success
           |
           V
{ ¬started, ¬starting, ¬stopping }     # == 1.

In that context, stop_requested captures a transition (IOW #stop has been called) instead of representing a state. In effect this leads to a representational issue: @stop_requested will be forever true even when a restartable worker has been restarted and no #stop has been issued to that second #start state.

AIUI what stop_requested intends to introduce is:

  • a distinction between the first and last state of the diagram, turning the last one into a 5. then making 5. equivalent to 1. by a conditional on @allow_restart
  • a distinction between 3. and 4. which became identical due to the removal of @stopping.

As a consequence I feel introducing a stopped state would better represent all possible states:

{ ¬started, ¬starting, ¬stopping, ¬stopped }     # 1.
           |
           |
        #start
           |
           V
{ ¬started, starting, ¬stopping, ¬stopped }      # 2.
           |
           |
        success
           |
           V
{ started, ¬starting, ¬stopping, ¬stopped }      # 3.
           |
           |
         #stop
           |
           V
{ started, ¬starting, stopping, ¬stopped }       # 4.
           |
           |
         success
           |
           V
{ ¬started, ¬starting, ¬stopping, stopped }      # 5.
           |
           |
        #start
           |
           V
{ ¬started, starting, ¬stopping, stopped }       # 6.
           |
           |
        success
           |
           V
{ started, ¬starting, ¬stopping, ¬stopped }      # == 3.

It then follows that:

  • @allow_restart is only effective when stopped (and not effective when ¬stopped)
  • the current implementation of stop_requested is equivalent to stopping OR stopped
    • ... except when the worker is started; in that cas I feel a stop count (or maybe better, a start count) could be useful
  • the ambiguity about stop_requested on a restartable worker is resolved and fully captured by stopping

I also feel it makes it easier to implement helper methods from these fundamental state primitives (e.g synthesising #stop_requested if needed)

success is actually a diamond with an error branch, elided so as not to overload the sequence diagram, and also because it happens via exceptional code flow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed @stop_requested to @stopped in 22f3330


@interval = interval
raise ArgumentError, 'can not initialize a worker without a block' unless block

@block = block
end

def start
def start(allow_restart: false)
Copy link
Contributor

@lloeki lloeki Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a situation where we want to allow restart?

My memory is a bit fuzzy but it seemed to me that once a remote worker has started in practice its only subsequent states are "stopping" and "stopped", but maybe I missed something?

(IOW is state 5. disallowed and maybe should be guarded against?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 760e084

Datadog.logger.debug { 'remote worker starting' }

acquire_lock

if @stop_requested && !allow_restart
Datadog.logger.debug('remote worker: refusing to restart after previous stop')
return
end

return if @starting || @started

@starting = true
Expand All @@ -46,8 +51,6 @@ def stop

acquire_lock

@stopping = true

thread = @thr

if thread
Expand All @@ -56,8 +59,8 @@ def stop
end

@started = false
@stopping = false
@thr = nil
@stop_requested = true

Datadog.logger.debug { 'remote worker stopped' }
ensure
Expand Down
2 changes: 1 addition & 1 deletion sig/datadog/core/remote/worker.rbs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ module Datadog

def initialize: (interval: ::Float) { () -> void } -> void

def start: () -> void
def start: (?allow_restart: bool) -> void

def stop: () -> void

Expand Down
22 changes: 22 additions & 0 deletions spec/datadog/core/remote/worker_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,28 @@

expect(worker.instance_variable_get(:@thr).thread_variable_get(:fork_safe)).to be true
end

it 'does not restart the worker after being stopped once' do
worker.start
expect(worker.instance_variable_get(:@started)).to be true

worker.stop

worker.start
expect(worker.instance_variable_get(:@started)).to be false
end

context 'when calling start with allow_restart: true' do
it 'restarts the worker after being stopped once' do
worker.start
expect(worker.instance_variable_get(:@started)).to be true

worker.stop

worker.start(allow_restart: true)
expect(worker.instance_variable_get(:@started)).to be true
end
end
end

describe '#stop' do
Expand Down
Loading