Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe thread starvation issues #41586

Closed
janrous-rai opened this issue Jul 14, 2021 · 19 comments
Closed

Severe thread starvation issues #41586

janrous-rai opened this issue Jul 14, 2021 · 19 comments
Labels
domain:multithreading Base.Threads and related functionality

Comments

@janrous-rai
Copy link

Summary

We have observed highly unpredictable thread starvation occurring on our production servers that are written in Julia. In particular, this thread starvation causes our monitoring (instrumentation) subsystem to stall for extended periods of time (tens of minutes, up to hours in extreme cases). This often leads to total loss of visibility into the operation of our production systems which makes it virtually impossible to diagnose the services and to run a reliable production system.

Solving this issue is a high priority for us at RelationalAI due to its impact on our production systems.

Background

Our database server (rai-server) is written in julia. The production deployment currently runs on Julia 1.6.1 with multi-threading enabled and number of threads set to number of cores of the underlying virtual machines (we are using Azure cloud).

Our monitoring subsystem relies heavily on background tasks - tasks that are spawned at the server startup and periodically kick off certain operations such as:

  1. flushing changes to metrics to the remote monitoring services
  2. collecting specific metrics such as Julia garbage collection statistics, memory usage details (by parsing /proc files) and so on.

In addition to metrics that are updated via background tasks, many other metrics are updated as soon as the events of interest occurs (e.g. when pager allocates memory for a new page, we increment metric that represents number of pages currently in use).

Users can run database queries. When these queries are received via http they're parsed, analyzed and executed. During execution, potentially large number of tasks that can have very long execution times can be generated. However, we currently do not have a good understanding of the fine-grained details (e.g. number of tasks spawned, duration of individual tasks). Some workloads may trigger long-running cpu-heavy work within a single task.

Perodic tasks

Monitoring subsystem and other batched or maintenance tasks are using the concept of a periodic task which implements the following pseudo-code:

while !periodic_task_should_terminate
  sleep(period)
  do_stuff()
end

In reality, the actual code is somewhat more complex because the above code reacts slowly to termination signals, so we are attempting to use analogue to C++ function wait_for that can wait for a specific period or a notification signal (whatever comes first).

Observed symptoms

On a server startup, we kick off periodic task that increments server_heartbeats metric once per second. Under normal conditions, when we plot the rate of change, we should get a flat line that is close to 1. However, when the server starts handling user queries, the rate of change of this metric dips well below 1 indicating that the background task is not running as frequently as it should.

heartbeats

Sometimes we experience complete blackout during which rai-server does not export any metrics at all. In some situations, this complete blackout can take tens of minutes or, in the example below, roughly 90 minutes. This seems to indicate that the statsd export thread is not running as frequently as it should.

memory-blackout

Suspected root cause

We suspect that the following factors contribute to this issue:

  1. task thread affinity where a long running task can only be scheduled on the same OS thread that it was assigned to initially. (Task migration is introduced in julia 1.7, which we've only just started testing.)
  2. Lack of task isolation - both user query handling as well as critical components of the server (http handlers, monitoring subsystem, ...) use the same Julia tasks and compete for the same limited pool of OS threads. There's no built-in way to isolate the two or to prioritize critical work over the user-supplied workloads.
  3. No task preemptions while tasks can explicitly yield and give up their OS thread, if they do not do that they can occupy OS thread indefinitely.
  4. Long running tasks if the code doesn't explicitly contain yield points or uses code that implicitly contains yield points, individual tasks can run for indefinite periods of time.
  5. Freeze-the-world GC pause waiting on long-running tasks if one Task triggers GC, and attempts to freeze the world, every thread will stop scheduling new work until all threads have paused. If one thread is running a long-running Task, GC pauses can keep the other threads starved for a long time.

Specifically we think that combination of (1), (3), (4) and maybe (5) are the primary causes that can explain the observed symptoms. The critical background tasks are scheduled on arbitrary threads. Whenever user generated tasks are spawned, they can get scheduled on those same threads. If any of these tasks are long-lived (we have queries that may run for several hours), will not yield and get scheduled on the same thread as the critical tasks, they can and will block them.

Suggested solutions

In the end, this is all about sharing of limited resources (threads) among tasks and enforcing some form of isolation. Ultimately, we need to be able to flag work that is critical (latency sensitive) and ensure that other tasks will not be able to starve it. There are several possible approaches:

  1. Task priorities - tasks could be associated with a priority that controls the order in which tasks are scheduled. In such a scenario, critical tasks will be marked as high-priority which will ensure that they're processed first before any lower priority work.
  2. Dedicated resources - this is somewhat analogous to thread pools. The idea is to dedicate one (or more) threads to critical tasks only (need to be able to flag those) which will ensure that bulk (lower priority) workloads will not be able to schedule on these threads. The downside of this solution is that some of the threads may end up being idle most of the time, but this may be a reasonable overhead. If the size of this exclusive thread pool can be controlled at startup time, we can easily tune the degree of this overhead.
  3. Ability to spawn tasks on a dedicated OS thread - the production critical components of our server are relatively lightweight but they're latency sensitive. We would likely be able to achieve good degree of isolation if we could launch these julia tasks on a new OS thread that will be dedicated to this task and this task only and will be cleaned up after the task finishes. This effectively means that the task will bypass julia scheduling entirely and will rely on kernel scheduler to effectively manage the underlying os thread.
  4. Task migration - as long as there are some available threads, allowing critical work to migrate and schedule on any thread should alleviate the risk from long-running tasks occupying the wrong thread. Though this still is vulnerable to problems if there are many tasks scheduled (perhaps starving critical tasks) or if enough long-running tasks occupy all available threads.
  5. Explicit yield points - the code itself can be instrumented with yield points to ensure that long-running tasks are unlikely. This approach may still not work if the latency sensitivity of critical task is higher than the granularity of the (time) distance between yield points. This approach also requires a lot of manual labor and may fall apart if new code that is not properly "yield instrumented" is introduced. This feels like the least favorable solution in the list but may be used as a quick-and-dirty interim hack to get things into a better shape.
  6. Better visibility into scheduling - this is not a solution per se but having tools/instrumentation that would help us inspect the task scheduling decisions, or log certain problematic conditions (e.g. long running tasks) will help us identify and profile sub-optimal behavior in our own code. For example, if we could easily flag long-running tasks, we might be able to insert yield-points and break these up into smaller bits.

We think that solution (3) seems like the most robust and flexible solution and would like feedback on the feasibility/complexity of this.

Example

The following example reproduces thread starvation. metronome function runs in fixed interval and measures the interval accuracy (drift between expected and observed interval). run_for method implements long-running non-yielding task that is spawned and blocks metronome.

using Dates
import Dates
using Base.Threads: Atomic, @spawn

function run_for(id::Int64, num_ns::Int64)
    println("$id: Kicked off run_for($num_ns) on thread: $(Threads.threadid())")
    start_ns = time_ns()
    local i = 0
    while time_ns() < start_ns + num_ns
        i += 1
    end
    println("$id: Finished run_for on thread: $(Threads.threadid())")
end

function metronome(term::Atomic{Bool}, period_ns::Int64)
    println("Kicked off metronome $period_ns on thread: $(Threads.threadid())")
    local i = 0
    while !term[]
        i += 1
        prev_ns = time_ns()
        println("Presleep $(time_ns())")
        sleep(Dates.Nanosecond(period_ns))
        actual_ns = time_ns() - prev_ns
        println("Postsleep $(time_ns()), delta $(actual_ns)")
        drift_ns = actual_ns - period_ns
        drift_s = drift_ns / 1_000_000_000
        println("$(Threads.threadid()): Metronome drift $(drift_s) seconds")
    end
    println("Terminating metronome after $i iterations")
end

# Kick of one metronome and bunch of worker threads that will compete for threads
@sync begin
    mterm = Atomic{Bool}(false)
    @spawn metronome(mterm, 1_000_000 * 250)  # Runs every 250ms
    sleep(4)
    println("Spawning busy tasks")
 
    @sync begin
       for j in 1:10
            @spawn run_for(j, 1_000_000 * 4000) # Blocks for 4s
        end
    end
    mterm[] = true
end
@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Jul 14, 2021

Can you try 1.7-beta3? I had some issues, that maybe related (or not, I didn't even read your issue to carefully) #39598 and possibly #41411 fixed one or both.

@c42f
Copy link
Member

c42f commented Jul 15, 2021

Thanks for this great writeup. We also face these same issues (though to a lesser degree of operational severity) as described in #34267 (comment)

As @tkf referred to in #34267 (comment), an attractive option may be to add dynamically scoped scheduling policies. Then normal code using @spawn inherits the scheduling policy from its calling context (for example, to schedule newly spawned tasks only onto such-and-such a subset of CPU threads).

Getting this working in full generality with custom user-defined schedulers seems like a longer term design project, but adding some tunable knobs to the builtin scheduler feels more doable.

Explicit yield points - the code itself can be instrumented with yield points to ensure that long-running tasks are unlikely

Related to this, what about having a compiler plugin insert implicit yield points for any code called in a latency-sensitive scheduling context?

This is like what the Go compiler does (or used to do) by inserting yield points at every function call boundary. This has performance overhead, but for Julia we could avoid emitting that code unless called in a latency sensitive context. (IIUC Go has now moved away from doing this at compile time and now has runtime asynchronous preemption of goroutines https://medium.com/a-journey-with-go/go-asynchronous-preemption-b5194227371c. That seems really great, but also a huge pile of work.)

@tkf
Copy link
Member

tkf commented Jul 15, 2021

normal code using @spawn inherits the scheduling policy from its calling context

Since Threads.@spawn needs to support full concurrency, I think adding more constraints that the scheduler has to consider might introduce more overhead to the scheduler. (Edit: hmm... maybe possible)

what about having a compiler plugin insert implicit yield points for any code called in a latency-sensitive scheduling context?

This can introduce races to code that relies on the lack of yields (i.e., certain blocks of code are treated as "atomic" with respect to Julia's concurrency).

Getting rid of such code and asking users to write fully "thread-like" code may be one option. But I think it'd be nice (and possible) to be able to declare a given function is blocking. If we have such annotations, a compiler (plugin) can safely insert yields.

@vchuravy
Copy link
Sponsor Member

This has performance overhead, but for Julia we could avoid emitting that code unless called in a latency sensitive context

The issue with that is that you will likely run a mix of sensitive and non-sensitive code, and the non-sensitive code will mess up your latency. So systematically figuring out where we are missing safepoints for GC latencies would be important, the obvious one is loops (maybe we can enter safepoints after vectorization?) and probably non-allocating non-inlined function calls?

@janrous-rai
Copy link
Author

Can you try 1.7-beta3? I had some issues, that maybe related (or not, I didn't even read your issue to carefully) #39598 and possibly #41411 fixed one or both.

Our systems are complex enough that migration to new versions is non-trivial. We are working on it but it will take a while to make it viable.

While the task migrations will likely make this issue less frequent, it is still possible to generate enough non-yielding task to block critical operations so the potential for this remains even then.

@JeffBezanson
Copy link
Sponsor Member

Having a @threads begin ... end that starts a new thread that runs only a specific task seems like a fairly clean way to accomplish this. We've also discussed having a dedicated thread for running tasks designated as latency-sensitive, but then you have to trust everybody to actually write low-latency tasks... :) We'll think about it.

@JeffBezanson JeffBezanson added the domain:multithreading Base.Threads and related functionality label Jul 15, 2021
@c42f
Copy link
Member

c42f commented Jul 16, 2021

The issue with that is that you will likely run a mix of sensitive and non-sensitive code, and the non-sensitive code will mess up your latency. So systematically figuring out where we are missing safepoints for GC latencies would be important, the obvious one is loops

Right, I guess there's two related but different issues:

  • Ensuring latency-sensitive tasks get scheduled regularly. This can be helped by dedicated latency-sensitive threads and sufficient yield points in all tasks running on that thread.
  • Ensuring that latency sensitive tasks don't get held up in GC by unrelated compute tasks which spend a long time in a GC unsafe region without hitting any GC safe points. This seems harder. I don't see a lot of code for gc_state transitions, does codegen reason about this at all?

@tkf
Copy link
Member

tkf commented Jul 16, 2021

Having a @threads begin ... end that starts a new thread that runs only a specific task

Isn't the requirement here rather that other tasks originated from other workers can't use the specified worker? We'd need to handle cases where tasks are scheduled from the dedicated worker since otherwise we can't call arbitrary library functions from the dedicated worker.

An interesting case is when a library (that you don't control) uses @spawn inside and wait on it

@sync begin
    @spawn f()
    g()
end

If @spawn f() is scheduled on other non-dedicated workers, we'll still have the latency problem by priority inversion. So, it seems like @spawn f() needs to be sticky on the dedicated worker. Or, maybe it'd be even better if it has a special property that it can be scheduled on non-dedicated and this dedicated worker.

Alternatively, maybe we can pull in outsider tasks (if it's not running) upon wait from the dedicated worker. This approach has the benefit that starting throughput-oriented tasks from the dedicated worker is easier. In the above approach, the application authors need to do some plumbing using channels etc. to start such tasks.

So systematically figuring out where we are missing safepoints for GC latencies would be important, the obvious one is loops (maybe we can enter safepoints after vectorization?) and probably non-allocating non-inlined function calls?

Didn't Go move away from this approach? GopherCon 2020: Austin Clements - Pardon the Interruption: Loop Preemption in Go 1.14 - YouTube

If it didn't work in Go, I think it's much less likely to work in Julia since people write very performance-sensitive (throughput-oriented) loops.

@vchuravy
Copy link
Sponsor Member

If it didn't work in Go, I think it's much less likely to work in Julia since people write very performance-sensitive (throughput-oriented) loops.

Our GC safepoints are async based, e.g. they are implemented as a load from a location that will seqfault and the fault handler will then switch to GC, I think the Go approach was to do more work in the user side of the safepoint. Adding an extra load in the backedge is certainly sub-optimal (and potentially additional work to maintain the data in the gcframe).

We already have the infrastructure for pre-empting the code, that is part of how the profiler work, but we are missing a way to find GC pointers at the interrupt location. e.g. stackmaps (let's not do that), conservative scanning (meh), or on ARM we could do pointer tagging (xD).

I would say we should see how big of an performance impact it would actually be to late insert a safepoint into the backedges of the loop. Also non-allocating functions are also missing safepoints.

julia> fib(x) = x <= 1 ? 1 : fib(x-1) + fib(x-2)
fib (generic function with 1 method)

julia> @code_llvm fib(2)
;  @ REPL[2]:1 within `fib'
define i64 @julia_fib_177(i64 signext %0) {
top:
; ┌ @ int.jl:442 within `<='
   %1 = icmp sgt i64 %0, 1
; └
  br i1 %1, label %L4, label %L3

L3:                                               ; preds = %top
  ret i64 1

L4:                                               ; preds = %top
; ┌ @ int.jl:86 within `-'
   %2 = add i64 %0, -1
; └
  %3 = call i64 @julia_fib_177(i64 signext %2)
; ┌ @ int.jl:86 within `-'
   %4 = add i64 %0, -2
; └
  %5 = call i64 @julia_fib_177(i64 signext %4)
; ┌ @ int.jl:87 within `+'
   %6 = add i64 %5, %3
; └
  ret i64 %6
}

So the recent work on reducing allocation might have made this problem worse :)

@tkf
Copy link
Member

tkf commented Jul 16, 2021

The load in the safepoint is sandwiched with seq_cst fences. It's a signal fence and so "free" for CPUs. But isn't it harmful to the optimizer?

@vchuravy
Copy link
Sponsor Member

But isn't it harmful to the optimizer?

Yeah, that's why I was thinking of inserting at the end of the optimization pipeline

@NHDaly
Copy link
Member

NHDaly commented Sep 15, 2021

We discussed today having a Multithreading Birds of a Feather working group discussion entirely focused on this topic to work through the design.

The main challenge here remains designing the feature, so meeting together could help unblock it. 👍

@tkf
Copy link
Member

tkf commented Sep 18, 2021

Regarding the direction for adding dedicated worker threads, I think it's worth considering not adding a Julia API and only provide a command line option to do it. This way, application authors can isolate latency-critical code but package authors are encouraged to write composable concurrent/parallel programs. Having a single central resource manager is important. Not many languages have this and it'd be nice if we can keep this property.

@jpsamaroo
Copy link
Member

I disagree, making users figure out how many threads to start on the CLI is annoying and error-prone. I'd consider it much more important to document that such an API is only to be used very sparingly, and that most users shouldn't even consider using it.

I agree that it would be ideal if we could always write composable concurrent programs, but that is not the case. We have libraries like CUDA which we have no control over, and have to use whatever API they give us, including all their non-composable, blocking routines. Providing a way for us to work around such problematic APIs is key to allowing the rest of our program to remain composable.

@tkf
Copy link
Member

tkf commented Sep 18, 2021

I've suggested a couple of ways for solving the tricky CUDA integration with the current concurrency API in Julia 1. I don't think dedicated worker is necessary or ease the implementation. It's also not straightforward to get this right due to the priority inversion I mentioned in #41586 (comment).

I think a more convincing use-case for the dedicated worker would be Distributed.jl and Dagger.jl (as the distributed scheduling is latency sensitive). But Distributed.jl can control the command line flag of the worker processes and so it can create a "manager thread" for each worker process quite easily. It can't add the dedicated worker thread in the main process but the main process usually doesn't do the computation in Distributed.jl. It's also possible to create a scheduler process and move the controlling responsibility out of the main process (which could be better for distributing schedulers anyway).

Footnotes

  1. Discussion of CUDA.jl hostcall implementation in https://julialang.slack.com/archives/C688QKS7Q/p1631550952060800

@davidanthoff
Copy link
Contributor

For the VS Code extension we would also like to be able to add a low latency worker thread without the need for command line arguments. One scenario for us is that a user starts a Julia process (without us being able to control or add any command line args), and then we give the user the ability to connect this process to the extension integration points, and in that scenario we would want to now run a low latency worker thread for the communication in that Julia process. If the only way to add a dedicated worker thread was via a command line option, we couldn't solve that scenario.

@jpsamaroo
Copy link
Member

Dedicated threads/threadpool PR is posted: #42302

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 10, 2023

Mitigated by #41616

@vtjnash vtjnash closed this as completed Mar 10, 2023
@NHDaly
Copy link
Member

NHDaly commented Mar 14, 2023

🎉 thanks everyone! Great to see this closed 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality
Projects
None yet
Development

No branches or pull requests

10 participants