Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Share circuit breakers between workers #227

Open
wants to merge 51 commits into
base: implement_lru_cache
Choose a base branch
from

Conversation

michaelkipper
Copy link

@michaelkipper michaelkipper commented Jun 12, 2019

This PR fixes #27.

What

Moves the implementation of circuit breakers to shared memory so circuit errors can be shared between workers on the same host.

Why

Currently, for single-threaded workers, we need to see error_threshold consecutive failures in order to open a circuit. For applications where timeouts are necessarily high (e.g. 25 seconds for MySQL) that translates into up to 75 seconds of blocking behaviour when a resource is unhealthy.

If all workers on a host experience the outage simultaneously, that information can be shared and the circuit can be tripped in a single timeout iterations, increasing the speed of detecting the unhealthy resource and opening the circuit.

In addition, in the case of a variable latency outage of an upstream resource, the collective hivemind can share information about the increased latency and detect trends earlier and more reliably. This feature is not implemented yet, but becomes possible after this change.

How

Note: This is very much a draft, and needs a lot of refactoring, but it's a start.

The main idea is to move the data for Simple::Integer and Simple::SlidingWindow to shared memory to move the data for Simple::Integer to a semaphore and `Simple::SlidingWindow to shared memory.

Simple::State simply uses the shared Simple::Integer as its backing store to easily share state between workers on the same host.

Feel free to take a look around, but I've still got some refactoring to do.

In particular, this implementation isn't thread-safe yet and requires some serious locking.

cc: @Shopify/servcomm
cc: @sirupsen, @csfrancis, @byroot

Copy link

@thegedge thegedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high-level first thoughts.

}

return (int*)val;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@name doesn't look to be changing after initialization. Could we cache the shmid in semian_simple_integer_t? How about the void *val?

Similarly for circuit_breaker.c and sliding_window.c.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I actually addressed this in a later commit.
Basically, we want to store the key generated from the name on the object itself.
I'll double check I've done this everywhere.

ext/semian/simple_integer.c Outdated Show resolved Hide resolved
ext/semian/simple_integer.c Outdated Show resolved Hide resolved
ext/semian/sliding_window.c Outdated Show resolved Hide resolved
ext/semian/sliding_window.c Show resolved Hide resolved
@pushrax
Copy link
Contributor

pushrax commented Jun 12, 2019

If we have this, I'm fairly sure we should also decouple bulkheads from circuits. At the moment (or at least as of the end of 2018), a bulkhead acquisition failure marks an error in the same resource's circuit. There's a lengthy discussion in https://github.com/Shopify/shopify/pull/178393. I'll reproduce the big points for open source sake. At the end, my conclusion was that sharing circuits between processes and decoupling bulkheads would be an attractive option, which is possible after this PR!

@ashcharles

If Semian is unable to acquire a ticket due to bulkheading, an error is thrown. Multiple errors may exceed the configured error_threshold for the resource causing the circuit to open. While this behaviour is tested in Semian, document it explicitly here with a test for the case of Redis to make it clear that repeated bulkheading on Redis may lead to open-circuit errors.

I found this behaviour surprising as did others in the ensuing discussion. I chose code > documentation. I thought that adding a test to core would make it clear for folks who aren't necessarily looking at the Semian code-base.

@jpittis

TLDR: I can't think of a reason why (besides backward compatibility) that we wouldn't want to remove this behaviour from Semian.

This is a really good test case to articulate an important property of Semian.

Bulkheads triggering does not mean a protected service is experiencing any kind of issue. But we still open a circuit and fast fail requests for the duration of error timeout.

I suspect that this has caused us to fail requests to Redis (and other protected services) when the service was not under duress.

@pushrax

Thanks for bringing this up. I agree it may not be desired behaviour.

The high level goal of a circuit breaker is: "if a request is likely to fail, don't issue it." This is done for two reasons. First, and most importantly, it reduces unnecessary load on the dependency, which is particularly crucial when the dependency is failing due to overload. Second, it reduces unnecessary load on the dependent, which is crucial when the dependency is timing out and taking up too much dependent capacity.

The relevant property to notice is: a circuit breaker is not useful when the cost of a failed request is similar to the cost of evaluating the circuit breaker. When a bulkhead is full, further requests will fail without hitting the network at all, thus failure is cheap. Circuit breakers aren't providing their intended improvement when applied to bulkhead failure. I agree it makes sense to remove this behaviour, it complicates thinking about tuning. An oscillating distributed bulkhead is even more confusing to reason about than a regular distributed bulkhead.

@sirupsen

I am not sure I am sold on removing this behaviour. I've always thought the interaction between them was useful. If tickets are 50% of workers and timeouts are high you'll be "wasting" 50% of workers on timeouts to the resource. After 3 x error_threshold you'd be back to having 100% of capacity (if the half_open_resource_timeout is reasonable).

Imagine having a 25s timeout on MySQL. When all tickets are taken, clearly that MySQL is a problem and we trigger circuits rapidly. If you waited only for the circuits, you'd have to wait 3 x 25s. If you had only bulkheads, you're "wasting" 50% of capacity. If you do both, you seem to get the best of both worlds (as long as you have the half open timeout).

@pushrax

I think this is a case where both positions are correct depending on the tuning and scenario. Simulation would be a great way to show exactly how this is true.

I keep forgetting circuits are not shared between processes, while bulkheads are. Your idea ends up working around this by sharing state between processes through the bulkheads to influence the circuits.

Vertical axis is resource usage (in tickets), horizontal is time:
6v1ebxycqeq5mpd9t2iteg_thumb_9c3

Assume t0 is the resource timeout. My earlier comment about tripping the circuit not being useful here was with reference to the overlapping time between the marked t0 interval and the breaker open interval. In that case, the bulkhead will prevent further requests anyway. In the remaining time (if any) of the breaker open interval, we avoid queuing more work. t1 is showing the "half open timeout" logic.

What about the case when t0 < timeout?

nq3bolr te6g3p2mdek0cg_thumb_9c3

When all tickets are taken, clearly that MySQL is a problem and we trigger circuits rapidly.

Downstream latency regression is one reason tickets can be exhausted, but upstream demand and small amounts of queuing can also cause this. If a large amount of load appears, filling tickets, we prematurely assume instability by shutting down the resource entirely. We fail requests that would not have failed.

When we trip circuits early based on bulkhead errors, we gain capacity for other work. If the other work can't use this capacity more effectively than the original work, we are making an error. This is the part that's difficult to understand: how likely is it that making capacity available for other work will have higher ROI given we just know a bulkhead is full? I don't have a good intuition for this.

Bulkheads can be tuned such that even when exhausted, there is sufficient capacity for other work, provided other competing bulkheads aren't concurrently exhausted. Is this not the way we use them?

Would sharing circuit state between processes be a more direct way of addressing the goals here without needing to understand the statistics of production failure modes?

@csfrancis
Copy link
Contributor

My big concern with this implementation is the number of shared memory segments that will be created. It looks like we're using distinct shared memory segments for each instance of Simple::Integer and Simple::SlidingWindow. For Shopify production, I could see this amounting to (tens of?) thousands of shared memory segments. I can't say for certain that this would be a problem, but it feels a bit sketchy to me and it would be good to verify that it's not.

@michaelkipper
Copy link
Author

My big concern with this implementation is the number of shared memory segments that will be created... For Shopify production, I could see this amounting to (tens of?) thousands of shared memory segments.

@csfrancis: The data I have for core suggests that it's hundreds, not thousands:
https://shopify.datadoghq.com/dashboard/mzg-id4-5rp/semian-lru?screenId=mzg-id4-5rp&screenName=semian-lru&tile_size=m&from_ts=1560376147986&to_ts=1560462547986&live=true&fullscreen_widget=812985505059370&fullscreen_section=overview

Our max_size LRU cache should put bounds on this.

@pushrax pushrax force-pushed the implement_lru_cache branch 2 times, most recently from db7215f to 0d55fb6 Compare June 14, 2019 20:26
@michaelkipper michaelkipper force-pushed the mkipper/global-circuit-breaker branch 2 times, most recently from c83159e to 1a46fb6 Compare June 18, 2019 20:54
Copy link
Contributor

@sirupsen sirupsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first high-level pass. :)

You describe the problem in the opening comment, but not why you think or know this approach will work. Why do we think this will solve it (Simulations proved it?)? What were alternatives considered? How will we prove whether it works?

README.md Outdated
sensitive.

To disable host-based circuits, set the environment variable
`SEMIAN_CIRCUIT_BREAKER_IMPL` to `ruby`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are disabled by default for backwards compatibility, might be worth noting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ext/semian/semian.c Outdated Show resolved Hide resolved
@@ -41,6 +42,24 @@ def test_reset
@integer.reset
assert_equal(0, @integer.value)
end

if ENV['SEMIAN_CIRCUIT_BREAKER_IMPL'] != 'ruby'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of where you missed the ruby / worker duality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This test can run for both implementations.

this configuration, we can reduce the time-to-open for a circuit from _E * T_
to simply _T_ (provided that _N_ is greater than _E_).

You should run a simulation with your workloads to determine an efficient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really tough advice to give someone. That can be days to weeks of work. Can we, with our simulation tooling, provide some sample values that we can include here? I imagine we can generate a somewhat magic constant based on simulations on (1) ss_error_rate steady-state error rates, (2) n number of Semian consumers (threads * workers), (3) error_threshold, (4) resource_timeout, (5) whatever other inputs you all use?

I'd like the advice to then be something along the lines of...

While your own simulation will likely find you a superior value for the scale factor, we recognize how much work this is. We've done simulations internally and have found that a scale factor that is `<magic constant> * <threads>` works fairly well and use that for almost all services.

@@ -64,4 +66,30 @@ void Init_semian()

/* Maximum number of tickets available on this system. */
rb_define_const(cSemian, "MAX_TICKETS", INT2FIX(system_max_semaphore_count));

if (use_c_circuits()) {
Init_SimpleInteger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to override them rather than define them and switch at the Ruby-layer instead? The latter seems a bit cleaner to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly because that's roughly the way the bulkhead was implemented, and I wanted to keep the codebase familiar. One could argue that the Ruby implementation of the bulkhead is actually just a stub, but then those functions should have NotImplemented assertions in them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me. 👍


dprintf("Initializing simple integer '%s' (key: %lu)", to_s(name), res->key);
res->sem_id = initialize_single_semaphore(res->key, SEM_DEFAULT_PERMISSIONS);
res->shmem = get_or_create_shared_memory(res->key, &init_fn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decide on shared memory rather than the existing semaphore wrappers for storing this number? (No strong opinion, but curious).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a brief period where there was something else in the shared memory (I can't really remember now). There's no real issue with using a semaphore, and it would get rid of the need for locks, assuming we could guarantee that the operations would not block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think more about this when I review the actual code and not just the approach.

}

VALUE
semian_simple_sliding_window_push(VALUE self, VALUE value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many other circuits just use a number and not a sliding window. What are the advantages of either approach? One huge advantage would be that you could just use the Integer class instead of a 450 line sliding window implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do circuit breaking without a sliding window, you'd have to use fixed time intervals for error counts, and reset the counter at the end of each fixed-time interval. There are several issues with this, off the top of my head:

  • That's a breaking change, compared to the current implementation.
  • You don't open circuits as fast, if a series of failure straddles a reset boundary. There can be cases when the circuit never opens at all (which you could argue is a benefit).
  • A master/slave election might be necessary to determine which worker resets the counter, which could be more complex than the sliding window implementation.

This approach is admittedly more complex than a single integer counter but I'm not sure that it's worth sacrificing the benefits of a sliding window.

It's hard to reason about the effect of this without millisecond-accurate production data.

Copy link
Contributor

@sirupsen sirupsen Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to ensure we're on the same page, the alternative approach I'm considering is that you have two integers: (1) Number of failures, (2) Timestamp of when to reset the number of failures. This makes the code significantly simpler (in Ruby, it's a few lines, but it's quite a bit more here).

I changed this a few years ago in #24. It's actually addressing a separate point, which was false positives. I think the reason why you didn't arrive at that conclusion was that the previous solution would refresh the TTL every time it was modified.

Overall, likely the way we should look at circuits (I believe @hkdsun suggested this? Or maybe @csfrancis? Not my idea) is to talk about an "error percentage", i.e. if 10% of requests were errors in some window of time, open the circuits. That model is compatible with both, I think.

Again, I have not reviewed the code in detail here (just the approach), but it's a lot of code to maintain for what might appear to be a somewhat marginal advantage. I think my concern from 2015 was more a matter of how easy it was in that approach than real value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pushrax suggested this in our shared doc as well.

My primary motivation for host-based circuits is to address the variable latency problem, where a single worker doesn't see error rates above a threshold but the superposition of all errors on a host shows an anomaly.

I'm not sure it works without some sort of window, because in the implementation before #24, the previous errors would only be purged when there was a period of time since the last observed error with no new errors occurring. If the rate of new errors is greater than the error_timeout, then the pseudo-window would grow without bound and the circuit would eventually open. So if I'm reading that code correctly, the circuit opens quickly when there are a lot of errors, slowly-but-surely when there are some errors, and only remains closed when there are fewer than one error per error_timeout.

Using that sort of approach in a host-based circuit implementation likely regresses to the slowly-but-surely case, where the circuit constantly opens at a low frequency. Combined with capacity loss during the half_open_resource_timeout period, this would be an overall drop in available capacity.

@michaelkipper michaelkipper force-pushed the mkipper/global-circuit-breaker branch from e58b147 to d16c6ab Compare June 26, 2019 21:12
@michaelkipper
Copy link
Author

@csfrancis has some legitimate concerns about the number of shared memory segments this PR would create. SHMMNI is 4,096 on Linux so with our LRU max_size of 500, we'd need 1,500 of them.

Scott suggested storing all the sliding windows in a single shared memory segment, but I have concerns with this approach. Specifically, complexity that would be required to manage the indexing into that data structure. Right now, we generate a key and the IPC subsystem handles the lookup.

Another difficulty is cleanup. Given that we set IPC_RMID and call shmdt when a sliding window goes out of scope, we can defer cleanup to the kernel. If we had a single memory segment, we'd have to manage all that ourselves.

For SimpleInteger, the size of a shared memory segment on Linux is at least one page (4kB) so storing SimpleInteger in a shared memory segment is extremely inefficient. SEMMNI is 32,000 on Linux, so storing SimpleIntegers as a semaphore is a reasonable choice (as @sirupsen questioned earlier). I'll try and implement that ASAP.

@michaelkipper
Copy link
Author

@csfrancis was concerned last week that the host-based circuits PR would require too many shared memory segments. The default on Linux is 4,096 and we were going to be using somewhere in the neightborhood of 1,500 with a max_size LRU of 500. He suggested using a single shared memory segment for all the circuit data.

I took a long stab at it, but ultimately concluded that it was too hard to do garbage collection. With semaphores, it's easy to do garbage collection with SEM_UNDO (in SysV IPC). With shared memory, we can set IPC_RMID to have the segment be destroyed after the last process detaches from it. But with a single segment for all the circuits, we'd have to manage garbage collection within the segment ourselves. I don't think it's impossible, but it adds a large amount of complexity to an already complex PR.

To alleviate concerns about the number of shared memory segments, I converted the Simple::Integer to use a semaphore, since there are 32k semaphore sets available, which leaves the number of shared memory segments required at 500 since only the SlidingWindow is using them.

As far as shipping this is concerned, I have https://github.com/Shopify/shopify/pull/206065 to bump core to use the new branch, with the previous behaviour. Then #243 adds support for an environment variable SEMIAN_CIRCUIT_BREAKER_FORCE_HOST which enables host-based circuits based on machine name (so we can enable it on a single node). Once we validate the behaviour on that node, we can move to a percentage rollout.

Copy link
Contributor

@csfrancis csfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not finished reviewing this, but I'm leaving the comments that I've left so far.

wait_for_shared_memory(uint64_t key)
{
for (int i = 0; i < RETRIES; ++i) {
int shmid = shmget(key, SHM_DEFAULT_SIZE, SHM_DEFAULT_PERMISSIONS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry logic is interesting - from looking at the docs, is there a case where shmget will fail and you expect that the retry will succeed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first try uses IPC_CREAT | IPC_EXCL. If that fails, then we assume it's because another process has created the segment. I think my intention was to check if that other process had run the shared_memory_init_fn but I've tried to optimize that out by building data structures that reset to zero'ed out memory.

I removed the wait loop - we can add that feature if we need it.

// or else sem_get will complain that we have requested an incorrect number of sems
// for the desired key, and have changed the number of semaphores for a given key
const int NUM_SEMAPHORES = 4;
sprintf(semset_size_key, "_NUM_SEMS_%d", NUM_SEMAPHORES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't add this, but this logic is just weird to me. If NUM_SEMAPHORES is a constant, why are we using sprintf here? Couldn't this be simplified to:

char semset_size_key[] = "_NUM_SEMS_4";
...
uniq_id_str = malloc(strlen(name) + strlen(semset_size_key) + 1);
sprintf(uniq_id_str, "%s%s", name, semset_size_key);

Should also probably be checking for malloc failures and raising.

ext/semian/simple_integer.c Outdated Show resolved Hide resolved
ext/semian/types.h Show resolved Hide resolved
ext/semian/sliding_window.c Show resolved Hide resolved
static VALUE
resize_window(int sem_id, semian_simple_sliding_window_shared_t* window, int new_max_size)
{
if (new_max_size > SLIDING_WINDOW_MAX_SIZE) return Qnil;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be raising?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ext/semian/sliding_window.c Outdated Show resolved Hide resolved

int sem_id = semget(key, 1, permissions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this produce a warning? I thought that key_t was a 32-bit value and you've changed the type of 64 bits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see one, and I compile with -Wall.
All the values are generated with:

key_t generate_key(const char *name);

I was following what was done in semian_resource_t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants