failure scenarios, monitoring and glibc recommendations #40

lukastribus · 2020-09-16T08:50:47Z

Hello,

I'm currently evaluating the FORT validator and have a few questions.

I'm concerned about bugs, misconfigurations or other issues (in all RP/RTR setups, not specific to FORT) that will cause obsolete VRP's on the production routers, because I believe this is the worst-case in RPKI ROV deployments.

I worry about how issues like:

crash bugs in the validation code
hangs during RPKI validation (even in rsync), that block the entire validation
memory allocation failures (failed malloc)
Linux OOM-killer

impact the RTR service:

The best-case scenario for me is that the RTR server goes down completely and all RTR sessions die, so that the production routers are aware there is a problem with that RTR end point and stop using it (failing over to other RTR servers, if available).

Is that the expected behavior in FORT? It is one-process with multiple threads, so a crash would achieve this, correct?

I'm also thinking about monitoring (maybe without regex'ing logfile)

how to best monitor for periodic successful validation runs (thinking about how to trigger an ALIVE signal to something like healthchecks.io )
how to monitor validation run time (like healthchecks.io run time measurement )

Other than parsing strings from logfiles, how could we best achieve this? Is there some stat socket that we could query to check for things like last validation start time and last validation completion time?

Regarding glibc memory allocation: when using glibc, should we just use MALLOC_ARENA_MAX=2 always or only in environments with limited memory? If this is a good middle ground, I'd prefer to just use it always in the glibc world and have systemd unit files set this.

The text was updated successfully, but these errors were encountered:

pcarana · 2020-09-18T19:48:38Z

Hi Lukas! I hope this answer can help to your analysis.

Is that the expected behavior in FORT? It is one-process with multiple threads, so a crash would achieve this, correct?

Correct. A crash will cause FORT validator to stop, which means that also the RTR server will go down. The key point here is: when will FORT validator crash? Definitely due to bugs (hopefully, shouldn't be much of these, but nobody's perfect) or a programming error (logged at crit level, see Logging#level at our docs).

Regarding the other issues:

Memory allocation failures are handled and logged at the operation logs (so that the operator be notified about the issue). FORT will try to do its best to keep processing data; so in the worst case scenario, the current validation cycle data will be discarded, thus the RTR data shouldn't be modified and the RTR server won't die. If the problem persists, we rely on the operator to take action, since the logs will be full of "Out of memory" messages. Maybe this isn't what you expect, since the RTR server will still live. So, we'll consider your proposal regarding stop the whole process to avoid the clients (routers) to keep stale date until the operator do something regarding the FORT validator messages.
Hangs during RPKI validation. The most likely "hang" scenario could be at the http/rsync requests. Here are some of the configurable timeouts for these requests (http.connect-timeout, http.transfer-timeout, http.idle-timeout, rsync.arguments-recursive, rsync.arguments-flat). By default, once the connection is established FORT validator will try to fetch all the required data from the endpoint; if data is being transferred, the connection won't be killed, so FORT will wait until the data is fully transferred (unless the connection is terminated by a local network or remote issue). In other words, the RTR server won't die even if the validation takes a while to complete.

I'm also thinking about monitoring (maybe without regex'ing logfile)

how to best monitor for periodic successful validation runs

how to monitor validation run time

Oops! As of today this data is logged at a level info in the operation logs (you'll have to set it before running FORT, using --log.level=info, since the default level is warning).

Is there some stat socket that we could query to check for things like last validation start time and last validation completion time?

This will definitely be at our TODO list, so for now "regex'ing logfile" (using the info level) is the way.

Regarding glibc memory allocation: when using glibc, should we just use MALLOC_ARENA_MAX=2 always or only in environments with limited memory? If this is a good middle ground, I'd prefer to just use it always in the glibc world and have systemd unit files set this.

Yes, I would recommend to use it in environments with limited memory. Of course, there's no problem using it always, since its main goal is to help.

lukastribus · 2020-09-25T11:00:38Z

Thanks for the feedback.

I think a stat socket or better yet a HTTP endpoint with a REST interface or something, that returns general health and validation metrics (especially last validation run - start, time and stop) would indeed be important for active monitoring.

pcarana · 2020-09-28T16:02:19Z

I agree, this will be a nice (and likely needed) feature.

pcarana · 2020-10-30T19:51:32Z

Newer version v1.5.0 will try to attend some of the points stated at this issue.

lukastribus · 2020-10-30T20:27:19Z

Regarding monitoring, I will build a rtrdump based tool to check for stalled RTR endpoints (same RTR serial and data output after X amount of time = monitoring alert). I believe this is a better way to monitor validator/RTR server health than relying on validation timestamps from a HTTP API.

Slightly off-topic: do you consider doing garbage collection based on the files not referenced in valid current manifests as opposed to rsync --delete/RRDP withdraw) ?

Thanks

pcarana · 2020-10-30T21:34:19Z

I believe this is a better way to monitor validator/RTR server health than relying on validation timestamps from a HTTP API.

I agree, that's a good approach to monitor RTR server health.

Slightly off-topic: do you consider doing garbage collection based on the files not referenced in valid current manifests as opposed to rsync --delete/RRDP withdraw) ?

Well, I've read it a moment ago. Seems a good suggestion but I haven't discussed it with the team yet, so we need to analyze it in order to take a call on what to do.

discussed in #40

ad841d5 was a mistake. It was never agreed in #40 that Fort should shotgun blast its own face on the first ENOMEM, and even if it was, the idea is preposterous. Memory allocation failures are neither programming errors nor an excuse to leave all the routers hanging. While there's some truth to the notion that a Fort memory leak (which has been exhausting memory over time) could be temporarily amended by killing Fort (and letting the OS clean it up), the argument completely misses the reality that memory allocation failures could happen regardless of the existence of a memory leak. A memory leak is a valid reason to throw away the results of a current validation run (as long as the admin is warned), but an existing validation result and the RTR server must remain afloat. Also includes a pr_enomem() caller review.

Mostly quality of life improvements. On the other hand, it looks like the notfatal hash table API was being used incorrectly. HASH_ADD_KEYPTR can OOM, but `errno` wasn't being catched. Fixing this is nontrivial, however, because strange `reqs_error` functions are in the way, and that's a spaggetti I decided to avoid. Instead, I converted HASH_ADD_KEYPTR usage to the fatal hash table API. That's the future according to #40, anyway. I don't think this has anything to do with #83, though.

Trying to recover is incorrect because we don't want to advertise an incomplete or outdated VRP table to the routers. We don't want to rely on the OOM-killer; we NEED to die on memory allocation failures ASAP. Though this doesn't have much to do with the RRDP refactor, I'm doing it early to get plenty of time for testing and review. Partially F1xes #40. (Still need to update some dependency usages.)

ydahhrk · 2023-12-01T16:02:50Z

Status:

crash bugs in the validation code

As has been previously mentioned, Fort panics when it detects programming errors. Because the validator and RTR server are part of the same binary, validator errors also bring down the RTR server with it.

This has worked this way since the inception of the project.

hangs during RPKI validation (even in rsync), that block the entire validation

There are a few timeouts in place (1, 2, 3, 4, 5), but I still believe the implementation to be naive.

My lead concerns right now are a timeout to rsync invocations, as well as a timeout to the overall validation. After that, I would like to worry about researching whether it's possible to assign timeouts to I/O operations in the cache.

Additional ideas welcomed.

memory allocation failures (failed malloc)

As of 1.6.0, Fort generally panics on memory allocation failures. As you proposed, this is intended to prevent Fort from advertising incomplete information, regardless of what the environment thinks is an adequate response to a failed allocation. All mallocs outside of the asn1 code have been already wrapped.

I still consider this an ongoing effort however, because of the still pendig asn1 review, and also because some of Fort's dependencies sometimes obfuscate error causes. I don't know if there's a solution for the latter, other than ditching the dependency entirely.

I'm also thinking about monitoring (maybe without regex'ing logfile)

Embarrassingly, this is still meant to be addressed through the logs.

A Prometheus endpoint has branched off into issue #50, and I believe is the problem I will address next. The missing stats server is not only crippling production monitoring, but also profiling during development and testing.

So, in summary... not a whole lot of progress, yet. But this is rapidly becoming the lead of my worries.

pcarana added this to the v1.5.0 milestone Oct 30, 2020

dhfelix added a commit that referenced this issue Nov 28, 2020

Force to stop application on ENOMEM error

ad841d5

discussed in #40

ydahhrk removed this from the v1.5.0 milestone Dec 1, 2023

ydahhrk added the Urgency: Low label Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failure scenarios, monitoring and glibc recommendations #40

failure scenarios, monitoring and glibc recommendations #40

lukastribus commented Sep 16, 2020 •

edited

Loading

pcarana commented Sep 18, 2020

lukastribus commented Sep 25, 2020

pcarana commented Sep 28, 2020

pcarana commented Oct 30, 2020

lukastribus commented Oct 30, 2020

pcarana commented Oct 30, 2020

ydahhrk commented Dec 1, 2023

failure scenarios, monitoring and glibc recommendations #40

failure scenarios, monitoring and glibc recommendations #40

Comments

lukastribus commented Sep 16, 2020 • edited Loading

pcarana commented Sep 18, 2020

lukastribus commented Sep 25, 2020

pcarana commented Sep 28, 2020

pcarana commented Oct 30, 2020

lukastribus commented Oct 30, 2020

pcarana commented Oct 30, 2020

ydahhrk commented Dec 1, 2023

lukastribus commented Sep 16, 2020 •

edited

Loading