Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure scenarios, monitoring and glibc recommendations #40

Open
lukastribus opened this issue Sep 16, 2020 · 7 comments
Open

failure scenarios, monitoring and glibc recommendations #40

lukastribus opened this issue Sep 16, 2020 · 7 comments

Comments

@lukastribus
Copy link

lukastribus commented Sep 16, 2020

Hello,

I'm currently evaluating the FORT validator and have a few questions.

I'm concerned about bugs, misconfigurations or other issues (in all RP/RTR setups, not specific to FORT) that will cause obsolete VRP's on the production routers, because I believe this is the worst-case in RPKI ROV deployments.

I worry about how issues like:

  • crash bugs in the validation code
  • hangs during RPKI validation (even in rsync), that block the entire validation
  • memory allocation failures (failed malloc)
  • Linux OOM-killer

impact the RTR service:

The best-case scenario for me is that the RTR server goes down completely and all RTR sessions die, so that the production routers are aware there is a problem with that RTR end point and stop using it (failing over to other RTR servers, if available).

Is that the expected behavior in FORT? It is one-process with multiple threads, so a crash would achieve this, correct?

I'm also thinking about monitoring (maybe without regex'ing logfile)

Other than parsing strings from logfiles, how could we best achieve this? Is there some stat socket that we could query to check for things like last validation start time and last validation completion time?

Regarding glibc memory allocation: when using glibc, should we just use MALLOC_ARENA_MAX=2 always or only in environments with limited memory? If this is a good middle ground, I'd prefer to just use it always in the glibc world and have systemd unit files set this.

@pcarana
Copy link
Contributor

pcarana commented Sep 18, 2020

Hi Lukas! I hope this answer can help to your analysis.

Is that the expected behavior in FORT? It is one-process with multiple threads, so a crash would achieve this, correct?

Correct. A crash will cause FORT validator to stop, which means that also the RTR server will go down. The key point here is: when will FORT validator crash? Definitely due to bugs (hopefully, shouldn't be much of these, but nobody's perfect) or a programming error (logged at crit level, see Logging#level at our docs).

Regarding the other issues:

  • Memory allocation failures are handled and logged at the operation logs (so that the operator be notified about the issue). FORT will try to do its best to keep processing data; so in the worst case scenario, the current validation cycle data will be discarded, thus the RTR data shouldn't be modified and the RTR server won't die. If the problem persists, we rely on the operator to take action, since the logs will be full of "Out of memory" messages. Maybe this isn't what you expect, since the RTR server will still live. So, we'll consider your proposal regarding stop the whole process to avoid the clients (routers) to keep stale date until the operator do something regarding the FORT validator messages.
  • Hangs during RPKI validation. The most likely "hang" scenario could be at the http/rsync requests. Here are some of the configurable timeouts for these requests (http.connect-timeout, http.transfer-timeout, http.idle-timeout, rsync.arguments-recursive, rsync.arguments-flat). By default, once the connection is established FORT validator will try to fetch all the required data from the endpoint; if data is being transferred, the connection won't be killed, so FORT will wait until the data is fully transferred (unless the connection is terminated by a local network or remote issue). In other words, the RTR server won't die even if the validation takes a while to complete.

I'm also thinking about monitoring (maybe without regex'ing logfile)

  • how to best monitor for periodic successful validation runs
  • how to monitor validation run time

Oops! As of today this data is logged at a level info in the operation logs (you'll have to set it before running FORT, using --log.level=info, since the default level is warning).

Is there some stat socket that we could query to check for things like last validation start time and last validation completion time?

This will definitely be at our TODO list, so for now "regex'ing logfile" (using the info level) is the way.

Regarding glibc memory allocation: when using glibc, should we just use MALLOC_ARENA_MAX=2 always or only in environments with limited memory? If this is a good middle ground, I'd prefer to just use it always in the glibc world and have systemd unit files set this.

Yes, I would recommend to use it in environments with limited memory. Of course, there's no problem using it always, since its main goal is to help.

@lukastribus
Copy link
Author

Thanks for the feedback.

I think a stat socket or better yet a HTTP endpoint with a REST interface or something, that returns general health and validation metrics (especially last validation run - start, time and stop) would indeed be important for active monitoring.

@pcarana
Copy link
Contributor

pcarana commented Sep 28, 2020

I agree, this will be a nice (and likely needed) feature.

@pcarana pcarana added this to the v1.5.0 milestone Oct 30, 2020
@pcarana
Copy link
Contributor

pcarana commented Oct 30, 2020

Newer version v1.5.0 will try to attend some of the points stated at this issue.

@lukastribus
Copy link
Author

Regarding monitoring, I will build a rtrdump based tool to check for stalled RTR endpoints (same RTR serial and data output after X amount of time = monitoring alert). I believe this is a better way to monitor validator/RTR server health than relying on validation timestamps from a HTTP API.

Slightly off-topic: do you consider doing garbage collection based on the files not referenced in valid current manifests as opposed to rsync --delete/RRDP withdraw) ?

Thanks

@pcarana
Copy link
Contributor

pcarana commented Oct 30, 2020

I believe this is a better way to monitor validator/RTR server health than relying on validation timestamps from a HTTP API.

I agree, that's a good approach to monitor RTR server health.

Slightly off-topic: do you consider doing garbage collection based on the files not referenced in valid current manifests as opposed to rsync --delete/RRDP withdraw) ?

Well, I've read it a moment ago. Seems a good suggestion but I haven't discussed it with the team yet, so we need to analyze it in order to take a call on what to do.

dhfelix added a commit that referenced this issue Nov 28, 2020
ydahhrk added a commit that referenced this issue Jun 11, 2021
ad841d5 was a mistake. It was never agreed in #40 that Fort should
shotgun blast its own face on the first ENOMEM, and even if it was, the
idea is preposterous. Memory allocation failures are neither programming
errors nor an excuse to leave all the routers hanging.

While there's some truth to the notion that a Fort memory leak (which
has been exhausting memory over time) could be temporarily amended by
killing Fort (and letting the OS clean it up), the argument completely
misses the reality that memory allocation failures could happen
regardless of the existence of a memory leak.

A memory leak is a valid reason to throw away the results of a current
validation run (as long as the admin is warned), but an existing
validation result and the RTR server must remain afloat.

Also includes a pr_enomem() caller review.
ydahhrk added a commit that referenced this issue Jun 7, 2022
Mostly quality of life improvements.

On the other hand, it looks like the notfatal hash table API was being
used incorrectly. HASH_ADD_KEYPTR can OOM, but `errno` wasn't being
catched.

Fixing this is nontrivial, however, because strange `reqs_error`
functions are in the way, and that's a spaggetti I decided to avoid.
Instead, I converted HASH_ADD_KEYPTR usage to the fatal hash table API.
That's the future according to #40, anyway.

I don't think this has anything to do with #83, though.
ydahhrk added a commit that referenced this issue Jun 23, 2023
Trying to recover is incorrect because we don't want to advertise an
incomplete or outdated VRP table to the routers. We don't want to rely
on the OOM-killer; we NEED to die on memory allocation failures ASAP.

Though this doesn't have much to do with the RRDP refactor, I'm doing it
early to get plenty of time for testing and review.

Partially F1xes #40. (Still need to update some dependency usages.)
@ydahhrk ydahhrk removed this from the v1.5.0 milestone Dec 1, 2023
@ydahhrk
Copy link
Member

ydahhrk commented Dec 1, 2023

Status:

crash bugs in the validation code

As has been previously mentioned, Fort panics when it detects programming errors. Because the validator and RTR server are part of the same binary, validator errors also bring down the RTR server with it.

This has worked this way since the inception of the project.

hangs during RPKI validation (even in rsync), that block the entire validation

There are a few timeouts in place (1, 2, 3, 4, 5), but I still believe the implementation to be naive.

My lead concerns right now are a timeout to rsync invocations, as well as a timeout to the overall validation. After that, I would like to worry about researching whether it's possible to assign timeouts to I/O operations in the cache.

Additional ideas welcomed.

memory allocation failures (failed malloc)

As of 1.6.0, Fort generally panics on memory allocation failures. As you proposed, this is intended to prevent Fort from advertising incomplete information, regardless of what the environment thinks is an adequate response to a failed allocation. All mallocs outside of the asn1 code have been already wrapped.

I still consider this an ongoing effort however, because of the still pendig asn1 review, and also because some of Fort's dependencies sometimes obfuscate error causes. I don't know if there's a solution for the latter, other than ditching the dependency entirely.

I'm also thinking about monitoring (maybe without regex'ing logfile)

Embarrassingly, this is still meant to be addressed through the logs.

A Prometheus endpoint has branched off into issue #50, and I believe is the problem I will address next. The missing stats server is not only crippling production monitoring, but also profiling during development and testing.


So, in summary... not a whole lot of progress, yet. But this is rapidly becoming the lead of my worries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants