-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failure scenarios, monitoring and glibc recommendations #40
Comments
Hi Lukas! I hope this answer can help to your analysis.
Correct. A crash will cause FORT validator to stop, which means that also the RTR server will go down. The key point here is: when will FORT validator crash? Definitely due to bugs (hopefully, shouldn't be much of these, but nobody's perfect) or a programming error (logged at Regarding the other issues:
Oops! As of today this data is logged at a level
This will definitely be at our TODO list, so for now "regex'ing logfile" (using the
Yes, I would recommend to use it in environments with limited memory. Of course, there's no problem using it always, since its main goal is to help. |
Thanks for the feedback. I think a stat socket or better yet a HTTP endpoint with a REST interface or something, that returns general health and validation metrics (especially last validation run - start, time and stop) would indeed be important for active monitoring. |
I agree, this will be a nice (and likely needed) feature. |
Newer version v1.5.0 will try to attend some of the points stated at this issue. |
Regarding monitoring, I will build a rtrdump based tool to check for stalled RTR endpoints (same RTR serial and data output after X amount of time = monitoring alert). I believe this is a better way to monitor validator/RTR server health than relying on validation timestamps from a HTTP API. Slightly off-topic: do you consider doing garbage collection based on the files not referenced in valid current manifests as opposed to rsync --delete/RRDP withdraw) ? Thanks |
I agree, that's a good approach to monitor RTR server health.
Well, I've read it a moment ago. Seems a good suggestion but I haven't discussed it with the team yet, so we need to analyze it in order to take a call on what to do. |
ad841d5 was a mistake. It was never agreed in #40 that Fort should shotgun blast its own face on the first ENOMEM, and even if it was, the idea is preposterous. Memory allocation failures are neither programming errors nor an excuse to leave all the routers hanging. While there's some truth to the notion that a Fort memory leak (which has been exhausting memory over time) could be temporarily amended by killing Fort (and letting the OS clean it up), the argument completely misses the reality that memory allocation failures could happen regardless of the existence of a memory leak. A memory leak is a valid reason to throw away the results of a current validation run (as long as the admin is warned), but an existing validation result and the RTR server must remain afloat. Also includes a pr_enomem() caller review.
Mostly quality of life improvements. On the other hand, it looks like the notfatal hash table API was being used incorrectly. HASH_ADD_KEYPTR can OOM, but `errno` wasn't being catched. Fixing this is nontrivial, however, because strange `reqs_error` functions are in the way, and that's a spaggetti I decided to avoid. Instead, I converted HASH_ADD_KEYPTR usage to the fatal hash table API. That's the future according to #40, anyway. I don't think this has anything to do with #83, though.
Trying to recover is incorrect because we don't want to advertise an incomplete or outdated VRP table to the routers. We don't want to rely on the OOM-killer; we NEED to die on memory allocation failures ASAP. Though this doesn't have much to do with the RRDP refactor, I'm doing it early to get plenty of time for testing and review. Partially F1xes #40. (Still need to update some dependency usages.)
Status:
As has been previously mentioned, Fort panics when it detects programming errors. Because the validator and RTR server are part of the same binary, validator errors also bring down the RTR server with it. This has worked this way since the inception of the project.
There are a few timeouts in place (1, 2, 3, 4, 5), but I still believe the implementation to be naive. My lead concerns right now are a timeout to rsync invocations, as well as a timeout to the overall validation. After that, I would like to worry about researching whether it's possible to assign timeouts to I/O operations in the cache. Additional ideas welcomed.
As of 1.6.0, Fort generally panics on memory allocation failures. As you proposed, this is intended to prevent Fort from advertising incomplete information, regardless of what the environment thinks is an adequate response to a failed allocation. All I still consider this an ongoing effort however, because of the still pendig asn1 review, and also because some of Fort's dependencies sometimes obfuscate error causes. I don't know if there's a solution for the latter, other than ditching the dependency entirely.
Embarrassingly, this is still meant to be addressed through the logs. A Prometheus endpoint has branched off into issue #50, and I believe is the problem I will address next. The missing stats server is not only crippling production monitoring, but also profiling during development and testing. So, in summary... not a whole lot of progress, yet. But this is rapidly becoming the lead of my worries. |
Hello,
I'm currently evaluating the FORT validator and have a few questions.
I'm concerned about bugs, misconfigurations or other issues (in all RP/RTR setups, not specific to FORT) that will cause obsolete VRP's on the production routers, because I believe this is the worst-case in RPKI ROV deployments.
I worry about how issues like:
impact the RTR service:
The best-case scenario for me is that the RTR server goes down completely and all RTR sessions die, so that the production routers are aware there is a problem with that RTR end point and stop using it (failing over to other RTR servers, if available).
Is that the expected behavior in FORT? It is one-process with multiple threads, so a crash would achieve this, correct?
I'm also thinking about monitoring (maybe without regex'ing logfile)
Other than parsing strings from logfiles, how could we best achieve this? Is there some stat socket that we could query to check for things like last validation start time and last validation completion time?
Regarding glibc memory allocation: when using glibc, should we just use
MALLOC_ARENA_MAX=2
always or only in environments with limited memory? If this is a good middle ground, I'd prefer to just use it always in the glibc world and have systemd unit files set this.The text was updated successfully, but these errors were encountered: