-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fort's validation produces no router keys #58
Comments
+1, getting the same error message since 1 hour ago. |
Como update, el servicio vuelve a caer al poco tiempo de reiniciarlo. |
¿Cuál es la versión de Fort? |
Yo tengo la 1.4.2 y me ocurre lo mismo |
Tenemos un mix de versiones. Confirmo que nos pasa con 1.5.1-1 que es el que tenemos más actual, en Ubuntu instalado desde los .deb. |
Aquí el stack trace de la versión 1.5.1-1: root@srv-fort:/usr/local/bin# systemctl status fort Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x1e745) [0x55a052dda745] |
I observed the following:
|
CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack |
It seems the #58 and #59 problem is a stray defer separator pop. The comment above x509stack_cancel() clearly states that the function should only be called shortly after a x509stack_push(), but there's one in certificate_traverse() that isn't. Removing this x509stack_cancel() seems to prevent the crash. I'm still investigating the original intent of this code. Tentatively f1xes #58 and #59.
(Español abajo) This seems to have a simple fix, but it concerns a feature I'm not completely familiar with. The patch I just uploaded seems to yield a successful RIPE traversal, but I'm still investigating whether it's fully correct. Feedback would be appreciated. Esto parece tener una corrección muy simple, pero no estoy completamente familiarizado con el código relevante. El parche que acabo de subir parece corregir el problema, pero todavía estoy investigando si es completamente correcto. ¿Podrían bajarlo y ver si se comporta bien? |
FWIW I also got this error
I'd prefer to wait for a stable bugfix release before updating though. |
Update: the RIPE file with issues was apparently fixed. It works now without modifying any config files. |
The crash is already fixed, but as @job pointed out, the patch introduced a memory leak, and I ran into more problems after that. I'll have a proper refactor tomorrow, hopefully. Sorry for the inconveniences. |
If there was a ripe file with issues - please list which one. We would be happy to investigate if there really was an unexpected object published in one of the repos we control (and follow up if it was a non-hosted repo). |
Way I see it, don't worry about it. The "unexpected object" contains a normal feature that was implemented two years ago in Fort, but adoption has apparently been slow (to the point 12 hours ago was seemingly the first time it has appeared in the wild), so the code either broke in the meantime or was never tested with real data. It's hard to tell, because the developer didn't leave relevant unit tests. Which, of course, also needs to be fixed urgently. I'm hoping to finish patching the code tomorrow. |
There were no bgpsec object out in the wild and real test data was/is not available. My understanding is that only Dragon Research Labs rpkid supports creating these objects. I guess this is also a painful reminder that there is an inherent risk to testing new objects in the wild instead under a live tal instead of on a testbed. |
@ties the problem is not “new objects”. The problem is that software crashes unexpectedly when it shouldn’t be crashing. You might also recall a very recent issue with lacking input validation in “old objects”, that affected a lot of validators. |
@job true - it should not crash. RP software crashing in this way because of (valid/non-abusive) inputs is a very scary failure mode and both examples motivate the existence of a diverse ecosystem of RP implementations. |
... and that's not to mention the universe of potential garbage that an RP could Surely it is implicit in the architecture of the RPKI that RPs should not |
It's a good thing that it does crash; a hung state machine with the RTR sessions up and running would be a lot worse, possibly remaining undetected for a while by a lot of people. The RTR protocol does not have any expire or TTL field for the VRP, routers could make wrong ROV decisions for weeks and months if it wouldn't crash. Also: a second validator/RTR server with a different software stack can workaround this only if the instance in question crashes. If it just hangs with the RTR sessions up, routers would still use obsolete VRP's until manual intervention. Manual intervention requires awareness, and this requires active monitoring: I hope we continue to see hard crashes when encountering bugs in validators and RTR servers, as opposed to the more nasty possibilities. |
There were more flaws than expected. - ASN ranges were being converted to ASNs too early, so the code was iterating too much. - BGPsec certificates were being largely handled by the CA certificate function, even though they're not. - The BGP certificate code was using the parent's resources when it was supposed to use the BGP certificate's resources, so it was generating lots of bogus router keys. - Bad cleanup in some unhappy path somewhere. Don't remember. There's a known issue: Global URI 'xxxxxx.cer' does not begin with 'rsync://'. But I suspect it's unrelated, so I decided to postpone it. The code has been drastically improved, but it still needs more testing.
Two new branches:
Get the first one if you just want to be immune to the bug. It cuts out the whole feature, but then again, adoption is pretty much nonexistent, so you're not missing anything. The second one is the proper fix, but it still needs more testing. |
The issue58-simple version of the patch has been released as part of 1.5.2. issue58-proper is still in the testing phase. |
Simplifies it by moving some of its clutter to helper functions, seemingly in preparation for #58. Part of a series of patches meant to manually rebase the issue58-proper branch.
Hola, repentinamente el proceso de fort falló en todas las instancias que instalamos:
Oct 11 19:04:11 srv-fort fort[2950289]: CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack
Oct 11 19:04:11 srv-fort fort[2950289]: Stack trace:
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(print_stack_trace+0x32) [0x55660962b4f2]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(pr_crit+0x8f) [0x55660962d8af]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(+0x1b297) [0x556609628297]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(deferstack_pop+0x3b) [0x55660962847b]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(+0x2da28) [0x55660963aa28]
Oct 11 19:04:11 srv-fort fort[2950289]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7ff03fea2609]
Oct 11 19:04:11 srv-fort fort[2950289]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff03fdc9293]
Oct 11 19:04:11 srv-fort fort[2950289]: (Stack size was 7.)
Oct 11 19:04:11 srv-fort fort[2950289]: ERR: rsync://rpki.apnic.net/member_repository/A91DE10F/60EA5C2CB5D311E7B6A2DD5DC4F9AE02/NDyCcTdhxY6CRQ2UqleWffm0bxU.mft: Unknown message digest sha256
Oct 11 19:04:11 srv-fort systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 11 19:04:12 srv-fort systemd[1]: fort.service: Failed with result 'exit-code'.
Reiniciamos el servicio y todo funcionó correctamente.
¿Podrían darnos apoyo para identificar el problema que lo causó?
Saludos,
Mauricio
The text was updated successfully, but these errors were encountered: