Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fort's validation produces no router keys #58

Open
moviedo-sc opened this issue Oct 11, 2021 · 22 comments
Open

Fort's validation produces no router keys #58

moviedo-sc opened this issue Oct 11, 2021 · 22 comments

Comments

@moviedo-sc
Copy link

Hola, repentinamente el proceso de fort falló en todas las instancias que instalamos:

Oct 11 19:04:11 srv-fort fort[2950289]: CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack
Oct 11 19:04:11 srv-fort fort[2950289]: Stack trace:
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(print_stack_trace+0x32) [0x55660962b4f2]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(pr_crit+0x8f) [0x55660962d8af]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(+0x1b297) [0x556609628297]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(deferstack_pop+0x3b) [0x55660962847b]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(+0x2da28) [0x55660963aa28]
Oct 11 19:04:11 srv-fort fort[2950289]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7ff03fea2609]
Oct 11 19:04:11 srv-fort fort[2950289]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff03fdc9293]
Oct 11 19:04:11 srv-fort fort[2950289]: (Stack size was 7.)
Oct 11 19:04:11 srv-fort fort[2950289]: ERR: rsync://rpki.apnic.net/member_repository/A91DE10F/60EA5C2CB5D311E7B6A2DD5DC4F9AE02/NDyCcTdhxY6CRQ2UqleWffm0bxU.mft: Unknown message digest sha256
Oct 11 19:04:11 srv-fort systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 11 19:04:12 srv-fort systemd[1]: fort.service: Failed with result 'exit-code'.

Reiniciamos el servicio y todo funcionó correctamente.

¿Podrían darnos apoyo para identificar el problema que lo causó?

Saludos,
Mauricio

@juliencorsini
Copy link

+1, getting the same error message since 1 hour ago.

@moviedo-sc
Copy link
Author

Como update, el servicio vuelve a caer al poco tiempo de reiniciarlo.

@ydahhrk
Copy link
Member

ydahhrk commented Oct 11, 2021

¿Cuál es la versión de Fort?

@bilbomx
Copy link

bilbomx commented Oct 11, 2021

Yo tengo la 1.4.2 y me ocurre lo mismo
La instale via rpm en Centos

@moviedo-sc
Copy link
Author

Tenemos un mix de versiones. Confirmo que nos pasa con 1.5.1-1 que es el que tenemos más actual, en Ubuntu instalado desde los .deb.

@moviedo-sc
Copy link
Author

Aquí el stack trace de la versión 1.5.1-1:

root@srv-fort:/usr/local/bin# systemctl status fort
● fort.service - FORT RPKI validator
Loaded: loaded (/lib/systemd/system/fort.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2021-10-11 20:19:31 UTC; 2min 27s ago
Docs: man:fort(8)
https://nicmx.github.io/FORT-validator/
Process: 3001527 ExecStart=/usr/bin/fort --configuration-file /etc/fort/config.json (code=exited, status=255/EXCEPTION)
Main PID: 3001527 (code=exited, status=255/EXCEPTION)

Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x1e745) [0x55a052dda745]
Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(deferstack_pop+0x37) [0x55a052dda9a7]
Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x3457a) [0x55a052df057a]
Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x34f68) [0x55a052df0f68]
Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x44ea8) [0x55a052e00ea8]
Oct 11 20:19:31 srv-fort fort[3001527]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f437cab3609]
Oct 11 20:19:31 srv-fort fort[3001527]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f437c9da293]
Oct 11 20:19:31 srv-fort fort[3001527]: (End of stack trace)
Oct 11 20:19:31 srv-fort systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 11 20:19:31 srv-fort systemd[1]: fort.service: Failed with result 'exit-code'.

@job
Copy link
Contributor

job commented Oct 11, 2021

I observed the following:

pcx$ doas ./fort --tal /etc/rpki/ripe.tal
Oct 11 20:21:48 INF: Disabling validation logging on syslog.
Oct 11 20:21:48 INF: Disabling validation logging on standard streams.
Oct 11 20:21:48 INF: Console log output configured; disabling operation logging on syslog.
Oct 11 20:21:48 INF: (Operation Logs will be sent to the standard streams only.)
Oct 11 20:21:48 WRN: First validation cycle has begun, wait until the next notification to connect your router(s)
Oct 11 20:25:54 ERR: /etc/rpki/ripe.tal: Attempted to pop empty X509 stack
Stack trace:
  0x3c6ea399078 <print_stack_trace+56> at ./fort
  0x3c6ea39a79c <pr_crit+236> at ./fort
  0x3c6ea3948b1 <deferstack_pop+337> at ./fort
  0x3c6ea3947be <deferstack_pop+94> at ./fort
  0x3c6ea3acb81 <perform_standalone_validation+1985> at ./fort
  0x3c6ea3ac7e6 <perform_standalone_validation+1062> at ./fort
  0x3c6ea3bdc5b <thread_pool_wait+619> at ./fort
  0x3c907d07281 <pthread_create+657> at /usr/lib/libpthread.so.26.1
(End of stack trace)

@moviedo-sc
Copy link
Author

CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack
Sin el archivo TAL de Ripe no se ha caído en 1.5.1-1.

ydahhrk added a commit that referenced this issue Oct 11, 2021
It seems the #58 and #59 problem is a stray defer separator pop.

The comment above x509stack_cancel() clearly states that the function
should only be called shortly after a x509stack_push(), but there's one
in certificate_traverse() that isn't.

Removing this x509stack_cancel() seems to prevent the crash. I'm still
investigating the original intent of this code.

Tentatively f1xes #58 and #59.
@ydahhrk
Copy link
Member

ydahhrk commented Oct 11, 2021

(Español abajo)

This seems to have a simple fix, but it concerns a feature I'm not completely familiar with. The patch I just uploaded seems to yield a successful RIPE traversal, but I'm still investigating whether it's fully correct.

Feedback would be appreciated.


Esto parece tener una corrección muy simple, pero no estoy completamente familiarizado con el código relevante. El parche que acabo de subir parece corregir el problema, pero todavía estoy investigando si es completamente correcto.

¿Podrían bajarlo y ver si se comporta bien?

@job
Copy link
Contributor

job commented Oct 11, 2021

@ydahhrk the changeset proposal seems to introduce a memory leak triggering OOM. I'll reach out to you via email to share additional information.

@forkwhilefork
Copy link

FWIW I also got this error

Oct 11 14:21:35 core1 fort[7387]: CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack
Oct 11 14:21:35 core1 fort[7387]: Stack trace:
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(print_stack_trace+0x32) [0x558a924a3e52]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(pr_crit+0x99) [0x558a924a73d9]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x1e563) [0x558a924a0563]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(deferstack_pop+0x2f) [0x558a924a077f]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x34fda) [0x558a924b6fda]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x35a11) [0x558a924b7a11]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x452b4) [0x558a924c72b4]
Oct 11 14:21:35 core1 fort[7387]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f1b0f98cfa3]
Oct 11 14:21:35 core1 fort[7387]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f1b0f8bd4cf]
Oct 11 14:21:35 core1 fort[7387]: (Stack size was 9.)
Oct 11 14:21:35 core1 fort[7387]: Expand failed !
Oct 11 14:21:35 core1 systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 11 14:21:35 core1 systemd[1]: fort.service: Failed with result 'exit-code'.

I'd prefer to wait for a stable bugfix release before updating though.

@moviedo-sc
Copy link
Author

moviedo-sc commented Oct 11, 2021

Update: the RIPE file with issues was apparently fixed. It works now without modifying any config files.
However, it will be important to determine why it made FORT crash.

@ydahhrk
Copy link
Member

ydahhrk commented Oct 12, 2021

The crash is already fixed, but as @job pointed out, the patch introduced a memory leak, and I ran into more problems after that. I'll have a proper refactor tomorrow, hopefully.

Sorry for the inconveniences.

@ties
Copy link

ties commented Oct 12, 2021

Update: the RIPE file with issues was apparently fixed. It works now without modifying any config files. However, it will be important to determine why it made FORT crash.

If there was a ripe file with issues - please list which one. We would be happy to investigate if there really was an unexpected object published in one of the repos we control (and follow up if it was a non-hosted repo).

@ydahhrk
Copy link
Member

ydahhrk commented Oct 12, 2021

Way I see it, don't worry about it. The "unexpected object" contains a normal feature that was implemented two years ago in Fort, but adoption has apparently been slow (to the point 12 hours ago was seemingly the first time it has appeared in the wild), so the code either broke in the meantime or was never tested with real data.

It's hard to tell, because the developer didn't leave relevant unit tests. Which, of course, also needs to be fixed urgently.

I'm hoping to finish patching the code tomorrow.

@ties
Copy link

ties commented Oct 12, 2021

Way I see it, don't worry about it. The "unexpected object" contains a normal feature that was implemented two years ago in Fort, but adoption has apparently been slow (to the point 12 hours ago was seemingly the first time it has appeared in the wild), so the code either broke in the meantime or was never tested with real data.

There were no bgpsec object out in the wild and real test data was/is not available. My understanding is that only Dragon Research Labs rpkid supports creating these objects. I guess this is also a painful reminder that there is an inherent risk to testing new objects in the wild instead under a live tal instead of on a testbed.

@job
Copy link
Contributor

job commented Oct 12, 2021

@ties the problem is not “new objects”. The problem is that software crashes unexpectedly when it shouldn’t be crashing.

You might also recall a very recent issue with lacking input validation in “old objects”, that affected a lot of validators.

@ties
Copy link

ties commented Oct 12, 2021

@job true - it should not crash. RP software crashing in this way because of (valid/non-abusive) inputs is a very scary failure mode and both examples motivate the existence of a diverse ecosystem of RP implementations.

@benmaddison
Copy link

@ties the problem is not “new objects”. The problem is that software crashes unexpectedly when it shouldn’t be crashing.

You might also recall a very recent issue with lacking input validation in “old objects”, that affected a lot of validators.

... and that's not to mention the universe of potential garbage that an RP could
encounter when processing data retrieved from an untrusted (and therefore
potentially hostile) publication point.

Surely it is implicit in the architecture of the RPKI that RPs should not
explode if they encounter unknown or invalid objects, let alone objects that
conform to the specifications of a standards track RFC?

@lukastribus
Copy link

It's a good thing that it does crash; a hung state machine with the RTR sessions up and running would be a lot worse, possibly remaining undetected for a while by a lot of people. The RTR protocol does not have any expire or TTL field for the VRP, routers could make wrong ROV decisions for weeks and months if it wouldn't crash.

Also: a second validator/RTR server with a different software stack can workaround this only if the instance in question crashes. If it just hangs with the RTR sessions up, routers would still use obsolete VRP's until manual intervention.

Manual intervention requires awareness, and this requires active monitoring:
https://labs.ripe.net/author/lukas_tribus/rpki-rov-about-stale-rtr-servers-and-how-to-monitor-them/

I hope we continue to see hard crashes when encountering bugs in validators and RTR servers, as opposed to the more nasty possibilities.

ydahhrk added a commit that referenced this issue Oct 13, 2021
There were more flaws than expected.

- ASN ranges were being converted to ASNs too early, so the code was
  iterating too much.
- BGPsec certificates were being largely handled by the CA certificate
  function, even though they're not.
- The BGP certificate code was using the parent's resources when it was
  supposed to use the BGP certificate's resources, so it was generating
  lots of bogus router keys.
- Bad cleanup in some unhappy path somewhere. Don't remember.

There's a known issue:

	Global URI 'xxxxxx.cer' does not begin with 'rsync://'.

But I suspect it's unrelated, so I decided to postpone it.

The code has been drastically improved, but it still needs more testing.
ydahhrk added a commit that referenced this issue Oct 13, 2021
Completely axes the whole ordeal.

Is not a long-term solution, but will lead to stable behavior until
the other branch has been thoroughly tested.
@ydahhrk
Copy link
Member

ydahhrk commented Oct 13, 2021

Two new branches:

  1. issue58-simple (matches master)
  2. issue58-proper

Get the first one if you just want to be immune to the bug. It cuts out the whole feature, but then again, adoption is pretty much nonexistent, so you're not missing anything.

The second one is the proper fix, but it still needs more testing.

@ydahhrk ydahhrk modified the milestone: 1.5.2 Oct 21, 2021
@ydahhrk
Copy link
Member

ydahhrk commented Oct 21, 2021

The issue58-simple version of the patch has been released as part of 1.5.2.

issue58-proper is still in the testing phase.

ydahhrk added a commit that referenced this issue Jul 7, 2023
Simplifies it by moving some of its clutter to helper functions,
seemingly in preparation for #58.

Part of a series of patches meant to manually rebase the issue58-proper
branch.
@ydahhrk ydahhrk changed the title Fort validator se detuvo en varias instancias Fort's validation produces no router keys Dec 1, 2023
@ydahhrk ydahhrk added this to the 1.6.6 milestone May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants