Fort's validation produces no router keys #58

moviedo-sc · 2021-10-11T19:08:07Z

Hola, repentinamente el proceso de fort falló en todas las instancias que instalamos:

Oct 11 19:04:11 srv-fort fort[2950289]: CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack
Oct 11 19:04:11 srv-fort fort[2950289]: Stack trace:
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(print_stack_trace+0x32) [0x55660962b4f2]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(pr_crit+0x8f) [0x55660962d8af]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(+0x1b297) [0x556609628297]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(deferstack_pop+0x3b) [0x55660962847b]
Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(+0x2da28) [0x55660963aa28]
Oct 11 19:04:11 srv-fort fort[2950289]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7ff03fea2609]
Oct 11 19:04:11 srv-fort fort[2950289]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff03fdc9293]
Oct 11 19:04:11 srv-fort fort[2950289]: (Stack size was 7.)
Oct 11 19:04:11 srv-fort fort[2950289]: ERR: rsync://rpki.apnic.net/member_repository/A91DE10F/60EA5C2CB5D311E7B6A2DD5DC4F9AE02/NDyCcTdhxY6CRQ2UqleWffm0bxU.mft: Unknown message digest sha256
Oct 11 19:04:11 srv-fort systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 11 19:04:12 srv-fort systemd[1]: fort.service: Failed with result 'exit-code'.

Reiniciamos el servicio y todo funcionó correctamente.

¿Podrían darnos apoyo para identificar el problema que lo causó?

Saludos,
Mauricio

juliencorsini · 2021-10-11T19:14:16Z

+1, getting the same error message since 1 hour ago.

moviedo-sc · 2021-10-11T19:29:03Z

Como update, el servicio vuelve a caer al poco tiempo de reiniciarlo.

ydahhrk · 2021-10-11T19:43:50Z

¿Cuál es la versión de Fort?

bilbomx · 2021-10-11T19:53:46Z

Yo tengo la 1.4.2 y me ocurre lo mismo
La instale via rpm en Centos

moviedo-sc · 2021-10-11T20:09:56Z

Tenemos un mix de versiones. Confirmo que nos pasa con 1.5.1-1 que es el que tenemos más actual, en Ubuntu instalado desde los .deb.

moviedo-sc · 2021-10-11T20:29:46Z

Aquí el stack trace de la versión 1.5.1-1:

root@srv-fort:/usr/local/bin# systemctl status fort
● fort.service - FORT RPKI validator
Loaded: loaded (/lib/systemd/system/fort.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2021-10-11 20:19:31 UTC; 2min 27s ago
Docs: man:fort(8)
https://nicmx.github.io/FORT-validator/
Process: 3001527 ExecStart=/usr/bin/fort --configuration-file /etc/fort/config.json (code=exited, status=255/EXCEPTION)
Main PID: 3001527 (code=exited, status=255/EXCEPTION)

Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x1e745) [0x55a052dda745]
Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(deferstack_pop+0x37) [0x55a052dda9a7]
Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x3457a) [0x55a052df057a]
Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x34f68) [0x55a052df0f68]
Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x44ea8) [0x55a052e00ea8]
Oct 11 20:19:31 srv-fort fort[3001527]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f437cab3609]
Oct 11 20:19:31 srv-fort fort[3001527]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f437c9da293]
Oct 11 20:19:31 srv-fort fort[3001527]: (End of stack trace)
Oct 11 20:19:31 srv-fort systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 11 20:19:31 srv-fort systemd[1]: fort.service: Failed with result 'exit-code'.

job · 2021-10-11T20:31:10Z

I observed the following:

pcx$ doas ./fort --tal /etc/rpki/ripe.tal
Oct 11 20:21:48 INF: Disabling validation logging on syslog.
Oct 11 20:21:48 INF: Disabling validation logging on standard streams.
Oct 11 20:21:48 INF: Console log output configured; disabling operation logging on syslog.
Oct 11 20:21:48 INF: (Operation Logs will be sent to the standard streams only.)
Oct 11 20:21:48 WRN: First validation cycle has begun, wait until the next notification to connect your router(s)
Oct 11 20:25:54 ERR: /etc/rpki/ripe.tal: Attempted to pop empty X509 stack
Stack trace:
  0x3c6ea399078 <print_stack_trace+56> at ./fort
  0x3c6ea39a79c <pr_crit+236> at ./fort
  0x3c6ea3948b1 <deferstack_pop+337> at ./fort
  0x3c6ea3947be <deferstack_pop+94> at ./fort
  0x3c6ea3acb81 <perform_standalone_validation+1985> at ./fort
  0x3c6ea3ac7e6 <perform_standalone_validation+1062> at ./fort
  0x3c6ea3bdc5b <thread_pool_wait+619> at ./fort
  0x3c907d07281 <pthread_create+657> at /usr/lib/libpthread.so.26.1
(End of stack trace)

moviedo-sc · 2021-10-11T20:38:18Z

CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack
Sin el archivo TAL de Ripe no se ha caído en 1.5.1-1.

It seems the #58 and #59 problem is a stray defer separator pop. The comment above x509stack_cancel() clearly states that the function should only be called shortly after a x509stack_push(), but there's one in certificate_traverse() that isn't. Removing this x509stack_cancel() seems to prevent the crash. I'm still investigating the original intent of this code. Tentatively f1xes #58 and #59.

ydahhrk · 2021-10-11T21:31:26Z

(Español abajo)

This seems to have a simple fix, but it concerns a feature I'm not completely familiar with. The patch I just uploaded seems to yield a successful RIPE traversal, but I'm still investigating whether it's fully correct.

Feedback would be appreciated.

Esto parece tener una corrección muy simple, pero no estoy completamente familiarizado con el código relevante. El parche que acabo de subir parece corregir el problema, pero todavía estoy investigando si es completamente correcto.

¿Podrían bajarlo y ver si se comporta bien?

job · 2021-10-11T21:43:45Z

@ydahhrk the changeset proposal seems to introduce a memory leak triggering OOM. I'll reach out to you via email to share additional information.

forkwhilefork · 2021-10-11T22:27:48Z

FWIW I also got this error

Oct 11 14:21:35 core1 fort[7387]: CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack
Oct 11 14:21:35 core1 fort[7387]: Stack trace:
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(print_stack_trace+0x32) [0x558a924a3e52]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(pr_crit+0x99) [0x558a924a73d9]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x1e563) [0x558a924a0563]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(deferstack_pop+0x2f) [0x558a924a077f]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x34fda) [0x558a924b6fda]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x35a11) [0x558a924b7a11]
Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x452b4) [0x558a924c72b4]
Oct 11 14:21:35 core1 fort[7387]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f1b0f98cfa3]
Oct 11 14:21:35 core1 fort[7387]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f1b0f8bd4cf]
Oct 11 14:21:35 core1 fort[7387]: (Stack size was 9.)
Oct 11 14:21:35 core1 fort[7387]: Expand failed !
Oct 11 14:21:35 core1 systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Oct 11 14:21:35 core1 systemd[1]: fort.service: Failed with result 'exit-code'.

I'd prefer to wait for a stable bugfix release before updating though.

moviedo-sc · 2021-10-11T22:30:41Z

Update: the RIPE file with issues was apparently fixed. It works now without modifying any config files.
However, it will be important to determine why it made FORT crash.

ydahhrk · 2021-10-12T01:23:29Z

The crash is already fixed, but as @job pointed out, the patch introduced a memory leak, and I ran into more problems after that. I'll have a proper refactor tomorrow, hopefully.

Sorry for the inconveniences.

ties · 2021-10-12T06:55:21Z

Update: the RIPE file with issues was apparently fixed. It works now without modifying any config files. However, it will be important to determine why it made FORT crash.

If there was a ripe file with issues - please list which one. We would be happy to investigate if there really was an unexpected object published in one of the repos we control (and follow up if it was a non-hosted repo).

ydahhrk · 2021-10-12T07:07:15Z

Way I see it, don't worry about it. The "unexpected object" contains a normal feature that was implemented two years ago in Fort, but adoption has apparently been slow (to the point 12 hours ago was seemingly the first time it has appeared in the wild), so the code either broke in the meantime or was never tested with real data.

It's hard to tell, because the developer didn't leave relevant unit tests. Which, of course, also needs to be fixed urgently.

I'm hoping to finish patching the code tomorrow.

ties · 2021-10-12T09:24:17Z

Way I see it, don't worry about it. The "unexpected object" contains a normal feature that was implemented two years ago in Fort, but adoption has apparently been slow (to the point 12 hours ago was seemingly the first time it has appeared in the wild), so the code either broke in the meantime or was never tested with real data.

There were no bgpsec object out in the wild and real test data was/is not available. My understanding is that only Dragon Research Labs rpkid supports creating these objects. I guess this is also a painful reminder that there is an inherent risk to testing new objects in the wild instead under a live tal instead of on a testbed.

job · 2021-10-12T09:51:44Z

@ties the problem is not “new objects”. The problem is that software crashes unexpectedly when it shouldn’t be crashing.

You might also recall a very recent issue with lacking input validation in “old objects”, that affected a lot of validators.

ties · 2021-10-12T10:01:57Z

@job true - it should not crash. RP software crashing in this way because of (valid/non-abusive) inputs is a very scary failure mode and both examples motivate the existence of a diverse ecosystem of RP implementations.

benmaddison · 2021-10-12T10:10:19Z

@ties the problem is not “new objects”. The problem is that software crashes unexpectedly when it shouldn’t be crashing.

You might also recall a very recent issue with lacking input validation in “old objects”, that affected a lot of validators.

... and that's not to mention the universe of potential garbage that an RP could
encounter when processing data retrieved from an untrusted (and therefore
potentially hostile) publication point.

Surely it is implicit in the architecture of the RPKI that RPs should not
explode if they encounter unknown or invalid objects, let alone objects that
conform to the specifications of a standards track RFC?

lukastribus · 2021-10-12T11:52:31Z

It's a good thing that it does crash; a hung state machine with the RTR sessions up and running would be a lot worse, possibly remaining undetected for a while by a lot of people. The RTR protocol does not have any expire or TTL field for the VRP, routers could make wrong ROV decisions for weeks and months if it wouldn't crash.

Also: a second validator/RTR server with a different software stack can workaround this only if the instance in question crashes. If it just hangs with the RTR sessions up, routers would still use obsolete VRP's until manual intervention.

Manual intervention requires awareness, and this requires active monitoring:
https://labs.ripe.net/author/lukas_tribus/rpki-rov-about-stale-rtr-servers-and-how-to-monitor-them/

I hope we continue to see hard crashes when encountering bugs in validators and RTR servers, as opposed to the more nasty possibilities.

There were more flaws than expected. - ASN ranges were being converted to ASNs too early, so the code was iterating too much. - BGPsec certificates were being largely handled by the CA certificate function, even though they're not. - The BGP certificate code was using the parent's resources when it was supposed to use the BGP certificate's resources, so it was generating lots of bogus router keys. - Bad cleanup in some unhappy path somewhere. Don't remember. There's a known issue: Global URI 'xxxxxx.cer' does not begin with 'rsync://'. But I suspect it's unrelated, so I decided to postpone it. The code has been drastically improved, but it still needs more testing.

Completely axes the whole ordeal. Is not a long-term solution, but will lead to stable behavior until the other branch has been thoroughly tested.

ydahhrk · 2021-10-13T06:52:24Z

Two new branches:

issue58-simple (matches master)
issue58-proper

Get the first one if you just want to be immune to the bug. It cuts out the whole feature, but then again, adoption is pretty much nonexistent, so you're not missing anything.

The second one is the proper fix, but it still needs more testing.

ydahhrk · 2021-10-21T03:48:37Z

The issue58-simple version of the patch has been released as part of 1.5.2.

issue58-proper is still in the testing phase.

Simplifies it by moving some of its clutter to helper functions, seemingly in preparation for #58. Part of a series of patches meant to manually rebase the issue58-proper branch.

This was referenced Oct 11, 2021

fort.service failed #59

Open

fort 1.4.2 and 1.2.1 crashes regularly, after 3-5 days uptime. #46

Closed

lukastribus mentioned this issue Oct 12, 2021

fort 1.5.1 double crash ... Attempted to pop empty X509 stack #60

Open

ydahhrk added a commit that referenced this issue Oct 13, 2021

Temporary #58 solution

c69a140

Completely axes the whole ordeal. Is not a long-term solution, but will lead to stable behavior until the other branch has been thoroughly tested.

ydahhrk modified the milestone: 1.5.2 Oct 21, 2021

ydahhrk added a commit that referenced this issue Jul 7, 2023

Partial refactor of certificate_traverse()

418f1bd

Simplifies it by moving some of its clutter to helper functions, seemingly in preparation for #58. Part of a series of patches meant to manually rebase the issue58-proper branch.

ydahhrk changed the title ~~Fort validator se detuvo en varias instancias~~ Fort's validation produces no router keys Dec 1, 2023

ydahhrk mentioned this issue Apr 21, 2024

Implement RTRv2 #132

Open

ydahhrk added the Urgency: Very High label Apr 21, 2024

ydahhrk added this to the 1.6.6 milestone May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fort's validation produces no router keys #58

Fort's validation produces no router keys #58

moviedo-sc commented Oct 11, 2021

juliencorsini commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021

ydahhrk commented Oct 11, 2021

bilbomx commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021

job commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021

ydahhrk commented Oct 11, 2021

job commented Oct 11, 2021

forkwhilefork commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021 •

edited

Loading

ydahhrk commented Oct 12, 2021 •

edited

Loading

ties commented Oct 12, 2021

ydahhrk commented Oct 12, 2021 •

edited

Loading

ties commented Oct 12, 2021

job commented Oct 12, 2021

ties commented Oct 12, 2021

benmaddison commented Oct 12, 2021

lukastribus commented Oct 12, 2021

ydahhrk commented Oct 13, 2021

ydahhrk commented Oct 21, 2021

Fort's validation produces no router keys #58

Fort's validation produces no router keys #58

Comments

moviedo-sc commented Oct 11, 2021

juliencorsini commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021

ydahhrk commented Oct 11, 2021

bilbomx commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021

job commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021

ydahhrk commented Oct 11, 2021

job commented Oct 11, 2021

forkwhilefork commented Oct 11, 2021

moviedo-sc commented Oct 11, 2021 • edited Loading

ydahhrk commented Oct 12, 2021 • edited Loading

ties commented Oct 12, 2021

ydahhrk commented Oct 12, 2021 • edited Loading

ties commented Oct 12, 2021

job commented Oct 12, 2021

ties commented Oct 12, 2021

benmaddison commented Oct 12, 2021

lukastribus commented Oct 12, 2021

ydahhrk commented Oct 13, 2021

ydahhrk commented Oct 21, 2021

moviedo-sc commented Oct 11, 2021 •

edited

Loading

ydahhrk commented Oct 12, 2021 •

edited

Loading

ydahhrk commented Oct 12, 2021 •

edited

Loading