Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FORT 1.5.3 Crashing - ERR: Unknown protocol: 114 #83

Closed
InsaneSplash opened this issue May 16, 2022 · 20 comments
Closed

FORT 1.5.3 Crashing - ERR: Unknown protocol: 114 #83

InsaneSplash opened this issue May 16, 2022 · 20 comments

Comments

@InsaneSplash
Copy link

InsaneSplash commented May 16, 2022

Hello,

I am picking up that the latest version of FORT 1.5.3 keeps crashing on a regular basis. We has paired FORT with FRRouting which is also running on the latest version on Oracle Linux V8

fort-1.5.3-1.el8.x86_64
frr-8.2.2-02.el8.x86_64

Below is the extract from the log showing the crashed process.

May 16 12:37:03 fort[5745]: ERR: Unknown protocol: 114
May 16 12:37:03 fort[5745]: Stack trace:
May 16 12:37:03 fort[5745]:  /usr/bin/fort(print_stack_trace+0x1f) [0x417e5f]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(pr_crit+0x81) [0x4194e1]
May 16 12:37:03 fort[5745]:  /usr/bin/fort() [0x433d95]
May 16 12:37:03 fort[5745]:  /usr/bin/fort() [0x43168d]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(compute_deltas+0x46) [0x4336c6]
May 16 12:37:03 fort[5745]:  /usr/bin/fort() [0x43440d]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(vrps_update+0x110) [0x434b80]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(validation_run_cycle+0x29) [0x41d729]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(main+0x16c) [0x413e6c]
May 16 12:37:03 fort[5745]: Expand failed !
May 16 12:37:03 fort[5745]:  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f337ec3e493]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(_start+0x2e) [0x413e9e]
May 16 12:37:03 fort[5745]: (End of stack trace)
May 16 12:37:03 systemd[1]: fort.service: Main process exited, code=exited, status=255/n/a
May 16 12:37:03 systemd[1]: fort.service: Failed with result 'exit-code'.
ydahhrk added a commit that referenced this issue May 16, 2022
Found this quirk while eyeballing #83. I don't think it's going to
fix the problem, but it's definitely an improvement.
@ydahhrk
Copy link
Member

ydahhrk commented May 16, 2022

I uploaded a small patch. I don't think it's going to solve the problem, but you might as well try it.

Are you using --output.roa?

If you enable it, do you get a slightly different error mesage?

Can you please post your fort command, with flags (and configuration file, if applies) included?

@InsaneSplash
Copy link
Author

Hey, sorry for the late reply..... another instance just crashed.

Command Line:
/usr/bin/fort --configuration-file /etc/fort/config.json

Config file:

{
        "tal": "/etc/fort/tal",
        "local-repository": "/var/lib/fort/repository",
        "slurm": "/etc/fort/slurm",
        "server": {
                "port": "3323",
                "interval": {
                        "validation": 3600,
                        "refresh": 3600,
                        "retry": 600,
                        "expire": 7200
        }
        },
        "log": {
                "output": "syslog"
        }
}

@InsaneSplash
Copy link
Author

InsaneSplash commented May 31, 2022

May 27 07:59:16 fort[98190]: /usr/bin/fort[0x417d97]
May 27 07:59:16 fort[98190]: /lib64/libpthread.so.0(+0x12c30)[0x7f6d27f1cc30]
May 27 07:59:16 fort[98190]: /usr/bin/fort(x509_name_put+0x0)[0x427dc0]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x4143cc]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x4144ac]
May 27 07:59:16 fort[98190]: /usr/bin/fort(deferstack_pop+0x3b)[0x4146eb]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x428cc4]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x4296c9]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x437307]
May 27 07:59:16 fort[98190]: /lib64/libpthread.so.0(+0x818a)[0x7f6d27f1218a]
May 27 07:59:16 fort[98190]: /lib64/libc.so.6(clone+0x43)[0x7f6d27c41dd3]

@InsaneSplash
Copy link
Author

Interesting the process provides a stack trace if you provide it a unknown option.

May 31 10:16:17 fort[916765]: ERR: Unrecognized option: 63
May 31 10:16:17 fort[916765]: Stack trace:
May 31 10:16:17 fort[916765]:  fort(print_stack_trace+0x1f) [0x417e5f]
May 31 10:16:17 fort[916765]:  fort(__pr_op_err+0x84) [0x418424]
May 31 10:16:17 fort[916765]:  fort(handle_flags_config+0x315) [0x416145]
May 31 10:16:17 fort[916765]:  fort(main+0x66) [0x413d66]
May 31 10:16:17 fort[916765]:  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f59ef759493]
May 31 10:16:17 fort[916765]:  fort(_start+0x2e) [0x413e9e]
May 31 10:16:17 fort[916765]: (End of stack trace)
May 31 10:16:17 fort[916765]: ERR: Try 'fort --usage' or 'fort --help' for more information.
May 31 10:16:17 fort[916765]: Stack trace:
May 31 10:16:17 fort[916765]:  fort(print_stack_trace+0x1f) [0x417e5f]
May 31 10:16:17 fort[916765]:  fort(__pr_op_err+0x84) [0x418424]
May 31 10:16:17 fort[916765]:  fort(handle_flags_config+0x33b) [0x41616b]
May 31 10:16:17 fort[916765]:  fort(main+0x66) [0x413d66]
May 31 10:16:17 fort[916765]:  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f59ef759493]
May 31 10:16:17 fort[916765]:  fort(_start+0x2e) [0x413e9e]
May 31 10:16:17 fort[916765]: (End of stack trace)

@kmisak
Copy link

kmisak commented Jun 6, 2022

I also getting this crash regularly, but with Unknown protocol: 0
index

@InsaneSplash
Copy link
Author

InsaneSplash commented Jun 6, 2022

Ive left the service running with no BGP services using it and lost 2 instances this weekend.

note: dont update librtr to version 8

@ydahhrk
Copy link
Member

ydahhrk commented Jun 6, 2022

Do you have files in the SLURM directory? (/etc/fort/slurm)
If so, can I have them? (It's fine if you want to censor IPs)

ydahhrk added a commit that referenced this issue Jun 7, 2022
Mostly quality of life improvements.

On the other hand, it looks like the notfatal hash table API was being
used incorrectly. HASH_ADD_KEYPTR can OOM, but `errno` wasn't being
catched.

Fixing this is nontrivial, however, because strange `reqs_error`
functions are in the way, and that's a spaggetti I decided to avoid.
Instead, I converted HASH_ADD_KEYPTR usage to the fatal hash table API.
That's the future according to #40, anyway.

I don't think this has anything to do with #83, though.
@ydahhrk
Copy link
Member

ydahhrk commented Jun 7, 2022

Ok, it looks like this is going to be a difficult bug.

Is either of you willing to run a custom debug-heavy Fort binary?

@kmisak
Copy link

kmisak commented Jul 4, 2022

I will do that, no problem

@InsaneSplash
Copy link
Author

This is all I have in that file

{
  "slurmVersion": 1,
  "validationOutputFilters": {
    "prefixFilters": [],
    "bgpsecFilters": []
  },
  "locallyAddedAssertions": {
    "prefixAssertions": [],
    "bgpsecAssertions": []
  }
}

@ydahhrk
Copy link
Member

ydahhrk commented Jul 21, 2022

Sorry it's taken so long. Debug commit is at branch issue83.

I need the first logging line that contains the string "VRP Corrupted!":

Jul 21 21:21:10 ERR [V]: After standalone: VRP corrupted!
Jul 21 21:21:10 ERR [V]: After SLURM: VRP corrupted!

It shouldn't crash anymore, but I'm not entirely sure what side effects the bogus VRP might induce.

This is all I have in that file

Ok thank you. Probably not the problem either.

@ydahhrk
Copy link
Member

ydahhrk commented Feb 1, 2023

Have you gotten any "VRP corrupted!" messages yet?

Just to clarify: The issue83 branch contains a patch that prevents Fort from crashing, but does not, in fact, fix the bug.

ydahhrk added a commit that referenced this issue Feb 1, 2023
ydahhrk added a commit that referenced this issue Feb 2, 2023
There are no readers, so there's no point in this being a reader-writer
lock.

Still not meant to be a fix for #83/#89. I'm mostly just trying to force
myself to interact with the code in hopes of finding the bug.
ydahhrk added a commit that referenced this issue Feb 2, 2023
ydahhrk added a commit that referenced this issue Feb 3, 2023
ydahhrk added a commit that referenced this issue Feb 7, 2023
1. Revert panic back into the code.

- Fort SHOULD die as soon as it realizes the VRP table is corrupted, as
  we should not send garbage to the routers.
- Also, I'm not entirely sure the code would not crash later anyway,
  since the table is, in fact, corrupted.
- Plus, if it doesn't crash, there would be no core dump to further
  analyze the bug.

2. Point bug output to the currently active bug report

Might help us get some output earlier.
@ydahhrk ydahhrk closed this as completed in f6d3573 Feb 7, 2023
@ydahhrk
Copy link
Member

ydahhrk commented Feb 21, 2023

Didn't mean to close this.

@ydahhrk ydahhrk reopened this Feb 21, 2023
@Jhoanor
Copy link

Jhoanor commented Jul 4, 2023

With us sometimes it crashes after 1 day, sometimes after more than 6 weeks...

(Cannot implement 1.5.4 though because that would require a RPM package.
But if I read correctly I understand #83 is not yet resolved in 1.5.4. anyway)

@ydahhrk
Copy link
Member

ydahhrk commented Jul 6, 2023

Ok, I managed to apparently successfully generate the RPMs for 1.5.4, and uploaded them here.

(I say "apparently" because CentOS 8's death forced me to migrate to Rocky Linux 8, and I'm not sure if packages generated there will be compatible with other RHELs. Please feedback.)

In other news, I have so far discovered and fixed at least one undefined behavior during the development of 1.5.5, so the bug might already be fixed in the main branch. For your convenience, I packaged this as rpm-1.5.4.1.tar.gz.

Please install either 1.5.4 or 1.5.4.1, and provide the crashing output once it happens. If it never happens, I would also like to know it.

@rfc1036
Copy link

rfc1036 commented Jul 6, 2023

Do you mind tagging 1.5.4 (and 1.5.4.1?) in the repository? This way I will be able to update the Debian package.

@ydahhrk
Copy link
Member

ydahhrk commented Jul 6, 2023

Do you mind tagging 1.5.4

What do you mean? It's been tagged since release.

@rfc1036
Copy link

rfc1036 commented Jul 6, 2023

Nevermind: I tought that you had released a new version with the more recent changes. I will wait for the next one, unless you think that I should package a snapshot right now.

@Jhoanor
Copy link

Jhoanor commented Jul 7, 2023

RPM 1.5.4-1 package installs fine on RHEL. Thank you.
Now running one day, and still up.
I'll let you know over a week if still running (or earlier in case of crash)

@Jhoanor
Copy link

Jhoanor commented Aug 10, 2023

Well, it looks like it did the trick. No crashes in more than a month. Chapeau and thanks! :)

@ydahhrk ydahhrk closed this as completed Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants