New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional crash with 3.0.19 #2732
Comments
@jpereira just checking; none of your patches ended up in a PR, right? |
@pauldekkers exactly. Alan mentioned some good points and it needs more investigation. |
No comments on this for a few months, is this still an issue? |
@mcnewton I did not test with 3.0.20 yet, but I did not see commits to threads.c and am not aware of another fix (that would have resolved this in 3.0.19). Happy to try for a while with 3.0.20 though (it may take some time to occur). |
Thanks. Could you try with v3.0.x HEAD? There's been a couple of fixes since 3.0.20 which affect proxying. |
Sorry it took a while. I used v3.0.x from March 2, up to commit 395a247 - it still crashes in the same way after 2 weeks:
|
@pauldekkers can you share the result of the GDB command "bt all"? even if you have the |
@jpereira I'll send you an e-mail |
Thank you @pauldekkers, you could send it to jpereira@freeradius.org and I will share it with the others. |
Ah! if you faced any issues to send the file due to the size. you could try to send it to Google Drive, Dropbox and share it with us. if you don't mind, of course. :) |
@jpereira I sent a message first with the additional bt, hope it got through spam filters ;-) |
I got it @pauldekkers , The
|
Here's another one that crashed last night. Also gives an indication of how often it occurs (after two weeks or so). Hope it helps.
|
That's good... except I have no idea what's at line 2446 of xlat.c i.e. all of the releases I checked have line 2446 which is empty space, brackets, simple assignment, etc. Nothing which would cause a crash. So what version are you running, and what's on line 2446? |
This was 395a247:
That (still) seems to be: I upgraded to 9c36e20 BTW, to stay current. Next crash in two weeks perhaps ;-) |
Well, that didn't take long:
|
There is no way that line could possibly cause a crash. It's doing a Similarly, the previous crash was in a line which assigned a value to a variable on the stack. The only way that the system crashes when accessing the stack is (a) stack overflow, or (b) bad memory. I would run a memory checker on the hardware. I suspect bad memory. Also, try running the server on completely different hardware. If it doesn't crash, you know that the original hardware was faulty. |
It's a VMware virtual machine. I can have it migrated to a different host. It's the only process that has been crashing since mid 2019 though. And I believe it also happened to the workload in a completely different datacenter, but I'll try. Thanks for the followup! |
Last crash was April 25th (around 04:00 at night, relatively quiet), and while it should have been on different hardware: I'm not 100% sure if VMware didn't migrate things back before that time. I would expect VMware to give warnings about memory corruption either way, or have the ECC memory of the servers repair things, and there was nothing in the event logs. Meanwhile the entire hardware cluster is replaced/VM migrated. As far as I'm concerned we could close this issue and I'd open it again if it reoccurs. For completeness, this crash was:
... and no new crashes since. |
Issue type
Defect
How to reproduce the issue
There's no trigger I'm aware of, or no special configuration. It just happens after a couple of weeks. It does happen occasionally. It happened with earlier releases too, I never got to capturing the event: but now I have a core dump :-)
(While it happened with earlier releases, somewhere above 3.0.15 I think, the exact same configuration did run without issues somewhere in the past. And while this is a busy server, the crashes are really occasional.)
I'm using the 3.0.19 docker container these days, on an Ubuntu 18.04 LTS host. (Previously it crashed with self-compiled releases and no containers too.) Looking at system graphs, I see no issues with resources (memory, disk spikes) whatsoever at the time of the crash. (It was actually at a quiet time, in the middle of the night).
This is a proxy server (it does no EAP, just RADIUS/UDP and TLS, sqlite and rlm_cache_rbtree).
Full backtrace from LLDB or GDB
Hope this gives some direction to look at. If there's a new event (hopefully similar) I'll add it to this issue, but for this crash I was already waiting for over 6 weeks. (And before that, core dumps were not happening for various reasons.)
The text was updated successfully, but these errors were encountered: