-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CRASH] When one or more defined RTPEngines are down when OpenSIPS starts ... and an external script is called. #3335
Comments
It wasn't mentioned in the ticket, because I thought that it was obvious that the file in question is:
I have replaced "goto badproxy" with "return NULL" and the segfault during startup did not happen. This is not the correct solution because the valid setting of a badproxy is not performed when a timeout happens., just indicating that it appears to be a "workaround" to avoid the segfault.
|
Found that in the "connect_rtpengines()" routine wrapping the call to rtpe_test() with RTPE_START_READ() / RTPE_STOP_READ(), that performs a lock, appears to be a better workaround. The segfault is avoided and RTPEngine failover works. Unfortunately, this workaround causes a deadlock / loop when using the rtpengine_reload command.
|
Because of #3334 a temporary hack was added to provide the ability to scale-out/scale-in the number of RTPEngines, added a timer_route[] to periodically reload the list of RTPEngines. It might be the rtpengine_reload during startup that caused the deadlock / segfault. Within rtpengine.c, the "rtpengine_reload" routing has locking, but also appears to make significant changes to the internal data structure used to maintain the list of RTPEngines. If a routine other that "rtpengine_reload" is in the middle of executing (for example, blocking while waiting on a lock that rtpengine_reload has) the RTPEngine list data structure might have changed and a segfault would likely happen. It may be a good idea to ensure that the RTPEngine list data structure does not change while other routines that use it are executing. For example, "rtpengine_show", even though it doesn't change the list of RTPEngines, it could be impacted if the list changes while looping through it. This may be the read for RTPE_START_READ/RTPE_STOP_READ. If so, perhaps there is an edge case that needs to be covered. That is why I tried adding it (see previous post). timer_route[rtpengine_check, 3] { |
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days. |
Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details. |
The segfault happens when OpenSIPS has one (or more) RTPEngines defined, however, at startup, one or more RTPEngines are down.
The E_RTPENGINE_STATUS event is being triggered with status "inactive". When this happens, an external script is being executed.
The external script issues some commands to determine the impact of the failed RTPEngine. For example, "rtpengine_show" is being executed.
When the external script is not run when the E_RTPENGINE_STATUS "inactive" is triggered, no segfault happens.
After testing multiple times, it appears the segfault only happens after this message is logged ERROR:rtpengine:send_rtpe_command: timeout waiting reply from a RTP proxy. Other calls to the badproxy: label do not appear to be causing the segfault.
OpenSIPS version you are running
Crash Core Dump
Describe the traffic that generated the bug
Added the following log lines to the send_rtpe_command() because core dumps are not being produced.
The TESTING logs are listed in the logs shown below.
To Reproduce
Relevant System Logs
Mar 8 14:38:38 [58] INFO:rtpengine:rtpe_test: rtp proxy udp:100.127.6.20:22222 found, support for it enabled
Mar 8 14:38:38 [58] INFO:rtpengine:rtpe_test: rtp proxy udp:100.127.6.21:22222 found, support for it enabled
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: can't send (#7 iovec buffers) command to a RTP proxy (111:Connection refused)
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: timeout waiting reply from a RTP proxy
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 1
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: proxy udp:100.127.6.22:22222 does not respond, disable it
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 2
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 3
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 4
Mar 8 14:38:38 [58] DBG:core:evi_param_set: set str socket='udp:100.127.6.22:22222'
Mar 8 14:38:38 [58] DBG:core:evi_param_set: set str status='inactive'
Mar 8 14:38:38 [58] DBG:core:get_dummy_sip_msg: reusing the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [58] DBG:core:evi_raise_event_msg: found subscriber E_RTPENGINE_STATUS
Mar 8 14:38:38 [58] DBG:core:dup_ref_script_route_in_shm: dupping 0x7fbf5d1daeb8 [E_RTPENGINE_STATUS], idx 2, ver/cnt 1, to new 0x7fbf5d1dd080 [E_RTPENGINE_STATUS], idx 2, ver/cnt 1
Mar 8 14:38:38 [18] DBG:core:ipc_handle_job: received job type 0[RPC] from process 41
Mar 8 14:38:38 [18] DBG:core:get_dummy_sip_msg: reusing the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [18] NOTICE:[E_RTPENGINE_STATUS] udp:100.127.6.22:22222: inactive
Mar 8 14:38:38 [18] DBG:core:comp_scriptvar: str 20: inactive
Mar 8 14:38:38 [58] DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [58] DBG:core:destroy_avp_list: destroying list (nil)
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 5
Mar 8 14:38:38 [58] ERROR:rtpengine:rtpe_test: proxy did not respond to ping
Mar 8 14:38:38 [18] DBG:core:async_script_launch: placing launch job into reactor
Mar 8 14:38:38 [18] DBG:core:io_watch_add: [UDP_worker] io_watch_add op (286 on 199) (0x5592bf1f1f40, 286, 3, 0x7fbf5d1dd340,1), fd_no=5/1048576
Mar 8 14:38:38 [58] DBG:rtpengine:connect_rtpengines: successfully updated rtpengine sets
Mar 8 14:38:38 [58] DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [58] DBG:core:destroy_avp_list: destroying list (nil)
Mar 8 14:38:38 [62] DBG:core:io_watch_add: [TCP_main] io_watch_add op (261 on 4) (0x5592bf1f1f40, 261, 22, 0x7fbf5d19b3f8,1), fd_no=38/1048576
Mar 8 14:38:38 [18] DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [18] DBG:core:destroy_avp_list: destroying list 0x7fbf5d1dd120
Mar 8 14:38:38 [62] DBG:core:io_watch_add: [TCP_main] io_watch_add op (263 on 4) (0x5592bf1f1f40, 263, 22, 0x7fbf5d19c658,1), fd_no=39/1048576
Mar 8 14:38:38 [28] ERROR:rtpengine:send_rtpe_command: timeout waiting reply from a RTP proxy
Mar 8 14:38:38 [28] ERROR:rtpengine:send_rtpe_command: TESTING 1
Mar 8 14:38:38 [28] CRITICAL:core:sig_usr: segfault in process pid: 28, id: 15
Mar 8 14:38:38 [28] DBG:core:restore_segv_handler: restoring SIGSEGV handler...
Mar 8 14:38:38 [28] DBG:core:restore_segv_handler: successfully restored system SIGSEGV handler
Mar 8 14:38:38 [1] DBG:core:handle_sigs: OpenSIPS exit status = 11
Mar 8 14:38:38 [1] INFO:core:handle_sigs: child process 28 exited by a signal 11
Mar 8 14:38:38 [1] INFO:core:handle_sigs: core was not generated
Mar 8 14:38:38 [1] INFO:core:handle_sigs: terminating due to SIGCHLD
OS/environment information
Additional context
The text was updated successfully, but these errors were encountered: