Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] When one or more defined RTPEngines are down when OpenSIPS starts ... and an external script is called. #3335

Closed
NormB opened this issue Mar 8, 2024 · 5 comments
Labels

Comments

@NormB
Copy link
Member

NormB commented Mar 8, 2024

The segfault happens when OpenSIPS has one (or more) RTPEngines defined, however, at startup, one or more RTPEngines are down.

The E_RTPENGINE_STATUS event is being triggered with status "inactive". When this happens, an external script is being executed.

The external script issues some commands to determine the impact of the failed RTPEngine. For example, "rtpengine_show" is being executed.

When the external script is not run when the E_RTPENGINE_STATUS "inactive" is triggered, no segfault happens.

After testing multiple times, it appears the segfault only happens after this message is logged ERROR:rtpengine:send_rtpe_command: timeout waiting reply from a RTP proxy. Other calls to the badproxy: label do not appear to be causing the segfault.

OpenSIPS version you are running

version: opensips 3.5.0-dev (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, CC_O0, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 3a6a8240e
main.c compiled on 19:37:30 Mar  8 2024 with cc 12

Crash Core Dump

Describe the traffic that generated the bug

Added the following log lines to the send_rtpe_command() because core dumps are not being produced.
The TESTING logs are listed in the logs shown below.

badproxy:
        LM_ERR("TESTING 1\n");
        LM_ERR("proxy <%s> does not respond, disable it\n", node->rn_url.s);
        LM_ERR("TESTING 2\n");
        node->rn_disabled = 1;
        LM_ERR("TESTING 3\n");
        node->rn_recheck_ticks = get_ticks() + rtpengine_disable_tout;
        LM_ERR("TESTING 4\n");
        raise_rtpengine_status_event(node);
        LM_ERR("TESTING 5\n");

        return NULL;

To Reproduce

Relevant System Logs

Mar 8 14:38:38 [58] INFO:rtpengine:rtpe_test: rtp proxy udp:100.127.6.20:22222 found, support for it enabled
Mar 8 14:38:38 [58] INFO:rtpengine:rtpe_test: rtp proxy udp:100.127.6.21:22222 found, support for it enabled
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: can't send (#7 iovec buffers) command to a RTP proxy (111:Connection refused)
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: timeout waiting reply from a RTP proxy
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 1
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: proxy udp:100.127.6.22:22222 does not respond, disable it
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 2
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 3
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 4
Mar 8 14:38:38 [58] DBG:core:evi_param_set: set str socket='udp:100.127.6.22:22222'
Mar 8 14:38:38 [58] DBG:core:evi_param_set: set str status='inactive'
Mar 8 14:38:38 [58] DBG:core:get_dummy_sip_msg: reusing the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [58] DBG:core:evi_raise_event_msg: found subscriber E_RTPENGINE_STATUS
Mar 8 14:38:38 [58] DBG:core:dup_ref_script_route_in_shm: dupping 0x7fbf5d1daeb8 [E_RTPENGINE_STATUS], idx 2, ver/cnt 1, to new 0x7fbf5d1dd080 [E_RTPENGINE_STATUS], idx 2, ver/cnt 1
Mar 8 14:38:38 [18] DBG:core:ipc_handle_job: received job type 0[RPC] from process 41
Mar 8 14:38:38 [18] DBG:core:get_dummy_sip_msg: reusing the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [18] NOTICE:[E_RTPENGINE_STATUS] udp:100.127.6.22:22222: inactive
Mar 8 14:38:38 [18] DBG:core:comp_scriptvar: str 20: inactive
Mar 8 14:38:38 [58] DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [58] DBG:core:destroy_avp_list: destroying list (nil)
Mar 8 14:38:38 [58] ERROR:rtpengine:send_rtpe_command: TESTING 5
Mar 8 14:38:38 [58] ERROR:rtpengine:rtpe_test: proxy did not respond to ping
Mar 8 14:38:38 [18] DBG:core:async_script_launch: placing launch job into reactor
Mar 8 14:38:38 [18] DBG:core:io_watch_add: [UDP_worker] io_watch_add op (286 on 199) (0x5592bf1f1f40, 286, 3, 0x7fbf5d1dd340,1), fd_no=5/1048576
Mar 8 14:38:38 [58] DBG:rtpengine:connect_rtpengines: successfully updated rtpengine sets
Mar 8 14:38:38 [58] DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [58] DBG:core:destroy_avp_list: destroying list (nil)
Mar 8 14:38:38 [62] DBG:core:io_watch_add: [TCP_main] io_watch_add op (261 on 4) (0x5592bf1f1f40, 261, 22, 0x7fbf5d19b3f8,1), fd_no=38/1048576
Mar 8 14:38:38 [18] DBG:core:release_dummy_sip_msg: cleaning the static sip msg 0x7fbfa019bcf0
Mar 8 14:38:38 [18] DBG:core:destroy_avp_list: destroying list 0x7fbf5d1dd120
Mar 8 14:38:38 [62] DBG:core:io_watch_add: [TCP_main] io_watch_add op (263 on 4) (0x5592bf1f1f40, 263, 22, 0x7fbf5d19c658,1), fd_no=39/1048576
Mar 8 14:38:38 [28] ERROR:rtpengine:send_rtpe_command: timeout waiting reply from a RTP proxy
Mar 8 14:38:38 [28] ERROR:rtpengine:send_rtpe_command: TESTING 1
Mar 8 14:38:38 [28] CRITICAL:core:sig_usr: segfault in process pid: 28, id: 15
Mar 8 14:38:38 [28] DBG:core:restore_segv_handler: restoring SIGSEGV handler...
Mar 8 14:38:38 [28] DBG:core:restore_segv_handler: successfully restored system SIGSEGV handler
Mar 8 14:38:38 [1] DBG:core:handle_sigs: OpenSIPS exit status = 11
Mar 8 14:38:38 [1] INFO:core:handle_sigs: child process 28 exited by a signal 11
Mar 8 14:38:38 [1] INFO:core:handle_sigs: core was not generated
Mar 8 14:38:38 [1] INFO:core:handle_sigs: terminating due to SIGCHLD

OS/environment information

  • Operating System:
  • OpenSIPS installation:
  • other relevant information:

Additional context

@NormB
Copy link
Member Author

NormB commented Mar 8, 2024

It wasn't mentioned in the ticket, because I thought that it was obvious that the file in question is:

modules/rtpengine/rtpengine.c

I have replaced "goto badproxy" with "return NULL" and the segfault during startup did not happen.

This is not the correct solution because the valid setting of a badproxy is not performed when a timeout happens., just indicating that it appears to be a "workaround" to avoid the segfault.

if (i == rtpengine_retr) {
        LM_ERR("timeout waiting reply from a RTP proxy\n");
        goto badproxy;
}

@NormB
Copy link
Member Author

NormB commented Mar 8, 2024

Found that in the "connect_rtpengines()" routine wrapping the call to rtpe_test() with RTPE_START_READ() / RTPE_STOP_READ(), that performs a lock, appears to be a better workaround. The segfault is avoided and RTPEngine failover works.

Unfortunately, this workaround causes a deadlock / loop when using the rtpengine_reload command.

        RTPE_START_READ();
        for(rtpe_list = (*rtpe_set_list)->rset_first; rtpe_list != 0;
                rtpe_list = rtpe_list->rset_next){

                for (pnode=rtpe_list->rn_first; pnode!=0; pnode = pnode->rn_next){
                        /* reuse socket if already initialized */
                        if ((rtpe_socks[pnode->idx] != -1 || rtpengine_connect_node(pnode)) && force_test)
                                pnode->rn_disabled = rtpe_test(pnode, 0, 1);
                        /* else, there is an error, and we try to connect the next one */
                }
        }
        RTPE_STOP_READ();

@NormB
Copy link
Member Author

NormB commented Mar 11, 2024

Because of #3334 a temporary hack was added to provide the ability to scale-out/scale-in the number of RTPEngines, added a timer_route[] to periodically reload the list of RTPEngines.

It might be the rtpengine_reload during startup that caused the deadlock / segfault. Within rtpengine.c, the "rtpengine_reload" routing has locking, but also appears to make significant changes to the internal data structure used to maintain the list of RTPEngines.

If a routine other that "rtpengine_reload" is in the middle of executing (for example, blocking while waiting on a lock that rtpengine_reload has) the RTPEngine list data structure might have changed and a segfault would likely happen.

It may be a good idea to ensure that the RTPEngine list data structure does not change while other routines that use it are executing. For example, "rtpengine_show", even though it doesn't change the list of RTPEngines, it could be impacted if the list changes while looping through it.

This may be the read for RTPE_START_READ/RTPE_STOP_READ. If so, perhaps there is an edge case that needs to be covered. That is why I tried adding it (see previous post).

timer_route[rtpengine_check, 3] {
mi("rtpengine_reload", $var(ret));
}

Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Mar 27, 2024
Copy link

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant