Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opensips 2.3 - BUG - shutdown timeout triggered #1172

Closed
apsaras opened this issue Aug 9, 2017 · 12 comments
Closed

Opensips 2.3 - BUG - shutdown timeout triggered #1172

apsaras opened this issue Aug 9, 2017 · 12 comments
Assignees

Comments

@apsaras
Copy link

apsaras commented Aug 9, 2017

Hello.

I am testing latest OpenSIPs 2.3 on Debian 7 64bit running on XenServer VM and randomly crash with no core dump generated. Those are the results of log level 6.

version: opensips 2.3.1 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
git revision: c33f23f
main.c compiled on 17:30:58 Aug 9 2017 with gcc 4.7

Aug 9 19:27:08 myhostname /sbin/opensips[24788]: DBG:nathelper:nh_timer: resolving next hop: '1.1.1.1'
Aug 9 19:27:09 myhostname /sbin/opensips[24782]: DBG:httpd:answer_to_connection: START *** cls=(nil), connection=0x1a6df20, url=/json/sip_trace, method=GET, versio=HTTP/1.1, upload_data[0]=(nil), con_cls=(nil)
Aug 9 19:27:09 myhostname /sbin/opensips[24782]: DBG:httpd:getConnectionHeader: Accept=
/*
Aug 9 19:27:09 myhostname /sbin/opensips[24782]: DBG:httpd:answer_to_connection: accept_type=[-1]
Aug 9 19:27:09 myhostname /sbin/opensips[24782]: DBG:httpd:answer_to_connection: normalised_url=[/sip_trace]
Aug 9 19:27:09 myhostname /sbin/opensips[24782]: DBG:mi_json:mi_json_answer_to_connection: START *** cls=(nil), connection=0x1a6df20, url=/sip_trace, method=GET, versio=HTTP/1.1, upload_data[0]=(nil), *con_cls=(nil)
Aug 9 19:27:09 myhostname /sbin/opensips[24782]: ERROR:mi_json:mi_json_answer_to_connection: unable to find mi command [sip_trace]
Aug 9 19:27:09 myhostname /sbin/opensips[24782]: DBG:httpd:answer_to_connection: MHD_create_response_from_data [0x7fe85a5e06e0:56]
Aug 9 19:27:12 myhostname /sbin/opensips[24787]: DBG:usrloc:run_ul_callbacks: contact=0x7fe85f988d18, callback type 8/15, id 0 entered
Aug 9 19:27:12 myhostname /sbin/opensips[24787]: DBG:mid_registrar:mid_reg_ct_event: Contact callback (8): contact='sip:s@1.1.1.24:14128' | param=(0x7fe85f988e48 -> (nil)) | data[0]=((nil))
Aug 9 19:27:12 myhostname /sbin/opensips[24826]: DBG:core:io_wait_loop_epoll: [TCP_main] EPOLLHUP on IN ->connection closed by the remote peer!
Aug 9 19:27:12 myhostname /sbin/opensips[24826]: CRITICAL:core:receive_fd: EOF on 22
Aug 9 19:27:12 myhostname /sbin/opensips[24826]: DBG:core:handle_worker: dead child 6, pid 24787 (shutting down?)
Aug 9 19:27:12 myhostname /sbin/opensips[24826]: DBG:core:io_watch_del: [TCP_main] io_watch_del op on index 5 22 (0x8d3f00, 22, 5, 0x0,0x1) fd_no=35 called
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: DBG:core:handle_sigs: status = 11
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: INFO:core:handle_sigs: child process 24787 exited by a signal 11
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: INFO:core:handle_sigs: core was not generated
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: INFO:core:handle_sigs: terminating due to SIGCHLD
Aug 9 19:27:12 myhostname /sbin/opensips[24826]: INFO:core:sig_usr: signal 15 received
......
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: INFO:core:cleanup: cleanup
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: DBG:uac_auth:mod_destroy: done
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: DBG:core:pool_remove: connection still kept in the pool
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: DBG:core:pool_remove: connection still kept in the pool
Aug 9 19:27:12 myhostname /sbin/opensips[24776]: DBG:httpd:httpd_proc_destroy: destroying module ...
Aug 9 19:28:12 myhostname /sbin/opensips[24776]: CRITICAL:core:sig_alarm_abort: BUG - shutdown timeout triggered, dying...

@apsaras
Copy link
Author

apsaras commented Aug 14, 2017

I have two core dumps. Bellow are bt / bt full output

Core dump 1

(gdb) bt #0 0x00007fe9171ca125 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fe9171cd3a0 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x0000000000451afb in sig_alarm_abort (signo=<optimized out>) at main.c:414 #3 <signal handler called> #4 0x00007fe917250477 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x00007fe8f1c7f146 in get_lock (lock=<optimized out>) at ../../fastlock.h:221 #6 lock_ulslot (i=<optimized out>, _d=<optimized out>) at udomain.c:1235 #7 lock_ulslot (_d=0x7fe8f5c9c9b0, i=-171315776) at udomain.c:1231 #8 0x00007fe8f1c7f714 in free_udomain (_d=0x7fe8f5c9c9b0) at udomain.c:393 #9 0x00007fe8f1c6f253 in free_all_udomains () at dlist.c:742 #10 0x00007fe8f1c88ff0 in destroy () at ul_mod.c:503 #11 0x00000000004ed823 in destroy_modules () at sr_module.c:527 #12 0x0000000000452225 in cleanup (show_status=show_status@entry=1) at main.c:319 #13 0x0000000000452ef3 in handle_sigs () at main.c:522 #14 0x000000000041ec50 in main_loop () at main.c:722 #15 main (argc=<optimized out>, argv=<optimized out>) at main.c:1283 (gdb) bt full #0 0x00007fe9171ca125 in raise () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #1 0x00007fe9171cd3a0 in abort () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #2 0x0000000000451afb in sig_alarm_abort (signo=<optimized out>) at main.c:414 __FUNCTION__ = "sig_alarm_abort" #3 <signal handler called> No symbol table info available. #4 0x00007fe917250477 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #5 0x00007fe8f1c7f146 in get_lock (lock=<optimized out>) at ../../fastlock.h:221 i = 0 #6 lock_ulslot (i=<optimized out>, _d=<optimized out>) at udomain.c:1235 No locals. #7 lock_ulslot (_d=0x7fe8f5c9c9b0, i=-171315776) at udomain.c:1231 No locals. #8 0x00007fe8f1c7f714 in free_udomain (_d=0x7fe8f5c9c9b0) at udomain.c:393 i = <optimized out> #9 0x00007fe8f1c6f253 in free_all_udomains () at dlist.c:742 ptr = 0x7fe8f5c9c950 #10 0x00007fe8f1c88ff0 in destroy () at ul_mod.c:503 __FUNCTION__ = "destroy" #11 0x00000000004ed823 in destroy_modules () at sr_module.c:527 t = 0x7fe915969610 foo = 0x7fe915969530 __FUNCTION__ = "destroy_modules" #12 0x0000000000452225 in cleanup (show_status=show_status@entry=1) at main.c:319 __FUNCTION__ = "cleanup" #13 0x0000000000452ef3 in handle_sigs () at main.c:522 chld = 0 chld_status = 139 overall_status = 139 ---Type <return> to continue, or q <return> to quit--- i = <optimized out> do_exit = <optimized out> __FUNCTION__ = "handle_sigs" #14 0x000000000041ec50 in main_loop () at main.c:722 startup_done = <optimized out> chd_rank = 21 #15 main (argc=<optimized out>, argv=<optimized out>) at main.c:1283 cfg_stream = <optimized out> c = <optimized out> r = <optimized out> tmp = 0x7ffce93c3eba "" tmp_len = <optimized out> port = <optimized out> proto = <optimized out> protos_no = <optimized out> options = 0x626d88 "f:cCm:M:b:l:n:N:rRvdDFETSVhw:t:u:g:P:G:W:o:" ret = -1 seed = 3605883726 rfd = <optimized out> __FUNCTION__ = "main"

Core Dump 2

(gdb) bt #0 0x00007f145bae7195 in mri_free (mri=mri@entry=0x0) at mid_registrar.c:403 #1 0x00007f145bafb86f in mid_reg_ct_event (binding=<optimized out>, type=<optimized out>, data=<optimized out>) at ulcb.c:144 #2 0x00007f145c4057fe in run_ul_callbacks (binding=0x7f1460438c90, type=8) at ul_callback.h:126 #3 nodb_timer (_r=0x7f1460438be0) at urecord.c:245 #4 timer_urecord (_r=_r@entry=0x7f1460438be0, ins_list=0x7f1400000000, ins_list@entry=0x7f14604169b8) at urecord.c:419 #5 0x00007f145c3f9221 in mem_timer_udomain (_d=<optimized out>) at udomain.c:1149 #6 0x00007f145c3e95fd in synchronize_all_udomains () at dlist.c:803 #7 0x00007f145c4017b7 in timer (ticks=<optimized out>, param=<optimized out>) at ul_mod.c:518 #8 0x0000000000501df5 in handle_timer_job () at timer.c:723 #9 0x00000000005ebcf8 in handle_io (fm=<optimized out>, idx=<optimized out>, event_type=<optimized out>) at net/net_udp.c:264 #10 io_wait_loop_epoll (repeat=<optimized out>, t=<optimized out>, h=<optimized out>) at net/../io_wait_loop.h:284 #11 0x00000000005eda54 in udp_start_processes (chd_rank=chd_rank@entry=0x885f24, startup_done=startup_done@entry=0x0) at net/net_udp.c:389 #12 0x000000000041e94f in main_loop () at main.c:677 #13 main (argc=<optimized out>, argv=<optimized out>) at main.c:1283 (gdb) bt full #0 0x00007f145bae7195 in mri_free (mri=mri@entry=0x0) at mid_registrar.c:403 __FUNCTION__ = "mri_free" #1 0x00007f145bafb86f in mid_reg_ct_event (binding=<optimized out>, type=<optimized out>, data=<optimized out>) at ulcb.c:144 c = <optimized out> mri = 0x0 update = <optimized out> __FUNCTION__ = "mid_reg_ct_event" #2 0x00007f145c4057fe in run_ul_callbacks (binding=0x7f1460438c90, type=8) at ul_callback.h:126 cbp = 0x7f146040a6f0 ct_extra_idx = <optimized out> aor_extra_idx = 0 ele = 0x7f146040a708 #3 nodb_timer (_r=0x7f1460438be0) at urecord.c:245 ptr = 0x7f1460438c90 t = <optimized out> #4 timer_urecord (_r=_r@entry=0x7f1460438be0, ins_list=0x7f1400000000, ins_list@entry=0x7f14604169b8) at urecord.c:419 No locals. #5 0x00007f145c3f9221 in mem_timer_udomain (_d=<optimized out>) at udomain.c:1149 ptr = 0x7f1460438be0 dest = <optimized out> i = <optimized out> ret = <optimized out> flush = 0 it = {node = 0x0, map = 0x7f146041e8a8} prev = {node = 0x7f1460438c40, map = 0x7f146041e8a8} __FUNCTION__ = "mem_timer_udomain" #6 0x00007f145c3e95fd in synchronize_all_udomains () at dlist.c:803 res = 0 ptr = 0x7f1460416950 #7 0x00007f145c4017b7 in timer (ticks=<optimized out>, param=<optimized out>) at ul_mod.c:518 ---Type <return> to continue, or q <return> to quit--- __FUNCTION__ = "timer" #8 0x0000000000501df5 in handle_timer_job () at timer.c:723 t = 0x7f14603e7588 l = <optimized out> __FUNCTION__ = "handle_timer_job" #9 0x00000000005ebcf8 in handle_io (fm=<optimized out>, idx=<optimized out>, event_type=<optimized out>) at net/net_udp.c:264 No locals. #10 io_wait_loop_epoll (repeat=<optimized out>, t=<optimized out>, h=<optimized out>) at net/../io_wait_loop.h:284 ret = 1 n = 1 r = 2 i = <optimized out> e = <optimized out> ep_event = {events = 0, data = {ptr = 0x885f2400000000, fd = 0, u32 = 0, u64 = 38385205056110592}} fd = <optimized out> #11 0x00000000005eda54 in udp_start_processes (chd_rank=chd_rank@entry=0x885f24, startup_done=startup_done@entry=0x0) at net/net_udp.c:389 si = <optimized out> load_p = 0x7f1460421c08 pid = <optimized out> i = <optimized out> __FUNCTION__ = "udp_start_processes" #12 0x000000000041e94f in main_loop () at main.c:677 startup_done = 0x0 chd_rank = 5 #13 main (argc=<optimized out>, argv=<optimized out>) at main.c:1283 cfg_stream = <optimized out> c = <optimized out> r = <optimized out> tmp = 0x7ffc39fa6eba "" ---Type <return> to continue, or q <return> to quit--- tmp_len = <optimized out> port = <optimized out> proto = <optimized out> protos_no = <optimized out> options = 0x626d88 "f:cCm:M:b:l:n:N:rRvdDFETSVhw:t:u:g:P:G:W:o:" ret = -1 seed = 3663129094 rfd = <optimized out> __FUNCTION__ = "main"

@liviuchircu liviuchircu self-assigned this Aug 18, 2017
@liviuchircu
Copy link
Member

Hi, @apsaras!

Any specific configuration to your usrloc / mid_registrar modules? I have yet to trigger a contact expiry crash - maybe some module config details can help me understand your scenario better.

@apsaras
Copy link
Author

apsaras commented Aug 18, 2017

Hello Liviu

My configuration is a bit strange. What I am doing is the following. I am using registrar to register a PBX with OpenSIPs, and then, using the established connection, to forward registration requests to PBX from remote users using mid-registrar.

Following are the registrar and mid-registrar module configuration.

----- mid_registrar -----

modparam("mid_registrar","mode",2)
modparam("mid_registrar", "insertion_mode", 1)
modparam("mid_registrar", "outgoing_expires", 3600)
modparam("mid_registrar", "received_avp", "$avp(recavp)")
modparam("mid_registrar", "max_contacts", 10)

----- registrar params -----

modparam("registrar", "max_contacts", 10)
modparam("registrar", "received_avp", "$avp(recavp)")
modparam("registrar", "min_expires", 3600)
modparam("registrar", "max_expires", 3600)
modparam("registrar", "default_expires", 3600)
modparam("registrar", "default_q", 1000)

@liviuchircu
Copy link
Member

Ok, that explains it. This is quite an unexpected scenario - by design, registrar and mid_registrar are currently mutually exclusive, hence why the PBX contact's "mid_registrar" data is NULL, causing the crash on expiry.

Thinking of a way to solve this.

@apsaras
Copy link
Author

apsaras commented Aug 18, 2017

Is there any other way to implement the above scenario?

PBX is behind NAT with dynamic IP and remote users should register on PBX via Proxy.

The only reason I need PBX Registration is for authentication, location and NAT.

Another issue that I have and it is strange, not sure if related to that bug is that in some cases Remote IP Phones are registering, receiving OK and after that send a de-registration (Registration with 0 expiration). When that happens, OpenSIPs returns 400.

@apsaras
Copy link
Author

apsaras commented Aug 25, 2017

Hello Liviu.

Do you have any estimation on that? Can you suggest any workaround?

Thank you.

liviuchircu added a commit that referenced this issue Aug 25, 2017
    * do not crash with usrloc replication (although it's not supported)
    * handle registrar + mid_registrar concurrently

Fixes #1172
Fixes #1183

(cherry picked from commit bcae3a0)
@liviuchircu
Copy link
Member

Just pushed a fix for this - please let me know if you encounter any more issues when running both registrars concurrently.

PS: We analyzed your scenario, and couldn't come up with a better way of solving it :)

@apsaras
Copy link
Author

apsaras commented Aug 25, 2017

Hello Liviu.

Thank you very much for your prompt fix.

I compiled the new rev. and I am getting the following cores dumps now.

core dump 1

(gdb) bt #0 0x00000000005395ef in fm_remove_free (n=0x7f5c2590f660, qm=0x7f5c25515000) at mem/f_malloc.c:200 #1 fm_malloc (qm=0x7f5c25515000, size=size@entry=40) at mem/f_malloc.c:428 #2 0x00007f5c220b7ed6 in shm_malloc (size=40) at ../../mem/shm_mem.h:437 #3 insert_tmcb (cb_list=0x7f5c222fc140, types=types@entry=2, f=0x7f5c20f30d6d <mid_reg_resp_in>, param=0x7f5c258d4bf0, release_func=0) at t_hooks.c:92 #4 0x00007f5c220b81f8 in register_tmcb (p_msg=<optimized out>, t=<optimized out>, types=2, f=<optimized out>, param=<optimized out>, release_func=<optimized out>) at t_hooks.c:172 #5 0x00007f5c20f284da in prepare_forward (msg=msg@entry=0x7f5c45555cf8, ud=0x18, ud@entry=0x7f5c258519b0, aor=aor@entry=0x7ffe86d4e928, sctx=sctx@entry=0x7ffe86d4e920) at save.c:1432 #6 0x00007f5c20f35392 in mid_reg_save (msg=<optimized out>, dom=<optimized out>, flags_gp=<optimized out>, to_uri_gp=<optimized out>, expires_gp=<optimized out>) at save.c:2101 #7 0x00000000004221bc in do_action (a=a@entry=0x7f5c45530988, msg=msg@entry=0x7f5c45555cf8) at action.c:1862 #8 0x00000000004297ec in run_action_list (a=<optimized out>, msg=0x7f5c45555cf8) at action.c:172 #9 0x000000000042685f in do_action (a=a@entry=0x7f5c45532540, msg=msg@entry=0x7f5c45555cf8) at action.c:1130 #10 0x000000000042a2ac in run_action_list (msg=<optimized out>, a=<optimized out>) at action.c:172 #11 run_actions (msg=<optimized out>, a=<optimized out>) at action.c:137 #12 run_actions (a=<optimized out>, msg=0x7f5c45555cf8) at action.c:118 #13 0x0000000000424b21 in do_action (a=a@entry=0x7f5c4553f870, msg=msg@entry=0x7f5c45555cf8) at action.c:761 #14 0x00000000004297ec in run_action_list (a=<optimized out>, msg=0x7f5c45555cf8) at action.c:172 #15 0x0000000000425b44 in do_action (a=a@entry=0x7f5c4553fa30, msg=msg@entry=0x7f5c45555cf8) at action.c:1124 #16 0x000000000042922f in run_action_list (msg=0x7f5c45555cf8, a=<optimized out>) at action.c:172 #17 run_actions (a=a@entry=0x7f5c45537e78, msg=msg@entry=0x7f5c45555cf8) at action.c:137 #18 0x0000000000429cfb in run_actions (msg=0x7f5c45555cf8, a=0x7f5c45537e78) at action.c:130 #19 run_top_route (a=0x7f5c45537e78, msg=msg@entry=0x7f5c45555cf8) at action.c:214 #20 0x00000000004aeb84 in receive_msg ( buf=0x8c32a0 "REGISTER sip:pbx1.example.com.gr;transport=UDP SIP/2.0\r\nVia: SIP/2.0/UDP 1.1.1.201:31400;branch=z9hG4bK-524287-1---b14ca798a23a08e9;rport\r\nMax-Forwards: 70\r\nContact: <sip:5511@185.135.2.2"..., len=<optimized out>, rcv_info=rcv_info@entry=0x7ffe86d509c0, existing_context=existing_context@entry=0x0) at receive.c:208 #21 0x000000000060c656 in udp_read_req (si=<optimized out>, bytes_read=<optimized out>) at net/proto_udp/proto_udp.c:192 #22 0x00000000005ebc96 in handle_io (fm=<optimized out>, idx=<optimized out>, event_type=<optimized out>) ---Type <return> to continue, or q <return> to quit--- at net/net_udp.c:259 #23 handle_io (fm=<optimized out>, idx=<optimized out>, event_type=<optimized out>) at net/../io_wait_loop.h:152 #24 io_wait_loop_epoll (repeat=<optimized out>, t=<optimized out>, h=<optimized out>) at net/../io_wait_loop.h:284 #25 0x00000000005eda4f in udp_start_processes (chd_rank=chd_rank@entry=0x885f24, startup_done=startup_done@entry=0x0) at net/net_udp.c:389 #26 0x000000000041e94f in main_loop () at main.c:677 #27 main (argc=<optimized out>, argv=<optimized out>) at main.c:1283

Core Dump 2
(gdb) bt #0 0x00007f5c46d7f125 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f5c46d823a0 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x0000000000451afb in sig_alarm_abort (signo=<optimized out>) at main.c:414 #3 <signal handler called> #4 fm_status (qm=0x7f5c25515000) at mem/f_malloc.c:710 #5 0x0000000000452333 in shm_status () at mem/shm_mem.h:611 #6 cleanup (show_status=show_status@entry=1) at main.c:341 #7 0x0000000000452ef3 in handle_sigs () at main.c:522 #8 0x000000000041ec50 in main_loop () at main.c:722 #9 main (argc=<optimized out>, argv=<optimized out>) at main.c:1283

@liviuchircu
Copy link
Member

Let me put together a full testing setup first - hopefully it will allow me to figure out what's missing.

@apsaras
Copy link
Author

apsaras commented Aug 25, 2017

I can give you access to my system to test and trace

@liviuchircu
Copy link
Member

Please email connect details to liviu@opensips.org - I can't seem yet to get it to crash over here.

@apsaras
Copy link
Author

apsaras commented Aug 25, 2017

Just sent it. Please check and if you need anything contact me via email.

razvancrainea pushed a commit that referenced this issue Nov 22, 2017
    * do not crash with usrloc replication (although it's not supported)
    * handle registrar + mid_registrar concurrently

Fixes #1172
Fixes #1183
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants