Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] opensips 3.1 crashes in drouting.so module (and in proto_bin.so too) #2581

Closed
kertor opened this issue Jul 23, 2021 · 9 comments
Closed
Assignees
Milestone

Comments

@kertor
Copy link

kertor commented Jul 23, 2021

OpenSIPS version you are running

version: opensips 3.1.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: e5def93b1
main.c compiled on  with gcc 9

Crash Core Dump

https://disk.yandex.ru/d/3orUt-QRLvJVtg

Describe the traffic that generated the bug

Usual calls. Crash happened after opensips restart.

To Reproduce

Configured drouting replication between opensips nodes:

modparam("drouting", "cluster_id", 101)
modparam("drouting", "cluster_sharing_tag", "ping_master")

Steps to reproduce:

  1. Drouting have some entries
  2. Configured drouting replication between two nodes:
    "Clusters": [
        {
            "cluster_id": 101,
            "Capabilities": [
                {
                    "name": "drouting-status-repl",
                    "state": "Ok"
                },
                {
                    "name": "cachedb-local-repl",
                    "state": "Ok"
                }
            ]
        }
  1. Restart opensips on node 1. It will restarted, but after starting it will crashes with error in dmesg:
[Fri Jul 23 09:39:53 2021] opensips[3577469]: segfault at 8 ip 00007f79d05e244e sp 00007ffea6e93250 error 4 in drouting.so[7f79d05d5000+31000]
[Fri Jul 23 09:40:31 2021] opensips[3577149]: segfault at 8 ip 00007f79d0505ed5 sp 00007ffea6e93470 error 4 in proto_bin.so[7f79d0504000+6000]

Relevant System Logs

Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: message repeated 19 times: [ Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!]
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: message repeated 64 times: [ Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!]
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: message repeated 94 times: [ Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!]
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: message repeated 12 times: [ Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!]
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] WARNING:drouting:dr_recv_sync_packet: failed to process sync chunk!
Jul 23 09:39:26 ts-devdev01-1 opensips[2357093]: Jul 23 09:39:26 [2357093] CRITICAL:core:sig_usr: segfault in process pid: 2357093, id: 48
Jul 23 09:40:00 ts-devdev01-1 opensips[2357031]: Jul 23 09:40:00 [2357031] INFO:core:handle_sigs: child process 2357093 exited by a signal 11

OS/environment information

  • Operating System: Ubuntu 20.04.2 LTS
  • OpenSIPS installation: git
  • other relevant information:

Additional context

Thank you for checking!

@bogdan-iancu
Copy link
Member

@kertor , in corefile 2357093, in frame 0, I assume the carrier ID tg109-dal1-o-l1 and partition out are actually valid, right ?

Also, still in frame 0, could you try to print this part->rdata->carriers_tree ? if it fails, try printing part->rdata or part

@bogdan-iancu bogdan-iancu self-assigned this Jul 23, 2021
@kertor
Copy link
Author

kertor commented Jul 23, 2021

Hello @bogdan-iancu ! Thank you for quick response.

in corefile 2357093, in frame 0, I assume the carrier ID tg109-dal1-o-l1 and partition out are actually valid, right ?

Yes, all looks correct, and partition looks good too:

opensips=> select * from dr_rules where gwlist like '%tg109-dal1-o-l1%';
 ruleid | groupid | prefix | timerec | priority | routeid |      gwlist      | sort_alg | sort_profile | attrs 
--------+---------+--------+---------+----------+---------+------------------+----------+--------------+-------
      2 | 109     |        |         |        0 |         | #tg109-dal1-o-l1 | N        |              | 
(1 row)

opensips=> select * from dr_carriers where carrierid like '%tg109-dal1-o-l1%';
 id |    carrierid    |          gwlist          | flags | sort_alg | state | attrs 
----+-----------------+--------------------------+-------+----------+-------+-------
  2 | tg109-dal1-o-l1 | if13_178_19_19_20_5060=1 |     0 | W        |     0 | 
(1 row)

opensips=> select * from dr_gateways where gwid = 'if13_178_19_19_20_5060';
 id  |          gwid          | type |                           address                            | strip | pri_prefix |         attrs          | probe_mode | state | socket 
-----+------------------------+------+--------------------------------------------------------------+-------+------------+------------------------+------------+-------+--------
 120 | if13_178_19_19_20_5060 |    1 | sip:178.19.19.20:5060;sock=udp:1.1.1.1:5060;zid=1013 |     0 |            | test-1/test-1-1 |          0 |     0 | 
(1 row)

Also, still in frame 0, could you try to print this part->rdata->carriers_tree ? if it fails, try printing part->rdata or part

Can not get info about requested data:

(gdb) frame 0
#0  0x00007f79d05e244e in cr_status_update (packet=packet@entry=0x7ffea6e93350) at dr_clustering.c:205
205	in dr_clustering.c
(gdb) part->rdata->carriers_tree
Undefined command: "part->rdata->carriers_tree".  Try "help".
(gdb) part->rdata
Undefined command: "part->rdata".  Try "help".
(gdb) part
Undefined command: "part".  Try "help".

(gdb) frame 0 -> part->rdata->carriers_tree
Attempt to extract a component of a value that is not a structure pointer.
(gdb) frame 0 -> part->rdata
Attempt to extract a component of a value that is not a structure pointer.
(gdb) frame 0 -> part
Attempt to extract a component of a value that is not a structure pointer.

@kertor
Copy link
Author

kertor commented Jul 23, 2021

sorry, here output:

(gdb) print part->rdata->carriers_tree
Cannot access memory at address 0x8
(gdb) print part->rdata
$1 = (rt_data_t *) 0x0
(gdb) print part
$2 = (struct head_db *) 0x7f79d2b247e0

@github-actions
Copy link

github-actions bot commented Aug 8, 2021

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Aug 8, 2021
@bogdan-iancu
Copy link
Member

thank you @kertor ! As I see it, on the freshly restarted openSIPS, there is a kind of race condition between (a) loading the routing data from DB and (b) already receiving the sync info from the cluster (with the status of the GWs). If the (b) is faster, the dr data will by NULL (not yet loaded).
I guess I need to check for a wait to (1) trigger the sync AFTER the load was completed and (2) be sure that we properly handle replicated data pointing to partitions which are not yet loaded.

@bogdan-iancu bogdan-iancu added this to the 3.1.4 milestone Aug 11, 2021
@bogdan-iancu
Copy link
Member

@kertor, could you try this small patch:

diff --git a/modules/drouting/dr_clustering.c b/modules/drouting/dr_clustering.c
index 6bc5337fc..265c6afa5 100644
--- a/modules/drouting/dr_clustering.c
+++ b/modules/drouting/dr_clustering.c
@@ -377,8 +377,5 @@ int dr_init_cluster(void)
                dr_cluster_shtag.len = 0;
        }
 
-       if (dr_cluster_sync() < 0)
-               return -1;
-
        return 0;
 }
diff --git a/modules/drouting/drouting.c b/modules/drouting/drouting.c
index a1ba47b19..69b99e3eb 100644
--- a/modules/drouting/drouting.c
+++ b/modules/drouting/drouting.c
@@ -1930,6 +1930,9 @@ static int db_connect_head(struct head_db *x) {
 static void rpc_dr_reload_data(int sender_id, void *unused)
 {
        dr_reload_data(1);
+
+       dr_cluster_sync();
+
 }
 
 

@stale stale bot removed the stale label Aug 11, 2021
@kertor
Copy link
Author

kertor commented Aug 11, 2021

Thank you @bogdan-iancu!
Let me check this patch.

@kertor
Copy link
Author

kertor commented Aug 11, 2021

@bogdan-iancu Tests looks good, problem was fixed. Thank you!

bogdan-iancu added a commit that referenced this issue Aug 11, 2021
Be sure we trigger the startup cluster sync AFTER loading the data from DB.
Also, when receiving replicated data, be sure the data is actually loaded.
Closes #2581
bogdan-iancu added a commit that referenced this issue Aug 11, 2021
Be sure we trigger the startup cluster sync AFTER loading the data from DB.
Also, when receiving replicated data, be sure the data is actually loaded.
Closes #2581

(cherry picked from commit 3b8bdb7)
@bogdan-iancu
Copy link
Member

Thank you @kertor !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants