[BUG] usrloc sync problems #1976

vasilevalex · 2020-02-18T08:15:25Z

Hello team,

OpenSIPS version you are running

version: opensips 2.4.6 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: unknown
main.c compiled on 12:59:35 Dec 18 2019 with gcc 4.8.5

Describe the bug
Cluster with 2 nodes - Active/Passive both are running. Clusterer, mid_registrar, nathelper.
Registrars are asterisk servers. All usrloc data in memory, no db. About 10000 phones, 99% with TLS.

If I restart Backup node, it starts syncing, but syncs only several thousand AORs (sometimes 1, sometimes 3 thousand). Then just stops without any errors in logs. After that I have to run ul_cluster_sync manually several times, until full sync.
When there are bulk re-registrations, Backup node sometimes loses updates from Active host and deletes 10-1000 AORs that were expired. But they are successfully renewed on Active host and continue working.
During bulk re-registrations Active host is not overloaded, and should not stop answering to Backup over bin proto. And I'm 100% sure that there are no network issues between hosts.

To Reproduce
Could not reproduce in test environment with small amount of phones.

Expected behavior
Always correct and synced data on both nodes.

Relevant System Logs
I have this lines on backup host 20-100 times per day:
Feb 9 03:36:21 srv01 /usr/sbin/opensips[1739]: CRITICAL:usrloc:receive_ucontact_insert: #12>>> replicated ct insert: differring rlabels! (ci: '313538313231393337383531313637-2bhem6nfyoe5')#012It seems you have hit a programming bug.#012Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues
And during bulk re-registrations on backup host there are sometimes messages:
Feb 18 06:01:56 srv01 /usr/sbin/opensips[1539]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is down
Feb 18 06:01:57 srv01 /usr/sbin/opensips[1592]: INFO:clusterer:handle_internal_msg: Node [2] is UP
Feb 18 06:31:43 srv01 /usr/sbin/opensips[1593]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is down
Feb 18 06:31:44 srv01 /usr/sbin/opensips[1592]: INFO:clusterer:handle_internal_msg: Node [2] is UP

OS/environment information

Operating System: CentOS 7
OpenSIPS installation: git
other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

liviuchircu · 2020-02-19T12:29:34Z

Hi @vasilevalex !

Discussing with @rvlad-patrascu about this, we think we have a strong lead on this problem. To confirm, could you first scan your active node logs for the following types of messages (adjust per your cluster ID):

INFO:usrloc:replicate.*: All destinations in cluster: 1 are down or probing
ERROR:usrloc:replicate.*: Error sending in cluster: 1

Do you also see either 1) or 2) logs nearby your "Ping reply not received, node down" logs from clusterer? If yes, which of them? Thanks!

vasilevalex · 2020-02-19T12:41:04Z

Hi, @liviuchircu !
For 3 days I have only 1 message number 2 (ERROR:usrloc:replicate_ucontact_update: Error sending in cluster: 1) on the Active host, and 35000 messages number 1 (INFO:usrloc:replicate_ucontact_update: All destinations in cluster: 1 are down or probing) on Active host too. All messages number 1 are at the time of bulk re-registration. And this is the same time, when Backup host writes about lost pings.
But the Active host's rsyslogd is very busy, e.g.

rsyslogd: imjournal: 87654 messages lost due to rate-limiting

Some messages can be lost.

liviuchircu · 2020-02-19T13:10:06Z

Excellent, thank you! We think our theory is confirmed ~90% now. In short, we think that the BIN link between the nodes has become a bottleneck in your setup, since lots of replicated packets + cluster pings must share it.

The interesting thing is that, conceptually, we cannot perform the cluster keepalive pings on a separate connection, because it would defeat the purpose of the keepalive (the main idea is to check whether cluster packets can be still safely sent over that pipe). So we have to deal with what we have and optimize it. Here, we have two solutions:

"quick and dirty solution": just increase your cluster ping_timeout by a factor of 10. So, instead of 1000 ms, make it 10000 ms and the node-node link should be much more robust.
"improved scalability": with some patching, we can make the Backup node process BIN packets in parallel, not serially! In fact, @rvlad-patrascu already did this commit on master branch half a year ago (see 40b7422), which will dramatically reduce the chance that ping packets stay queued for longer than ping_timeout seconds, thus breaking node-node connectivity. I can't say if it's easily backportable to 2.4, I'm afraid. It may not apply cleanly...

vasilevalex · 2020-02-19T13:19:46Z

Thanks, @liviuchircu .
I'd vote for the second solution.
In my case, replicated contacts make not much sense with TLS (as they anyway reconnect on every switch Active -> Backup). It's only about very small amount of UDP devices.
So I can try to backport this commit.

vasilevalex · 2020-02-21T12:08:51Z

Hi @liviuchircu .
I backported commit to 2.4 vasilevalex@1d9cb41 it was built and it runs ok in test environment. But without huge amount of phones I can't test it :)
If you or @rvlad-patrascu can look at this change? Do you think it is ready for production?

liviuchircu · 2020-02-21T12:32:34Z

@vasilevalex your backported commit is almost correct, make sure to also incorporate this fix into it: fabcfa1, so your data doesn't misalign during a sync. Good job!

vasilevalex · 2020-02-21T12:42:01Z

ok, thanks @liviuchircu ! Will do. Yes, and I think, I need part of this for clusterer.c: 1f7ea96

vasilevalex · 2020-02-28T12:12:01Z

Yes, I can confirm, that with timeout increased from 1000 ms to 5000 ms 10000 AORs are replicated without any problems. Version with IPC is running in test environment, but havn't checked it on production.

liviuchircu self-assigned this Feb 19, 2020

liviuchircu added the fixed label Feb 28, 2020

vasilevalex closed this as completed Feb 28, 2020

Lt-Flash mentioned this issue Apr 6, 2024

[BUG] Clusterer sync upon restart of OpenSIPS, usrloc issues #3354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] usrloc sync problems #1976

[BUG] usrloc sync problems #1976

vasilevalex commented Feb 18, 2020

liviuchircu commented Feb 19, 2020 •

edited

vasilevalex commented Feb 19, 2020 •

edited

liviuchircu commented Feb 19, 2020

vasilevalex commented Feb 19, 2020

vasilevalex commented Feb 21, 2020

liviuchircu commented Feb 21, 2020

vasilevalex commented Feb 21, 2020 •

edited

vasilevalex commented Feb 28, 2020 •

edited

[BUG] usrloc sync problems #1976

[BUG] usrloc sync problems #1976

Comments

vasilevalex commented Feb 18, 2020

liviuchircu commented Feb 19, 2020 • edited

vasilevalex commented Feb 19, 2020 • edited

liviuchircu commented Feb 19, 2020

vasilevalex commented Feb 19, 2020

vasilevalex commented Feb 21, 2020

liviuchircu commented Feb 21, 2020

vasilevalex commented Feb 21, 2020 • edited

vasilevalex commented Feb 28, 2020 • edited

liviuchircu commented Feb 19, 2020 •

edited

vasilevalex commented Feb 19, 2020 •

edited

vasilevalex commented Feb 21, 2020 •

edited

vasilevalex commented Feb 28, 2020 •

edited