Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] usrloc sync problems #1976

Closed
vasilevalex opened this issue Feb 18, 2020 · 8 comments
Closed

[BUG] usrloc sync problems #1976

vasilevalex opened this issue Feb 18, 2020 · 8 comments
Assignees
Labels

Comments

@vasilevalex
Copy link
Contributor

Hello team,

OpenSIPS version you are running

version: opensips 2.4.6 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: unknown
main.c compiled on 12:59:35 Dec 18 2019 with gcc 4.8.5

Describe the bug
Cluster with 2 nodes - Active/Passive both are running. Clusterer, mid_registrar, nathelper.
Registrars are asterisk servers. All usrloc data in memory, no db. About 10000 phones, 99% with TLS.

  1. If I restart Backup node, it starts syncing, but syncs only several thousand AORs (sometimes 1, sometimes 3 thousand). Then just stops without any errors in logs. After that I have to run ul_cluster_sync manually several times, until full sync.
  2. When there are bulk re-registrations, Backup node sometimes loses updates from Active host and deletes 10-1000 AORs that were expired. But they are successfully renewed on Active host and continue working.
    During bulk re-registrations Active host is not overloaded, and should not stop answering to Backup over bin proto. And I'm 100% sure that there are no network issues between hosts.

To Reproduce
Could not reproduce in test environment with small amount of phones.

Expected behavior
Always correct and synced data on both nodes.

Relevant System Logs
I have this lines on backup host 20-100 times per day:
Feb 9 03:36:21 srv01 /usr/sbin/opensips[1739]: CRITICAL:usrloc:receive_ucontact_insert: #12>>> replicated ct insert: differring rlabels! (ci: '313538313231393337383531313637-2bhem6nfyoe5')#012It seems you have hit a programming bug.#012Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues
And during bulk re-registrations on backup host there are sometimes messages:
Feb 18 06:01:56 srv01 /usr/sbin/opensips[1539]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is down
Feb 18 06:01:57 srv01 /usr/sbin/opensips[1592]: INFO:clusterer:handle_internal_msg: Node [2] is UP
Feb 18 06:31:43 srv01 /usr/sbin/opensips[1593]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is down
Feb 18 06:31:44 srv01 /usr/sbin/opensips[1592]: INFO:clusterer:handle_internal_msg: Node [2] is UP

OS/environment information

  • Operating System: CentOS 7
  • OpenSIPS installation: git
  • other relevant information:

Additional context

@liviuchircu liviuchircu self-assigned this Feb 19, 2020
@liviuchircu
Copy link
Member

liviuchircu commented Feb 19, 2020

Hi @vasilevalex !

Discussing with @rvlad-patrascu about this, we think we have a strong lead on this problem. To confirm, could you first scan your active node logs for the following types of messages (adjust per your cluster ID):

  1. INFO:usrloc:replicate.*: All destinations in cluster: 1 are down or probing
  2. ERROR:usrloc:replicate.*: Error sending in cluster: 1

Do you also see either 1) or 2) logs nearby your "Ping reply not received, node down" logs from clusterer? If yes, which of them? Thanks!

@vasilevalex
Copy link
Contributor Author

vasilevalex commented Feb 19, 2020

Hi, @liviuchircu !
For 3 days I have only 1 message number 2 (ERROR:usrloc:replicate_ucontact_update: Error sending in cluster: 1) on the Active host, and 35000 messages number 1 (INFO:usrloc:replicate_ucontact_update: All destinations in cluster: 1 are down or probing) on Active host too. All messages number 1 are at the time of bulk re-registration. And this is the same time, when Backup host writes about lost pings.
But the Active host's rsyslogd is very busy, e.g.

rsyslogd: imjournal: 87654 messages lost due to rate-limiting

Some messages can be lost.

@liviuchircu
Copy link
Member

Excellent, thank you! We think our theory is confirmed ~90% now. In short, we think that the BIN link between the nodes has become a bottleneck in your setup, since lots of replicated packets + cluster pings must share it.

The interesting thing is that, conceptually, we cannot perform the cluster keepalive pings on a separate connection, because it would defeat the purpose of the keepalive (the main idea is to check whether cluster packets can be still safely sent over that pipe). So we have to deal with what we have and optimize it. Here, we have two solutions:

  • "quick and dirty solution": just increase your cluster ping_timeout by a factor of 10. So, instead of 1000 ms, make it 10000 ms and the node-node link should be much more robust.
  • "improved scalability": with some patching, we can make the Backup node process BIN packets in parallel, not serially! In fact, @rvlad-patrascu already did this commit on master branch half a year ago (see 40b7422), which will dramatically reduce the chance that ping packets stay queued for longer than ping_timeout seconds, thus breaking node-node connectivity. I can't say if it's easily backportable to 2.4, I'm afraid. It may not apply cleanly...

@vasilevalex
Copy link
Contributor Author

Thanks, @liviuchircu .
I'd vote for the second solution.
In my case, replicated contacts make not much sense with TLS (as they anyway reconnect on every switch Active -> Backup). It's only about very small amount of UDP devices.
So I can try to backport this commit.

@vasilevalex
Copy link
Contributor Author

Hi @liviuchircu .
I backported commit to 2.4 vasilevalex@1d9cb41 it was built and it runs ok in test environment. But without huge amount of phones I can't test it :)
If you or @rvlad-patrascu can look at this change? Do you think it is ready for production?

@liviuchircu
Copy link
Member

@vasilevalex your backported commit is almost correct, make sure to also incorporate this fix into it: fabcfa1, so your data doesn't misalign during a sync. Good job!

@vasilevalex
Copy link
Contributor Author

vasilevalex commented Feb 21, 2020

ok, thanks @liviuchircu ! Will do. Yes, and I think, I need part of this for clusterer.c: 1f7ea96

@vasilevalex
Copy link
Contributor Author

vasilevalex commented Feb 28, 2020

Yes, I can confirm, that with timeout increased from 1000 ms to 5000 ms 10000 AORs are replicated without any problems. Version with IPC is running in test environment, but havn't checked it on production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants