New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] usrloc sync problems #1976
Comments
Hi @vasilevalex ! Discussing with @rvlad-patrascu about this, we think we have a strong lead on this problem. To confirm, could you first scan your active node logs for the following types of messages (adjust per your cluster ID):
Do you also see either 1) or 2) logs nearby your |
Hi, @liviuchircu !
Some messages can be lost. |
Excellent, thank you! We think our theory is confirmed ~90% now. In short, we think that the BIN link between the nodes has become a bottleneck in your setup, since lots of replicated packets + cluster pings must share it. The interesting thing is that, conceptually, we cannot perform the cluster keepalive pings on a separate connection, because it would defeat the purpose of the keepalive (the main idea is to check whether cluster packets can be still safely sent over that pipe). So we have to deal with what we have and optimize it. Here, we have two solutions:
|
Thanks, @liviuchircu . |
Hi @liviuchircu . |
@vasilevalex your backported commit is almost correct, make sure to also incorporate this fix into it: fabcfa1, so your data doesn't misalign during a sync. Good job! |
ok, thanks @liviuchircu ! Will do. Yes, and I think, I need part of this for clusterer.c: 1f7ea96 |
Yes, I can confirm, that with timeout increased from 1000 ms to 5000 ms 10000 AORs are replicated without any problems. Version with IPC is running in test environment, but havn't checked it on production. |
Hello team,
OpenSIPS version you are running
Describe the bug
Cluster with 2 nodes - Active/Passive both are running. Clusterer, mid_registrar, nathelper.
Registrars are asterisk servers. All usrloc data in memory, no db. About 10000 phones, 99% with TLS.
ul_cluster_sync
manually several times, until full sync.During bulk re-registrations Active host is not overloaded, and should not stop answering to Backup over bin proto. And I'm 100% sure that there are no network issues between hosts.
To Reproduce
Could not reproduce in test environment with small amount of phones.
Expected behavior
Always correct and synced data on both nodes.
Relevant System Logs
I have this lines on backup host 20-100 times per day:
Feb 9 03:36:21 srv01 /usr/sbin/opensips[1739]: CRITICAL:usrloc:receive_ucontact_insert: #12>>> replicated ct insert: differring rlabels! (ci: '313538313231393337383531313637-2bhem6nfyoe5')#012It seems you have hit a programming bug.#012Please help us make OpenSIPS better by reporting it at https://github.com/OpenSIPS/opensips/issues
And during bulk re-registrations on backup host there are sometimes messages:
Feb 18 06:01:56 srv01 /usr/sbin/opensips[1539]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is down
Feb 18 06:01:57 srv01 /usr/sbin/opensips[1592]: INFO:clusterer:handle_internal_msg: Node [2] is UP
Feb 18 06:31:43 srv01 /usr/sbin/opensips[1593]: INFO:clusterer:do_action_trans_2: Ping reply not received, node [2] is down
Feb 18 06:31:44 srv01 /usr/sbin/opensips[1592]: INFO:clusterer:handle_internal_msg: Node [2] is UP
OS/environment information
Additional context
The text was updated successfully, but these errors were encountered: