[BUG] HA backup node deletes subscriptions in Federation cluster #2960

gitforxh · 2022-11-29T03:25:53Z

OpenSIPS version you are running

[root@sip-657d756759-6rmtp /]# opensips -V
version: opensips 3.2.9 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
main.c compiled on 19:32:18 Oct 19 2022 with gcc 4.8.5

Describe the bug

We have two opensips instances configured as active-backup HA pair in a federation cluster mode.

The active node has following settings:

modparam("clusterer","db_url","mysql://xxxx")
modparam("clusterer", "my_node_id", 1)
modparam("clusterer", "sharing_tag" ,"69.108.214.69/1=active")

modparam("presence","db_url","mysql://xxxx")
modparam("presence", "db_update_period", 60)
modparam("presence", "fallback2db", 1)
modparam("presence", "cluster_id", 1) 
modparam("presence", "cluster_federation_mode", "on-demand-sharing")

The backup node has following settings:

modparam("clusterer","db_url","mysql://xxxx")
modparam("clusterer", "my_node_id", 2) 
modparam("clusterer", "sharing_tag" ,"69.108.214.69/1=backup")

modparam("presence","db_url","mysql://xxxx")
modparam("presence", "db_update_period", 60)
modparam("presence", "fallback2db", 1)
modparam("presence", "cluster_id", 1)
modparam("presence", "cluster_federation_mode", "on-demand-sharing")

This is the entries in clusterer table:

mysql> select * from clusterer;
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+
| id | cluster_id | node_id | url                     | state | no_ping_retries | priority | sip_addr      | flags | description |
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+
|  2 |          1 |       1 | bin:69.108.214.99:5566  |     1 |               3 |       50 | 69.108.214.69 | seed  | NULL        |
|  4 |          1 |       2 | bin:69.108.214.100:5566 |     1 |               3 |       50 | 69.108.214.69 | NULL  | NULL        |
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+

The VIP 69.168.214.69 is configured on the active node_id 1.

We have phones sending REGISTER and SUBSCRIBE (for BLF) requests to VIP on node 1. And the subscriptions are processed by node 1 and stored into the active_watchers table.

When there's call going on, we have a Presence server sending PUBLISH request to VIP on node 1, which does following:

Add an entry into the presentity table.
Lookup the subscriptions and send NOTIFY to phones to turn on the BLF light.

At the same time, node 1 also broadcast the PUBLISH request to node 2, which then does following things that we think it's not supposed to do:

Also tries to add an entry into the presentity table, but fails and prints out following error:

Nov 29 03:51:22 [18533] CRITICAL:db_mysql:wrapper_single_mysql_real_query: driver error (1062): Duplicate entry '7104*2001-sandbox2-sip.nxf-test.xxx.com-dialog-a.1669615985' for key 'presentity_idx'
Nov 29 03:51:22 [18533] ERROR:core:db_do_insert: error while submitting query
Nov 29 03:51:22 [18533] ERROR:presence:update_presentity: inserting new record in database
Nov 29 03:51:22 [18533] ERROR:presence:handle_replicated_publish: failed to update presentity based on replicated Publish
Nov 29 03:51:22 [18533] ERROR:presence:handle_replicated_publish: failed to handle bin packet 1 from node 1
Nov 29 03:51:22 [18533] ERROR:presence:bin_packet_handler: failed to process binary packet!

Lookup the subscriptions in the active_watchers table (as fallback2db is 1), found subscriptions, and also tries to send NOTIFY to phones. But since the phones are registered to VIP on node 1, there's no NAT session between phones and node 2, the sending is failed.

Nov 29 03:51:23 [18534] INFO:presence:publ_notify: notify
Nov 29 03:51:23 [18534] ERROR:core:proto_udp_send: sendto(sock,0x7fd122757c18,1044,0,0x7fd1227bf8e0,16): Network is unreachable(101) [114.73.73.222:55088]
Nov 29 03:51:23 [18534] ERROR:tm:msg_send: send() to 114.73.73.222:55088 for proto udp/1 failed
Nov 29 03:51:23 [18534] ERROR:tm:t_uac: attempt to send to 'sip:HP-0002F2850F73-2001@104.73.73.222:55088' failed
Nov 29 03:51:23 [18534] INFO:presence:send_notify_request: NOTIFY sip:HP-0002F2850F73-2001@sandbox2-sip.nxf-test.xxx.com via sip:HP-0002F2850F73-2001@104.73.73.222:55088 on behalf of sip:7104*2001@sandbox2-sip.nxf-test.xxx.com for event dialog, to_tag=c36a-86009661c8d9640bf2cbb712c441ca50, cseq=4
Nov 29 03:51:23 [18565] ERROR:core:proto_udp_send: sendto(sock,0x7fd122757c18,1044,0,0x7fd1227bf8e0,16): Network is unreachable(101) [114.73.73.222:55088]
Nov 29 03:51:23 [18565] ERROR:tm:msg_send: send() to 104.73.73.222:55088 for proto udp/1 failed
Nov 29 03:51:24 [18512] ERROR:core:proto_udp_send: sendto(sock,0x7fd122757c18,1044,0,0x7fd1227bf8e0,16): Network is unreachable(101) [114.73.73.222:55088]
Nov 29 03:51:24 [18512] ERROR:tm:msg_send: send() to 104.73.73.222:55088 for proto udp/1 failed

What's worse is that after a few failed sending attempts, the backup node 2 probably thinks that the subscriber is unreachable, it DELETEs the subscription in the active_watchers table!!!

Expected behavior

To our understanding, the backup node is not supposed to do any of the above behaviours. Since it's just a BACKUP node,

It's NOT supposed to handle the PUBLISH request.
It's NOT supposed to insert into presentity table.
It's NOT supposed to query the active_watchers table.
It's NOT supposed to send NOTIFY to subscribers.
It's NOT supposed to delete any entry in the active_watchers table.

Why is it doing all above at all? Are we configuring the cluster wrong or missing any key settings?

The text was updated successfully, but these errors were encountered:

github-actions · 2022-12-14T06:36:24Z

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

bogdan-iancu · 2023-01-16T15:56:07Z

@gitforxh , I'm a bit confused here - you mentioned 2 servers in HA mode (so one active, the other standby), but I see your clustering configuration is "on-demand-sharing" - and this setting is more for implementing horizontal scalability with multiple active locations.
Please see https://blog.opensips.org/2018/03/27/clustering-presence-services-with-opensips-2-4/, were you implemented a "Federating scenario" scenario, while I think you are looking more for a "Load Balancing scenario".

gitforxh · 2023-01-16T22:56:01Z

@bogdan-iancu We're trying to implement the "Federating scenario with redundancy" described in https://blog.opensips.org/2018/03/27/clustering-presence-services-with-opensips-2-4/, and we're starting with 1 pair only and see how it goes.

And we're following the recommended settings in the doc for the cluster_federation_mode parameter:

modparam("presence", "cluster_federation_mode", 1)

Isn't the value '1' corresponding to "on-demand-sharing" list here?
https://opensips.org/html/docs/modules/3.2.x/presence.html#param_cluster_federation_mode

bogdan-iancu · 2023-01-18T13:00:14Z

@gitforxh , thanks for the input . The original blog post is for 2.4, while in 3.2 things were changed a bit (to be more straight forward as usage). and yes, the cluster_federation_mode set to 1 matches the on-demand-sharing in 3.2 .

And reviewing the scenario, I agree that the backup node (inside a location) should actually do nothing upon receiving a replicated PUBLISH via clustering - somehow the presence module should know which is the sharing-tag controlling the active-backup mode and if the tag is inactive, it should ignore the data received via clustering.

LEt me do more thinking on this, to see what's the best way to get this in place, without bloating it too much.

github-actions · 2023-02-03T06:35:05Z

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions · 2023-03-06T06:37:35Z

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

gitforxh · 2023-03-06T06:45:11Z

@bogdan-iancu Any update on this? Can you leave this ticket open?

github-actions · 2023-04-01T06:30:31Z

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions · 2023-05-01T06:31:41Z

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

bogdan-iancu · 2024-04-11T13:06:20Z

well, almost 1 year later, but better than never :P .
The BIG problem here is that the sharing tags (to be used when db fallback is active) are attached only to subscriptions and not to presentities . So, when receiving a PUBLISH, there is no mechanism right now to control if it should be handled or not.
Now, putting the fallback mechanism together with the federation one is not straight fwd - why? as the fallback mechanism assumes a PUBLISH hits only ONE presence server, while the federation is doing PUBLISH broadcast, so sending the same PUBLISH to all the other nodes... Shortly said, they are not quite compatible.
There is no much we can do here at the DB fallback mechanism, so I would say we need a new CLUSTERING flavor. Beside the on-demand-sharing and full-sharing, to have a new mode where a node may "discard" all PUBLISH broadcast based on the status of a sharing tag. Or even better, for sharing tag to dictate if a federation node should behave as "idle" - ignore any clustering activities. And the idea will be to use here the same sharing tag you use the controlling the HA for the DB fallback mode.

Added a sharing tag to control which node (from the HA combination) is active in the federated cluster. See all the details here #2960 (this is fully backward compatible) Closes #2960 (cherry picked from commit 8b96b70)

gitforxh changed the title ~~[BUG]~~ [BUG] HA backup node deletes subscriptions in Federation cluster Nov 29, 2022

github-actions bot added the stale label Dec 14, 2022

bogdan-iancu assigned liviuchircu Jan 12, 2023

bogdan-iancu removed the stale label Jan 12, 2023

bogdan-iancu added this to the 3.2.11 milestone Jan 12, 2023

bogdan-iancu assigned bogdan-iancu and unassigned liviuchircu Jan 12, 2023

github-actions bot added the stale label Feb 3, 2023

github-actions bot closed this as completed Mar 6, 2023

bogdan-iancu reopened this Mar 15, 2023

github-actions bot removed the stale label Mar 16, 2023

github-actions bot added the stale label Apr 1, 2023

github-actions bot closed this as completed May 1, 2023

bogdan-iancu reopened this May 2, 2023

stale bot removed the stale label May 2, 2023

bogdan-iancu added the high-priority label May 2, 2023

bogdan-iancu modified the milestones: 3.2.18, 3.5-dev Apr 11, 2024

bogdan-iancu closed this as completed in 8b96b70 Apr 22, 2024

bogdan-iancu added the fixed label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] HA backup node deletes subscriptions in Federation cluster #2960

[BUG] HA backup node deletes subscriptions in Federation cluster #2960

gitforxh commented Nov 29, 2022 •

edited

github-actions bot commented Dec 14, 2022

bogdan-iancu commented Jan 16, 2023

gitforxh commented Jan 16, 2023

bogdan-iancu commented Jan 18, 2023

github-actions bot commented Feb 3, 2023

github-actions bot commented Mar 6, 2023

gitforxh commented Mar 6, 2023

github-actions bot commented Apr 1, 2023

github-actions bot commented May 1, 2023

bogdan-iancu commented Apr 11, 2024

[BUG] HA backup node deletes subscriptions in Federation cluster #2960

[BUG] HA backup node deletes subscriptions in Federation cluster #2960

Comments

gitforxh commented Nov 29, 2022 • edited

github-actions bot commented Dec 14, 2022

bogdan-iancu commented Jan 16, 2023

gitforxh commented Jan 16, 2023

bogdan-iancu commented Jan 18, 2023

github-actions bot commented Feb 3, 2023

github-actions bot commented Mar 6, 2023

gitforxh commented Mar 6, 2023

github-actions bot commented Apr 1, 2023

github-actions bot commented May 1, 2023

bogdan-iancu commented Apr 11, 2024

gitforxh commented Nov 29, 2022 •

edited