Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HA backup node deletes subscriptions in Federation cluster #2960

Closed
gitforxh opened this issue Nov 29, 2022 · 10 comments
Closed

[BUG] HA backup node deletes subscriptions in Federation cluster #2960

gitforxh opened this issue Nov 29, 2022 · 10 comments
Assignees
Milestone

Comments

@gitforxh
Copy link

gitforxh commented Nov 29, 2022

OpenSIPS version you are running

[root@sip-657d756759-6rmtp /]# opensips -V
version: opensips 3.2.9 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
main.c compiled on 19:32:18 Oct 19 2022 with gcc 4.8.5

Describe the bug

We have two opensips instances configured as active-backup HA pair in a federation cluster mode.

The active node has following settings:

modparam("clusterer","db_url","mysql://xxxx")
modparam("clusterer", "my_node_id", 1)
modparam("clusterer", "sharing_tag" ,"69.108.214.69/1=active")

modparam("presence","db_url","mysql://xxxx")
modparam("presence", "db_update_period", 60)
modparam("presence", "fallback2db", 1)
modparam("presence", "cluster_id", 1) 
modparam("presence", "cluster_federation_mode", "on-demand-sharing")

The backup node has following settings:

modparam("clusterer","db_url","mysql://xxxx")
modparam("clusterer", "my_node_id", 2) 
modparam("clusterer", "sharing_tag" ,"69.108.214.69/1=backup")

modparam("presence","db_url","mysql://xxxx")
modparam("presence", "db_update_period", 60)
modparam("presence", "fallback2db", 1)
modparam("presence", "cluster_id", 1)
modparam("presence", "cluster_federation_mode", "on-demand-sharing") 

This is the entries in clusterer table:

mysql> select * from clusterer;
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+
| id | cluster_id | node_id | url                     | state | no_ping_retries | priority | sip_addr      | flags | description |
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+
|  2 |          1 |       1 | bin:69.108.214.99:5566  |     1 |               3 |       50 | 69.108.214.69 | seed  | NULL        |
|  4 |          1 |       2 | bin:69.108.214.100:5566 |     1 |               3 |       50 | 69.108.214.69 | NULL  | NULL        |
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+

The VIP 69.168.214.69 is configured on the active node_id 1.

We have phones sending REGISTER and SUBSCRIBE (for BLF) requests to VIP on node 1. And the subscriptions are processed by node 1 and stored into the active_watchers table.

When there's call going on, we have a Presence server sending PUBLISH request to VIP on node 1, which does following:

  1. Add an entry into the presentity table.
  2. Lookup the subscriptions and send NOTIFY to phones to turn on the BLF light.

At the same time, node 1 also broadcast the PUBLISH request to node 2, which then does following things that we think it's not supposed to do:

  1. Also tries to add an entry into the presentity table, but fails and prints out following error:
Nov 29 03:51:22 [18533] CRITICAL:db_mysql:wrapper_single_mysql_real_query: driver error (1062): Duplicate entry '7104*2001-sandbox2-sip.nxf-test.xxx.com-dialog-a.1669615985' for key 'presentity_idx'
Nov 29 03:51:22 [18533] ERROR:core:db_do_insert: error while submitting query
Nov 29 03:51:22 [18533] ERROR:presence:update_presentity: inserting new record in database
Nov 29 03:51:22 [18533] ERROR:presence:handle_replicated_publish: failed to update presentity based on replicated Publish
Nov 29 03:51:22 [18533] ERROR:presence:handle_replicated_publish: failed to handle bin packet 1 from node 1
Nov 29 03:51:22 [18533] ERROR:presence:bin_packet_handler: failed to process binary packet!

  1. Lookup the subscriptions in the active_watchers table (as fallback2db is 1), found subscriptions, and also tries to send NOTIFY to phones. But since the phones are registered to VIP on node 1, there's no NAT session between phones and node 2, the sending is failed.
Nov 29 03:51:23 [18534] INFO:presence:publ_notify: notify
Nov 29 03:51:23 [18534] ERROR:core:proto_udp_send: sendto(sock,0x7fd122757c18,1044,0,0x7fd1227bf8e0,16): Network is unreachable(101) [114.73.73.222:55088]
Nov 29 03:51:23 [18534] ERROR:tm:msg_send: send() to 114.73.73.222:55088 for proto udp/1 failed
Nov 29 03:51:23 [18534] ERROR:tm:t_uac: attempt to send to 'sip:HP-0002F2850F73-2001@104.73.73.222:55088' failed
Nov 29 03:51:23 [18534] INFO:presence:send_notify_request: NOTIFY sip:HP-0002F2850F73-2001@sandbox2-sip.nxf-test.xxx.com via sip:HP-0002F2850F73-2001@104.73.73.222:55088 on behalf of sip:7104*2001@sandbox2-sip.nxf-test.xxx.com for event dialog, to_tag=c36a-86009661c8d9640bf2cbb712c441ca50, cseq=4
Nov 29 03:51:23 [18565] ERROR:core:proto_udp_send: sendto(sock,0x7fd122757c18,1044,0,0x7fd1227bf8e0,16): Network is unreachable(101) [114.73.73.222:55088]
Nov 29 03:51:23 [18565] ERROR:tm:msg_send: send() to 104.73.73.222:55088 for proto udp/1 failed
Nov 29 03:51:24 [18512] ERROR:core:proto_udp_send: sendto(sock,0x7fd122757c18,1044,0,0x7fd1227bf8e0,16): Network is unreachable(101) [114.73.73.222:55088]
Nov 29 03:51:24 [18512] ERROR:tm:msg_send: send() to 104.73.73.222:55088 for proto udp/1 failed

What's worse is that after a few failed sending attempts, the backup node 2 probably thinks that the subscriber is unreachable, it DELETEs the subscription in the active_watchers table!!!

Expected behavior

To our understanding, the backup node is not supposed to do any of the above behaviours. Since it's just a BACKUP node,

It's NOT supposed to handle the PUBLISH request.
It's NOT supposed to insert into presentity table.
It's NOT supposed to query the active_watchers table.
It's NOT supposed to send NOTIFY to subscribers.
It's NOT supposed to delete any entry in the active_watchers table.

Why is it doing all above at all? Are we configuring the cluster wrong or missing any key settings?

@gitforxh gitforxh changed the title [BUG] [BUG] HA backup node deletes subscriptions in Federation cluster Nov 29, 2022
@github-actions
Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@bogdan-iancu
Copy link
Member

@gitforxh , I'm a bit confused here - you mentioned 2 servers in HA mode (so one active, the other standby), but I see your clustering configuration is "on-demand-sharing" - and this setting is more for implementing horizontal scalability with multiple active locations.
Please see https://blog.opensips.org/2018/03/27/clustering-presence-services-with-opensips-2-4/, were you implemented a "Federating scenario" scenario, while I think you are looking more for a "Load Balancing scenario".

@gitforxh
Copy link
Author

@bogdan-iancu We're trying to implement the "Federating scenario with redundancy" described in https://blog.opensips.org/2018/03/27/clustering-presence-services-with-opensips-2-4/, and we're starting with 1 pair only and see how it goes.

And we're following the recommended settings in the doc for the cluster_federation_mode parameter:

modparam("presence", "cluster_federation_mode", 1)

Isn't the value '1' corresponding to "on-demand-sharing" list here?
https://opensips.org/html/docs/modules/3.2.x/presence.html#param_cluster_federation_mode

@bogdan-iancu
Copy link
Member

@gitforxh , thanks for the input . The original blog post is for 2.4, while in 3.2 things were changed a bit (to be more straight forward as usage). and yes, the cluster_federation_mode set to 1 matches the on-demand-sharing in 3.2 .

And reviewing the scenario, I agree that the backup node (inside a location) should actually do nothing upon receiving a replicated PUBLISH via clustering - somehow the presence module should know which is the sharing-tag controlling the active-backup mode and if the tag is inactive, it should ignore the data received via clustering.

LEt me do more thinking on this, to see what's the best way to get this in place, without bloating it too much.

@github-actions
Copy link

github-actions bot commented Feb 3, 2023

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Feb 3, 2023
@github-actions
Copy link

github-actions bot commented Mar 6, 2023

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

@github-actions github-actions bot closed this as completed Mar 6, 2023
@gitforxh
Copy link
Author

gitforxh commented Mar 6, 2023

@bogdan-iancu Any update on this? Can you leave this ticket open?

@bogdan-iancu bogdan-iancu reopened this Mar 15, 2023
@github-actions github-actions bot removed the stale label Mar 16, 2023
@github-actions
Copy link

github-actions bot commented Apr 1, 2023

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Apr 1, 2023
@github-actions
Copy link

github-actions bot commented May 1, 2023

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

@github-actions github-actions bot closed this as completed May 1, 2023
@bogdan-iancu bogdan-iancu reopened this May 2, 2023
@stale stale bot removed the stale label May 2, 2023
@bogdan-iancu
Copy link
Member

well, almost 1 year later, but better than never :P .
The BIG problem here is that the sharing tags (to be used when db fallback is active) are attached only to subscriptions and not to presentities . So, when receiving a PUBLISH, there is no mechanism right now to control if it should be handled or not.
Now, putting the fallback mechanism together with the federation one is not straight fwd - why? as the fallback mechanism assumes a PUBLISH hits only ONE presence server, while the federation is doing PUBLISH broadcast, so sending the same PUBLISH to all the other nodes... Shortly said, they are not quite compatible.
There is no much we can do here at the DB fallback mechanism, so I would say we need a new CLUSTERING flavor. Beside the on-demand-sharing and full-sharing, to have a new mode where a node may "discard" all PUBLISH broadcast based on the status of a sharing tag. Or even better, for sharing tag to dictate if a federation node should behave as "idle" - ignore any clustering activities. And the idea will be to use here the same sharing tag you use the controlling the HA for the DB fallback mode.

@bogdan-iancu bogdan-iancu modified the milestones: 3.2.18, 3.5-dev Apr 11, 2024
bogdan-iancu added a commit that referenced this issue Apr 22, 2024
Added a sharing tag to control which node (from the HA combination) is active in the federated cluster. See all the details here #2960
(this is fully backward compatible)

Closes #2960

(cherry picked from commit 8b96b70)
bogdan-iancu added a commit that referenced this issue Apr 22, 2024
Added a sharing tag to control which node (from the HA combination) is active in the federated cluster. See all the details here #2960
(this is fully backward compatible)

Closes #2960

(cherry picked from commit 8b96b70)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants