Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fence_scsi use clusterwide nodeID instead of local nodelist ID of node #289

Merged
merged 4 commits into from Nov 19, 2019

Conversation

OndrejHome
Copy link
Contributor

short story: 'corosync-cmapctl nodelist' output and local IDs can be different on
different nodes of same cluster while "real" nodeID is always same in
same cluster. Therefore use real "nodeID" for consistent behaviour.

long story:
I have noticed that sometimes my test systems ends up using same key for fence_scsi without any particular reason. I wondered what can be cause so I have tried re-ordering nodes in the /etc/corosync/corosync.conf preserving all the other values. For example below is nodelist from both nodes of Fedora 30 cluster (same happens on CentOS 7.6, but for short I will omit config).

fastvm-fedora30-93 /etc/corosync/corosync.conf

nodelist {
    node {
        ring0_addr: fastvm-fedora30-93
        name: fastvm-fedora30-93
        nodeid: 1
    }

    node {
        ring0_addr: fastvm-fedora30-94
        name: fastvm-fedora30-94
        nodeid: 2
    }
}

fastvm-fedora30-94 /etc/corosync/corosync.conf

nodelist {
    node {
        ring0_addr: fastvm-fedora30-94
        name: fastvm-fedora30-94
        nodeid: 2
    }

    node {
        ring0_addr: fastvm-fedora30-93
        name: fastvm-fedora30-93
        nodeid: 1
    }
}

Above configs are "logically" same, but just changing the ordering causes the corosync-cmapctl nodelist to give different ID in the list of nodes on different nodes as seen below.

CentOS 7.6 - corosync-2.4.3-4.el7.x86_64

[root@fastvm-centos-7-6-91 ~]#  corosync-cmapctl nodelist
nodelist.local_node_pos (u32) = 0
nodelist.node.0.nodeid (u32) = 1
nodelist.node.0.ring0_addr (str) = fastvm-centos-7-6-91
nodelist.node.1.nodeid (u32) = 2
nodelist.node.1.ring0_addr (str) = fastvm-centos-7-6-92
[root@fastvm-centos-7-6-92 ~]# corosync-cmapctl nodelist
nodelist.local_node_pos (u32) = 0
nodelist.node.0.nodeid (u32) = 2
nodelist.node.0.ring0_addr (str) = fastvm-centos-7-6-92
nodelist.node.1.nodeid (u32) = 1
nodelist.node.1.ring0_addr (str) = fastvm-centos-7-6-91

Fedora 30 - corosync-3.0.2-1.fc30.x86_64

[root@fastvm-fedora30-93 ~]# corosync-cmapctl nodelist
nodelist.local_node_pos (u32) = 0
nodelist.node.0.name (str) = fastvm-fedora30-93
nodelist.node.0.nodeid (u32) = 1
nodelist.node.0.ring0_addr (str) = fastvm-fedora30-93
nodelist.node.1.name (str) = fastvm-fedora30-94
nodelist.node.1.nodeid (u32) = 2
nodelist.node.1.ring0_addr (str) = fastvm-fedora30-94
[root@fastvm-fedora30-94 ~]#  corosync-cmapctl nodelist
nodelist.local_node_pos (u32) = 0
nodelist.node.0.name (str) = fastvm-fedora30-94
nodelist.node.0.nodeid (u32) = 2
nodelist.node.0.ring0_addr (str) = fastvm-fedora30-94
nodelist.node.1.name (str) = fastvm-fedora30-93
nodelist.node.1.nodeid (u32) = 1
nodelist.node.1.ring0_addr (str) = fastvm-fedora30-93

In the end I think that when using the get_node_id function the fence_scsi agent should really use nodeID and not just arbitrary ID that might be different depending on the node. I have no evidence at this time when the nodelist got bad during normal operation like adding/removing nodes or who know what can affect the nodelist output. However above proof of concept should be sufficient enough to point out that maybe more stable nodeID should be used here.

Proposed code change here takes the ID and tries to match it to 'nodeid' from corosync-cmapctl nodelist output for given node which fixes problem when above situation is introduced. For backward compatibility the detected nodeID is decreased by 1 to match old behaviour so upgrade should be seemles here. (cluster node IDs starts from 1, while corosync-cmapctl nodelist listing starts from 0).

Tested on CentOS 7.6 and Fedora 30 two-node clusters.

@HideoYamauchi
Copy link
Contributor

HideoYamauchi commented Sep 6, 2019

Hi OndrejHome,

There is one point to worry about.
When a cluster is configured by directly editing corosync.conf.

In older versions of corosync2.4.5, the nodeid in corosync.conf is not mandatory and will be created automatically if not specified.

nodelist {
        node {
                ring0_addr: 192.168.10.100
                ring1_addr: 192.168.20.100
                name: rh76hv-01
        }
        node {
                ring0_addr: 192.168.10.110
                ring1_addr: 192.168.20.110
                name: rh76hv-02
        }
}

[root@rh76hv-01 corosync-2.4.5]# corosync-cfgtool -s
Printing ring status.
Local node ID 3232238180
RING ID 0
        id      = 192.168.10.100
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.20.100
        status  = ring 1 active with no faults

[root@rh76hv-02 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 3232238190
RING ID 0
        id      = 192.168.10.110
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.20.110
        status  = ring 1 active with no faults

In this case, this change makes fence_scsi unavailable.

If you make this change, you should encourage the message to set the nodeid to corosync.conf.

(In the case of corosync3.x, nodeid of nodelist of corosync.conf is essential, so I think that there is no problem.)

Best Regards,
Hideo Yamauchi.

@OndrejHome
Copy link
Contributor Author

Hi @HideoYamauchi,

Thank you for feedback. I can see now that when I omit the 'nodeid' in CentOS 7.6 cluster (corosync 2.4.x) the corosync-cmapctl nodelist doesn't contain nodeid at all.

# corosync-cmapctl nodelist
nodelist.local_node_pos (u32) = 0
nodelist.node.0.ring0_addr (str) = fastvm-centos-7-6-41
nodelist.node.1.ring0_addr (str) = fastvm-centos-7-6-42
# corosync-cfgtool -s
Printing ring status.
Local node ID 3232241193
RING ID 0
	id	= 192.168.22.41
	status	= ring 0 active with no faults

In that case this change will really fail. So a good question is on what to do in such case, some ideas that comes to my mind:

  • if 'nodeid' is not present from output, try to get it some other way
  • if 'nodeid' is not present fail back to old style used until now and warn about it into logs
    My concern is mainly on what if nodeid is present on "some" nodes but not others. This I would consider quite edge case, but still worth figuring out.

I will try to have a look at this during next week and update here on my findings.

Ondrej (온드레이)

@oalbrigt
Copy link
Collaborator

oalbrigt commented Sep 6, 2019

Wouldnt sorting the output from corosync-cmapctl nodelist before looking for the node ID solve the issue (hopefully even without this patch)?

@OndrejHome
Copy link
Contributor Author

Hi @oalbrigt,
which value you suggest to sort by? :)
Output is sorted by "local ID" in which the corosync-cmapctl nodelist found the nodes to be defined in corosync.conf. The main problem that there is no guarantee that this order is same on all nodes in cluster.

  • by node name - it would make consistent output, but if node is added/removed then more IDs would change
  • by nodeid - it may fail if nodeid is not present -> this is also problem of current patch compared to original
  • ... something else that will result into same order on all nodes

If I would forget about backward compatibility then most probably hashing the 'node name' in similar way how the 'cluster name' is done already to make first part of SCSI key would make sense to me.

It would work in both corosync 2.x and 3.x as 'node name' is something we ask user to provide so it matches further. Eventually it wouldn't even need a lookup in the corosync-cmapctl unless we want to make sure that such node name exists. But this approach would make the new SCSI key very different from the current ones, so there will be need to migrate key from old to new. Current patch is aiming to be backward compatible so the migration should not be needed in most cases (with exception of clusters that have holes in nodelist like example below).

# corosync-cmapctl nodelist
nodelist.local_node_pos (u32) = 1
nodelist.node.0.nodeid (u32) = 2
nodelist.node.0.ring0_addr (str) = fastvm-centos-7-6-41
nodelist.node.1.nodeid (u32) = 4
nodelist.node.1.ring0_addr (str) = fastvm-centos-7-6-42

@oalbrigt
Copy link
Collaborator

oalbrigt commented Sep 6, 2019

If you simply sort the entire output (e.g. sorted() around run_cmd()["out"] here:

out = run_cmd(options, cmd)["out"]

@OndrejHome
Copy link
Contributor Author

Hmm, I might be missing something, but how does 'sorted()' solve the issue?
Below is shell sort which I would expect to be close to sorted from python in principle. Idea is that with or without sort it is same. We may try to strip parts with IDs and then try doing numbering again - is that what you suggest? Something like :

echo $output|grep ring0_addr|cut -d= -f2|sort
 fastvm-centos-7-6-91
 fastvm-centos-7-6-92

@fastvm-centos-7-6-91 $ corosync-cmapctl nodelist|sort
nodelist.local_node_pos (u32) = 0
nodelist.node.0.nodeid (u32) = 1
nodelist.node.0.ring0_addr (str) = fastvm-centos-7-6-91
              ^-- ID of fastvm-centos-7-6-91 is '0' according to fence_scsi on node fastvm-centos-7-6-91
nodelist.node.1.nodeid (u32) = 2
nodelist.node.1.ring0_addr (str) = fastvm-centos-7-6-92
@fastvm-centos-7-6-92 $ corosync-cmapctl nodelist|sort
nodelist.local_node_pos (u32) = 0
nodelist.node.0.nodeid (u32) = 2
nodelist.node.0.ring0_addr (str) = fastvm-centos-7-6-92
nodelist.node.1.nodeid (u32) = 1
nodelist.node.1.ring0_addr (str) = fastvm-centos-7-6-91
              ^-- ID of fastvm-centos-7-6-91 is '1' according to fence_scsi on node fastvm-centos-7-6-92

@oalbrigt
Copy link
Collaborator

oalbrigt commented Sep 6, 2019

Ah. This is kind of tricky.

I guess the best way to do this might be to only return ring0_addr to an array (and sort it), and use the nodes position in the array as the node ID.

@oalbrigt
Copy link
Collaborator

oalbrigt commented Sep 6, 2019

I'll see if I can come up with a good way to do that.

@oalbrigt
Copy link
Collaborator

oalbrigt commented Sep 6, 2019

I realized doing it with arrays all the nodes will have to redo unfencing for it to work correctly.

Ondrej: is this an issue you've seen between nodes running same version of a distro, and does it ever happen without manually editing corosync.conf?

@OndrejHome
Copy link
Contributor Author

@oalbrigt, I have run into while preparing training in which I wanted to explain how the SCSI key made by fence_scsi is generated. So no real impact from that. Editing corosync.conf was a simple POC that using the IDs might be flawed. For more "real-life" scenario check below:

3 node cluster

# corosync-cmapctl nodelist
nodelist.local_node_pos (u32) = 2
nodelist.node.0.nodeid (u32) = 1
nodelist.node.0.ring0_addr (str) = fastvm-centos-7-6-111
nodelist.node.1.nodeid (u32) = 2
nodelist.node.1.ring0_addr (str) = fastvm-centos-7-6-112
nodelist.node.2.nodeid (u32) = 3
nodelist.node.2.ring0_addr (str) = fastvm-centos-7-6-113
# pcs stonith show scsi
 Resource: scsi (class=stonith type=fence_scsi)
  Attributes: devices=/dev/sdb pcmk_host_list="fastvm-centos-7-6-111 fastvm-centos-7-6-112 fastvm-centos-7-6-113"
  Meta Attrs: provides=unfencing 
  Operations: monitor interval=60s (scsi-monitor-interval-60s)
# sg_persist -n -i -k -d /dev/sdb
  PR generation=0x5, 3 registered reservation keys follow:
    0x36830000
    0x36830001
    0x36830002

Remove any node except of the last (I have removed second node).

# pcs cluster node remove fastvm-centos-7-6-112
fastvm-centos-7-6-112: Stopping Cluster (pacemaker)...
fastvm-centos-7-6-112: Successfully destroyed cluster
fastvm-centos-7-6-111: Corosync updated
fastvm-centos-7-6-113: Corosync updated
# sg_persist -n -i -k -d /dev/sdb
  PR generation=0x7, 3 registered reservation keys follow:
    0x36830000
    0x36830001
    0x36830002
# corosync-cmapctl nodelist
nodelist.local_node_pos (u32) = 1
nodelist.node.0.nodeid (u32) = 1
nodelist.node.0.ring0_addr (str) = fastvm-centos-7-6-111
nodelist.node.1.nodeid (u32) = 3
nodelist.node.1.ring0_addr (str) = fastvm-centos-7-6-113

One of issues that I overlook here is that SCSI key from node2 that was removed is not removed from device. But this is not important for this PR.

Fence last node

# pcs stonith fence fastvm-centos-7-6-113
Node: fastvm-centos-7-6-113 fenced
# sg_persist -n -i -k -d /dev/sdb
  PR generation=0x8, 2 registered reservation keys follow:
    0x36830000
    0x36830002

Expected outcome: Node 3 key 0x36830002 is removed from device
Actual outcome: Old node 2 key 0x36830001 is remove from device and node 3 can still write to shared device.

[root@fastvm-centos-7-6-113 ~]# echo test3 > /dev/sdb
[root@fastvm-centos-7-6-113 ~]# journalctl |tail
Sep 09 02:21:20 fastvm-centos-7-6-113 corosync[16712]:  [SERV  ] Service engine unloaded: corosync profile loading service
Sep 09 02:21:20 fastvm-centos-7-6-113 corosync[16712]:  [MAIN  ] Corosync Cluster Engine exiting normally
[root@fastvm-centos-7-6-113 ~]# hexdump -C /dev/sdb
00000000  74 65 73 74 33 0a 00 00  00 00 00 00 00 00 00 00  |test3...........|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
=======
[root@fastvm-centos-7-6-112 ~]# echo test2 > /dev/sdb
[root@fastvm-centos-7-6-112 ~]# journalctl |tail
Sep 09 02:22:16 fastvm-centos-7-6-112 kernel: sd 3:0:0:0: Parameters changed
Sep 09 02:22:16 fastvm-centos-7-6-112 kernel: sd 3:0:0:0: reservation conflict
Sep 09 02:22:16 fastvm-centos-7-6-112 kernel: sd 3:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Sep 09 02:22:16 fastvm-centos-7-6-112 kernel: sd 3:0:0:0: [sdb] CDB: Write(10) 2a 00 00 00 00 00 00 00 08 00
Sep 09 02:22:16 fastvm-centos-7-6-112 kernel: blk_update_request: critical nexus error, dev sdb, sector 0
Sep 09 02:22:16 fastvm-centos-7-6-112 kernel: Buffer I/O error on dev sdb, logical block 0, lost async page write
Sep 09 02:22:16 fastvm-centos-7-6-112 kernel: VFS: Dirty inode writeback failed for block device sdb (err=-5).
[root@fastvm-centos-7-6-112 ~]# hexdump -C /dev/sdb
00000000  74 65 73 74 33 0a 00 00  00 00 00 00 00 00 00 00  |test3...........|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

This is a little bit different example form one that I opened this PR for, but root cause is the same. Using list numbering that can change over time instead of stable IDs of nodes or something else that is stable. In example here the fencing succeeds without providing the desired effect - preventing the write access to shared drive. Example here doesn't manually change anything and uses only pcs to remove node from cluster and to instruct fencing to happen. This is to demonstrate that while corosync-cmapctl nodelist IDs changed the fence_scsi will just assume that it can use the changed IDs. I think the same issue would appear if the node names are sorted in some kind of list and same change appears - the ordering would shift. So as most logical ordering that is stable across nodes in cluster I see the 'nodeid' attribute. Or rather 'node name'/'ring0_addr' - something that all nodes in cluster can resolve to same string even after removing a node from cluster or even when just shuffling the order of nodes in corosync.conf.

Please let me know what you think about this. I will a bit busy this week because of upcoming holidays here so I may respond in second half of week.

Ondrej

@oalbrigt
Copy link
Collaborator

oalbrigt commented Sep 9, 2019

We'll have to find a way that works for all scenarios, as there's no reason to "solve" it by just introducing a new edge-case.

It seems like we cant use hostnames, as the reservation key is limited to 8 bytes (from sg_persist man-page):

       -K, --param-rk=RK
              specify the reservation key found in the parameter block of  the
              PROUT  command.  RK  is  assumed to be hex (up to 8 bytes long).
              Default value is 0. This option is needed by most PROUT sub-com‐
              mands.

@OndrejHome
Copy link
Contributor Author

also half of that key is already used by small chunk of md5-ed cluster name.

def generate_key(options):
return "%.4s%.4d" % (get_cluster_id(options), int(get_node_id(options)))

So we have only 4 bytes if we keep format used so far. I wonder why this is used and only thing that comes to my mind is when storage is used by multiple clusters and somehow we can see multiple keys where I would not expect them to be seen. Otherwise it is rather a nice to see thing to tell from which cluster the keys came from if they were generated by fence_scsi.

Few rather theoretical questions comes to my mind:

  1. Can we fit all possible cluster nodeid within 8 bytes?
  2. Is there need to keep the 4 bytes cluster name 4 bytes "node ID" as seen now with fence_scsi?

If node ids cannot fit within 8 bytes I can imagine to have stored on all nodes table that would map nodeid to SCSI key in consistent way. But as this is rather stateful information that can potentially get out-of-date as the cluster lives I would like to avoid it if possible.

@oalbrigt
Copy link
Collaborator

oalbrigt commented Oct 4, 2019

I guess hashing the nodename combined with only using 4 bytes of the hash might be the least likely way to create duplicates.

@oalbrigt
Copy link
Collaborator

oalbrigt commented Oct 7, 2019

We'll have to implement this as a strongly recommended setting to avoid breaking rolling upgrades.

So e.g. key-value=<id|hash> that would default to the old behaviour.

@OndrejHome
Copy link
Contributor Author

Thank you Oyvind for reply and ideas. I will have a look at this and update you during this week.

@OndrejHome
Copy link
Contributor Author

I'm sorry for delay, being more busy than expected. I will try to get to this soon.

@oalbrigt
Copy link
Collaborator

Sounds good.

this methos generates second part of SCSI key based on hash of cluster
node name instead of currently used ID based approach which can brake if
the nodes get removed from cluster but whole cluster is not restarted
because the IDs changes. With hash approach hashes stays same.
Note that there is theoretical risk that hashes could colide.
@OndrejHome OndrejHome reopened this Oct 20, 2019
@OndrejHome
Copy link
Contributor Author

Hi Oyvind, so finally a new proposal implementing new option key_value that allows choosing between current behaviour (id) that is set as default and new approach based on hashing the node name (hash). Updates is fence_scsi agent and the tests checking the output from -o metadata. Tested on 2-node and 3-node cluster on CentOS 7.7 and verified that after removing the node from 3-node cluster this still works as expected. Please let me know what you think about this. Thank you.

agents/scsi/fence_scsi.py Outdated Show resolved Hide resolved
agents/scsi/fence_scsi.py Show resolved Hide resolved
@OndrejHome
Copy link
Contributor Author

Hi Oyvind, I have added the changes to address your review in previous weeks. Once you have time, please let me know if they are OK or if you see any other changes that needs to be done to move this forward. Thank you!

@oalbrigt
Copy link
Collaborator

I added a comment to your last commit. Maybe you didnt see it.

@OndrejHome
Copy link
Contributor Author

hmm, by 'last commit' you mean 5810571? I don't see any comment on it - I have tried looking directly into commit and checking the mail notifications, but nothing is showing up. Can you post here the link to that comment please?

agents/scsi/fence_scsi.py Outdated Show resolved Hide resolved
@oalbrigt
Copy link
Collaborator

Oh. It seems like I had a lack of coffee that day, so it was stuck in "review" mode. I just added it as a comment as I intended to initially.

@OndrejHome
Copy link
Contributor Author

Thanks for double-checking, I have added the change and it looks that it also passed the tests. In case there is anything more in need of changing please let me know.

@oalbrigt oalbrigt merged commit 99c1f7d into ClusterLabs:master Nov 19, 2019
@oalbrigt
Copy link
Collaborator

LGTM. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants