Support for metrocluster #228

hendrikland · 2019-03-26T12:22:07Z

In a Metrocluster configuration, a switchover/site failover causes Trident to fail, e.g. you can't provision any new volumes. Apparently, Trident looks for the vServer name. However, that name changes in case of a Metrocluster switchover and a "-mc" suffix is added to the original vServer name. Once a switchback is issued, the name is reverted to the original one and then Trident starts working again. So dynamic volume provisioning fails while in switchover (either planned as part of a maintenance activity or unplanned in a real disaster event). Adding the "-mc" suffix is standard Ontap behaviour in a Metrocluster environment and shouldn't cause a Trident failure. Trident should assume that in a metrocluster switchover state, any vServer name with a "-mc" suffix is identical to the original vServer).

Logs during MCC switchover look like this:

time="2019-03-23T09:15:51Z" level=warning msg="Kubernetes frontend has no record of the updated storage class; instead it will try to create it." storageClass=blockip time="2019-03-23T09:15:51Z" level=error msg="Kubernetes frontend couldn't add the storage class: Trident initialization failed; problem initializing storage driver 'ontap-nas': error initializing ontap-nas driver: could not create Data ONTAP API client: error reading SVM details: API status: failed, Reason: Specified vserver not found, Code: 15698" storageClass=blockip storageClass_parameters="map[backendType:ontap-san fsType:xfs provisioningType:thin]" storageClass_provisioner=netapp.io/trident
time="2019-03-23T09:15:51Z" level=warning msg="Kubernetes frontend has no record of the updated storage class; instead it will try to create it." storageClass=premium time="2019-03-23T09:15:51Z" level=error msg="Kubernetes frontend couldn't add the storage class: Trident initialization failed; problem initializing storage driver 'ontap-nas': error initializing ontap-nas driver: could not create Data ONTAP API client: error reading SVM details: API status: failed, Reason: Specified vserver not found, Code: 15698" storageClass=premium storageClass_parameters="map[backendType:ontap-nas provisioningType:thin snapshots:true storagePools:ontapnas_172.19.6.45:aggr1_dpp501001]" storageClass_provisioner=netapp.io/trident
time="2019-03-23T09:15:51Z" level=warning msg="Kubernetes frontend has no record of the updated storage class; instead it will try to create it." storageClass=standard time="2019-03-23T09:15:51Z" level=error msg="Kubernetes frontend couldn't add the storage class: Trident initialization failed; problem initializing storage driver 'ontap-nas': error initializing ontap-nas driver: could not create Data ONTAP API client: error reading SVM details: API status: failed, Reason: Specified vserver not found, Code: 15698" storageClass=standard storageClass_parameters="map[snapshots:true storagePools:ontapnas_172.19.6.46:aggr1_dpp509004 backendType:ontap-nas provisioningType:thin]" storageClass_provisioner=netapp.io/trident

(I can provide access to a NetApp internal test lab to reproduce this if required.)

jacobjohnanda · 2019-03-27T13:18:35Z

I believe this could be solved by updating the backend with the new vserver name using the "tridentctl update backend -f <backend_json_file>".
Could you share the setup details.

si-heger · 2019-07-31T08:50:49Z

Any update here? When can we expect this?

gtehrani · 2019-07-31T12:11:25Z

No, still tracking as an uncommitted roadmap item.

sirlatrom · 2019-08-29T12:34:23Z

I work for one of your customers, and we experienced this issue during an unplanned failover. It resulted in nearly half an hour of downtime due to troubleshooting and having to reconfigure and disable/enable the plugin on all nodes.

rijmenantspj · 2020-04-16T14:39:36Z

In my professional life, I have a customer that is running into a simmilar issue. Hence, is there any update on this matter; as this Issue has been "Open" for quite some time now?

Reading through the documentation, I'm left wondering whether it would be an option to update Trident's backend configuration so that:

the "managementLIF" parameter is included;
the "svm" parameter is left out.

The documentation states that the "svm" parameter's value is derived if a management LIF is specified. Thus, somewhat likely the "-mc" suffix in the Storage Virtual Machine's name would no longer matter in such scenario.

As I currently have no combination of a NetApp MetroCluster and Trident at my disposal, I would appreciate it if someone would be able to validate this.

rijmenantspj · 2020-04-17T18:02:19Z

In my professional life, I have a customer that is running into a simmilar issue. Hence, is there any update on this matter; as this Issue has been "Open" for quite some time now?

Reading through the documentation, I'm left wondering whether it would be an option to update Trident's backend configuration so that:

the "managementLIF" parameter is included;

the "svm" parameter is left out.

The documentation states that the "svm" parameter's value is derived if a management LIF is specified. Thus, somewhat likely the "-mc" suffix in the Storage Virtual Machine's name would no longer matter in such scenario.

As I currently have no combination of a NetApp MetroCluster and Trident at my disposal, I would appreciate it if someone would be able to validate this.

Update:
In the meantime, the end customer confirmed the above can be used as workaround. Leaving the "svm" parameter out of Trident's backend configuration indeed ensures the "-mc" suffix is derived while in Switchover. However each MetroCluster Switchover/Switchback still requires an update of de backend configuration (same file can be used).

The above will do the trick for our end customer for now. Also, I hope posting this back will help others. However, we'll still be looking forward to see this "enhancement" incorporated in a future version of Trident so that we no longer need to act manually upon MetroCluster Switchover/Switchback.

Numblesix · 2020-06-22T08:33:35Z

Hello,

any Update on this :) ?

gnarl · 2020-10-26T17:02:02Z

This issue is still being tracked as an uncommitted roadmap item. If this request is considered a high priority by you please contact your NetApp account team and let them know.

ysakashita · 2021-04-02T01:14:46Z

@gnarl
When we evaluate NetApp/Trident on Metro Cluster with NetApp Japan member, we occurred same problem.
So we can’t proceed to install Metro Cluster to our data center.
I hope that the issue fix.

gnarl · 2021-04-23T16:14:31Z

Hi @ysakashita,

The work to provide a change to detect and handle an MCC switchover isn't currently scheduled on the Trident roadmap. Please work with your NetApp account team to communicate the priority for this change.

ysakashita · 2021-04-24T01:26:26Z

@gnarl Sure. I understood that it was a priority issue of roadmap, not a technical's one. I already discussed the priority for our business impact with NetApp Japan account team.

pkris · 2021-05-06T11:18:24Z

Hi @gnarl
We have experienced a similar incident during a Metrocluster failover recently.
Does the Trident documentation state somewhere that connections will not survive such a scenario - I tried digging through the but was unable to find anything ?

gnarl · 2021-05-06T13:25:47Z

Hi @pkris,

Unfortunately I am unable to answer that question as Trident hasn't been qualified against MCC switchover. The recent activity on this GitHub Issue has put focus on MCC compatibility and we'll keep this issue updated when new information is available.

gnarl · 2021-05-06T19:40:21Z

Hi @pkris,

I have more information regarding MCC now. During a MCC switchover existing connections should not be affected. During a switchover all data and management LIFs are migrated to the ONTAP cluster that is now active. Per the above stated workaround the only thing that needs to be updated is the SVM name in the Trident backend config. This change is needed so that Trident can communicate with the MCC ONTAP cluster that is currently active.

Hope this helps.

pkris · 2021-05-07T07:32:43Z

Hi @gnarl

Thanks for the update.
So the current state of the Trident driver stack does not allow for client-side transparency to failover events in the upstream Metrocluster - would that be a fair conclusion with the current state of things?
Any thoughts on how a client cluster utilizing the Trident driver could become aware of the need to apply the workaround (updating the svm name in the backend config) - i.e. know that a failover has occured upstream?

Cheers

gnarl · 2021-05-07T13:11:33Z

Hi @pkris,

MCC switchover works transparently at the networking level. Anything that is currently mounted will continue to work after a switchover. The SVM name needs to be updated in the Trident Backend config so that any new communication from Trident (volume create, clone, resize, etc...) to the backend are sent to the currently active SVM.

We are investigating how to properly detect a MCC switchover in Trident.

chrifey · 2022-04-11T16:34:14Z

Hi, I just would like to bump this issue again, since more and more customers are using trident, chances rise that there are MCC customers and at the moment, you still have to like change backend, or restart trident pods in order to react after a switchover.

nicoseiberth · 2022-04-13T07:22:14Z

Hi, same here at several customers. MCC Switchover and SVM-DR is working fine, but restarting the trident pods is needed. Please try to fix this issue.

gnarl · 2022-04-13T13:23:54Z

Hi @nicoseiberth, there shouldn't be a need to restart any Trident pods. Updating the SVM name in the backend config is what is needed to communicate with ONTAP after an MCC switchover.

nicoseiberth · 2022-04-13T15:36:44Z

@gnarl Thanks for the hint. Indeed we configured our backend without SVMs names, using fqdn for SVM connections, aggregate wildcards etc. so we do not need to update any backend configs. Easiest was is to just restart the trident pods to initiate a "rescan".

shamziman · 2022-05-02T12:32:04Z

There is a very simple fix for this. If the code logging into the SVM simply accepts both the normal SVM name and the SVM-mc name but checks the UUID of the SVM to make sure this it has remained the same (and I'm assuming that the SVM UUID is registered somewhere), then things will just work. UUID's between the primary MetroCluster SVM and it's site failover secondary SVM-mc are identical.

I can't really see why this should be terribly difficult to implement or why it has been laying around for 3 years. The MetroCluster platform could well use the bump in Trident functionality as it is the flagship high-availability system. The idea that one has to update perhaps dozens of backend configurations really does lack the necessary automation for high-availability operations.

One could even make this configurable which changes to the SVM name are acceptable (i.e. altSvm spec field) and provide an additional UUID field (altUuid spec field) for SVM-DR setups.

si-heger · 2022-05-02T14:51:54Z

The solution shamziman proposed makes a lot of sense and there are many companies using a MetroCluster, therefore this Issue definitely requires more attention.

gnarl · 2022-07-20T20:06:48Z

Transparent MCC failover support has been enabled in Trident with commit de27a8d. This support requires that the Trident backend configuration file use the SVM IP for management LIF IP address.

Addressing some of the previous comments, upon investigation it was not possible to use the SVM UUID. Sending ZAPI calls to the cluster LIF doesn't work since the SVM still responds to some API calls even though it doesn't have data LIFs. This disqualifies the cluster LIF as an option to infer that a failover occurred.

Sending ZAPI calls to the SVM LIF was found to work as desired, for the failover scenario, if the API call is tunneled to the correct SVM. Many ZAPI calls that Trident uses are tunneled directly to the SVM and ZAPI tunneling requires the SVM name as the UUID is not supported.

Our investigation did find that the ONTAP REST API implementation does support use of the SVM UUID but we were waiting on REST API fixes available in ONTAP 9.11.1 GA release prior to implementing support for MCC failover using the REST API. The REST API support in Trident continues to be in beta until we've worked through the remaining API gaps with the ONTAP teams.

The reason why the metrocluster APIs were not used is because they require cluster admin access. We strongly recommend that customer us the ONTAP vsadmin user/role when configuring Trident for security reasons. Most use cases are such that storage of cluster credentials in a Kubernetes cluster should not be considered a secure practice.

shamziman · 2022-08-10T11:09:57Z

Is the change to use the REST API and fetch the SVM UUID something that you all will be tracking internally or does one need to open an issue here?
Using an IP address in the field for managementLIF eliminates some options wrt. being able to migrate SVM's to new IP addresses, so I'm hoping that this will eventually be addressed by tracking the UUID of the SVM when this information becomes available via API call.

clintonk · 2022-08-10T13:32:05Z

Hello, @shamziman. Yes, the planned MCC support with REST involves sending all API calls using the SVM UUID. That solution won't work with ZAPI, hence the current requirement to provide the SVM LIF address with MCC backends.

innergy changed the title ~~Metrocluster switchover causes Trident to fail~~ Support for metrocluster Mar 26, 2019

innergy added enhancement tracked labels Mar 26, 2019

netapp-ci closed this as completed in de27a8d Jul 18, 2022

juliantap mentioned this issue Jul 29, 2022

22.07 Astra Trident release NetAppDocs/trident#166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for metrocluster #228

Support for metrocluster #228

hendrikland commented Mar 26, 2019

jacobjohnanda commented Mar 27, 2019

si-heger commented Jul 31, 2019

gtehrani commented Jul 31, 2019

sirlatrom commented Aug 29, 2019

rijmenantspj commented Apr 16, 2020

rijmenantspj commented Apr 17, 2020

Numblesix commented Jun 22, 2020

gnarl commented Oct 26, 2020

ysakashita commented Apr 2, 2021

gnarl commented Apr 23, 2021

ysakashita commented Apr 24, 2021

pkris commented May 6, 2021

gnarl commented May 6, 2021

gnarl commented May 6, 2021

pkris commented May 7, 2021

gnarl commented May 7, 2021

chrifey commented Apr 11, 2022

nicoseiberth commented Apr 13, 2022

gnarl commented Apr 13, 2022

nicoseiberth commented Apr 13, 2022 •

edited

Loading

shamziman commented May 2, 2022

si-heger commented May 2, 2022

gnarl commented Jul 20, 2022 •

edited

Loading

shamziman commented Aug 10, 2022

clintonk commented Aug 10, 2022

Support for metrocluster #228

Support for metrocluster #228

Comments

hendrikland commented Mar 26, 2019

jacobjohnanda commented Mar 27, 2019

si-heger commented Jul 31, 2019

gtehrani commented Jul 31, 2019

sirlatrom commented Aug 29, 2019

rijmenantspj commented Apr 16, 2020

rijmenantspj commented Apr 17, 2020

Numblesix commented Jun 22, 2020

gnarl commented Oct 26, 2020

ysakashita commented Apr 2, 2021

gnarl commented Apr 23, 2021

ysakashita commented Apr 24, 2021

pkris commented May 6, 2021

gnarl commented May 6, 2021

gnarl commented May 6, 2021

pkris commented May 7, 2021

gnarl commented May 7, 2021

chrifey commented Apr 11, 2022

nicoseiberth commented Apr 13, 2022

gnarl commented Apr 13, 2022

nicoseiberth commented Apr 13, 2022 • edited Loading

shamziman commented May 2, 2022

si-heger commented May 2, 2022

gnarl commented Jul 20, 2022 • edited Loading

shamziman commented Aug 10, 2022

clintonk commented Aug 10, 2022

nicoseiberth commented Apr 13, 2022 •

edited

Loading

gnarl commented Jul 20, 2022 •

edited

Loading