Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for metrocluster #228

Closed
hendrikland opened this issue Mar 26, 2019 · 25 comments
Closed

Support for metrocluster #228

hendrikland opened this issue Mar 26, 2019 · 25 comments

Comments

@hendrikland
Copy link

In a Metrocluster configuration, a switchover/site failover causes Trident to fail, e.g. you can't provision any new volumes. Apparently, Trident looks for the vServer name. However, that name changes in case of a Metrocluster switchover and a "-mc" suffix is added to the original vServer name. Once a switchback is issued, the name is reverted to the original one and then Trident starts working again. So dynamic volume provisioning fails while in switchover (either planned as part of a maintenance activity or unplanned in a real disaster event). Adding the "-mc" suffix is standard Ontap behaviour in a Metrocluster environment and shouldn't cause a Trident failure. Trident should assume that in a metrocluster switchover state, any vServer name with a "-mc" suffix is identical to the original vServer).

Logs during MCC switchover look like this:

time="2019-03-23T09:15:51Z" level=warning msg="Kubernetes frontend has no record of the updated storage class; instead it will try to create it." storageClass=blockip time="2019-03-23T09:15:51Z" level=error msg="Kubernetes frontend couldn't add the storage class: Trident initialization failed; problem initializing storage driver 'ontap-nas': error initializing ontap-nas driver: could not create Data ONTAP API client: error reading SVM details: API status: failed, Reason: Specified vserver not found, Code: 15698" storageClass=blockip storageClass_parameters="map[backendType:ontap-san fsType:xfs provisioningType:thin]" storageClass_provisioner=netapp.io/trident
time="2019-03-23T09:15:51Z" level=warning msg="Kubernetes frontend has no record of the updated storage class; instead it will try to create it." storageClass=premium time="2019-03-23T09:15:51Z" level=error msg="Kubernetes frontend couldn't add the storage class: Trident initialization failed; problem initializing storage driver 'ontap-nas': error initializing ontap-nas driver: could not create Data ONTAP API client: error reading SVM details: API status: failed, Reason: Specified vserver not found, Code: 15698" storageClass=premium storageClass_parameters="map[backendType:ontap-nas provisioningType:thin snapshots:true storagePools:ontapnas_172.19.6.45:aggr1_dpp501001]" storageClass_provisioner=netapp.io/trident
time="2019-03-23T09:15:51Z" level=warning msg="Kubernetes frontend has no record of the updated storage class; instead it will try to create it." storageClass=standard time="2019-03-23T09:15:51Z" level=error msg="Kubernetes frontend couldn't add the storage class: Trident initialization failed; problem initializing storage driver 'ontap-nas': error initializing ontap-nas driver: could not create Data ONTAP API client: error reading SVM details: API status: failed, Reason: Specified vserver not found, Code: 15698" storageClass=standard storageClass_parameters="map[snapshots:true storagePools:ontapnas_172.19.6.46:aggr1_dpp509004 backendType:ontap-nas provisioningType:thin]" storageClass_provisioner=netapp.io/trident

(I can provide access to a NetApp internal test lab to reproduce this if required.)

@innergy innergy changed the title Metrocluster switchover causes Trident to fail Support for metrocluster Mar 26, 2019
@jacobjohnanda
Copy link

I believe this could be solved by updating the backend with the new vserver name using the "tridentctl update backend -f <backend_json_file>".
Could you share the setup details.

@si-heger
Copy link

Any update here? When can we expect this?

@gtehrani
Copy link

No, still tracking as an uncommitted roadmap item.

@sirlatrom
Copy link

I work for one of your customers, and we experienced this issue during an unplanned failover. It resulted in nearly half an hour of downtime due to troubleshooting and having to reconfigure and disable/enable the plugin on all nodes.

@rijmenantspj
Copy link

In my professional life, I have a customer that is running into a simmilar issue. Hence, is there any update on this matter; as this Issue has been "Open" for quite some time now?

Reading through the documentation, I'm left wondering whether it would be an option to update Trident's backend configuration so that:

  • the "managementLIF" parameter is included;
  • the "svm" parameter is left out.

The documentation states that the "svm" parameter's value is derived if a management LIF is specified. Thus, somewhat likely the "-mc" suffix in the Storage Virtual Machine's name would no longer matter in such scenario.

As I currently have no combination of a NetApp MetroCluster and Trident at my disposal, I would appreciate it if someone would be able to validate this.

@rijmenantspj
Copy link

In my professional life, I have a customer that is running into a simmilar issue. Hence, is there any update on this matter; as this Issue has been "Open" for quite some time now?

Reading through the documentation, I'm left wondering whether it would be an option to update Trident's backend configuration so that:

  • the "managementLIF" parameter is included;
  • the "svm" parameter is left out.

The documentation states that the "svm" parameter's value is derived if a management LIF is specified. Thus, somewhat likely the "-mc" suffix in the Storage Virtual Machine's name would no longer matter in such scenario.

As I currently have no combination of a NetApp MetroCluster and Trident at my disposal, I would appreciate it if someone would be able to validate this.

Update:
In the meantime, the end customer confirmed the above can be used as workaround. Leaving the "svm" parameter out of Trident's backend configuration indeed ensures the "-mc" suffix is derived while in Switchover. However each MetroCluster Switchover/Switchback still requires an update of de backend configuration (same file can be used).

The above will do the trick for our end customer for now. Also, I hope posting this back will help others. However, we'll still be looking forward to see this "enhancement" incorporated in a future version of Trident so that we no longer need to act manually upon MetroCluster Switchover/Switchback.

@Numblesix
Copy link

Hello,

any Update on this :) ?

@gnarl
Copy link
Contributor

gnarl commented Oct 26, 2020

This issue is still being tracked as an uncommitted roadmap item. If this request is considered a high priority by you please contact your NetApp account team and let them know.

@ysakashita
Copy link

@gnarl
When we evaluate NetApp/Trident on Metro Cluster with NetApp Japan member, we occurred same problem.
So we can’t proceed to install Metro Cluster to our data center.
I hope that the issue fix.

@gnarl
Copy link
Contributor

gnarl commented Apr 23, 2021

Hi @ysakashita,

The work to provide a change to detect and handle an MCC switchover isn't currently scheduled on the Trident roadmap. Please work with your NetApp account team to communicate the priority for this change.

@ysakashita
Copy link

@gnarl Sure. I understood that it was a priority issue of roadmap, not a technical's one. I already discussed the priority for our business impact with NetApp Japan account team.

@pkris
Copy link

pkris commented May 6, 2021

Hi @gnarl
We have experienced a similar incident during a Metrocluster failover recently.
Does the Trident documentation state somewhere that connections will not survive such a scenario - I tried digging through the but was unable to find anything ?

@gnarl
Copy link
Contributor

gnarl commented May 6, 2021

Hi @pkris,

Unfortunately I am unable to answer that question as Trident hasn't been qualified against MCC switchover. The recent activity on this GitHub Issue has put focus on MCC compatibility and we'll keep this issue updated when new information is available.

@gnarl
Copy link
Contributor

gnarl commented May 6, 2021

Hi @pkris,

I have more information regarding MCC now. During a MCC switchover existing connections should not be affected. During a switchover all data and management LIFs are migrated to the ONTAP cluster that is now active. Per the above stated workaround the only thing that needs to be updated is the SVM name in the Trident backend config. This change is needed so that Trident can communicate with the MCC ONTAP cluster that is currently active.

Hope this helps.

@pkris
Copy link

pkris commented May 7, 2021

Hi @gnarl

Thanks for the update.
So the current state of the Trident driver stack does not allow for client-side transparency to failover events in the upstream Metrocluster - would that be a fair conclusion with the current state of things?
Any thoughts on how a client cluster utilizing the Trident driver could become aware of the need to apply the workaround (updating the svm name in the backend config) - i.e. know that a failover has occured upstream?

Cheers

@gnarl
Copy link
Contributor

gnarl commented May 7, 2021

Hi @pkris,

MCC switchover works transparently at the networking level. Anything that is currently mounted will continue to work after a switchover. The SVM name needs to be updated in the Trident Backend config so that any new communication from Trident (volume create, clone, resize, etc...) to the backend are sent to the currently active SVM.

We are investigating how to properly detect a MCC switchover in Trident.

@chrifey
Copy link

chrifey commented Apr 11, 2022

Hi, I just would like to bump this issue again, since more and more customers are using trident, chances rise that there are MCC customers and at the moment, you still have to like change backend, or restart trident pods in order to react after a switchover.

@nicoseiberth
Copy link

Hi, same here at several customers. MCC Switchover and SVM-DR is working fine, but restarting the trident pods is needed. Please try to fix this issue.

@gnarl
Copy link
Contributor

gnarl commented Apr 13, 2022

Hi @nicoseiberth, there shouldn't be a need to restart any Trident pods. Updating the SVM name in the backend config is what is needed to communicate with ONTAP after an MCC switchover.

@nicoseiberth
Copy link

nicoseiberth commented Apr 13, 2022

@gnarl Thanks for the hint. Indeed we configured our backend without SVMs names, using fqdn for SVM connections, aggregate wildcards etc. so we do not need to update any backend configs. Easiest was is to just restart the trident pods to initiate a "rescan".

@shamziman
Copy link

There is a very simple fix for this. If the code logging into the SVM simply accepts both the normal SVM name and the SVM-mc name but checks the UUID of the SVM to make sure this it has remained the same (and I'm assuming that the SVM UUID is registered somewhere), then things will just work. UUID's between the primary MetroCluster SVM and it's site failover secondary SVM-mc are identical.

I can't really see why this should be terribly difficult to implement or why it has been laying around for 3 years. The MetroCluster platform could well use the bump in Trident functionality as it is the flagship high-availability system. The idea that one has to update perhaps dozens of backend configurations really does lack the necessary automation for high-availability operations.

One could even make this configurable which changes to the SVM name are acceptable (i.e. altSvm spec field) and provide an additional UUID field (altUuid spec field) for SVM-DR setups.

@si-heger
Copy link

si-heger commented May 2, 2022

The solution shamziman proposed makes a lot of sense and there are many companies using a MetroCluster, therefore this Issue definitely requires more attention.

@gnarl
Copy link
Contributor

gnarl commented Jul 20, 2022

Transparent MCC failover support has been enabled in Trident with commit de27a8d. This support requires that the Trident backend configuration file use the SVM IP for management LIF IP address.

Addressing some of the previous comments, upon investigation it was not possible to use the SVM UUID. Sending ZAPI calls to the cluster LIF doesn't work since the SVM still responds to some API calls even though it doesn't have data LIFs. This disqualifies the cluster LIF as an option to infer that a failover occurred.

Sending ZAPI calls to the SVM LIF was found to work as desired, for the failover scenario, if the API call is tunneled to the correct SVM. Many ZAPI calls that Trident uses are tunneled directly to the SVM and ZAPI tunneling requires the SVM name as the UUID is not supported.

Our investigation did find that the ONTAP REST API implementation does support use of the SVM UUID but we were waiting on REST API fixes available in ONTAP 9.11.1 GA release prior to implementing support for MCC failover using the REST API. The REST API support in Trident continues to be in beta until we've worked through the remaining API gaps with the ONTAP teams.

The reason why the metrocluster APIs were not used is because they require cluster admin access. We strongly recommend that customer us the ONTAP vsadmin user/role when configuring Trident for security reasons. Most use cases are such that storage of cluster credentials in a Kubernetes cluster should not be considered a secure practice.

@shamziman
Copy link

Is the change to use the REST API and fetch the SVM UUID something that you all will be tracking internally or does one need to open an issue here?
Using an IP address in the field for managementLIF eliminates some options wrt. being able to migrate SVM's to new IP addresses, so I'm hoping that this will eventually be addressed by tracking the UUID of the SVM when this information becomes available via API call.

@clintonk
Copy link
Contributor

Hello, @shamziman. Yes, the planned MCC support with REST involves sending all API calls using the SVM UUID. That solution won't work with ZAPI, hence the current requirement to provide the SVM LIF address with MCC backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests