Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to disable iSCSI session SelHealing feature #864

Closed
ysakashita opened this issue Nov 9, 2023 · 8 comments
Closed

Option to disable iSCSI session SelHealing feature #864

ysakashita opened this issue Nov 9, 2023 · 8 comments

Comments

@ysakashita
Copy link

Describe the solution you'd like

I would like to be provided with an option to disable the feature of "automation to detect and fix broken or stale iSCSI sessions on host nodes" in Tridnet v23.01.
The feature may cause iSCSI sessions to be logged out at the incorrect time, risking a serious incident.
For example, if an iSCSI session is logged out by this function at the perfect timing when a path is switched in Multi Path, the number of alive paths will be zero, leading to a serious failure.
Therefore, I would like to disable option of the function.

Describe alternatives you've considered

Mature iSCSI session SelfHealing functionality must be provided. However, until this functionality grows, it must be removed or turned off from Trident.
Alternatively, you can not do iSCSI Session SelfHealing for Trident and let open-iscsi session recovery do the work.

Additional context
None

@rohit-arora-dev
Copy link
Contributor

Hello @ysakashita

I agree there should be an option to disable or modify this feature. There are two inputs into the daemonset iscsi_self_healing_interval (default: 5 minutes) and iscsi_self_healing_wait_time (default: 7 minutes).

In the meantime, the only way to disable these features is by first disabling the Trident Operator (setting replica count to 0) and then passing iscsi_self_healing_interval=-1 option to the daemonset, alternatively iscsi_self_healing_wait_time can also be used to set the logout time to a higher value.

Ideally, both the configuration parameters (iscsi_self_healing_interval, iscsi_self_healing_wait_time) should be exposed via operator as well as tridentctl installation to disable or modify iSCSI self-healing behaviour.

@rohit-arora-dev
Copy link
Contributor

Minor correction, it has to be a 0 value and not -1. So, iscsi_self_healing_interval=0.

@ysakashita
Copy link
Author

ysakashita commented Nov 14, 2023

Thank you for the configuration parameters.
Your idea does not apply to Trident Operator, does it?
Unfortunately, I am using the trident operator. So, the daemon set is installed with configuration parameters by the operator.

I would like to be provided the disable option(or tuning value) not only in tridentctl install (custom YAML) but also in trident operator.

@ysakashita
Copy link
Author

IMO, I think you can add --iscsi_self_healing_interval=0 parameter to the following code section to make the option to disable also supported by Trident operator.
https://github.com/NetApp/trident/blob/v23.10.0/cli/k8s_client/yaml_factory.go#L934-L935

@rohit-arora-dev
Copy link
Contributor

@ysakashita

As part of the enhancement, I agree there should be an option in future releases of Trident that would allow users to override the default behaviour (e.g. disableISCSISelfHealing: true) to disable iSCSI Self-healing via the Trident Operator as well as Helm. It means the Operator would set --iscsi_self_healing_interval=0 in the yaml_factor.go and users need not to do it manually.

Today, this option does not exist, therefore in the absence of this option the only way to achieve it today is:

  1. For tridentctl-based installations: Use custom YAML-based installation and set --iscsi_self_healing_interval=0 on the daemonset.
  2. For Trident Operator-based Installation (after the installation):
    a. Disable the Trident Operator by setting the Trident Operator deployment replica count to zero.
    b. Patch Trident daemonset with --iscsi_self_healing_interval=0.
    c. Please do not re-enable Trident Operator or increase its replica count to 1.

Please note: This is a workaround, the downside of disabling the Trident Operator is that you would lose Trident Operator's capabilities to remediate Trident installation issues, automatic upgrades and watches that ensures Trident is running in a desirable state.

@ysakashita
Copy link
Author

ysakashita commented Nov 14, 2023

@ntap-arorar
I seem to be fine with your enhancement idea.
And thanks for the workaround.

I can use this workaround in my experimental environment.
However, we are providing and managing over 1200 Kubernetes clusters for our customers, so I will wait for the official enhancement.

Please let me know about versions that support this feature if NetApp make a plan.

@rohit-arora-dev
Copy link
Contributor

The fix for this issue is merged f1d7e12.

Two configuration parameters have been added to Trident installers (Operator, tridentctl, and Helm):

iSCSI Self-Healing Interval: Changing this value influences at what interval iSCSI Self-healing is run (default 5 mins). A user may configure it to run more often by setting a lower number or less frequently by configuring it to a larger value. Setting this to 0s stops iSCSI self-healing completely.

iSCSI Self-Healing Wait Time: Changing this value influences how much time iSCSI self-healing waits before logging out of an unhealthy session and trying to log in again (default 7 mins). A user may configure it to a larger value so that sessions that are identified as unhealthy have to wait longer before being logged out and then an attempt is made to log in again or a smaller value to log out and log in earlier.

e.g. (Operator)

iscsiSelfHealingInterval: 10m
iscsiSelfHealingWaitTime: 15m

e.g. (tridentctl)

--iscsi-self-healing-interval=10m
--iscsi-self-healing-wait-time=15m

@uppuluri123
Copy link

Fixed in 24.02.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants