Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing unnecessary condition #29

Closed
wants to merge 2 commits into from
Closed

Conversation

julian3xl
Copy link

This condition is supposed to be necessary if you install the node feature discovery chart included on this chart, but in the real world a production cluster already have that stuff set up and running, so having the proper node selector is mandatory and this condition on the chart avoids it.

Signed-off-by: julian3xl julian3xl@gmail.com

Signed-off-by: julian3xl <julian3xl@gmail.com>
@julian3xl julian3xl mentioned this pull request Oct 11, 2022
@julian3xl
Copy link
Author

@St0rmingBr4in or @y2kenny-amd, I'm asking you to review this pull request because you are the latest ones with activity on this repository that seems to be really abandoned :(

@y2kenny
Copy link
Contributor

y2kenny commented Oct 11, 2022

My understanding is the NFD is an add-on. Do you have any documentation to indicate otherwise?

@julian3xl
Copy link
Author

julian3xl commented Oct 11, 2022

Yeah, the problem is that if I don't want to install the included NFD because I have one already running on my cluster (that's gonna be the 99,9% of the use cases), I have to set nfd.enabled to false, and because this condition on the chart I'm not able to setup node selectors anymore, and they are really important because otherwise I can't avoid the amd gpu device plugin to be deployed into all my machines on the cluster

@julian3xl
Copy link
Author

what I'm trying to say is that node selectors is something related with Kubernetes and they should not be dependant on if I'm installing and add-on or not

@y2kenny
Copy link
Contributor

y2kenny commented Oct 11, 2022

I don't understand. If you use NFD on your cluster, you set nfd.enabled to on here to use the node selector. This device plugin does not install NFD for you.

@y2kenny
Copy link
Contributor

y2kenny commented Oct 11, 2022

Ah ok, may be that's the real issue. My intention was for that dependencies to be optional (or may be I can make that dependencies 'softer' some how.)

@julian3xl
Copy link
Author

well, that's fine. I've seen that in other charts and it's fine to get it installed in case you don't want to do it by yourself but anyway kubernetes standard fields should be always settable in my opinion.

@julian3xl
Copy link
Author

julian3xl commented Oct 11, 2022

Because the documentation says that the node selector is set to feature.node.kubernetes.io/pci-0300_1002.present: "true" by default and that's totally fine and acceptable if someone might want to keep it empty, it just needs to set Values.node_selector to "" so this chart would be fitting the standard.

@y2kenny
Copy link
Contributor

y2kenny commented Oct 11, 2022

Um.... I will need to think about this. Not necessarily disagreeing here just need to context switch my brain back into this because I don't work with Helm every day.

@boniek83
Copy link

boniek83 commented Mar 9, 2023

Will node selector issue be resolved anytime soon? Kustomizing helm charts for basic features is a pain.

@julian3xl
Copy link
Author

To be honest, I don't see the maintainers of this repo really interested on fix this... 🤷

@y2kenny
Copy link
Contributor

y2kenny commented Mar 9, 2023

Is this still an issue now that the helm dependency uses >= instead of = for nfd when nfd.enabled is true?

@julian3xl
Copy link
Author

julian3xl commented Mar 9, 2023

nodeSelector is a standard helm chart property, it exists on every chart so this discussion doesn't make sense, regardless if you're installing NFD or not you should be able to set node selectors, man! that's the charts standard. Go artifacthub.io, check some charts and you will see that almost all of them have the option to set nodeSelectors. I see your point, you want to ensure:

node_selector:
  feature.node.kubernetes.io/pci-0300_1002.present: "true"

ok, put some helper to merge that selector to .Values.node_selector so users could add their custom nodeSelectors too without having to install nfd from your chart.

@y2kenny
Copy link
Contributor

y2kenny commented Mar 9, 2023

ok, put some helper to merge that selector to .Values.node_selector so users could add their custom nodeSelectors too without having to install nfd from your chart.

Can you update this pull request to reflect what you mean by this?

@rptaylor
Copy link
Contributor

rptaylor commented Apr 4, 2024

I agree, this is an issue. The situation is that this chart installs NFD, if nfd.enabled=true :
https://github.com/ROCm/k8s-device-plugin/blob/master/helm/amd-gpu/Chart.yaml#L25

However the default value is nfd.enabled=false, and in scenarios discussed here, which would be the case for me too, it must remain false, because I need to install and configure NFD separately myself, not using this chart.

That being the case, with nfd disabled, it unnecessarily blocks the use of nodeselectors: https://github.com/ROCm/k8s-device-plugin/blob/master/helm/amd-gpu/templates/deviceplugin-daemonset.yaml#L19

However nodeselectors are a normal built-in k8s feature that should generally always be configurable by chart users if they want. Nodeselectors do not depend on NFD per se, though may rely on labels set by NFD.

Unfortunately this chart overloads the nfd.enabled property for two different purposes: whether to install NFD, and whether to apply nodeselector labels. That being the case, I think it is not possible to address this without some form of breaking change (it would just require users to read the release notes and update their values, not a big deal).

IIUC, the change proposed here would result in the default behaviour changing from:
NFD not installed and no nodeselector (so device plugin runs on all nodes)
to:
NFD not installed and default nodeselector labels being applied (so potentially the plugin would not run on any nodes if NFD is not applying labels). Users should simply adjust .Values.node_selector to an empty list to preserve the previous behaviour.

An alternative approach would be to add a node_selectors_enabled variable (default false). People who were relying on the previous behaviour of setting nfd.enabled=true in their values would need to also set node_selectors_enabled: true in order to maintain the same behaviour as before. If they upgrade without reading the release notes the failure scenario would be that the device plugin would run on all nodes instead of only the selected nodes.

It seems to me the safer and more conservative approach is to avoid unintentionally running on undesired nodes?

All things considered, if I have understood the situation correctly, IMHO the simplest and safest way to fix this would be to merge this PR and add a release note to the chart docs saying "if you want to keep the default behaviour of running the plugin on all nodes set .Values.node_selector to an empty list".

@y2kenny what do you think?

@y2kenny
Copy link
Contributor

y2kenny commented Apr 11, 2024

@rptaylor, thanks for the detailed explanation. I think others have wrote similar thing but it didn't 'click' in my head until I read what you wrote so that is much appreciated.

As to the solution to this problem, I actually don't think running the plugin on every single node is an issue, because:

  1. the plugin check for the existence of the driver, and
    if _, err := os.Stat(path); err == nil {
  2. if not present, the plugin is just a blocked process by the golang select statement
    https://github.com/kubevirt/device-plugin-manager/blob/16a4b8a71a689ccf89f6cacd5ac3dbea84ffadb1/pkg/dpm/manager.go#L69

If the choice is between running the plugin on all nodes vs no nodes as the default behaviour, I would prefer all because the user is installing the Helm Chart for a reason. Ideally I would have a context sensitive default value but sounds like that is also not welcome because nodeselector is well used/customized by cluster admins.

The remaining issue is how to communicate the feature.node.kubernetes.io/pci-0300_1002.present: "true". Should I just set a variable of some sort but leave it unused or just document it in the README?

@rptaylor
Copy link
Contributor

rptaylor commented Apr 11, 2024

2. if not present, the plugin is just a blocked process by the golang select statement
   https://github.com/kubevirt/device-plugin-manager/blob/16a4b8a71a689ccf89f6cacd5ac3dbea84ffadb1/pkg/dpm/manager.go#L69

Meaning that it will remain in interruptible sleep indefinitely?

If the choice is between running the plugin on all nodes vs no nodes as the default behaviour, I would prefer all because > the user is installing the Helm Chart for a reason. Ideally I would have a context sensitive default value but sounds like
that is also not welcome because nodeselector is well used/customized by cluster admins.

I am not sure how the chart could detect the right default, it depends on what the admin wants to do and if NFD is installed independently. Anyway I'll make a MR.

The remaining issue is how to communicate the feature.node.kubernetes.io/pci-0300_1002.present: "true". Should I just set a variable of some sort but leave it unused or just document it in the README?

I think the values of node_selector could just remain the same. People can modify it if they need.

@y2kenny-amd
Copy link
Collaborator

Resolved by #58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants