New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Helm chart support for k8s node tolerations #13214
Conversation
@ZhuTopher Thanks for adding this! You will need to rebase when #13226 is in. It took 0.6.17 so you'll need to be 0.6.18. |
2570c5e
to
3a540fe
Compare
@nirav-chotai PTAL too, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZhuTopher ! I left some comments
{{- if .Values.master.tolerations }} | ||
{{ toYaml .Values.master.tolerations | trim | indent 8 }} | ||
{{- else if .Values.tolerations }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you like .Values.master.tolerations
+ .Values.tolerations
? In other words I think this should be a union instead of if-else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if .Values.tolerations
is there which can add tolerations to all the Alluxio components rather than specifying them individually. And if someone wants to run only some components on dedicated nodes, then they can use .Values.XYZ.tolerations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense to me, I can change it to a union of the tolerations.
{{ toYaml .Values.worker.tolerations | trim | indent 8 }} | ||
{{- end }} | ||
{{- if .Values.tolerations }} | ||
{{ toYaml .Values.tolerations | trim | indent 8 }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheliu3 Is this what you meant by union? This will include duplicate definitions if the same toleration is in both .Values.tolerations
and .Values.xxx.tolerations
but I think that's on the user if they do that. Also I tested duplicate toleration definitions and I don't think it matters to Kubernetes anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly this is what I meant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM except for the comment below. Please remove the change I mentioned in the comment then I'll merge it.
{{- if .Values.logserver.nodeSelector }} | ||
{{ toYaml .Values.logserver.nodeSelector | trim | indent 8 }} | ||
{{- else if .Values.nodeSelector }} | ||
{{ toYaml .Values.nodeSelector | trim | indent 8 }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see you choose if
+else if
because the other code are doing this way. However I think if
+if
is a better option because it allows you to have general node selectors and more specific ones.
I think this PR (and helm chart version) is not intended for this nodeSelector change so let's remove this few lines. In a separate PR you can address this one and if you get a chance, change all node selectors to be if
+ if
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I personally think it makes sense to have nodeSelector
be XORs while letting tolerations
be ANDs, I'll explain why in a moment but I agree this doesn't need to be in this PR so I'll remove it.
As for why I don't think nodeSelector
should be AND, it is because in order for the resource to get scheduled it must match all of the labels provided. For taints, it is just a signifier to the Kubernetes scheduler that it may schedule onto nodes with the matching taints.
- If we were talking about specifically "soft" node affinity then I think AND would make sense just like
tolerations
Having this if-else order of 'most-specific first' allows the user to "override" the default set by the top-level nodeSelector
. The alternative if it was AND would be that you would have to define the same selector for all resources except for the exception manually since you would no longer be able to use a top-level nodeSelector
.
- The implemented behaviour should be explained in the
values.yaml
comments more clearly regardless of what we choose too
@@ -416,7 +416,6 @@ logserver: | |||
# dnsPolicy: ClusterFirst | |||
# JVM options specific to the logserver container | |||
jvmOptions: | |||
nodeSelector: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheliu3 I removed this because it was (and will still be) unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's keep it this way (don't remove this line).
One use case I can imagine is, there are nodes that have storage and nodes that do not. By defining a selector, the logserver is scheduled to a node that has storage, so that the logs are persisted to the storage.
Another argument is, let's do not take away what's there, unless the whole PR is intended for that. In this PR we add toleration, so let's not touch unrelated stuff. In the next PR just for nodeSelector, we can do the refactor as we redesign.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZhuTopher I'll merge after this is reverted. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZhuTopher ooops sorry I seem to have clicked resolve unintendedly. Could you revert this line and then I'll merge? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I didn't notice this comment previously my bad! The reason I removed this from the Helm chart values.yaml
is because this field is unused in templates/logserver/deployment.yaml
.
I tried to add the nodeSelector
behaviour from our other templates to it but since we decided to leave it out of this PR I also removed the related chart value to avoid confusion (chart implies you can specify a nodeSelector
when you actually can't).
Anyway, again we can leave all this to a separate 'nodeSelector'-PR so I'll revert it for now.
Codecov Report
@@ Coverage Diff @@
## master #13214 +/- ##
=========================================
Coverage ? 17.37%
Complexity ? 2625
=========================================
Files ? 1362
Lines ? 78502
Branches ? 9519
=========================================
Hits ? 13640
Misses ? 63663
Partials ? 1199
Continue to review full report at Codecov.
|
the failed test is due to flaky codecov, irrelevant to the change |
alluxio-bot, merge this please |
Same as the other PRs please add to the doc page too. Let's do the doc change in a separate PR in order not to block yourself. Thanks! |
This PR is being made to address git [issue Alluxio#13203](Alluxio#13203). I tested that the Helm chart rendered using the following in my `config.yaml` ``` tolerations: [ {"key": "test", "operator": "Equal", "value": "test", "effect": "NoSchedule"} ] ``` and that the tolerations worked as intended on k8s nodes I tainted using `kubectl taint nodes <node> test=test:NoSchedule` **Note:** I tested the Helm chart using all default values; so no FUSE nor logging-server deployments. pr-link: Alluxio#13214 change-id: cid-81d850f3b53308a78b5bb207531b294ae09d8cdb
This PR is being made to address git [issue Alluxio#13203](Alluxio#13203). I tested that the Helm chart rendered using the following in my `config.yaml` ``` tolerations: [ {"key": "test", "operator": "Equal", "value": "test", "effect": "NoSchedule"} ] ``` and that the tolerations worked as intended on k8s nodes I tainted using `kubectl taint nodes <node> test=test:NoSchedule` **Note:** I tested the Helm chart using all default values; so no FUSE nor logging-server deployments. pr-link: Alluxio#13214 change-id: cid-81d850f3b53308a78b5bb207531b294ae09d8cdb
### What changes were proposed in this pull request? This PR proposes some additions to the [Alluxio k8s guide](https://docs.alluxio.io/os/user/stable/en/deploy/Running-Alluxio-On-Kubernetes.html) to reflect some of the recent additions to the Alluxio Helm chart: - `serviceAccount`: #13297 - `tolerations`: #13214 - `hostAliases`: #13226 - `strategy`: #13423 ### Why are the changes needed? Doc validation ### Does this PR introduce _any_ user-facing change? Yes, the documentation. pr-link: #13579 change-id: cid-cd456e708547bb121cdab5df1de7599860e6e089
### What changes were proposed in this pull request? This PR proposes some additions to the [Alluxio k8s guide](https://docs.alluxio.io/os/user/stable/en/deploy/Running-Alluxio-On-Kubernetes.html) to reflect some of the recent additions to the Alluxio Helm chart: - `serviceAccount`: #13297 - `tolerations`: #13214 - `hostAliases`: #13226 - `strategy`: #13423 ### Why are the changes needed? Doc validation ### Does this PR introduce _any_ user-facing change? Yes, the documentation. pr-link: #13579 change-id: cid-cd456e708547bb121cdab5df1de7599860e6e089
This PR is being made to address git issue #13203.
I tested that the Helm chart rendered using the following in my
config.yaml
and that the tolerations worked as intended on k8s nodes I tainted using
kubectl taint nodes <node> test=test:NoSchedule
Note: I tested the Helm chart using all default values; so no FUSE nor logging-server deployments.