-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add k8s Job definition for journal format #15092
base: master-2.x
Are you sure you want to change the base?
[WIP] Add k8s Job definition for journal format #15092
Conversation
{{- $count := int (ternary .Values.master.count 1 (eq .Values.journal.type "EMBEDDED")) }} | ||
{{ range $i, $e := until $count }} | ||
{{- $journalFormatJobName := printf "%s-%d-%s" $statefulsetName $i $journalFormatJobNameSuffix }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately in order to prevent Helm from overriding k8s Jobs in the same release but for different Master Pods (i.e: embedded multi-master scenario), the current workaround is to define unique resource names for the jobs.
In the future, it is possible to replace this with indexed jobs which would allow the use of a single k8s Job resource instead.
- This feature is alpha as of k8s 1.21, and beta in 1.22+
completions: 1 | ||
parallelism: 1 | ||
backoffLimit: 2 | ||
activeDeadlineSeconds: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activeDeadlineSeconds
set to 30 to match with previously removed implementation from this PR
# POD_NAME is the StatefulSet name with the indexed suffix | ||
- name: POD_NAME | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: metadata.name | ||
# manually override entrypoint command in order to evaluate POD_NAME | ||
command: ['sh', '-c', 'wait_for.sh "job" "${POD_NAME}-{{ $journalFormatJobNameSuffix }}"'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is currently not yet an easy way to access the Pod index number for a StatefulSet, we use the name.
- This was the workaround I found in a k8s issue on this topic
@ZhuTopher I can see the state of this PR is still WIP and there are some unaddressed scenarios. Do you need my review at this stage (and discuss)? Or I should wait till this PR is past the WIP stage? Just a side note, from our past few interactions with the helm hooks, I get the vague feeling that they don't work too well with some K8s cases just because of some limitations K8s is not able to resolve at the moment. Or even if K8s gets able to resolve those limitations, the fact that the hooks don't work with some older K8s versions is still there. If that's the case, feel free to park the attempt until the time and tools are ripe, in the (near) future. Sometimes it's even very hard to find the truth about some undocumented or unwanted behaviors. My opinion at this stage is, we don't have that familiarity with the K8s code at this moment, so let's not spend too much time digging. Asking on stackoverflow is sometimes worth a try, and in a few weeks some questions get answered unexpectedly. |
@jiacheliu3 Thanks, yes this PR is mostly here just to park the stuff I started looking into regarding this topic. There's no urgency in completing this feature. The implementation is "done" so from that standpoint feel free to review at your own leisure. As you mention there are a lot of factors to consider (supported k8s/Helm version, sufficient hook understanding, use-cases) which I wanted to use this PR as a place for discussion as well. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions. |
What changes are proposed in this pull request?
(Re)adding a k8s Job to handle master journal formatting.
initContainer
in this PRWhy are the changes needed?
Currently in order to perform journal formatting through the Helm chart, you must set the value
journal.format.runFormat=true
. This defines aninitContainer
inside the Alluxio master StatefulSet Pods which runsalluxio format
. This means that any time the pods are restarted, the journal will be reformatted.The only way to stop this behaviour is to
helm upgrade
the chart release with valuejournal.format.runFormat=false
.This change defines a k8s Job which performs the
alluxio format
outside of the lifecycle scope of the Pod by using Helm hooks. The end-user can modify the lifecycle of the journal format job in a limited way through the usage of the new value `journal.format.frequency=("once"|"always").journal.format.frequency="once"
will runalluxio format
exactly once per Helm release (i.e: duringhelm install
)journal.format.frequency="always"
will runalluxio format
on any change to the Helm release (i.e:helm install
,helm upgrade
,helm rollback
)Does this PR introduce any user facing changes?
Please list the user-facing changes introduced by your change, including
journal.format.frequency
Design choice explanation
The goal in this design was to decouple Pod lifecycles with the journal format lifecycle. Unfortunately in doing so, there was no way to guarantee that the Pods could wait for the format Job to actually run to completion before starting without the usage of an
initContainer
."helm.sh/hook": pre-install
annotation.There appears to be some demand in the K8s community about asking for this type of feature, which resulted in the following workaround: kubernetes/kubernetes#106802 (comment)
default
ServiceAccount credentials to theinitContainer
with additional Role permissions in order to allow the container to query the k8s API for the Job list & statusNote that because of this
initContainer
implementation any time thatjournal.format.runFormat
is true, the corresponding k8s Job(s) must beCompleted
and present in the k8s system. A consequence of that fact is that we must explicitly avoid deletion of completed/terminated Jobs. In the k8s Job docs they state that:So what this means is that even if we define that we do not wish for completed Jobs to be cleaned up, their corresponding pods may still get garbage collected regardless. I don't believe this affects the Job resource whatsoever, however I was unable to find a concrete answer. The only thing I found in the docs was for the aforementioned TTL mechanism.
post-install
Helm hook instead ofpre-install
since the Master pods are protected by theinitContainer
, however there is no way to define anownerReferences
block without the resource UID (which is unknown until object creation); you would have to PATCH the Job after-the-factTest scenarios
PERSISTED
files resulted in reads from UFS, and anyNOT_PERSISTED
files prior to restart were unable to be read.