Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for multiple sizes huge pages #84051

Merged

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Oct 17, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is an implementation of recently merged update of the hugepages KEP

It was tested on a local cluster with allocated huge pages of two sizes:

$ kubectl describe node |grep -A4 Allocatable
Allocatable:
  cpu:                24
  ephemeral-storage:  905877587288
  hugepages-1Gi:      2Gi
  hugepages-2Mi:      20Mi

With this pod configuration:

kind: Pod
apiVersion: v1
metadata:
  name: test
spec:
  containers:
    - name: test
      image: ubuntu
      command: ["/bin/sh", "-c", "sleep 30000"]
 
      resources:
        requests:
          cpu: "250m"
          hugepages-2Mi: 2Mi
          hugepages-1Gi: 2Gi
        limits:
          cpu: "250m"
          hugepages-2Mi: 2Mi
          hugepages-1Gi: 2Gi

      volumeMounts:
      - mountPath: /hugepages-2Mi
        name: hugepage-2mi
      - mountPath: /hugepages-1Gi
        name: hugepage-1gi

  volumes:
  - name: hugepage-2mi
    emptyDir:
      medium: HugePages-2Mi
  - name: hugepage-1gi
    emptyDir:
      medium: HugePages-1Gi

  restartPolicy: Never

Both sizes hugepages where mounted correctly in the container:

# mount |grep hugepages
nodev on /hugepages-2Mi type hugetlbfs (rw,relatime,pagesize=2M)
nodev on /hugepages-1Gi type hugetlbfs (rw,relatime,pagesize=1024M)

Pod level allocations for both huge pages sizes look correct as well:

$ cat /sys/fs/cgroup/hugetlb/kubepods/burstable/pod1b2ff802-2560-4e77-ba16-96f0aa50530d/hugetlb.2MB.limit_in_bytes 
2097152
$ cat /sys/fs/cgroup/hugetlb/kubepods/burstable/pod1b2ff802-2560-4e77-ba16-96f0aa50530d/hugetlb.1GB.limit_in_bytes 
2147483648

Does this PR introduce a user-facing change?:

Added support for multiple sizes huge pages on a container level

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 17, 2019
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 17, 2019

@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 17, 2019
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

Copy link
Member

@odinuge odinuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial thoughts, but otherwise this makes much sense to me. Thanks for your work on this.

The questions about validation are open for discussion, and should probably not be a part of this PR.

Test failures look valid, so those should be addressed. We also have some e2e tests for hugepages, have you verified that they still pass?

Adding hold since this depends on support on node level: #82820
/hold

pkg/volume/emptydir/empty_dir.go Show resolved Hide resolved
pkg/volume/emptydir/empty_dir.go Show resolved Hide resolved
pkg/volume/emptydir/empty_dir.go Outdated Show resolved Hide resolved
pkg/volume/emptydir/empty_dir.go Outdated Show resolved Hide resolved
pkg/volume/emptydir/empty_dir.go Outdated Show resolved Hide resolved
pkg/apis/core/v1/helper/helpers.go Show resolved Hide resolved
pkg/apis/core/types.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2019
@odinuge
Copy link
Member

odinuge commented Oct 17, 2019

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 17, 2019
@bg-chun
Copy link
Member

bg-chun commented Oct 17, 2019

I think that overall changes in emptyDir.go follow the KEP update as well.

Regarding medium: HugePages, I have a question.
It seems that validation logic will pass the below case.
Below Sample Pod Spec shows that the Pod has both of medium: HugePages and medium: HugePages-1Gi and consumes only 1Gi hugepages.

The KEP says that For backwards compatibility, a pod that uses one page size should pass validation if a volume emptyDir medium=HugePages notation is used..
There is no actual restriction for volume, and we put a restriction for only page size.

Would it be okay to allow the above case?
I think it is just a little bit wired, but I guess there will be no issue to consume single size hugepages in below pod.
(correct me if my understanding is wrong)

[Pod Spec]

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: container1
    volumeMounts:
    - mountPath: /hugepages
      name: hugepage
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  - name: container2
    volumeMounts:
    - mountPath: /hugepages-1Gi
      name: hugepage-1Gi
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: hugepage-1Gi
    emptyDir:
      medium: HugePages-1Gi

@kad
Copy link
Member

kad commented Oct 18, 2019

[Pod Spec]

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: container1
    volumeMounts:
    - mountPath: /hugepages
      name: hugepage
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  - name: container2
    volumeMounts:
    - mountPath: /hugepages-1Gi
      name: hugepage-1Gi
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: hugepage-1Gi
    emptyDir:
      medium: HugePages-1Gi

This is an interesting corner case. Volumes are Pod level and right now the sizes of those volumes are calculated by sum of all containers requests/limits for hugepages. In theory, if pod has more than one container and more than one hugetlbfs mounts and not all of those volumemounts are used in all of the containers, we need to change logic of validation and logic of calculating sizes for each of hugepage volume.

@odinuge
Copy link
Member

odinuge commented Oct 18, 2019

@kad: This is an interesting corner case. Volumes are Pod level and right now the sizes of those volumes are calculated by sum of all containers requests/limits for hugepages. In theory, if pod has more than one container and more than one hugetlbfs mounts and not all of those volumemounts are used in all of the containers, we need to change logic of validation and logic of calculating sizes for each of hugepage volume.

Not sure if I understand what you mean. The "size" of a volume mount of type hugetlbfs is the page size used, and not the amount of memory available. By default a program can use all the pre allocated huge page memory from such a mount, but this is limited via the hugetlb cgroup.

How to handle multiple containers with multiple hugepage sizes, together with different volumes is maybe something we can incorporate into the KEP, but think it should be ok as is now too. The reason we verify that the page sizes used in the mount is in requests/limits is to make sure that they are valid on the node that schedules the pod. It is not a problem that a container (even when we start supporting container level cgroup enforcement) without any huge page limit/request mounts a hugetlbfs volume, since the cgroup will not allow the processes use it.

The problem will arise when a pod use a page size in a volume without having the size in the requests/limits:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: container
    volumeMounts:
    - mountPath: /hugepages
      name: hugepage
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages # alternatively HugePages-1Gi

This (the example above) can be valid on a given node if it supports 1GiB pages, but we cannot be sure since the scheduler doesn't take huge page support into account when finding a node. This is a case we successfully validates during volume mounting today, but the podspec is still "valid" in the sense that the apiserver will accept it.

In this pod we know that the node running the pod support 1GiB huge pages, so container2 can mount the volume, but the cgroup enforcement will limit its usage to 0 (when we starts supporting container level enforcement, today we only enforce it on pod level). So I think this example is a pod spec that should be treated as valid:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: container1
    volumeMounts:
    - mountPath: /hugepages
      name: hugepage
    resources:
      requests:
        hugepages-1Gi: 2Gi
      limits:
        hugepages-1Gi: 2Gi
  - name: container2
    volumeMounts:
    - mountPath: /hugepages-1Gi
      name: hugepage-1Gi
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: hugepage-1Gi
    emptyDir:
      medium: HugePages-1Gi

@bart0sh bart0sh force-pushed the PR0079-multiple-sizes-hugepages branch from 5d89c9a to 1feb251 Compare October 21, 2019 14:49
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 21, 2019

@odinuge @bg-chun thank you for the review! I've updated the PR according to your suggestions. Please review again.

@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 22, 2019

/test pull-kubernetes-e2e-gce-storage-slow

@bart0sh bart0sh force-pushed the PR0079-multiple-sizes-hugepages branch from 515630e to 1a5cad6 Compare February 13, 2020 13:00
@bart0sh
Copy link
Contributor Author

bart0sh commented Feb 13, 2020

@liggitt > is it expected that pods today could be specifying these new hugepages mediums?

Yes, if they need to use multiple huge pages sizes.

pkg/apis/core/validation/validation.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/validation.go Outdated Show resolved Hide resolved
@@ -290,11 +290,11 @@ func (ed *emptyDir) setupHugepages(dir string) error {
}
// If the directory is a mountpoint with medium hugepages, there is no
// work to do since we are already in the desired state.
if isMnt && medium == v1.StorageMediumHugePages {
if isMnt && v1helper.IsHugePageMedium(medium) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously, there could only be a single hugepages size per pod, so all hugepages mounts had to agree. with this PR, that is no longer the case, correct?

@bart0sh bart0sh force-pushed the PR0079-multiple-sizes-hugepages branch 2 times, most recently from 6df5aa5 to 54f60a1 Compare February 13, 2020 15:00
@bart0sh
Copy link
Contributor Author

bart0sh commented Feb 14, 2020

/retest

@bart0sh bart0sh force-pushed the PR0079-multiple-sizes-hugepages branch from 54f60a1 to c38dbb6 Compare February 19, 2020 12:20
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 19, 2020
@liggitt
Copy link
Member

liggitt commented Feb 19, 2020

API validation changes lgtm, will defer the rest of the review to node/storage folks. Please assign me once those have lgtm and I'll add approval for the API bits.

/unassign

@bart0sh bart0sh force-pushed the PR0079-multiple-sizes-hugepages branch from c38dbb6 to 4430f88 Compare February 19, 2020 16:14
bart0sh and others added 2 commits February 19, 2020 18:15
This implementation allows Pod to request multiple hugepage resources
of different size and mount hugepage volumes using storage medium
HugePage-<size>, e.g.

spec:
  containers:
    resources:
      requests:
        hugepages-2Mi: 2Mi
        hugepages-1Gi: 2Gi
    volumeMounts:
      - mountPath: /hugepages-2Mi
        name: hugepage-2mi
      - mountPath: /hugepages-1Gi
        name: hugepage-1gi
    ...
  volumes:
    - name: hugepage-2mi
      emptyDir:
        medium: HugePages-2Mi
    - name: hugepage-1gi
      emptyDir:
        medium: HugePages-1Gi

NOTE: This is an alpha feature.
      Feature gate HugePageStorageMediumSize must be enabled for it to work.
Co-Authored-By: Odin Ugedal <odin@ugedal.com>
@bart0sh bart0sh force-pushed the PR0079-multiple-sizes-hugepages branch 3 times, most recently from 23d1c36 to 03ecc20 Compare February 20, 2020 11:55
Extended GetMountMedium function to check if hugetlbfs volume
is mounted with the page size equal to the medium size.

Page size is obtained from the 'pagesize' mount option of the
mounted hugetlbfs volume.
@bart0sh
Copy link
Contributor Author

bart0sh commented Feb 20, 2020

/retest

@derekwaynecarr
Copy link
Member

the kubelet changes still lgtm.

/approve
/lgtm

assign @liggitt

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2020
@liggitt
Copy link
Member

liggitt commented Feb 25, 2020

/approve

@liggitt liggitt added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bart0sh, derekwaynecarr, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 851efa8 into kubernetes:master Feb 25, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Feb 25, 2020
if len(hugePageResources) > 1 {
allErrs = append(allErrs, field.Invalid(specPath, hugePageResources, "must use a single hugepage size in a pod spec"))
if !opts.AllowMultipleHugePageResources {
allErrs = append(allErrs, ValidatePodSingleHugePageResources(pod, specPath)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if HugePageStorageMediumSize is enabled, should we validate the name and format etc. too? What if user gives an arbitrary string in this field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly, the user could already give an arbitrary string in that field. if that's the case, we cannot easily tighten validation on an existing field

Copy link
Contributor

@jingxu97 jingxu97 Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this ValidatePodSingleHugePageResources is for validating whether there are multiple huge page size is specified in the resources. So it makes sense to only validate if feature is disabled.

So one scenario is rollback, if MutipleHuagePageResource is enabled during pod creation, so it sets multiple sizes. If during pod rollback, MutipleHuagePageResource is disabled, it might fail to update pod?

I also checked a few cases, the following can pass validation which seems not right

resources:
  limits:
    hugepages-xGi: 100Mi

volumes:

  • name: hugepage
    emptyDir:
    medium: abc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If during pod rollback, MutipleHuagePageResource is disabled, it might fail to update pod?

that is addressed here:

func (podStrategy) ValidateUpdate(ctx context.Context, obj, old runtime.Object) field.ErrorList {
oldFailsSingleHugepagesValidation := len(validation.ValidatePodSingleHugePageResources(old.(*api.Pod), field.NewPath("spec"))) > 0
opts := validation.PodValidationOptions{
// Allow multiple huge pages on pod create if feature is enabled or if the old pod already has multiple hugepages specified
AllowMultipleHugePageResources: oldFailsSingleHugepagesValidation || utilfeature.DefaultFeatureGate.Enabled(features.HugePageStorageMediumSize),
}

the following can pass validation which seems not right

see discussion in #52936 (comment)

cynepco3hahue pushed a commit to cynepco3hahue/api that referenced this pull request Mar 23, 2020
As part of the telco effort we should provide possibility,
to use multiple sizes huge pages under the node.

Kubernetes supports this feature as alpha under the 1.18, to enable it
you should enable feature gate `HugePageStorageMediumSize`,
see kubernetes/kubernetes#84051.

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet