Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow volume ownership to be only set after fs formatting. #69699

Closed
msau42 opened this issue Oct 11, 2018 · 58 comments
Closed

Allow volume ownership to be only set after fs formatting. #69699

msau42 opened this issue Oct 11, 2018 · 58 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@msau42
Copy link
Member

msau42 commented Oct 11, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
@kubernetes/sig-storage-feature-requests

What happened:
Today, the fsgroup setting is recursively set on every mount. This can make mount very slow if the volume has many files. Most of the time, a pod using a volume should use the same fsgroup everytime and not need to change it across multiple pods.

I'm proposing that we add a flag that will only apply the fsgroup right after initial fs formatting and not on every mount.

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 11, 2018
@krmayankk
Copy link

doing recursive group ownership for all files every mount time is excessive. Where are you proposing the flag live ?

@msau42
Copy link
Member Author

msau42 commented Oct 15, 2018

I haven't thought about exactly where in the API we can add it. Maybe in the same place where fsgroup is specified. There are also other issues around volume ownership that may impact this also, and it can benefit from someone taking full ownership of this area to investigate a complete solution.

Ref other ownership issues: #2630, #57923

@jsafrane
Copy link
Member

cc @tsmetana @gnufied

To add even more confusion and problems, SELinux labels are applied by a container runtime. CRI does not allow any option to skip it. All files on a volume are labelled by the container runtime every time a container is started. It would be worth adding an option to skip this step somehow.

@tsmetana
Copy link
Member

@jsafrane yes... I've tried to propose something similar in the past but it was turned down since it was not a complete solution (SELinux...).

I have only vague knowledge of CRI, but quickly scanning the API shows some SELinux boolean in the Mount message: https://github.com/kubernetes/kubernetes/blob/release-1.5/pkg/kubelet/api/v1alpha1/runtime/api.proto#L139

I will try to find out how is this being used. Perhaps there might be a way to turn off also the recursive re-labelling using the existing APIs.

@dimm0
Copy link

dimm0 commented Nov 2, 2018

Having issues mounting a filesystem with a bunch of files in it - taking 14 minutes to start a container. Would love to see an option to disable recursive chown.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2019
@gnufied
Copy link
Member

gnufied commented Jan 31, 2019

/remove-lifecycle-stale

@dvlato
Copy link

dvlato commented Feb 11, 2019

Hi,

We also hit the problem of the recursive chown when we set fsGroup. In our case, I don't think we need to recursively change the file permissions as the files are created by the pod (in ReadWriteOnce mode) so we only need to change the permission for the root folder. Would that be a better fit than trying to find a place where to change the owner only once?

Additionally, is there a way to diagnose this issue? Now we have removed the fsGroup we get timeouts in a slightly different place (between container created and started), it seems to be a similar issue but we cannot find any indication of the causes (afaik we don't use SELinux at all so it shouldn't be selinux relabeling); how can we confirm that it is a chmod/chown that is causing the problem? If you knew how to verify whether selinux relabeling is happening, I would also be most thankful.

@gnufied
Copy link
Member

gnufied commented Feb 11, 2019

There is a metric for volume mount operation. This metric should tell you if it is indeed mount operation(which includes chmod,chmod) that is taking time or something else.

Whether selinux relabelling is playing a role in this or not can be verified by checking if selinux is enabled on the node (getenforce) and then depending on type of volume you are using. What kind of volume you are using? Is that a block storage volume type?

@dvlato
Copy link

dvlato commented Feb 11, 2019

I know it was mount operation as the error message says 'Unable to mount'... However I don't know if it's chmoding, chowning or doing something different.

The volume is an EBS volume, so yes, block storage. I was expecting something in Kube logs or the maybe the API to give me the information as I cannot ssh into the nodes (as most devs I assume, and I'd love to find that information myself rather than finding someone to get it for me who has time the moment I have time...). Also I don't feel that 'SELinux is enabled so it must be that' is the best troubleshooting technique. Is the "getenforce" the best/easiest way to find out if it could be SELinux? Of course, thank you for the information, and it will serve me if there's nothing else, but I was expecting I could get more info from Kube (and I think it would be interesting to improve traceability in this regard otherwise).

@dvlato
Copy link

dvlato commented Feb 11, 2019

Anyway thank you very much for the super prompt response!

@dimm0
Copy link

dimm0 commented Feb 11, 2019

selinux is disabled on all our nodes. It's not doing chown anymore in the recent rook version, but it's clearly doing something else still, since pods are taking longer time to run under non-root user compared to root user.

@dvlato
Copy link

dvlato commented Feb 12, 2019

In my case, they are taking a long time even when running under the root user with no fsGroup... It's clearly related to the files inside the EBS volume as the containers that mount it get stuck between "Created container" and "Started container" only if there are files (around 6 Million in this case) in the volume.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 14, 2019
@dvlato
Copy link

dvlato commented Mar 14, 2019

Has anyone else experienced the same problem?

@gnufied
Copy link
Member

gnufied commented Mar 14, 2019

Do you have selinux enabled?

@msau42
Copy link
Member Author

msau42 commented Apr 4, 2019

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 4, 2019
@dimm0
Copy link

dimm0 commented Apr 4, 2019

I'm having lots of issues with chown and selinux relabelling in rook ceph volumes, and patiently waiting for a fix from k8s...

@dimm0
Copy link

dimm0 commented Apr 4, 2019

Users who are running jupyter (containers with non-root users) and having 100K+ files in their volumes are unable to run, because jupyter times out before k8s is done chowning every single file

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2019
@azalio
Copy link

azalio commented Jul 18, 2019

/remove-lifecycle stale

@unixfox
Copy link

unixfox commented Apr 4, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2021
@dimm0
Copy link

dimm0 commented May 15, 2021

This still doesn't work in k8s 1.20.5. I tried setting fsGroupChangePolicy: "OnRootMismatch" and enabling featureGates in apiserver/controller/kubelet (even though those should be already enabled by default)

Anything else I could try?

@gnufied
Copy link
Member

gnufied commented May 17, 2021

@dimm0 can you provide more details ? Do you see volume being recursively chown/chmod'ed on each mount even if you set fsGroupChangePolicy: "OnRootMismatch" ? Can you post your pod spec and logs from kubelet? The OnRootMismatch policy is not designed to fix first time recursive chown of volumes, it only guarantees no recursive chown will be performed if fsgroup already matches.

Apart from our own testing, other folks have confirmed OnRootMismatch to work - longhorn/longhorn#2131 (comment)

@dimm0
Copy link

dimm0 commented May 17, 2021

Do you see volume being recursively chown/chmod'ed on each mount even if you set fsGroupChangePolicy: "OnRootMismatch"

Yes. I'm running jupyterhub with non-root account, and users are having issues starting containers with volumes having many (>400K) files. Also I verified that if I set the fsGroup in pod's securityContext, files get chowned. I tried starting the pod several time with the same volume and manually chowning files inside the volume between restarts.

Ok, I'll post logs

My pod:

apiVersion: v1
kind: Pod
metadata:
  name: vol-pod
spec:
  containers:
  - name: vol-container
    image: busybox
    command: ["sleep", "infinity"]
    volumeMounts:
    - mountPath: "/data"
      name: vol
  securityContext:
    fsGroup: 100
    fsGroupChangePolicy: "OnRootMismatch"
  volumes:
  - name: vol
    persistentVolumeClaim:
      claimName: examplevol

@dimm0
Copy link

dimm0 commented May 17, 2021

kubelet.log

vol-pod2 pod

@gnufied
Copy link
Member

gnufied commented May 17, 2021

@dimm0 so what is the bug? Is recursive permissions not being skipped as expected? You appear to be using flexvolume version of rook plugin - not sure if that could be the reason(should not be). But it is very hard to tell based on attached logs.

Next steps:

  1. Can you try and create a minimal working example and verify mount times by creating lots of files in the volume and see if using fsGroupChangePolicy makes a difference? You can compare the timings using mount metrics. There is also volume_fsgroup_recursive_apply which should not be emitted if recursive permissions are being skipped.
  2. See if flexvolume driver supports fsgroup.

@dimm0
Copy link

dimm0 commented May 18, 2021

But it is very hard to tell based on attached logs.

Exactly. How can I debug this?

There's CSI plugin, but some (older) volumes are still flexVolumes. The volume in question is CSI.

I tried creating just a single file in the volume, chown it to 22:22, then kill the pod and create again. The file inside is 22:100 after pod start.

I can give you access to the namespace if you want.. I have a minimal example above. Can send configs.

Is chowning shown in logs anywhere?

@dimm0
Copy link

dimm0 commented May 18, 2021

Is it something to do with the CSI driver, or it's fully on kubernetes side? Should I bug rook guys about this?

@gnufied
Copy link
Member

gnufied commented May 18, 2021

But it is very hard to tell based on attached logs.

Exactly. How can I debug this?

There's CSI plugin, but some (older) volumes are still flexVolumes. The volume in question is CSI.

I tried creating just a single file in the volume, chown it to 22:22, then kill the pod and create again. The file inside is 22:100 after pod start.

I can give you access to the namespace if you want.. I have a minimal example above. Can send configs.

Is chowning shown in logs anywhere?

if you increase the log level of KCM to 3, you should be able to see following message whenever applicable:

"klog.V(3).InfoS("Skipping permission and ownership change for volume", "path", mounter.GetPath())"

@oxr463
Copy link

oxr463 commented Jun 3, 2021

Is there a big performance impact from this? e.g., increased memory usage on deployments.

@gnufied
Copy link
Member

gnufied commented Jun 3, 2021

No. It should not result in any difference in memory usage.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2021
@m-yosefpor
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2021
@msau42
Copy link
Member Author

msau42 commented Sep 1, 2021

kubernetes/enhancements#695 is beta in 1.20 and targeting GA in 1.23. I think we can close this issue and track progress through the enhancement issue.
/close

@k8s-ci-robot
Copy link
Contributor

@msau42: Closing this issue.

In response to this:

kubernetes/enhancements#695 is beta in 1.20 and targeting GA in 1.23. I think we can close this issue and track progress through the enhancement issue.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@m-yosefpor
Copy link

@msau42 thanks for your clarification. However kubernetes/enhancements#695 only fixes kubelet recursive walk, but the recursive chcon applied by container runtime is still an issue (see #69699 (comment)). As @jsafrane also mentioned, we need some sort of way to tell runtime not to apply those chcon (via a change in CRI or some other method).

@gnufied
Copy link
Member

gnufied commented Sep 2, 2021

Yes ideally selinux work will be tracked via - kubernetes/enhancements#1710. The selinux enhancement is being postponed because of lack of time/contributors.

@JJwangbilin
Copy link

已测试验证,可解决问题:
在pod.securityContext.fsGroupChangePolicy设置为“OnRootMismatch”即可。
(1)第一次挂载可能会触发chown -R的操作,是为了保证挂载点根目录的group和pod.securityContext.fsGroup一致;
(2)后续挂载均会跳过chown,因为挂载点根目录的group和pod.securityContext.fsGroup已经一致(只关心挂载点根目录,不会管存储里面的内容);
(3)OnRootMismatch含义为:根目录group信息一致则跳过chown,不是默认参数,需显式的去设置;
(4)此参数从k8s 1.20开始支持;

@aiici
Copy link

aiici commented Mar 6, 2024

I came across it in the v1.18.1 version, and if I can, I suggest upgrading to v1.18.20 to fix the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests