Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SELinux prevents access to existing sub-directories in statically provisioned PVs #481

Closed
gfschmidt opened this issue Aug 30, 2021 · 32 comments
Assignees
Labels
Customer Impact: Localized high impact (3) Reduction of function. Significant impact to workload. Customer Probability: Medium (3) Issue occurs in normal path but specific limited timing window, or other mitigating factor Severity: 3 Indicates the the issue is on the priority list for next milestone. Type: Bug Indicates issue is an undesired behavior, usually caused by code error.

Comments

@gfschmidt
Copy link
Member

gfschmidt commented Aug 30, 2021

Describe the bug

When using static provisioning in OpenShift 4.7 with Scale CSI v2.2.0 to gain access to an entire IBM Spectrum Scale file system (e.g. mounted at /mnt/fs1 in CNSA) with existing data then SELinux on the worker node prevents access to existing sub-directories even when running the pod under OpenShift with cluster-admin role. This was observed with IBM Spectrum Scale CNSA v5.1.1.1 and CSI v2.2.0 on x86 and OpenShift 4.7.17.

To Reproduce

  1. We have an existing file system in IBM Spectrum Scale that we want to make available in OpenShift with static provisioning:

File system attributes on CNSA:

sh-4.4# mmlscluster
GPFS cluster information
========================
  GPFS cluster name:         ibm-spectrum-scale.ibm-spectrum-scale.ocp4.scale.ibm.com
  GPFS cluster id:           17399599334533218545

sh-4.4# mmlsfs fs1 --uid
flag                value                    description
------------------- ------------------------ -----------------------------------
 --uid              099B6A7A:5EB99721        File system UID

sh-4.4# mmremotefs show     
Local Name  Remote Name  Cluster name       Mount Point        Mount Options    Automount  Drive  Priority
fs1         ess3000_1M   ess3000.bda.scale.ibm.com /mnt/fs1           rw               yes          -        0

File system attributes on remote storage cluster:

[root@fscc-sr650-12 dynamic2]# mmlscluster

GPFS cluster information
========================
  GPFS cluster name:         ess3000.bda.scale.ibm.com
  GPFS cluster id:           215057217487177715

[root@fscc-sr650-12 dynamic2]# mmlsfs ess3000_1M --uid
flag                value                    description
------------------- ------------------------ -----------------------------------
 --uid              099B6A7A:5EB99721        File system UID

[root@fscc-sr650-12 GPFS]# mmlsfs ess3000_1M -Q --perfileset-quota --filesetdf
flag                value                    description
------------------- ------------------------ -----------------------------------
 --filesetdf        yes                      Fileset df enabled?
 -Q                 user;group;fileset       Quotas accounting enabled
                    user;group;fileset       Quotas enforced
                    none                     Default quotas enabled
 --perfileset-quota no                       Per-fileset quota enforcement

[root@fscc-sr650-12 GPFS]# mmlsconfig enforceFilesetQuotaOnRoot
enforceFilesetQuotaOnRoot yes  

[root@fscc-sr650-12 GPFS]# mmlsconfig controlSetxattrImmutableSELinux
controlSetxattrImmutableSELinux yes  

[root@fscc-sr650-12 dynamic2]# ls -al /gpfs/ess3000_1M/
total 262
drwxr-xr-x. 13 root       root       262144 Aug 30 13:42 .
drwxr-xr-x.  5 root       root           62 Apr 14 22:42 ..
drwxrwxrwx.  2 root       root         4096 Aug 30 11:37 nfs-export

[root@fscc-sr650-12 dynamic2]# ls -al /gpfs/ess3000_1M/nfs-export/
total 258
drwxrwxrwx.  2 root root   4096 Aug 30 14:33 .
drwxr-xr-x. 13 root root 262144 Aug 30 13:42 ..
-rw-r--r--.  1 root root   1148 Aug 25 19:00 fstab
-rw-r--r--.  1 root root    158 Aug 25 19:00 hosts
-rw-rw-rw-.  1 root root      0 Aug 30 11:37 readall

  1. Create a static PV pv000 for that GPFS root directory (mounted at /mnt/fs1 with CNSA 5.1.1.1):

We create a static PV for the directory /mnt/fs1 on CNSA which is backed by a remote filesystem /gpfs/ess3000_1M on a remote storage cluster.
In order to allow a direct binding between the PVC and the specific PV holding the specific data we make use of Kubernetes labels and use a faked storage class name "static" which is used as some sort of annotation or identifier rather than a real storage class (which it isn't) - this prevents that the default storage class (if any should be defined) is applied instead rather than matching the specific PV from the pool of statically provisioned PVs with the PVC and its specific labels (in case the storage class is omitted in the PVC).

[gero@oc6314270534]$ cat pv000.yaml 
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv000
  labels:
    product: ibm-spectrum-scale
    type: test
spec:
  storageClassName: static
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  csi:
    driver: spectrumscale.csi.ibm.com
    volumeHandle: "17399599334533218545;099B6A7A:5EB99721;path=/mnt/fs1"

[gero@oc6314270534]$ oc apply -f pv000.yaml 
persistentvolume/pv000 created

[gero@oc6314270534]$ oc get pv
NAME                             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                                         STORAGECLASS                  REASON   AGE
pv000                            100Gi      RWX            Retain           Available                                                                 static                                 3s
  1. Create PVC for that volume

The following PVC achieves a direct 1:1 match between the claim and the specifc PV created above with the labels and the faked storage class name as selectors:

[gero@oc6314270534]$ cat pvc000.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc000
spec:
  storageClassName: static
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  selector: 
    matchLabels:
      product: ibm-spectrum-scale
      type: test

[gero@oc6314270534]$ oc apply -f pvc000.yaml 
persistentvolumeclaim/pvc000 created

[gero@oc6314270534]$ oc get pvc
NAME     STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc000   Bound    pv000    100Gi      RWX            static         4s
  1. Create pod as user with cluster-admin role
[gero@oc6314270534]$ cat pod.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: pod
spec:
  containers:
    - name: test
      image: registry.access.redhat.com/ubi8/ubi-minimal:latest
      command: [ "/bin/sh", "-c", "--" ]
      args: [ "while true; do echo $(hostname) $(date +%Y%m%d-%H:%M:%S) | tee -a /train/train1.log ; sleep 5 ; done;" ]
      volumeMounts:
        - name: vol1
          mountPath: "/data"
  volumes:
    - name: vol1
      persistentVolumeClaim:
        claimName: pvc000

[gero@oc6314270534]$ oc apply -f pod.yaml 
pod/pod created

[gero@oc6314270534]$ oc get pods
NAME   READY   STATUS              RESTARTS   AGE
pod    1/1     Running             0          13s
  1. Verify that access to existing sub-directories is denied by SELinux
[gero@oc6314270534]$ oc rsh pod

sh-4.4# id
uid=0(root) gid=0(root) groups=0(root)
sh-4.4# whoami
root

sh-4.4# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         446G   29G  418G   7% /
tmpfs            64M     0   64M   0% /dev
tmpfs           126G     0  126G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
tmpfs           126G  105M  126G   1% /etc/hostname
fs1              15T  6.5G   15T   1% /data
/dev/sda4       446G   29G  418G   7% /etc/hosts
tmpfs           126G   28K  126G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           126G     0  126G   0% /proc/acpi
tmpfs           126G     0  126G   0% /proc/scsi
tmpfs           126G     0  126G   0% /sys/firmware

sh-4.4# ls -al /data
total 262
drwxr-xr-x. 13 root       root       262144 Aug 30 11:42 .
dr-xr-xr-x.  1 root       root           18 Aug 30 12:10 ..
drwxrwxrwx.  2 root       root         4096 Aug 30 09:37 nfs-export

# Read/Write access to mount point /data is granted
sh-4.4# touch /data/testfile
sh-4.4# ls -al /data/                  
total 262
drwxr-xr-x. 13 root       root       262144 Aug 30 12:11 .
dr-xr-xr-x.  1 root       root           18 Aug 30 12:10 ..
-rw-r--r--.  1 root       root            0 Aug 30 12:11 testfile
drwxrwxrwx.  2 root       root         4096 Aug 30 09:37 nfs-export
sh-4.4# rm /data/testfile 

# Read/Write access to existing sub-directory under mount point /data is denied by SELinux
sh-4.4# ls -al /data/nfs-export/
ls: cannot open directory '/data/nfs-export/': Permission denied
sh-4.4# cat /data/nfs-export/hosts
cat: /data/nfs-export/hosts: Permission denied
sh-4.4# cat /data/nfs-export/hostss
cat: /data/nfs-export/hostss: No such file or directory
sh-4.4# cat /data/nfs-export/readall
cat: /data/nfs-export/readall: Permission denied
sh-4.4# ls -al /data/nfs-export/readall
ls: cannot access '/data/nfs-export/readall': Permission denied

sh-4.4# stat /data/
  File: /data/
  Size: 262144    	Blocks: 512        IO Block: 262144 directory
Device: 100039h/1048633d	Inode: 3           Links: 13
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-08-30 12:11:52.072513451 +0000
Modify: 2021-08-30 12:11:58.880299000 +0000
Change: 2021-08-30 12:11:58.880299000 +0000
 Birth: -

sh-4.4# stat /data/nfs-export/
  File: /data/nfs-export/
  Size: 4096      	Blocks: 1          IO Block: 262144 directory
Device: 100039h/1048633d	Inode: 19456       Links: 2
Access: (0777/drwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-08-30 09:38:09.693424826 +0000
Modify: 2021-08-30 09:37:56.904677000 +0000
Change: 2021-08-30 09:37:56.904677000 +0000
 Birth: -

sh-4.4# ls -al /data/nfs-export/
ls: cannot open directory '/data/nfs-export/': Permission denied

sh-4.4# cat /data/nfs-export/hosts
cat: /data/nfs-export/hosts: Permission denied
  1. Check the SELinux logs on the OpenShift worker node where pod is running
[gero@oc6314270534]$ oc get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE     IP             NODE                          NOMINATED NODE   READINESS GATES
pod    1/1     Running   0          3m19s   10.130.2.181   worker04.ocp4.scale.ibm.com   <none>           <none>

root@worker04 core]# aureport -a

AVC Report
===============================================================
# date time comm subj syscall class permission obj result event
===============================================================
41. 08/30/2021 12:11:11 ls system_u:system_r:container_t:s0:c20,c25 257 dir read unconfined_u:object_r:unlabeled_t:s0 denied 281388
42. 08/30/2021 12:11:16 ls system_u:system_r:container_t:s0:c20,c25 257 dir read system_u:object_r:unlabeled_t:s0 denied 281389
43. 08/30/2021 12:11:23 cat system_u:system_r:container_t:s0:c20,c25 257 file read system_u:object_r:unlabeled_t:s0 denied 281391
44. 08/30/2021 12:11:30 sh system_u:system_r:container_t:s0:c20,c25 257 dir read system_u:object_r:unlabeled_t:s0 denied 281392
45. 08/30/2021 12:11:33 cat system_u:system_r:container_t:s0:c20,c25 257 file read unconfined_u:object_r:unlabeled_t:s0 denied 281393
46. 08/30/2021 12:12:17 ls system_u:system_r:container_t:s0:c20,c25 257 dir read system_u:object_r:unlabeled_t:s0 denied 281397
47. 08/30/2021 12:12:24 cat system_u:system_r:container_t:s0:c20,c25 257 file read system_u:object_r:unlabeled_t:s0 denied 281438

[root@worker04 core]# ausearch -m avc | tail -20
----
time->Mon Aug 30 12:11:30 2021
type=PROCTITLE msg=audit(1630325490.681:281392): proctitle="/bin/sh"
type=SYSCALL msg=audit(1630325490.681:281392): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=564bd8f6e290 a2=90800 a3=0 items=0 ppid=2508113 pid=2508123 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="sh" exe="/usr/bin/bash" subj=system_u:system_r:container_t:s0:c20,c25 key=(null)
type=AVC msg=audit(1630325490.681:281392): avc:  denied  { read } for  pid=2508123 comm="sh" name="nfs-export" dev="gpfs" ino=19456 scontext=system_u:system_r:container_t:s0:c20,c25 tcontext=system_u:object_r:unlabeled_t:s0 tclass=dir permissive=0
----
time->Mon Aug 30 12:11:33 2021
type=PROCTITLE msg=audit(1630325493.138:281393): proctitle=2F7573722F62696E2F636F72657574696C73002D2D636F72657574696C732D70726F672D73686562616E673D636174002F7573722F62696E2F636174002F646174612F6E66732D6578706F72742F72656164616C6C
type=SYSCALL msg=audit(1630325493.138:281393): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=7ffd810e6e06 a2=0 a3=0 items=0 ppid=2508123 pid=2509511 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="cat" exe="/usr/bin/coreutils" subj=system_u:system_r:container_t:s0:c20,c25 key=(null)
type=AVC msg=audit(1630325493.138:281393): avc:  denied  { read } for  pid=2509511 comm="cat" name="readall" dev="gpfs" ino=66048 scontext=system_u:system_r:container_t:s0:c20,c25 tcontext=unconfined_u:object_r:unlabeled_t:s0 tclass=file permissive=0
----
time->Mon Aug 30 12:12:17 2021
type=PROCTITLE msg=audit(1630325537.086:281397): proctitle=2F7573722F62696E2F636F72657574696C73002D2D636F72657574696C732D70726F672D73686562616E673D6C73002F7573722F62696E2F6C73002D616C002F646174612F6E66732D6578706F72742F
type=SYSCALL msg=audit(1630325537.086:281397): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=55e0aab62820 a2=90800 a3=0 items=0 ppid=2508123 pid=2510912 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="ls" exe="/usr/bin/coreutils" subj=system_u:system_r:container_t:s0:c20,c25 key=(null)
type=AVC msg=audit(1630325537.086:281397): avc:  denied  { read } for  pid=2510912 comm="ls" name="nfs-export" dev="gpfs" ino=19456 scontext=system_u:system_r:container_t:s0:c20,c25 tcontext=system_u:object_r:unlabeled_t:s0 tclass=dir permissive=0
----
time->Mon Aug 30 12:12:24 2021
type=PROCTITLE msg=audit(1630325544.123:281438): proctitle=2F7573722F62696E2F636F72657574696C73002D2D636F72657574696C732D70726F672D73686562616E673D636174002F7573722F62696E2F636174002F646174612F6E66732D6578706F72742F686F737473
type=SYSCALL msg=audit(1630325544.123:281438): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=7fff423cee08 a2=0 a3=0 items=0 ppid=2508123 pid=2511180 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="cat" exe="/usr/bin/coreutils" subj=system_u:system_r:container_t:s0:c20,c25 key=(null)
type=AVC msg=audit(1630325544.123:281438): avc:  denied  { read } for  pid=2511180 comm="cat" name="hosts" dev="gpfs" ino=26112 scontext=system_u:system_r:container_t:s0:c20,c25 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0

It appears we have full RW access to the root or first level of the file system backed by the static PV (/mnt/fs1) but we do not have any access to existing sub-directories in that directory /mnt/fs1/nfs-export). Even if the sub-directory's permissions is set to chmod 777.
Here, the statically provisioned PV is backed by the directory /mnt/fs1 in IBM Spectrum Scale which holds the sub-directory "nfs-export".

Expected behavior
When using static provisioning with IBM Spectrum Scale CSI to grant access to an existing directory (i.e. existing data) in IBM Spectrum Scale then it is expected that access to the sub-directories in that directory is available and not prevented by SELinux (as long as the file system UID/GID permissions allow it).

Environment

OpenShift Cluster w/CNSA:
 
CNSA v5.1.1.1
CSI v2.2.0

[gero@oc6314270534]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.19    True        False         49d     Cluster version is 4.7.19

Storage cluster:

[root@fscc-sr650-12 dynamic2]# mmdiag
Current GPFS build: "5.1.1.2 ".
Built on Jul  1 2021 at 16:02:28

[root@fscc-sr650-12]# rpm -qa | grep gpfs
gpfs.gskit-8.0.55-19.x86_64
gpfs.crypto-5.1.1-2.x86_64
gpfs.docs-5.1.1-2.noarch
gpfs.gnr-5.1.1-2.x86_64
gpfs.gnr.base-1.0.0-0.x86_64
gpfs.gss.pmcollector-5.1.1-2.el8.x86_64
gpfs.msg.en_US-5.1.1-2.noarch
gpfs.adv-5.1.1-2.x86_64
gpfs.gui-5.1.1-2.noarch
gpfs.gss.pmsensors-5.1.1-2.el8.x86_64
gpfs.base-5.1.1-2.x86_64
gpfs.gpl-5.1.1-2.noarch
gpfs.afm.cos-1.0.0-3.x86_64
gpfs.license.dmd-5.1.1-2.x86_64
gpfs.java-5.1.1-2.x86_64
gpfs.compression-5.1.1-2.x86_64
@gfschmidt gfschmidt added the Type: Bug Indicates issue is an undesired behavior, usually caused by code error. label Aug 30, 2021
@gfschmidt
Copy link
Member Author

It seems that access to the sub-directories of a statically provisioned PV is not prevented by SELinux if the static PV itself is not backed by the root fileset of an IBM Spectrum Scale file system directly (e.g. mounted at /mnt/fs1).

If the static PV is instead backed by a sub-directory located under the IBM Spectrum Scale file system's root directory (e.g. /mnt/fs1/nfs-export) then access to sub-directories in that directory works as expected and is not prevented by SELinux.

For example, if we only want to grant access to a sub-directory /mnt/fs1/nfs-export in IBM Spectrum Scale and not to the entire IBM Spectrum Scale file system mounted at /mnt/fs1 and all its sub-directories then it seems to work ok.

With a static PV backed by a sub-directory /mnt/fs1/nfs-export in IBM Spectrum Scale

[gero@oc6314270534]$ cat pv000.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv000
  labels:
    product: ibm-spectrum-scale
    type: test
spec:
  storageClassName: static
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  csi:
    driver: spectrumscale.csi.ibm.com
    volumeHandle: "17399599334533218545;099B6A7A:5EB99721;path=/mnt/fs1/nfs-export"

and a PVC

[gero@oc6314270534]$ cat pvc000.yaml 
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc000
spec:
  storageClassName: static
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  selector: 
    matchLabels:
      product: ibm-spectrum-scale
      type: test

mounted in the pod at /data/fs1/nfs-export

[gero@oc6314270534]$ cat pod.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: pod
spec:
  containers:
    - name: test
      image: registry.access.redhat.com/ubi8/ubi-minimal:latest
      command: [ "/bin/sh", "-c", "--" ]
      args: [ "while true; do echo $(hostname) $(date +%Y%m%d-%H:%M:%S) | tee -a /train/train1.log ; sleep 5 ; done;" ]
      volumeMounts:
        - name: vol1
          mountPath: "/data/fs1/nfs-export"
  volumes:
    - name: vol1
      persistentVolumeClaim:
        claimName: pvc000

we have full access to all sub-directories and their files of this backing directory:

[gero@oc6314270534]$ oc rsh pod
sh-4.4# id
uid=0(root) gid=0(root) groups=0(root)

sh-4.4# ls -al /data/fs1/nfs-export/
total 3
drwxrwxrwx. 4 root root 4096 Aug 30 13:57 .
drwxr-xr-x. 3 root root   24 Aug 30 14:03 ..
drwxrwxrwx. 2 root root 4096 Aug 30 13:59 dir1
drwxr-xr-x. 2 root root 4096 Aug 30 13:58 dir2
-rw-r--r--. 1 root root 1148 Aug 25 17:00 fstab
-rw-r--r--. 1 root root  158 Aug 25 17:00 hosts
-rw-rw-rw-. 1 root root    0 Aug 30 09:37 readall

sh-4.4# ls -al /data/fs1/nfs-export/dir1
total 2
drwxrwxrwx. 2 root root 4096 Aug 30 13:59 .
drwxrwxrwx. 4 root root 4096 Aug 30 13:57 ..
-rw-r--r--. 1 root root  547 Aug 30 13:58 config
-rwxrwxrwx. 1 root root  451 Aug 30 13:59 crontab

sh-4.4# ls -al /data/fs1/nfs-export/dir2
total 2
drwxr-xr-x. 2 root root 4096 Aug 30 13:58 .
drwxrwxrwx. 4 root root 4096 Aug 30 13:57 ..
-rw-r--r--. 1 root root  547 Aug 30 13:58 config
sh-4.4# tail -1 /data/fs1/nfs-export/dir1/config 

It seems that SELinux prevents access to sub-directories of the backing directory in a statically provisioned PV only in case the static PV's volumeHandle (e.g. "17399599334533218545;099B6A7A:5EB99721;path=/mnt/fs1") is pointing directly to the root fileset of an IBM Spectrum Scale file system (i.e. the root of an entire IBM Spectrum Scale file system).

In this case when granting access to an entire IBM Spectrum Scale file system with a statically provisioned PV (e.g. path=/mnt/fs1) the expected behavior would be to have full access to all sub-directories as determined by the individual file system permissions and access not to be prevented by SELinux on the worker nodes.

@smitaraut
Copy link
Member

Looks like in case of root fileset PVC, OCP did not relabel the existing subdirectory. Saying because of following selinux log-

read system_u:object_r:unlabeled_t:s0

Can you verify by running ?

ls -lZ /mnt/fs1
ls -lZ /mnt/fs1/nfs-export

Few other things to check-

  1. Is SELinux in permissive or enforcing mode on storage cluster?
  2. Is there any different in ls -lZ output on storage vs OCP cluster?
  3. Can you try the same thing with selinuxContext as MustRunAs in SCC? - https://www.ibm.com/docs/en/spectrum-scale-csi?topic=pods-considerations-mounting-read-write-many-rwx-volumes
  4. Please also check the SELinuz considerations section here- https://www.ibm.com/docs/en/spectrum-scale-csi?topic=planning-deployment-considerations

@gfschmidt
Copy link
Member Author

gfschmidt commented Sep 15, 2021

Storage cluster SELinux settings for this Issue are:

[root@fscc-sr650-12 ~]# mmdsh -Nall -f1 getenforce
fab3-3a.bda.scale.ibm.com: Enforcing
fab3-3b.bda.scale.ibm.com: Enforcing
node12.bda.scale.ibm.com: Enforcing

When I first run into this issue I had SELinux : Disabled on all nodes in the storage cluster which was supposed to be the default on the ESS. When I opened this issue I already had changed the setting to Enforcing and also freshly re-created the directory nfs-export which was then used for the exercise described in this issue.

If the SELinux settings on the storage cluster do matter then what is the required SELinux setting on the IBM Spectrum Scale storage cluster / ESS used with CNSA/CSI on OpenShift?

@gfschmidt
Copy link
Member Author

File listing with ls -lZ on storage and client cluster

Note that most of the other directories here were created when SELinux was disabled. The nfs-export directory referred to in this Issue was freshly created after enabling SELinux: Enforcing on the storage cluster.

gero:~$ oc rsh ibm-spectrum-scale-core-75l5b
sh-4.4# ls -lZ /mnt/fs1
total 6
drwxr-xr-x. 5 root       root       system_u:object_r:container_file_t:s0:c20,c25    4096 Feb 15  2021 DAAA
drwxr-xr-x. 2 root       root       system_u:object_r:container_file_t:s0:c10,c25    4096 Mar 23 22:02 MYDATA
drwxrwxrwx. 2 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Apr  9 17:58 cpd-volumes
drwxr-xr-x. 4 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Jun 24 17:34 demo
drwxr-xr-x. 2 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Jul 23 11:25 mvi
-rw-r--r--. 1 root       root       system_u:object_r:container_file_t:s0:c20,c25       0 Aug 30 11:00 myNewFile
drwxrwxrwx. 5 root       root       system_u:object_r:container_file_t:s0:c20,c25    4096 Sep 13 12:46 nfs-export
drwxrwxrwx. 2 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Apr 18 18:41 sherlock
drwxrwx--x. 3 root       root       system_u:object_r:unlabeled_t:s0                 4096 Jul 22 19:25 spectrum-scale-csi-volume-store
drwxr-xr-x. 2 root       root       system_u:object_r:container_file_t:s0:c249,c1012 4096 Jun 24 22:08 test
drwxrwxrwx. 2 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Jun  7 16:08 wmla
drwxrwxrwx. 9 1000750000 1000750000 system_u:object_r:container_file_t:s0:c24,c27    4096 May  6 20:39 wmla-afm

[root@fscc-sr650-12 ~]# ls -lZ /gpfs/ess3000_1M/
total 6
drwxr-xr-x. 5 root       root       system_u:object_r:container_file_t:s0:c20,c25    4096 Feb 15  2021 DAAA
drwxr-xr-x. 2 root       root       system_u:object_r:container_file_t:s0:c10,c25    4096 Mar 23 23:02 MYDATA
drwxrwxrwx. 2 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Apr  9 19:58 cpd-volumes
drwxr-xr-x. 4 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Jun 24 19:34 demo
drwxr-xr-x. 2 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Jul 23 13:25 mvi
-rw-r--r--. 1 root       root       system_u:object_r:container_file_t:s0:c20,c25       0 Aug 30 13:00 myNewFile
drwxrwxrwx. 5 root       root       system_u:object_r:container_file_t:s0:c20,c25    4096 Sep 13 14:46 nfs-export
drwxrwxrwx. 2 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Apr 18 20:41 sherlock
drwxrwx--x. 3 root       root       system_u:object_r:unlabeled_t:s0                 4096 Jul 22 21:25 spectrum-scale-csi-volume-store
drwxr-xr-x. 2 root       root       system_u:object_r:container_file_t:s0:c249,c1012 4096 Jun 25 00:08 test
drwxrwxrwx. 2 root       root       unconfined_u:object_r:unlabeled_t:s0             4096 Jun  7 18:08 wmla
drwxrwxrwx. 9 1000750000 1000750000 system_u:object_r:container_file_t:s0:c24,c27    4096 May  6 22:39 wmla-afm

The output from the CNSA cluster and storage cluster looks the same to me.

All files in the nfs-export sub-directory carry the same SELinux flags:

Storage cluster:

[root@fscc-sr650-12 ~]# ls -lZ /gpfs/ess3000_1M/nfs-export/
-rw-r--r--. 1 root root system_u:object_r:container_file_t:s0:c20,c25  8589934592 Sep  8 15:46 dd10.out
-rw-r--r--. 1 root root system_u:object_r:container_file_t:s0:c20,c25  8589934592 Sep  8 15:47 dd11.out
...

CNSA:

sh-4.4# ls -lZ /mnt/fs1/nfs-export
-rw-r--r--. 1 root root system_u:object_r:container_file_t:s0:c20,c25  8589934592 Sep  8 13:46 dd10.out
-rw-r--r--. 1 root root system_u:object_r:container_file_t:s0:c20,c25  8589934592 Sep  8 13:47 dd11.out
...

@gfschmidt
Copy link
Member Author

gfschmidt commented Sep 15, 2021

Please also note my first comment.

  • When mounting a static PV of /mnt/fs1 (the root directory of a Spectrum Scale fs) into the Pod then SELinux prevents access to all the sub-directories in there, for example, also nfs-exports.

  • When mounting a static PV of a sub-directory of the same Spectrum Scale root fs into the Pod, e.g. /mnt/fs1/nfs-export, with the very same deployment and environment then everything works fine including full access to all the sub-directories in nfs-export.

So it only fails when mounting an entire Spectrum Scale root file system via a static PV.

In both cases the pod is directly executed by an OpenShift admin user with cluster role cluster-admin. The pod is even running privileged for the time being. So I assume the SCC that will be used is the privileged SCC which grants unlimited permissions with

gero:~$ oc get scc privileged
NAME         PRIV   CAPS    SELINUX    RUNASUSER   FSGROUP    SUPGROUP   PRIORITY     READONLYROOTFS   VOLUMES
privileged   true   ["*"]   RunAsAny   RunAsAny    RunAsAny   RunAsAny   <no value>   false            ["*"]

@madhuthorat
Copy link
Member

@amdabhad Please check this.

@amdabhad
Copy link
Member

Issue 1. When a directory A is used for a PV, and in turn used by a pod then inside the pod, the sub-directories of A are not accessible.

  • Solution:
    a. Enable SELinux on storage cluster nodes where scale is installed
    b. Add a label container_file_t recursively to the directory which you want use as storage for CNSA.
    Following are the steps to add the required label to a directory e.g. /ibm/fs1 recursively:
  semanage fcontext -a -t container_file_t "/ibm/fs1(/.*)?"
  restorecon -R -v /ibm/fs1

Issue 2. When multiple pods are accessing same/nested path - the pods created earlier lose access to the common path due to relabeling.

  • Solution:
    a. Create a new SCC (or edit the existing SCC used by your app) with a constant label
    e.g.
   seLinuxContext:	
	seLinuxOptions:
	  level: "s0:c50,c100"

b. In order for multiple pods to have access to a common path, the pods must use above SCC.

@gfschmidt Please have a look at these solutions and let me know if any concerns.

@gfschmidt
Copy link
Member Author

@amdabhad, thanks. I will be running some tests and will share my results soon.

@gfschmidt
Copy link
Member Author

gfschmidt commented Oct 26, 2021

The instructions provided above are workarounds as proposed similarly in Considerations for mounting read-write many (RWX) volumes and may help to fix given conditions when running into them. None of these manual workarounds should typically be required when provisioning static PVs.

The point of this issue is that the SELinux relabeling is applied differently when using static provisioning:

  • When using a static PV which is backed by a directory in IBM Spectrum Scale (e.g. /mnt/fs1/dir1) the whole directory and its contents is relabeled with the SELinux context of the namespace where the pod is running.
PV -> volumeHandle: "17399599334539944523;099B6A7A:5EB99721;path=/mnt/fs1/dir1"
  • When using the very same setup but instead using a static PV which is backed by the entire IBM Spectrum Scale file system (/mnt/fs1) then no SELinux relabeling takes place at all and the pod typically has no access to all or some of the data depending on the pre-existing SELinux labels which may or may not match the SELinux context of the namespace.
PV -> volumeHandle: "17399599334539944523;099B6A7A:5EB99721;path=/mnt/fs1"

So there is a difference in how a static PV is handled with regard to the mount point in the volumeHandle. No SELinux relabeling is happening with static PVs if the mount point of an entire IBM Spectrum Scale file system is given in the volumeHandle. Even if SELinux relabeling is not wanted in some cases the expected behavior is that it should happen in the same way as with a regular directory to ensure that a pod has access to the data in the PV. The creation of the static PV is in the domain of an admin (not the user). So the creation is a manual process similar to all static PVs and intentional. IBM Spectrum Scale supports multiple file systems and there may be cases where an entire Spectrum Scale file system (similar to a directory in IBM Spectrum Scale) is shared and made available to OpenShift users through static provisioning (i.e. a static PV). With proper SELinux relabeling and the proposed methods described in point 3 and 4 below parallel access to the data can be established with the proper SELinux context in the same way as with regular directories.

I observed the following behavior in OpenShift (here with /mnt/fs1 = root of IBM Spectrum Scale file system):

Expected behavior (access granted on directory level beneath the IBM Spectrum Scale root mount point):

  1. When mounting a directory in IBM Spectrum Scale like /mnt/fs1/dir1 as static PV (demo-pv01) inside a pod (pod1) the whole directory (and its contents) is being relabeled with the SELinux context of the OpenShift namespace as soon as the pod consuming the PV is started. No matter if the previous SELinux context was different.
gero:stat-prov-test$ oc get ns proj1 -o yaml | head -10
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c26,c0

[root@fscc-sr650-12 ess3000_1M]# ls -alZ /mnt/fs1/dir1
drwxrwxr-x.  4 root root system_u:object_r:container_file_t:s0:c0,c26    4096 Oct 25 15:53 .
drwxr-xr-x. 18 root root system_u:object_r:container_file_t:s0:c20,c25 262144 Oct 25 14:56 ..
-rw-r--r--.  1 root root system_u:object_r:container_file_t:s0:c0,c26      36 Oct 25 15:53 demo1.log
drwxrwxr-x.  7 root root system_u:object_r:container_file_t:s0:c0,c26    4096 Oct 25 14:53 model
drwxrwxr-x.  7 root root system_u:object_r:container_file_t:s0:c0,c26    4096 Oct 25 14:53 train
  1. When another pod (pod2) in another namespace mounts another static PV (demo-pv02) which is backed by the same directory /mnt/fs1/dir1 in IBM Spectrum Scale the whole directory (and its contents) is being relabeled again but with the SELinux context of the new OpenShift namespace as soon as the pod is started. The previous SELinux context is overwritten and the previous pod immediately loses all access because typically another namespace enforces different SELinux settings.
gero:stat-prov-test$ oc get ns proj2 -o yaml | head -10
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c27,c19

[root@fscc-sr650-12 ess3000_1M]# ls -alZ /mnt/fs1/dir1
drwxrwxr-x.  4 root root system_u:object_r:container_file_t:s0:c19,c27   4096 Oct 25 16:01 .
drwxr-xr-x. 18 root root system_u:object_r:container_file_t:s0:c20,c25 262144 Oct 25 14:56 ..
-rw-r--r--.  1 root root system_u:object_r:container_file_t:s0:c19,c27   1674 Oct 25 16:01 demo1.log
-rw-r--r--.  1 root root system_u:object_r:container_file_t:s0:c19,c27     72 Oct 25 16:01 demo2.log
drwxrwxr-x.  7 root root system_u:object_r:container_file_t:s0:c19,c27   4096 Oct 25 14:53 model
drwxrwxr-x.  7 root root system_u:object_r:container_file_t:s0:c19,c27   4096 Oct 25 14:53 train

gero:~$ oc get pods
NAME   READY   STATUS    RESTARTS   AGE
pod1   1/1     Running   0          11m
gero:~$ oc rsh pod1
sh-4.4# id
uid=0(root) gid=0(root) groups=0(root)
sh-4.4# ls -alRZ dir1/
ls: cannot open directory 'dir1/': Permission denied
sh-4.4# ls -alZ
dr-xr-xr-x.   1 root root system_u:object_r:container_file_t:s0:c0,c26    30 Oct 25 13:53 .
dr-xr-xr-x.   1 root root system_u:object_r:container_file_t:s0:c0,c26    30 Oct 25 13:53 ..
lrwxrwxrwx.   1 root root system_u:object_r:container_file_t:s0:c0,c26     7 Apr 23  2020 bin -> usr/bin
dr-xr-xr-x.   2 root root system_u:object_r:container_file_t:s0:c0,c26     6 Apr 23  2020 boot
drwxrwxr-x.   4 root root system_u:object_r:container_file_t:s0:c19,c27 4096 Oct 25 14:01 dir1
  1. If the admin changes the annotation openshift.io/sa.scc.mcs: s0:c27,c19 of the second namespace to the same SELinux context as used by the first namespace openshift.io/sa.scc.mcs: s0:c26,c0 where pod1 is running then pod2 and pod1 both run under the same SELinux context and can both access the directory /mnt/fs1/dir1 in parallel in IBM Spectrum Scale as intended. All pods in the second namespace will now run under with the same SELinux context as in the first namespace and can easily share parallel access to common directories in IBM Spectrum Scale through static PVs. As PVCs are namespaced objects and bind to PVs we need one PV per namespace even if these are backed by the very same directory in IBM Spectrum Scale. In this case demo-pv01 and demo-pv02 only differ in their object name.
gero:stat-prov-test$ oc edit ns proj2
gero:stat-prov-test$ oc get ns proj2 -o yaml | head -10
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c26,c0
  1. A user can also enforce the correct SELinux contexts on container level (if permitted by the SCC under which the user runs) by using the securityContext stanza with the proper seLinuxOptions derived from the SELinux context of the first namespace:
spec:
  containers:
    - name: test-pod
      image: registry.access.redhat.com/ubi8/ubi-minimal:latest
      securityContext:
        seLinuxOptions:
          level: "s0:c26,c0"

In this case the SELinux context can be controlled on pod level by the user if permitted by the SCC under which the user is running. The SCC is controlled by the admin.

Method 3 (per namespace) and method 4 (per pod/container) allow to share parallel access to the same data directory in IBM Spectrum Scale through static provisioning.

These two approaches are simple to apply. Defining a specific SELinux context in the SCC as proposed above requires additional efforts (configuration of roles, rolebindings, service accounts).

-> Especially method 3 based on the namespace annotation may be worth being added to the CSI documentation, e.g. in Considerations for mounting read-write many (RWX) volumes.

Unexpected behavior (access denied when using an IBM Spectrum Scale root mount point):

  1. When mounting the root directory of an entire IBM Spectrum Scale like /mnt/fs1 as static PV (demo-pv) inside a pod (pod) then the whole directory (and its contents) is NOT being relabeled with the required SELinux context of the OpenShift namespace in which the pod consuming the PV is running. This behavior differs from the behavior expected and observed when mounting a a directory (/mnt/fs1/dir1) in IBM Spectrum Scale. Without the proper SELinux relabelling the pod will not have access to the data in the static PV (as the pre-existing labels will likely not match the SELinux context of the namespace). The expected behavior would be the same as with a regular directory which is that SELinux relabeling with the SELinux context of the pod is taking place. This is the expected regular behavior in OpenShift (to grant access to the PV that the pod is claiming) - even if this may not be wanted in some cases.

-> Manual adjustment of the SELinux labels on the storage cluster as proposed above should not be the regular solution here - it's rather a workaround to fix a given situation. Manual adjustment of the SELinux labels on the storage cluster is also not needed when working on a regular directory (one level under the root mount point of the IBM Spectrum Scale file system) - here everything works as expected.

-> Furthermore, a definitive direction if SELinux needs to be enabled or disabled on the IBM Spectrum Scale storage cluster (i.e. set to enforcing/permissive/disabled and targeted /minimum/mls) is missing in the official documentation. As far as I know the default setting with IBM ESS SELinux is typically disabled on the io nodes.

Nested directories

  1. The SELinux context which is applied through relabeling at the start of the pod which is consuming the static PV backed by a directory in IBM Spectrum Scale does not seem to be cumulative in OpenShift in a sense that two users with a different SELinux context can both access the same backing directory. Therefore I assume that using static PVs with nested directories, e.g. /mnt/fs1/dir1 and /mnt/fs1/dir1/dir2, by independent users in different namespaces will not work either - unless measures are applied by the admin to specifically ensure that these independent users in their different namespaces are using the same SELinux context as proposed in point 3 and point 4 above.
    20211027 Static Provisioning with Spectrum Scale and SELinux

@gfschmidt
Copy link
Member Author

gfschmidt commented Nov 12, 2021

Summary of open issues:

  1. When using a static PV backed by an entire IBM Spectrum Scale file system (e.g., /mnt/fs1) then no SELinux relabeling takes place and the pod mounting the PV has no access to the data because no SELinux relabeling takes place. This behavior is not expected. With a regular directory in the IBM Spectrum file system (e.g., /mnt/fs1/dir1) SELinux relabeling takes place and access is ensured for the pod mounting the PV.
PV -> volumeHandle: "17399599334539944523;099B6A7A:5EB99721;path=/mnt/fs1"       <- no SELinux relabeling
PV -> volumeHandle: "17399599334539944523;099B6A7A:5EB99721;path=/mnt/fs1/dir1"  <- SELinux relabeling
  1. A definitive direction if SELinux needs to be enabled or disabled on the IBM Spectrum Scale storage cluster (i.e. set to enforcing | permissive | disabled and targeted | minimum/mls) is missing in the official documentation. As far as I know the default setting with IBM ESS SELinux is typically disabled on the ESS IO nodes. Is setting SELinux to enabled on ESS IO nodes required without impacting overall performance (e.g., metadata operations)?

@rkomandu
Copy link

rkomandu commented Dec 9, 2021

@gfschmidt , thank you for the detailed try outs and updates in this issue.

HPO team is also at the same state that the IBM SS FS (/mnt/fs1) is mounted onto the Noobaa pod using the namespacestore which will vary whenever we reinstall Noobaa on the OCP cluster. We are in 5121 stream and there is no proper automatic method to set this looks like

On SC we are doing Permissive and we have issues when working on the complete solution of creating objects etc

-- We mount the /mnt/fs1 as is on the pod (static PV)
-- We have created user directories on the SC with appropriate permissions on RWX
-- on OCP cluster we deploy the HPO solution (CNSA+CSI+HPO-DAS)
-- For Object access we create the user account with the pre-created directory on SC passed for the DAS service
-- From App node we try to create the IO using user credentials and then we run into this issue of Access Denied while trying to create objects into that directory.

@amdabhad / CSI team this has been discussed with @deeghuge few times with live debugging and we need an E2E solution without any manual changes on SC.

Noobaa suggested a solution to openup c0.c1023 (as temporary workaround) until this issue is fixed, which we see as a security loop hole by HPO team.

Can we reanalyze and get the Fix here ?

@rkomandu
Copy link

rkomandu commented Dec 9, 2021

@gfschmidt --> you can refer to Noobaa defect 6761 that has all the details w/r/t scc changing for each install of Noobaa in openshift-storage ns

@rkomandu
Copy link

@amdabhad
Could you comment on what happens if we followed step 1 of "semanage, restorecon" and then the cluster nodes rebooted. After the SS FS is mounted, the steps needs to be repeated ?

Cluster rebooted for us as shown below

mmdsh -N all uptime
rkomandu-ss5121-x-master.fyre.ibm.com: 03:21:40 up 5 days, 13:10, 3 users, load average: 0.23, 0.10, 0.09
rkomandu-ss5121-x-worker2.fyre.ibm.com: 03:21:40 up 5 days, 13:10, 0 users, load average: 0.42, 0.12, 0.03
rkomandu-ss5121-x-worker1.fyre.ibm.com: 03:21:40 up 5 days, 13:12, 0 users, load average: 0.06, 0.07, 0.05

@amdabhad
Copy link
Member

amdabhad commented Jan 20, 2022

@rkomandu If you are following step 1, I think you will have to perform this step again when the nodes are rebooted, because I have seen the scale operations like mmmount change the labels.

@gnufied
Copy link

gnufied commented Jan 31, 2022

When mounting the root of the volume - how does the Pod definition looks like and how does the mount appear on the node where pod was scheduled? i.e - can you post precise mount options that was visible once volume was mounted on the node?

@deeghuge
Copy link
Member

deeghuge commented Feb 2, 2022

Hi @gnufied , On worker node filesystem is mounted at /var/mnt/gpfs1.

sh-4.4# mount| grep gpfs
gpfs1 on /var/mnt/gpfs1 type gpfs (rw,relatime,seclabel)

sh-4.4# ls -laZ /var/mnt/gpfs1
total 259
drwxr-xr-x. 6 root root system_u:object_r:unlabeled_t:s0      262144 Jan 14 11:59 .
drwxr-xr-x. 3 root root system_u:object_r:var_t:s0                19 Jan  8 10:13 ..
dr-xr-xr-x. 2 root root system_u:object_r:unlabeled_t:s0        8192 Jan  1  1970 .snapshots
drwxr-xr-x. 3 root root system_u:object_r:unlabeled_t:s0        4096 Jan 14 12:00 existing-data
drwxrwx--x. 3 root root system_u:object_r:unlabeled_t:s0        4096 Jan  8 10:14 primary-fileset-gpfs1-14926875306623591168
drwxrwx--x. 3 root root system_u:object_r:container_file_t:s0   4096 Jan 13 11:56 pvc-26d9e4db-e15f-4b26-97d3-6e7eb0d10e06
drwxrwx--x. 3 root root system_u:object_r:unlabeled_t:s0        4096 Jan  8 10:17 pvc-5f40b561-4351-4bca-be3e-299f7de77493

This is the describe output of pv,pvc and pod

[root@api.dg01.cp.fyre.ibm.com ~]# kubectl describe pv dirpv
Name:            dirpv
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller: yes
Finalizers:      [kubernetes.io/pv-protection external-attacher/spectrumscale-csi-ibm-com]
StorageClass:
Status:          Bound
Claim:           test/pvc-dirpv
Reclaim Policy:  Retain
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        2Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            spectrumscale.csi.ibm.com
    FSType:
    VolumeHandle:      14926875306623591168;15250B0A:61D925EC;path=/mnt/gpfs1
    ReadOnly:          false
    VolumeAttributes:  <none>
Events:                <none>

[root@api.dg01.cp.fyre.ibm.com ~]# kubectl describe pvc pvc-dirpv
Name:          pvc-dirpv
Namespace:     test
StorageClass:
Status:        Bound
Volume:        dirpv
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      2Gi
Access Modes:  RWX
VolumeMode:    Filesystem
Used By:       csi-scale-staticdemo-pod
Events:        <none>

[root@api.dg01.cp.fyre.ibm.com ~]# oc describe pod csi-scale-staticdemo-pod
Name:         csi-scale-staticdemo-pod
Namespace:    test
Priority:     0
Node:         worker0.dg01.cp.fyre.ibm.com/10.17.117.153
Start Time:   Tue, 01 Feb 2022 23:08:29 -0800
Labels:       app=nginx
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.254.12.106"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.254.12.106"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: anyuid
Status:       Running
IP:           10.254.12.106
IPs:
  IP:  10.254.12.106
Containers:
  web-server:
    Container ID:   cri-o://249d3072b9532f80fda2a06d1d4b2ab9c9f54e317c9c3045aa065c811732adf4
    Image:          nginx
    Image ID:       docker.io/library/nginx@sha256:2834dc507516af02784808c5f48b7cbe38b8ed5d0f4837f16e78d00deb7e7767
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 01 Feb 2022 23:08:48 -0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /usr/share/nginx/html/scale from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nkvkt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-dirpv
    ReadOnly:   false
  kube-api-access-nkvkt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/hostname=worker0.dg01.cp.fyre.ibm.com
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason                  Age    From                     Message
  ----    ------                  ----   ----                     -------
  Normal  Scheduled               7m27s  default-scheduler        Successfully assigned test/csi-scale-staticdemo-pod to worker0.dg01.cp.fyre.ibm.com
  Normal  SuccessfulAttachVolume  7m27s  attachdetach-controller  AttachVolume.Attach succeeded for volume "dirpv"
  Normal  AddedInterface          7m18s  multus                   Add eth0 [10.254.12.106/22] from openshift-sdn
  Normal  Pulling                 7m18s  kubelet                  Pulling image "nginx"
  Normal  Pulled                  7m8s   kubelet                  Successfully pulled image "nginx" in 9.425617632s
  Normal  Created                 7m8s   kubelet                  Created container web-server
  Normal  Started                 7m8s   kubelet                  Started container web-server

/mnt/gpfs1 and /var/mnt/gpfs1 is same thing on worker node

sh-4.4# ls -lZa /mnt/gpfs1
total 259
drwxr-xr-x. 6 root root system_u:object_r:unlabeled_t:s0              262144 Jan 14 11:59 .
drwxr-xr-x. 3 root root system_u:object_r:var_t:s0                        19 Jan  8 10:13 ..
dr-xr-xr-x. 2 root root system_u:object_r:unlabeled_t:s0                8192 Jan  1  1970 .snapshots
drwxr-xr-x. 3 root root system_u:object_r:container_file_t:s0:c20,c26   4096 Jan 14 12:00 existing-data
drwxrwx--x. 3 root root system_u:object_r:unlabeled_t:s0                4096 Jan  8 10:14 primary-fileset-gpfs1-14926875306623591168
drwxrwx--x. 3 root root system_u:object_r:container_file_t:s0           4096 Jan 13 11:56 pvc-26d9e4db-e15f-4b26-97d3-6e7eb0d10e06
drwxrwx--x. 3 root root system_u:object_r:unlabeled_t:s0                4096 Jan  8 10:17 pvc-5f40b561-4351-4bca-be3e-299f7de77493
sh-4.4# ls -lZa /var/mnt/gpfs1
total 259
drwxr-xr-x. 6 root root system_u:object_r:unlabeled_t:s0              262144 Jan 14 11:59 .
drwxr-xr-x. 3 root root system_u:object_r:var_t:s0                        19 Jan  8 10:13 ..
dr-xr-xr-x. 2 root root system_u:object_r:unlabeled_t:s0                8192 Jan  1  1970 .snapshots
drwxr-xr-x. 3 root root system_u:object_r:container_file_t:s0:c20,c26   4096 Jan 14 12:00 existing-data
drwxrwx--x. 3 root root system_u:object_r:unlabeled_t:s0                4096 Jan  8 10:14 primary-fileset-gpfs1-14926875306623591168
drwxrwx--x. 3 root root system_u:object_r:container_file_t:s0           4096 Jan 13 11:56 pvc-26d9e4db-e15f-4b26-97d3-6e7eb0d10e06
drwxrwx--x. 3 root root system_u:object_r:unlabeled_t:s0                4096 Jan  8 10:17 pvc-5f40b561-4351-4bca-be3e-299f7de77493

Before pod actually see data, there is additional softlink get created on host which looks like as this

sh-4.4# ls -lZ /var/lib/kubelet/pods/7d736e8c-e4da-4e41-8705-1734529dcfe9/volumes/kubernetes.io~csi/dirpv/mount
lrwxrwxrwx. 1 root root system_u:object_r:container_var_lib_t:s0 10 Feb  2 07:08 /var/lib/kubelet/pods/7d736e8c-e4da-4e41-8705-1734529dcfe9/volumes/kubernetes.io~csi/dirpv/mount -> /mnt/gpfs1

@gnufied
Copy link

gnufied commented Feb 2, 2022

The spec says:

  // For volumes with an access type of mount, the SP SHALL place the
  // mounted directory at target_path.

And in this case target path is just a symlink. The first problem is - since target path is symlink, selinux detection does not work. We expect target path to be a mount point.

@deeghuge
Copy link
Member

deeghuge commented Feb 2, 2022

@gnufied it does work with softlink when softlink point to directory inside the filesystem for example if we use /mnt/gpfs1/xyz but it does not work when softlink points to mount point of filesystem eg /mnt/gpfs1

@gnufied
Copy link

gnufied commented Feb 2, 2022

how does the target path looks like when softlink points to a directory inside filesystem? The bottom line is - SELinux detection currrently is not working in k8s if target path is a symlink.

If you think it works correctly if volume is a subdirectory - then I would say it may be completely a fluke(I would not trust the result). Also, who is creating mount points like /mnt/gpfs on nodes? Why are we mounting volumes in such non-standard locations?

@deeghuge
Copy link
Member

deeghuge commented Feb 3, 2022

This is for directory inside filesystem
Data before pod started had these SELinux lables

sh-4.4# ls -lZRa
.:
total 258
drwxr-xr-x. 3 root root system_u:object_r:unlabeled_t:s0   4096 Feb  3 10:02 .
drwxr-xr-x. 7 root root system_u:object_r:unlabeled_t:s0 262144 Feb  3 10:02 ..
drwxr-xr-x. 4 root root system_u:object_r:unlabeled_t:s0   4096 Feb  3 10:02 Thu-Feb--3-10-02-47-UTC-2022
-rw-r--r--. 1 root root system_u:object_r:unlabeled_t:s0   1871 Feb  3 10:02 generatefiles.sh

Once pod started, the softlink on worker node looks like

sh-4.4# ls -al /var/lib/kubelet/pods/15f04324-7596-40c1-8e0c-b61549cbda69/volumes/kubernetes.io~csi/dirpv/mount
lrwxrwxrwx. 1 root root 16 Feb  3 10:09 /var/lib/kubelet/pods/15f04324-7596-40c1-8e0c-b61549cbda69/volumes/kubernetes.io~csi/dirpv/mount -> /mnt/gpfs1/xdata

And the SELinux labels after pod running looks like

sh-4.4# ls -lZ /mnt/gpfs1/xdata -a
total 258
drwxr-xr-x. 3 root root system_u:object_r:container_file_t:s0:c20,c26   4096 Feb  3 10:02 .
drwxr-xr-x. 7 root root system_u:object_r:unlabeled_t:s0              262144 Feb  3 10:02 ..
drwxr-xr-x. 4 root root system_u:object_r:container_file_t:s0:c20,c26   4096 Feb  3 10:02 Thu-Feb--3-10-02-47-UTC-2022
-rw-r--r--. 1 root root system_u:object_r:container_file_t:s0:c20,c26   1871 Feb  3 10:02 generatefiles.sh

who is creating mount points like /mnt/gpfs on nodes? Why are we mounting volumes in such non-standard locations?
We have gpfs clients pod running worker nodes which make filesystem available on /var/mnt/gpfs and /mnt/gpfs for consuming in CSI and keep backward compatibility

@gnufied
Copy link

gnufied commented Feb 3, 2022

Once pod with subdirectory is running, can you post the output of /proc/self/mountinfo from worker node?

@deeghuge
Copy link
Member

deeghuge commented Feb 3, 2022

Once pod with subdirectory is running, can you post the output of /proc/self/mountinfo from worker node?

mountinfo.txt

@gnufied
Copy link

gnufied commented Feb 3, 2022

This is rather surprising. The code in kubernetes works like following:

  1. Using the node-publish target path we try to determine if path is mounted in /proc/self/mountinfo and has seclabel on it - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_mounter.go#L277
  2. https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/hostutil/hostutil_linux.go#L238

We do not evaluate the symlink before determing the selinux support and hence it should not work. We can setup a call to go over this if you prefer. You can find me on kubernetes slack or coreos (redhat) slack.

@jsafrane
Copy link

jsafrane commented Feb 7, 2022

@deeghuge why the CSI driver uses symlinks instead of mounts? Esp. when bind-mounts are cheap?
Kubelet assumes that there will be a mount in NodePublish's target_path. We could fix SELinux detection, but who knows what else would break later.

@deeghuge
Copy link
Member

deeghuge commented Feb 7, 2022

@jsafrane Here is some background on why softlink was chosen over bind-mounts
User can mount Spectrum Scale(GPFS) filesystem on any path on Kubernetes worker nodes. There can be more than one filesystem on Kubernetes node. Now to use bind mount (NodePublish) all these path must be available inside CSI pod. This can be done by either exposing these filesystem's mountpoint inside CSI pod or expose whole host's /(root) inside CSI pod. To avoid the customisation(exposing path of filesystem every time customer add new filesystem ) or exposing host's / (root) inside the CSI pod, softlink approach was taken.

@jsafrane
Copy link

jsafrane commented Feb 8, 2022

If I understand it correctly, the CSI driver has only /var/lib/kubelet as a HostPath volume in its pod and it creates a symlink there to /mnt/gpfs1. This symlink is broken in the driver (the driver does not have GPFS mounted) and it's only resolved by kubelet on the host.

That's... unexpected.

There can be more than one filesystem on Kubernetes node.

This requires someone to configure and mount GPFS on every single host. That's defeats the purpose of CSI - details about the storage should be in the driver container and not on the host. I know the line between the host and CSI driver can be blurry, e.g. iscsid and multipathd can run on the host, still, having full storage mounted on the host is something I've never seen.

Did you consider mounting the GPFS volumes inside the driver container, e.g. based on a ConfigMap with all the filesystems to mount, and then providing bind-mounts from it?

@deeghuge
Copy link
Member

deeghuge commented Feb 8, 2022

I agree with your suggestions but there is some background/design aspect for the approach taken

  • gpfs requires client to be installed on each worker node where gpfs data need to accessed
  • gpfs client is containerised very recently and as of today it only support OCP with RHCOS worker node.
  • CSI had started early for supporting Vanila k8s as well as OCP with RHEL worker nodes.
  • as of today gpfs does not support selective mount hence whenever mount whole filesystem become visible.
  • As you are suggesting "mounting the GPFS volumes inside the driver container" can be done but it require gpfs client being truly containerised(available for all k8s variant) or CSI driver managing things based on where it is running.

@deeghuge deeghuge added this to the v2.7.0 milestone Jun 30, 2022
@gfschmidt
Copy link
Member Author

Just for further reference: In addition to my comment with the SELinux namespace annotations above (see gfschmidt commented on Oct 26, 2021 I summarized major considerations for using static provisioning with shared data access in IBM Spectrum Scale on OpenShift in the following blog post: Advanced Static Volume Provisioning with IBM Spectrum Scale on Red Hat OpenShift (i.e. distinct mapping of static PVs to PVCs using labels or claimRef, ensure proper uid/gid and SELinux context with namespace annotations or pod/container securityContext ).

@Jainbrt
Copy link
Member

Jainbrt commented Sep 7, 2022

@deeghuge could you please help put FQI labels?

@deeghuge deeghuge removed this from the v2.7.0 milestone Sep 9, 2022
@deeghuge deeghuge added Customer Probability: Medium (3) Issue occurs in normal path but specific limited timing window, or other mitigating factor Severity: 3 Indicates the the issue is on the priority list for next milestone. Customer Impact: Localized high impact (3) Reduction of function. Significant impact to workload. labels Sep 9, 2022
@gfschmidt
Copy link
Member Author

gfschmidt commented Nov 18, 2022

Just some additional information on this SELinux relabeling topic that might be helpful for anyone interested:

If someone wants to see SELinux relabeling in action and understand how the standard SELinux relabling works when pods in different namespaces access the same data on IBM Spectrum Scale I recorded and shared some demos:

  1. DATA SHARING DONE WRONG WITH ACCESS DENIED BY SELINUX
  2. DATA SHARING DONE RIGHT WITH CUSTOM SCC

The first demo shows what happens when the same data in IBM Spectrum Scale is accessed by three non-privileged users in OpenShift from three namespaces without taking the SELinux security context into account. The users are regular users running in their own namespaces under the "restricted" SCC (security context constraints). Here the "restricted" SCC enforces a MustRunAs policy on the SELinux security context and the default value for the SELinux MCS label is taken from the pre-allocated values given in the annotations of the namespace where the pod is running. So the pod in each namespace runs with another default SELinux MCS label and as soon as a pod is started and mounts the volume the whole data in the mounted volume is relabeled with the default SELinux MCS label of the namespace. In this case each user shuts out all other users from accessing the pods if no further precautions with regard to SELinux relabeling are taken.
This demo shows how the pre-allocated default values from the namespace for the SELinux MCS label, uid and fsGroup are applied and how each user locks out any previous user from accessing the shared data as soon as the user starts a pod.

This second demo shows how you can define and apply a custom SCC to ensure that different non-privileged users in three namespaces can safely access the same data in IBM Spectrum Scale across worker nodes in a RWX (read-write-many) access mode with a proper SELinux context and individual uid/gid file permissions. We show how to grant access to a custom SCC ("shared-scc") via a service account (plus role and rolebinding) in the user namespace as well as by assigning the custom SCC directly to the user (could also be a group of users). Here the pods in each namespace run with the same SELinux MCS label as defined in the custom SCC ("shared-scc"). The PVs have been provisioned as shown in the first demo above.

The PVs for the two demos have been provisioned as shown in the this demo: ADVANCED VOLUME PROVISIONING FOR DATA SHARING.

This presentation gives a good summary of the SELinux issue and how to safely mitigate it with custom SCCs. The setup used in the videos and more information is available in my blog post here.

SELinux relabeling may also lead to ContainerCreationErrors on large volumes with many files under some circumstances if a timeout of 120sec is hit and the relabeling did not successfully finish in that time. My blog post above also shows one way how to disable SELinux relabeling on a volume today by using a custom SCC and "spc_t" in this section which is based on one of two solutions proposed in a Red Hat artice.

@gfschmidt
Copy link
Member Author

gfschmidt commented Feb 17, 2023

Note that a similar behavior as described in this issue may also be observed if the backing directory of a statically provisioned volume is actually an IBM Spectrum Scale independent fileset (instead of a regular directory). An independent fileset may behave in a similar way when specified as backing directory in the volumeHandle of a statically provisioned volume . In this case automatic SELinux relabeling is eventually skipped by OpenShift on any sub-directories. This can be seen as an advantage for static provisioning and shared data access as the pre-existing permissions and SELinux attributes in the file system / fileset are honored. Here the storage admin keeps control over the access permissions of the data as set in the file system and can generally or selectively grant access to applications in OpenShift.

For example, the storage admin could grant general access to the data in an independent fileset accessed by pods/containers in OpenShift through a statically provisioned PV by setting the SELinux MCS label on an entire fileset (or just selected sub-directories or objects within the fileset) to "system_u:object_r:container_file_t:s0" on the IBM Spectrum Scale storage cluster (Note: The option -R will recursively change the SELinux label on all files and directories at the specified destination):

# chcon -R "system_u:object_r:container_file_t:s0" /[absolute path-to-fileset]/[fileset link point]

This would only have to be done once as the applied SELinux settings will persist. If SELinux is not enabled on the storage cluster the manual SElinux relabeling could also be done by a cluster admin on OpenShift by running a debug pod, for example

# oc debug node/[worker-node] -- chroot /host chcon -R 'system_u:object_r:container_file_t:s0' /mnt/[local-Scale-file-system-name]/[relative-path-to-fileset]/[fileset-link-point]

Note: As always when executing commands as root user: Be very cautious and careful when doing the SELinux relabeling by running the chcon command directly in a debug pod on an OpenShift worker node. When done wrong you can harm your system!

Without adding a specific category (like c7, c28) to the SELinux MCS label "container_file_t:s0" the data can be accessed by any pod/container independent of the specific SELinux MCS category that is assigned to the pod/container process by OpenShift like, for example, "container_t:s0:c7,c28" because the empty set of SELinux categories is also part of every process set. Any pod/container means, of course, any pod/container that has been given explicit access to the data through a statically provisioned PV.

The storage admin can even set the SELinux MCS labels more selectively with specific categories on selected (or all) objects in the independent fileset if needed but that would also entail to carefully align the SELinux labels in the file system with the SELinux security context of the pods/containers in OpenShift which is dependent on the OpenShift Security Context Constraints (SCCs) as well as the user, group, service account and pre-allocated defaults from the namespace. Please refer to my blog post here for more information.

@deeghuge
Copy link
Member

Issue if fixed in CNSA 5.1.7. Please reopen if you still see issue on latest CNSA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Customer Impact: Localized high impact (3) Reduction of function. Significant impact to workload. Customer Probability: Medium (3) Issue occurs in normal path but specific limited timing window, or other mitigating factor Severity: 3 Indicates the the issue is on the priority list for next milestone. Type: Bug Indicates issue is an undesired behavior, usually caused by code error.
Projects
None yet
Development

No branches or pull requests