Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mlx and datashim deployment with openshift #41

Merged
merged 14 commits into from
Jun 11, 2022
Merged

Conversation

Tomcli
Copy link

@Tomcli Tomcli commented Jun 9, 2022

Which issue is resolved by this Pull Request:
Resolves #

Description of your changes:

Checklist:

  • Unit tests pass:
    Make sure you have installed kustomize == 3.2.1
    1. make generate-changed-only
    2. make test

@yhwang yhwang merged commit 2dc4436 into IBM:master Jun 11, 2022
@ckadner
Copy link

ckadner commented Jun 13, 2022

Thank You @Tomcli for updating the manifests for MLX!

For KIND and OpenShift on Fyre we are still seeing issues with the kfp-csi-s3 pod getting stuck in ContainerCreating.

You had mentioned it last Friday but I wanted to keep a record of it. Should I create an issue on this or the MLX repo to keep track of it?

[IBM_manifests] (v1.5-branch=)$ kubectl get pod -n kubeflow kfp-csi-s3-4x6dl

NAME               READY   STATUS              RESTARTS   AGE
kfp-csi-s3-4x6dl   0/2     ContainerCreating   0          28m


[IBM_manifests] (v1.5-branch=)$ kubectl describe pod -n kubeflow kfp-csi-s3-4x6dl

Name:           kfp-csi-s3-4x6dl
Namespace:      kubeflow
Priority:       0
Node:           mlx-control-plane/172.18.0.2
Start Time:     Mon, 13 Jun 2022 12:32:31 -0700
Labels:         app=kfp-csi-s3
                app.kubernetes.io/name=kubeflow
                application-crd-id=kubeflow-pipelines
                controller-revision-hash=6d8fbd86d7
                pod-template-generation=1
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  DaemonSet/kfp-csi-s3
Containers:
  driver-registrar:
    Container ID:
    Image:         quay.io/k8scsi/csi-node-driver-registrar:v1.2.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --csi-address=/csi/csi.sock
      --kubelet-registration-path=/var/data/kubelet/plugins/kfp-csi-s3/csi.sock
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      KUBE_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /csi from socket-dir (rw)
      /registration from registration-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8gbd8 (ro)
  kfp-csi-s3:
    Container ID:
    Image:         quay.io/datashim/csi-s3:latest-amd64
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --endpoint=$(CSI_ENDPOINT)
      --nodeid=$(KUBE_NODE_NAME)
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      CSI_ENDPOINT:    unix:///csi/csi.sock
      KUBE_NODE_NAME:   (v1:spec.nodeName)
      cheap:           off
    Mounts:
      /csi from socket-dir (rw)
      /dev from dev-dir (rw)
      /var/data/kubelet/pods from mountpoint-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8gbd8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  socket-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/data/kubelet/plugins/kfp-csi-s3
    HostPathType:  DirectoryOrCreate
  mountpoint-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/data/kubelet/pods
    HostPathType:  DirectoryOrCreate
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/data/kubelet/plugins_registry
    HostPathType:  Directory
  dev-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  Directory
  kube-api-access-8gbd8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    28m                  default-scheduler  Successfully assigned kubeflow/kfp-csi-s3-4x6dl to mlx-control-plane
  Warning  FailedMount  28m (x2 over 28m)    kubelet            MountVolume.SetUp failed for volume "kube-api-access-8gbd8" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  17m (x3 over 21m)    kubelet            Unable to attach or mount volumes: unmounted volumes=[registration-dir], unattached volumes=[socket-dir registration-dir kube-api-access-8gbd8 mountpoint-dir dev-dir]: timed out waiting for the condition
  Warning  FailedMount  14m (x2 over 23m)    kubelet            Unable to attach or mount volumes: unmounted volumes=[registration-dir], unattached volumes=[kube-api-access-8gbd8 mountpoint-dir dev-dir socket-dir registration-dir]: timed out waiting for the condition
  Warning  FailedMount  12m                  kubelet            Unable to attach or mount volumes: unmounted volumes=[registration-dir], unattached volumes=[dev-dir socket-dir registration-dir kube-api-access-8gbd8 mountpoint-dir]: timed out waiting for the condition
  Warning  FailedMount  8m3s (x3 over 26m)   kubelet            Unable to attach or mount volumes: unmounted volumes=[registration-dir], unattached volumes=[registration-dir kube-api-access-8gbd8 mountpoint-dir dev-dir socket-dir]: timed out waiting for the condition
  Warning  FailedMount  100s (x21 over 28m)  kubelet            MountVolume.SetUp failed for volume "registration-dir" : hostPath type check failed: /var/data/kubelet/plugins_registry is not a directory

@yhwang
Copy link
Member

yhwang commented Jun 14, 2022

someone also hits a similar issue on AWS ORKS, I guess the key is the correct path of /var/data/kubelet/plugins_registry. not sure how to get the correct path of plugins_registry on KIND and OpenShift on Fyre.

correct kubelet path for KIND is /var/lib/kubelet/plugins_registry

update
after replace /var/data to /var/lib. the kfp-csi-s3 storage class works on my KIND

@yhwang
Copy link
Member

yhwang commented Jun 14, 2022

@jbusche also found out the path for OpenShift on Fyre is /var/lib too

@yhwang
Copy link
Member

yhwang commented Jun 14, 2022

I found Tommy created a datashim layer for kind using /var/lib : https://github.com/IBM/manifests/blob/v1.5-branch/contrib/datashim/kind/datashim.yaml#L784-L865
and both mlx-single-fyre-openshift and mlx-single-kind are using it now.

yhwang pushed a commit that referenced this pull request Aug 10, 2023
* add mlx and datashim deployment

* remove datashim conflicted spec

* add empty-dir for build dir

* update scc and fix typo

* add proc volume

* remove proc volume

* add kfp-csi-s3 to no auth kfp-tekton applications

* remove kfp-csi-s3 to no auth kfp-tekton applications

* remove kfp-csi-s3 to no auth kfp-tekton applications

* add datashim scc

* update dlf scc

* patch latest dlf oc yaml

* update kind datashim yaml

* add fyre openshift manifest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants