IBM Watson Machine Learning Community Edition on OpenShift Using Helm Tiller

IBM Watson Machine Learning Community Edition (WML CE) makes deep learning, machine learning, and AI more accessible and more performant.

Introduction

IBM WML CE incorporates some of the most popular deep learning frameworks, along with unique IBM augmentations to improve cluster performance and support larger deep learning models.

Chart Details

Deploys a pod with the WML CE container that has all of the supported WML CE frameworks.
Supports persistent storage, allowing you to access your data sets and provide your training application code to the pod.
Provides control over the command that is run during pod startup.
Allows you to control which GPU type is used. Useful when running multiple worker nodes of different GPU types. For example, AC922 with V100 and 822LC with P100.

Prerequisites

Kubernetes v1.11.3 or later with GPU scheduling enabled, and Tiller v2.9.1 or later (Refer to the Install Tiller and Helm section)
The application must run on nodes with supported GPUs see IBM WML CE V1.7.0 release notes.
Helm 2.9.1 or later
Refer to the Enabling GPUs in OpenShift section to configure on GPU nodes.
If you wish to leverage persistent storage for data sets and/or runtime code, you should enable persistence.enabled=true and create your persistent volume prior to deploying the chart (unless you use dynamic provisioning). It can be created by using a yaml file as in the following example: Note: accessModes can be ReadWriteOnce/ReadWriteMany

kind: PersistentVolume
apiVersion: v1
metadata:
  name: "wmlce-datavolume"
  labels:
    type: local
spec:
  storageClassName: ""
  capacity:
    storage: 8Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/wmlce/data"

Enabling GPUs in OpenShift

See https://developer.ibm.com/linuxonpower/2019/11/19/enabling-gpus-in-openshift-3-11/

Install Tiller and Helm

Create a Tiller project

Create a project where the tiller image/service will be deployed. The name of the project can be anything. For example, it can be as simple as "tiller". For the rest of this document, we will use TILLER_PROJECT for the project name.

 oc new-project TILLER_PROJECT

Get helm client and copy the binary

On Power

Copy tiller deployment and service yaml from the prereqs folder:

$ wget https://get.helm.sh/helm-v2.12.0-linux-ppc64le.tar.gz
$ tar xvf helm-v2.12.0-linux-ppc64le.tar.gz
$ cp linux-ppc64le/helm /usr/local/bin/

On x86

$ wget https://get.helm.sh/helm-v2.12.0-linux-amd64.tar.gz
$ tar xvf helm-v2.12.0-linux-amd64.tar.gz
$ cp linux-amd64/helm /usr/local/bin/

Start tiller deployment

On Power

Before starting the tiller deployment, in the tiller-template.yaml file, make sure to update the tiller image name:

...
spec:
  containers:
  - name: tiller
     image: <tiller image>
...

Replace with the full tiller image name. For example, ibmcom/tiller-ppc64le:v2.12.0-ocp-3.11

oc process -f tiller-template.yaml -p TILLER_NAMESPACE=TILLER_PROJECT -p HELM_VERSION=2.12.0 | oc create -f -

On x86

oc process -f https://github.com/openshift/origin/raw/master/examples/helm/tiller-template.yaml -p TILLER_NAMESPACE=TILLER_PROJECT -p HELM_VERSION=v2.12.0 | oc create -f -

Next, start the tiller service:

oc create -f tiller-service.yaml

Export HELM_HOST and HELM_HOME

export HELM_HOST=tiller.TILLER_PROJECT.svc.cluster.local:44134

Verify the setup with: helm version

Test that the helm binary can communicate with the tiller service:

$ helm version
Client: &version.Version{SemVer:"v2.12.0", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.12.0", GitCommit:"d325d2a9c179b33af1a024cdb5a4472b6288016a", GitTreeState:"clean"}

Adjust SCC (Security Context Constraint)

Bind the cluster-admin cluster role to the TILLER_PROJECT. This step only needs to be done once per cluster.

oc create clusterrolebinding tiller-cluster-admin --clusterrole=cluster-admin --serviceaccount=TILLER_PROJECT:tiller

Resources Required

Generally, WMLCE leverages GPUs for training and inferencing. You can control the number of GPUs a given pod has access to by setting the resources.gpu value. Setting it to 0 allows deployment on a non-GPU system. You can also control the GPU-type that is assigned to a given pod by using the resources.gputype value. This uses a nodeSelector label of gputype (example: gputype=nvidia-tesla-v100-16gb) and needs to be configured before deploying the Helm chart. This is useful when running a mix of GPU-enabled worker nodes, For Example: IBM Power Systems AC922 (POWER9) with V100 GPUs and IBM Power Systems 822LC for HPC (POWER8) with P100 GPUs.

Limitations

This chart is intended to be deployed in OpenShift.
This chart provides some basic building blocks to get started with WML CE. It is generally expected (though not required) that the WML CE Docker image and Helm chart would be extended for a specific production use case.
When DDL/Distributed mode with InfiniBand is enabled, IPC_LOCK, SYS_PTRACE, SYS_RESOURCE, and hostPID capabilities will be added.
Distributed mode can be used to deploy the cluster for all distributed frameworks like DML/DDL.
In future releases, the ddl option will be deprecated.
DDL/Distributed mode with Infiniband is only supported when all worker nodes are running on RHEL as the host operating system.

Installing the Chart

Clone the repository: git clone https://github.com/ibm/wmlce-openshift
Apply SCC for added capabilities

Note: If you have plan to enable paiDistributed or DDL, add the below values in the existing wmlce-scc.yaml file:

allowHostIPC: true
allowHostNetwork: true
allowHostPID: true
allowHostPorts: true

oc create -f prereqs/wmlce-scc.yaml
oc adm policy add-scc-to-user <name_of_scc> system:serviceaccount:<TILLER_PROJECT>:default

Pull the WMLCE image from the Redhat registry.

All WML CE image tags are available at https://access.redhat.com/containers/?tab=tags#/registry.connect.redhat.com/ibm/wmlce. If you have to check other framework specific image tags, replace wmlce with the framework name in the above link. These frameworks are available: pytorch, tensorflow, rapids, xgboost, caffe, and pai4sk. Visit the Redhat registry to check for the latest available tags before pulling any image. Use the following commands from a system to pull the image. Make sure to pull the image to all worker nodes.

$ docker login registry.connect.redhat.com
Username: ${REGISTRY-SERVICE-ACCOUNT-USERNAME}
Password: ${REGISTRY-SERVICE-ACCOUNT-PASSWORD}

$ docker pull registry.connect.redhat.com/ibm/wmlce:wmlce-1.7.0-py37-<arch>-4

where arch = ppc64le or x86_64

The default value wmlce-1.7.0-py37-ppc64le-4 is set in values.yaml. If you want to use differnt tag, update values.yaml with the correct image tag or update the tag during deployment.

Install the chart. In this example, it has the release name my-release:

$ helm install --name my-release --set license=accept <path_of_chart>

The command deploys ibm-wmlce on the OpenShift cluster in the default configuration. The configuration section lists the parameters that can be configured during installation.

Tip: List all releases using helm list

Verifying the Chart

See the NOTES.txt file associated with this chart for verification instructions.

Uninstalling the Chart

To uninstall/delete the my-release deployment:

$ helm delete my-release --purge

The command removes all the Kubernetes components associated with the chart and deletes the release. After deleting the chart, you should consider deleting any persistent volumes that you created.

For example :

When deleting a release with stateful sets, the associated persistent volume will need to be deleted.
Do the following after deleting the chart release to clean up orphaned persistent volumes.

$ oc delete pvc -l release=my-release

$ oc delete pv <name_of_pv>

Configuration

The following table lists the configurable parameters of the ibm-wmlce chart and their default values.

Parameter	Description	Default
`license`	Set `license=accept` to accept the terms of the license	`Not accepted`
`image.repository`	WMLCE image repository.	`registry.connect.redhat.com/ibm/wmlce`
`image.tag`	Docker Image tag. To get the tag of other images, visit " https://access.redhat.com/containers/?tab=tags#/registry.connect.redhat.com/ibm/wmlce"	`wmlce-1.7.0-py37-ppc64le-4`
`image.pullPolicy`	Docker Image pull policy (Options - IfNotPresent, Always, Never)	`IfNotPresent`
`global.image.secretName`	Docker Image pull secret, if you are using a private Docker registry	`nil`
`service.type`	Kubernetes service type for exposing ports (Options - ClusterIP, None)	`nil`
`service.port`	Kubernetes port number to expose	`nil`
`resources.gpu`	Number of GPUs on which to run the container. A value of 0 will not allocate a GPU.	`1`
`resources.gputype`	Type of GPU on which to run the container. Requires use of nodeSelector label of gputype to be configured prior. (E.G. gputype=nvidia-tesla-v100-16gb).	`nvidia-tesla-v100-16gb`
`paiDistributed.mode`	Enable WMLCE Distributed mode.	`false`
`paiDistributed.gpuPerHost`	Number of GPUs per host .	`4`
`paiDistributed.sshKeySecret`	Secret containing 'id_rsa' and 'id_rsa.pub' keys for the containers.	`nil`
`paiDistributed.useHostNetwork`	For better performance with TCP, use the host network. WARNING: SSH port needs to be different than 22.	`false`
`paiDistributed.sshPort`	Port used by SSH.	`22`
`paiDistributed.useInfiniBand`	Use InfiniBand for cross node communication.	`false`
`ddl.enabled`	Enable WMLCE Distributed mode when using DDL.	`false`
`ddl.gpuPerHost`	Number of GPUs per host when using DDL.	`4`
`ddl.sshKeySecret`	Secret containing 'id_rsa' and 'id_rsa.pub' keys for the containers.	`nil`
`ddl.useHostNetwork`	For better performance with TCP, use the host network. WARNING: SSH port needs to be different than 22.	`false`
`ddl.sshPort`	Port used by SSH.	`22`
`ddl.useInfiniBand`	Use InfiniBand for cross node communication.	`false`
`persistence.enabled`	Use a PVC to persist data	`false`
`persistence.useDynamicProvisioning`	Use dynamic provisioning for persistent volume	`false`
`wmlcePVC.name`	Name of volume claim	`datavolume`
`wmlcePVC.accessMode`	Volume access mode (Options: ReadWriteOnce, ReadWriteMany, ReadOnlyMany)	`ReadWriteMany`
`wmlcePVC.existingClaim`	Data PVC existing claim name	nil (will create a new claim by default)
`wmlcePVC.storageClassName`	Data PVC Storage class	nil (uses default cluster storage class for dynamic provisioning)
`wmlcePVC.size`	Data PVC size	`8Gi`
`command`	Command need to run inside pod. E.G. /usr/bin/python /wmlce/data/train.py;	`nil`

Specify each parameter using the --set key=value[,key=value] argument to helm install.

$ helm install --name my-release --set license=accept resources.gpu=1 <chartname>

Note : To Deploy WML CE for DML and DDL tests follow https://developer.ibm.com/linuxonpower/2018/10/04/distributed-deep-learning-ibm-cloud-private/

Alternatively, a YAML file that specifies the values for the parameters can be provided while installing the chart.

Tip: The default values are in the values.yaml file of the WMLCE chart.

$ helm install --name my-release -f values.yaml <chartname>

The volume is mounted in /wmlce/data when persistence.enabled=true

Storage

You can optionally provide a persistent volume to the deployment. This volume can hold data that you wish to process, as well as executables for the command you want to run. For example, if you had Python code that would train a model on a given set of data, this volume would host your Python code as well as your data, and you can run the Python code by specifying the appropriate command.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSES		LICENSES
ibm_cloud_pak		ibm_cloud_pak
prereqs		prereqs
templates		templates
.helmignore		.helmignore
Chart.yaml		Chart.yaml
LICENSE		LICENSE
README.md		README.md
RELEASENOTES.md		RELEASENOTES.md
values-metadata.yaml		values-metadata.yaml
values.yaml		values.yaml

License

IBM/wmlce-openshift

Folders and files

Latest commit

History

Repository files navigation