# Train a K-Means Model with Watson Machine Learning 

Notebook created by Zeming Zhao on June, 2021

In this notebook, you will learn how to use the Watson Machine Learning Accelerator (WML-A) API and accelerate the processing of K-Means model on GPU with Watson Machine Learning Accelerator.

K-Means is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed, and this becomes the new centroid.

In this notebook we have two versions of K-Means model. one uses scikit-learn and another uses cuML.

Both will be submitted onto WMLA, scikit-learn using cpu and cmML using GPU. And we can compare the performance benifit of cuML on GPU version.

This notebook covers the following sections:

1. [Setup K-Means using sklearning](#skl-model)<br>

1. [Training the model on CPU with Watson Machine Learning Accelerator](#skl-cpu)<br>

1. [Setup K-Means using cuML](#cuml-model)<br>

1. [Training the model on GPU with Watson Machine Learning Accelerator](#cuml-gpu)<br>

## Preparations
### Prepare directory and file for writing K-Means engine.

In [31]:
from pathlib import Path
model_dir = f'/project_data/data_asset/models' 
model_base_name = f'K-Means-main.py'
Path(model_dir).mkdir(exist_ok=True)
print("create model directory done.")

create model directory done.


<a id = "skl-model"></a>
## Step 1 : Setup K-Means model using scikit-learn
### Create a K-Means Model based on scikit-learn on CPU

In [32]:
model_main='sklean-'+model_base_name

In [33]:
%%writefile {model_dir}/{model_main}

# import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans as skKMeans
from sklearn.metrics import adjusted_rand_score
import os
import datetime

# Define Parameters for a large regression
n_samples = 1000000
n_features = 200
n_clusters = 10
random_state = 23

# Generate Data
start = datetime.datetime.now()
host_data, host_labels = make_blobs(n_samples=n_samples,
                                        n_features=n_features,
                                        centers=n_clusters,
                                        random_state=random_state,
                                        cluster_std=0.1)
end = datetime.datetime.now()
print ("generate data timecost: %.2gs" % ((end-start).total_seconds()))

# sklearn kmean model
kmeans_sk = skKMeans(init="k-means++",
                     n_clusters=n_clusters,
                     n_jobs=-1,
                    random_state=random_state)

# Fit
start = datetime.datetime.now()
kmeans_sk.fit(host_data)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# # Visualize 
# fig = plt.figure(figsize=(16, 10))
# plt.scatter(host_data[:, 0], host_data[:, 1], c=host_labels, s=50, cmap='viridis')

# #plot the sklearn kmeans centers with blue filled circles
# centers_sk = kmeans_sk.cluster_centers_
# plt.scatter(centers_sk[:,0], centers_sk[:,1], c='blue', s=100, alpha=.5)
# plt.title('sklearn kmeans clustering')
# plot_file = "./kmeans_cpu.png"
# plt.savefig(plot_file)

# Evaluate
start = datetime.datetime.now()
sk_score = adjusted_rand_score(host_labels, kmeans_sk.labels_)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

print("score (sklearning): %s" % sk_score)

Overwriting /project_data/data_asset/models/sklean-K-Means-main.py


<a id = "skl-cpu"></a>
## Step 2 :  Training the SK-Learning model on CPU with Watson Machine Learning Accelerator
### Prepare the model lib for job submission

In [34]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

from matplotlib import pyplot as plt
%pylab inline

import base64
import json
import time
import urllib

Populating the interactive namespace from numpy and matplotlib


### Configuring your environment and project details
To set up your project details, provide your credentials in this cell. You must include your cluster URL, username, and password.

In [35]:
# please enter Watson Machine Learning Accelerator host name
hostname='wmla-console-wmla.apps.dse-perf.cpolab.ibm.com'
# login='username:password' # please enter the login and password
login='mluser1:mluser1'

es = base64.b64encode(login.encode('utf-8')).decode("utf-8")
# print(es)
commonHeaders={'Authorization': 'Basic '+es}
req = requests.Session()
auth_url = 'https://{}/auth/v1/logon'.format(hostname)
print(auth_url)

a=requests.get(auth_url,headers=commonHeaders, verify=False)
access_token=a.json()['accessToken']
# print("Access_token: ", access_token)

dl_rest_url = 'https://{}/platform/rest/deeplearning/v1'.format(hostname)
commonHeaders={'accept': 'application/json', 'X-Auth-Token': access_token}
req = requests.Session()

# Health check
confUrl = 'https://{}/platform/rest/deeplearning/v1/conf'.format(hostname)
r = req.get(confUrl, headers=commonHeaders, verify=False)

https://wmla-console-wmla.apps.dse-perf.cpolab.ibm.com/auth/v1/logon


### Define the status checking function

In [36]:
import tarfile
import tempfile
import os
import json
import pprint
import pandas as pd
from IPython.display import clear_output

def query_job_status(job_id,refresh_rate=3) :

    execURL = dl_rest_url  +'/execs/'+ job_id['id']
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(execURL, headers=commonHeaders, verify=False)
        monitoring = pd.DataFrame(res.json(), index=[0])
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(monitoring)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['PENDING_CRD_SCHEDULER', 'SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res

### Define the submission function

In [37]:
def submit_job_to_wmla (args, files) :
    starttime = datetime.datetime.now()
    r = requests.post(dl_rest_url+'/execs?args='+args, files=files,
                  headers=commonHeaders, verify=False)
    if not r.ok:
        print('submit job failed: code=%s, %s'%(r.status_code, r.content))
    job_status = query_job_status(r.json(),refresh_rate=5)
    endtime = datetime.datetime.now()
    print("\nTotallly training cost: ", (endtime - starttime).seconds, " seconds.")

### Define the submission parameters for scikit-learn version on cpu

In [38]:
# specify the model file, conda env, device type and device number
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
--workerDeviceNum 1 \
--workerMemory 32G \
--workerDeviceType cpu \
--conda-env-name rapids-21.06-new  \
--model-main ' + model_main
print(args)

--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu --conda-env-name rapids-21.06-new  --model-main sklean-K-Means-main.py


### Submit WMLA Workload

In [39]:
files = {'file': open("{0}/{1}".format(model_dir,model_main),'rb')}
submit_job_to_wmla (args, files)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-954,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu --...,wmla-954,mluser1,FINISHED,wmla-954,https://wmla-mss:9080,wmla,/gpfs/myresultfs/mluser1/batchworkdir/wmla-954/_submitted_code,SingleNodeTensorflowTrain,2021-07-29T05:09:02Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-954',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta type=fs '
          '--workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu '
          '--conda-env-name rapids-21.06-new  --model-main '
          'sklean-K-Means-main.py ',
  'createTime': '2021-07-29T05:09:02Z',
  'creator': 'mluser1',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-954',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-954',
  'workDir': '/gpfs/myresultfs/mluser1/batchworkdir/wmla-954/_submitted_code'}

Totallly training cost:  257  seconds.


<a id = "cuml-model"></a>
## Step 3 :  Setup K-Means model using cmML
### Create a K-Means Model based on cuML on GPU

In [40]:
model_main='cuml-'+model_base_name

In [41]:
%%writefile {model_dir}/{model_main}

import cudf
import cupy
# import matplotlib.pyplot as plt
from cuml.cluster import KMeans as cuKMeans
from cuml.datasets import make_blobs
from sklearn.cluster import KMeans as skKMeans
from sklearn.metrics import adjusted_rand_score
import os
import datetime

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/km"

# Define Parameters for a large regression
n_samples = 1000000
n_features = 200
n_clusters = 10
random_state = 23

# Generate Data
start = datetime.datetime.now()
device_data, device_labels = make_blobs(n_samples=n_samples,
                                        n_features=n_features,
                                        centers=n_clusters,
                                        random_state=random_state,
                                        cluster_std=0.1)

device_data = cudf.DataFrame(device_data)
device_labels = cudf.Series(device_labels)

#  # Copy dataset from GPU memory to host memory.
host_data = device_data.to_pandas()
host_labels = device_labels.to_pandas()

end = datetime.datetime.now()
print ("generate and copy data timecost: %.2gs" % ((end-start).total_seconds()))

# cuML Model
kmeans_cuml = cuKMeans(init="k-means||",
                       n_clusters=n_clusters,
                       oversampling_factor=40,
                       random_state=random_state)

# Fit
start = datetime.datetime.now()
kmeans_cuml.fit(device_data)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# # Visualize 
# fig = plt.figure(figsize=(16, 10))
# plt.scatter(host_data.iloc[:, 0], host_data.iloc[:, 1], c=host_labels, s=50, cmap='viridis')
# 
# #plot the cuml kmeans centers with red circle outlines
# centers_cuml = kmeans_cuml.cluster_centers_
# plt.scatter(cupy.asnumpy(centers_cuml[0].values), 
#             cupy.asnumpy(centers_cuml[1].values), 
#             facecolors = 'none', edgecolors='red', s=100)
# 
# plt.title('cuml and sklearn kmeans clustering')

# # plt.show()
# plot_file = "./kmeans_gpu.png"
# plt.savefig(plot_file)

# Evaluate
start = datetime.datetime.now()
cuml_score = adjusted_rand_score(host_labels, kmeans_cuml.labels_.to_array())
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

print("score (cuML): %s" % cuml_score)

Overwriting /project_data/data_asset/models/cuml-K-Means-main.py


<a id = "cuml-gpu"></a>
## Step 4 :  Training the cuML model on GPU with Watson Machine Learning Accelerator
### Re-define the submssion parameters

In [42]:
# specify the model file, conda env, device type and device number
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
--workerDeviceNum 1 \
--workerMemory 32G \
--workerDeviceType gpu \
--conda-env-name rapids-21.06-new  \
--model-main ' + model_main

print(args)

--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu --conda-env-name rapids-21.06-new  --model-main cuml-K-Means-main.py


### Submit WMLA Workload

In [43]:
files = {'file': open("{0}/{1}".format(model_dir,model_main),'rb')}
submit_job_to_wmla (args, files)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-955,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu --...,wmla-955,mluser1,FINISHED,wmla-955,https://wmla-mss:9080,wmla,/gpfs/myresultfs/mluser1/batchworkdir/wmla-955/_submitted_code,SingleNodeTensorflowTrain,2021-07-29T05:13:21Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-955',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta type=fs '
          '--workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu '
          '--conda-env-name rapids-21.06-new  --model-main '
          'cuml-K-Means-main.py ',
  'createTime': '2021-07-29T05:13:21Z',
  'creator': 'mluser1',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-955',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-955',
  'workDir': '/gpfs/myresultfs/mluser1/batchworkdir/wmla-955/_submitted_code'}

Totallly training cost:  56  seconds.
