# Train a Linear Regression Model with Watson Machine Learning 

Notebook created by Zeming Zhao on June, 2021

In this notebook, you will learn how to use the Watson Machine Learning Accelerator (WML-A) API and accelerate the processing of Linear Regression model on GPU with Watson Machine Learning Accelerator.

Linear Regression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

In this notebook we have three versions of Linear Regression model: scikit-learn version, cuML version and snapML version.

All three versions will be submitted onto WMLA. And we can compare the performance benifit of cuML and snapML version.

This notebook covers the following sections:

1. [Setup Linear Regression using sklearning](#skl-model)<br>

1. [Training the model on CPU with Watson Machine Learning Accelerator](#cpu)<br>

1. [Setup Linear Regression using cuML](#cuml-model)<br>

1. [Training the model on GPU with Watson Machine Learning Accelerator](#gpu)<br>

1. [Setup Linear Regression using snapML](#snapml-model)<br>

1. [Training the model on GPU with Watson Machine Learning Accelerator](#snap)<br>

<a id = "rbm-model"></a>
## Preparations

### Prepare directory and file for writing Linear Regression engine.

In [1]:
from pathlib import Path
model_dir = f'/data/models' 
model_main = f'LinearRegression_main.py'
Path(model_dir).mkdir(exist_ok=True)
print("create model directory done.")

create model directory done.


<a id = "skl-model"></a>
## Step 1 : Setup Linear Regression model using scikit-learn.

### Create a Linear Regression Model based on scikit-learn on CPU

In [2]:
%%writefile {model_dir}/{model_main}

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression as skLinearRegression
#from sklearn import linear_model
import datetime
import os

# # specify the cache location to /gpfy since ~/.cache is not available
# os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/lr"

# Define Parameters for a large regression
n_samples = 2**20 #If you are running on a GPU with less than 16GB RAM, please change to 2**19 or you could run out of memory
n_features = 399
random_state = 23

# Generate Data
start = datetime.datetime.now()
X, y = make_regression(n_samples=n_samples, n_features=n_features, random_state=random_state)
X_skl, X_skl_test, y_skl, y_skl_test = train_test_split(X, y, test_size = 0.2, random_state=random_state)
end = datetime.datetime.now()
print ("generate data timecost: %.2gs" % ((end-start).total_seconds()))

# scikit-learn Model
# ols_skl = skLinearRegression()
ols_skl = skLinearRegression(fit_intercept=True,
                            normalize=True,
                            n_jobs=-1)
print ("init model done")

# Fit
start = datetime.datetime.now()
ols_skl.fit(X_skl, y_skl)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Predict
start = datetime.datetime.now()
predict_skl = ols_skl.predict(X_skl_test)
end = datetime.datetime.now()
print ("predict timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
r2_score_skl = r2_score(y_skl_test, predict_skl)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

#  print("R^2 score (SKL):  %s" % r2_score_sk)
print("R^2 score (scikit-learn): %.4f" % r2_score_skl)

Overwriting /data/models/LinearRegression_main.py


## Step 2 :  Training the SK-Learning model on CPU with Watson Machine Learning Accelerator

### Prepare the model lib for job submission

In [3]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

from matplotlib import pyplot as plt
%pylab inline

import base64
import json
import time
import urllib

Populating the interactive namespace from numpy and matplotlib


### Configuring your environment and project details
To set up your project details, provide your credentials in this cell. You must include your cluster URL, username, and password.

In [4]:
hostname='wmla-console-wmla.apps.wml1x180.ma.platformlab.ibm.com'  # please enter Watson Machine Learning Accelerator host name
# login='username:password' # please enter the login and password
login='admin:p7PMrMMknVQzEb3ptyj0D6XRTO5PQjYL'
es = base64.b64encode(login.encode('utf-8')).decode("utf-8")
# print(es)
commonHeaders={'Authorization': 'Basic '+es}
req = requests.Session()
auth_url = 'https://{}/auth/v1/logon'.format(hostname)
print(auth_url)

a=requests.get(auth_url,headers=commonHeaders, verify=False)
access_token=a.json()['accessToken']
# print("Access_token: ", access_token)

https://wmla-console-wmla.apps.wml1x180.ma.platformlab.ibm.com/auth/v1/logon


In [5]:
dl_rest_url = 'https://{}/platform/rest/deeplearning/v1'.format(hostname)
commonHeaders={'accept': 'application/json', 'X-Auth-Token': access_token}
req = requests.Session()

In [6]:
# Health check
confUrl = 'https://{}/platform/rest/deeplearning/v1/conf'.format(hostname)
r = req.get(confUrl, headers=commonHeaders, verify=False)

### Define the status checking function

In [7]:
import tarfile
import tempfile
import os
import json
import pprint
import pandas as pd
from IPython.display import clear_output

def query_job_status(job_id,refresh_rate=3) :

    execURL = dl_rest_url  +'/execs/'+ job_id['id']
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(execURL, headers=commonHeaders, verify=False)
        monitoring = pd.DataFrame(res.json(), index=[0])
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(monitoring)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['PENDING_CRD_SCHEDULER', 'SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res

### Define the submission commnad

In [8]:
def submit_job_to_wmla (args) :
    starttime = datetime.datetime.now()
    r = requests.post(dl_rest_url+'/execs?args='+args, # files=files,
                  headers=commonHeaders, verify=False)
    if not r.ok:
        print('submit job failed: code=%s, %s'%(r.status_code, r.content))
    job_status = query_job_status(r.json(),refresh_rate=5)
    endtime = datetime.datetime.now()
    print("\nTotallly training cost: ", (endtime - starttime).seconds, " seconds.")

<a id = "cpu"></a>
### Define the submission parameters for scikit-learn version on cpu

In [9]:
# specify the conda env of rapids and worker device type
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --workerDeviceType cpu \
                     --conda-env-name dlipy3-cpu  \
                     --model-main /gpfs/mydatafs/models/' + model_main

print(args)

--exec-start tensorflow --cs-datastore-meta type=fs                      --workerDeviceNum 1                      --workerDeviceType cpu                      --conda-env-name dlipy3-cpu                       --model-main /gpfs/mydatafs/models/LinearRegression_main.py


### Submit WMLA Workload

In [10]:
submit_job_to_wmla (args)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-329,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-329,admin,FINISHED,wmla-329,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-329/_submitted_code,SingleNodeTensorflowTrain,2021-07-13T09:08:59Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-329',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --workerDeviceType cpu                      '
          '--conda-env-name dlipy3-cpu                       --model-main '
          '/gpfs/mydatafs/models/LinearRegression_main.py ',
  'createTime': '2021-07-13T09:08:59Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-329',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-329',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-329/_submitted_code'}

Totallly training cost:  165  seconds.


## Step 3 :  Setup Linear Regression model using cmML

<a id = "cuml-model"></a>
### Create a Linear Regression Model based on cuML on GPU

In [11]:
%%writefile {model_dir}/{model_main}

import cudf
from cuml import make_regression, train_test_split
from cuml.linear_model import LinearRegression as cuLinearRegression
from cuml.metrics.regression import r2_score
#from sklearn.linear_model import LinearRegression as skLinearRegression
import datetime
import os

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/lr"

# Define Parameters for a large regression
n_samples = 2**20 #If you are running on a GPU with less than 16GB RAM, please change to 2**19 or you could run out of memory
n_features = 399
random_state = 23

# Generate Data
start = datetime.datetime.now()
X, y = make_regression(n_samples=n_samples, n_features=n_features, random_state=random_state)
X = cudf.DataFrame(X)
y = cudf.DataFrame(y)[0]
X_cudf, X_cudf_test, y_cudf, y_cudf_test = train_test_split(X, y, test_size = 0.2, random_state=random_state)
end = datetime.datetime.now()
print ("generate data timecost: %.2gs" % ((end-start).total_seconds()))

# cuML Model
ols_cuml = cuLinearRegression(fit_intercept=True,normalize=True,algorithm='eig')

# Fit
start = datetime.datetime.now()
ols_cuml.fit(X_cudf, y_cudf)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Predict
start = datetime.datetime.now()
predict_cuml = ols_cuml.predict(X_cudf_test)
end = datetime.datetime.now()
print ("predict timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
r2_score_cuml = r2_score(y_cudf_test, predict_cuml)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

print("R^2 score (cuML): %.4f" % r2_score_cuml)

Overwriting /data/models/LinearRegression_main.py


## Step 4 :  Training the cuML model on GPU with Watson Machine Learning Accelerator

<a id = "gpu"></a>
### Re-define the submssion parameters

In [12]:
# specify the conda env of rapids and worker device type
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --workerDeviceType gpu \
                     --conda-env-name rapids-21.06  \
                     --model-main /gpfs/mydatafs/models/' + model_main

print(args)

--exec-start tensorflow --cs-datastore-meta type=fs                      --workerDeviceNum 1                      --workerDeviceType gpu                      --conda-env-name rapids-21.06                       --model-main /gpfs/mydatafs/models/LinearRegression_main.py


### Submit WMLA Workload

In [13]:
submit_job_to_wmla (args)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-330,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-330,admin,FINISHED,wmla-330,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-330/_submitted_code,SingleNodeTensorflowTrain,2021-07-13T09:11:44Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-330',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --workerDeviceType gpu                      '
          '--conda-env-name rapids-21.06                       --model-main '
          '/gpfs/mydatafs/models/LinearRegression_main.py ',
  'createTime': '2021-07-13T09:11:44Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-330',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-330',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-330/_submitted_code'}

Totallly training cost:  43  seconds.


## Step 5 :  Setup Linear Regression model using snapML

<a id = "snaml-model"></a>
### Create a Linear Regression Model based on snapML 

In [14]:
model_main='snapML-'+model_main

In [15]:
%%writefile {model_dir}/{model_main}

from snapml import LinearRegression as SnapLinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import datetime
# import numpy as np

# Define Parameters for a large regression
n_samples = 2**20 #If you are running on a GPU with less than 16GB RAM, please change to 2**19 or you could run out of memory
n_features = 399
random_state = 23

# Generate Data
start = datetime.datetime.now()
X, y = make_regression(n_samples=n_samples, n_features=n_features, random_state=random_state)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=random_state)
end = datetime.datetime.now()
print ("generate data timecost: %.2gs" % ((end-start).total_seconds()))

# snapML model
model = SnapLinearRegression(fit_intercept=True,normalize=True)

# Fit
start = datetime.datetime.now()
model.fit(X_train, y_train)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

r2_score_snapml = r2_score(y_test, model.predict(X_test))
print("R^2 score (snapML): %.4f" % r2_score_snapml)

Overwriting /data/models/snapML-LinearRegression_main.py


## Step 6 :  Training the SnapML model on CPU with Watson Machine Learning Accelerator

<a id = "snapml"></a>
### Re-define the submission parameters

In [16]:
# specify the conda env of rapids and worker device type
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --workerDeviceType cpu \
                     --conda-env-name snapml-py3.7 \
                     --model-main /gpfs/mydatafs/models/' + model_main
print(args)

--exec-start tensorflow --cs-datastore-meta type=fs                      --workerDeviceNum 1                      --workerDeviceType cpu                      --conda-env-name snapml-py3.7                      --model-main /gpfs/mydatafs/models/snapML-LinearRegression_main.py


### Submit WMLA Workload

In [17]:
submit_job_to_wmla (args)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-331,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-331,admin,FINISHED,wmla-331,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-331/_submitted_code,SingleNodeTensorflowTrain,2021-07-13T09:12:28Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-331',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --workerDeviceType cpu                      '
          '--conda-env-name snapml-py3.7                      --model-main '
          '/gpfs/mydatafs/models/snapML-LinearRegression_main.py ',
  'createTime': '2021-07-13T09:12:28Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-331',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-331',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-331/_submitted_code'}

Totallly training cost:  111  seconds.
