# Train a Random Forest Model with Watson Machine Learning 

Notebook created by Zeming Zhao on June, 2021

The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.

In this notebook we have two versions of Random Forest Classification model. one uses scikit-learn and another uses cuML.

Both will be submitted onto WMLA, scikit-learn using cpu and cmML using GPU. And we can compare the performance benifit of cuML on GPU version.

This notebook covers the following sections:

1. [Setup Random Forest using sklearning](#skl-model)<br>

1. [Training the model on CPU with Watson Machine Learning Accelerator](#cpu)<br>

1. [Setup Random Forest using cuML](#cuml-model)<br>

1. [Training the model on GPU with Watson Machine Learning Accelerator](#gpu)<br>

<a id = "rbm-model"></a>
## Preparations

### Prepare directory and file for writing Random Forest engine.

In [2]:
from pathlib import Path
model_dir = f'/data/models' 
model_main = f'RandomForest_main.py'
Path(model_dir).mkdir(exist_ok=True)
print("create model directory done.")

create model directory done.


<a id = "rbm-model"></a>
## Step 1 : Setup Random Forest model using scikit-learn.

### Create a Random Forest Model based on scikit-learn on CPU

In [3]:
%%writefile {model_dir}/{model_main}

import numpy as np
import pandas as pd
import pickle
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import os
import datetime

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/rf"

# Define Parameters for a large regression
n_samples = 2**13 
n_features = 899 
n_info = 600 
data_type = np.float32

# Generate Data using scikit-learn
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)

# scikit-learn RandomForestClassifier model
sk_model = skrfc(n_estimators=40,
                 max_depth=16,
                 max_features=1.0,
                 random_state=10)

# Fit
start = datetime.datetime.now()
sk_model.fit(X_train, y_train)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
sk_predict = sk_model.predict(X_test)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

sk_acc = accuracy_score(y_test, sk_predict)
print("test accuracy: %.2gs" % sk_acc)

filename = './cuml_random_forest_model.sav'
# save the trained cuml model into a file
pickle.dump(sk_model, open(filename, 'wb'))
print("saved model to file ", filename)

Overwriting /data/models/RandomForest_main.py


## Step 2 :  Training the SK-Learning model on CPU with Watson Machine Learning Accelerator

### Prepare the model lib for job submission:

In [4]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

from matplotlib import pyplot as plt
%pylab inline

import base64
import json
import time
import urllib

Populating the interactive namespace from numpy and matplotlib


### Configuring your environment and project details
To set up your project details, provide your credentials in this cell. You must include your cluster URL, username, and password.

In [5]:
hostname='wmla-console-wmla.apps.wml1x180.ma.platformlab.ibm.com'  # please enter Watson Machine Learning Accelerator host name
# login='username:password' # please enter the login and password
login='admin:p7PMrMMknVQzEb3ptyj0D6XRTO5PQjYL'
es = base64.b64encode(login.encode('utf-8')).decode("utf-8")
# print(es)
commonHeaders={'Authorization': 'Basic '+es}
req = requests.Session()
auth_url = 'https://{}/auth/v1/logon'.format(hostname)
print(auth_url)

a=requests.get(auth_url,headers=commonHeaders, verify=False)
access_token=a.json()['accessToken']
# print("Access_token: ", access_token)

https://wmla-console-wmla.apps.wml1x180.ma.platformlab.ibm.com/auth/v1/logon


In [6]:
dl_rest_url = 'https://{}/platform/rest/deeplearning/v1'.format(hostname)
commonHeaders={'accept': 'application/json', 'X-Auth-Token': access_token}
req = requests.Session()

In [7]:
# Health check
confUrl = 'https://{}/platform/rest/deeplearning/v1/conf'.format(hostname)
r = req.get(confUrl, headers=commonHeaders, verify=False)


### Define the status checking function

In [8]:
import tarfile
import tempfile
import os
import json
import pprint
import pandas as pd
from IPython.display import clear_output

def query_job_status(job_id,refresh_rate=3) :

    execURL = dl_rest_url  +'/execs/'+ job_id['id']
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(execURL, headers=commonHeaders, verify=False)
        monitoring = pd.DataFrame(res.json(), index=[0])
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(monitoring)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['PENDING_CRD_SCHEDULER', 'SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res

<a id = "cpu"></a>
### Define the submission parameters

In [9]:
# specify the conda env of rapids and worker device type
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --workerDeviceType cpu \
                     --conda-env-name rapids-21.06  \
                     --model-main /gpfs/mydatafs/models/' + model_main

print(args)

--exec-start tensorflow --cs-datastore-meta type=fs                      --workerDeviceNum 1                      --workerDeviceType cpu                      --conda-env-name rapids-21.06                       --model-main /gpfs/mydatafs/models/RandomForest_main.py


### Define the submission commnad

In [10]:
def submit_job_to_wmla (args) :
    starttime = datetime.datetime.now()
    r = requests.post(dl_rest_url+'/execs?args='+args, # files=files,
                  headers=commonHeaders, verify=False)
    if not r.ok:
        print('submit job failed: code=%s, %s'%(r.status_code, r.content))
    job_status = query_job_status(r.json(),refresh_rate=5)
    endtime = datetime.datetime.now()
    print("\nTotallly training cost: ", (endtime - starttime).seconds, " seconds.")

### Submit WMLA Workload

In [11]:
submit_job_to_wmla (args)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-302,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-302,admin,FINISHED,wmla-302,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-302/_submitted_code,SingleNodeTensorflowTrain,2021-07-09T10:07:20Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-302',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --workerDeviceType cpu                      '
          '--conda-env-name rapids-21.06                       --model-main '
          '/gpfs/mydatafs/models/RandomForest_main.py ',
  'createTime': '2021-07-09T10:07:20Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-302',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-302',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-302/_submitted_code'}

Totallly training cost:  207  seconds.


Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-236,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-236,admin,FINISHED,wmla-236,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-236/_submitted_code,SingleNodeTensorflowTrain,2021-07-08T06:18:34Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-236',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --workerDeviceType cpu                      '
          '--conda-env-name rapids-21.06                       --model-main '
          '/gpfs/mydatafs/models/RandomForest_main.py ',
  'createTime': '2021-07-08T06:18:34Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-236',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-236',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-236/_submitted_code'}

Training cost on WMLA CPU is:  205  seconds.


## Step 3 :  Setup Random Forest model using cmML

<a id = "cuml-model"></a>
### Create a Random Forest Model based on cuML on GPU

In [12]:
%%writefile {model_dir}/{model_main}

import cudf
import numpy as np
import pandas as pd
import pickle

from cuml.ensemble import RandomForestClassifier as curfc
from cuml.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import os
import datetime

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/rf"

# Define Parameters for a large regression
n_samples = 2**13 
n_features = 899 
n_info = 600 
data_type = np.float32

# Generate Data using cuML
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                     test_size = 0.2,
                                                     random_state=0)

X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)
y_cudf_train = cudf.Series(y_train.values)
y_cudf_test = cudf.Series(y_test.values)
    
# cuML RandomForestClassifier model
cuml_model = curfc(n_estimators=40,
                   max_depth=16,
                   max_features=1.0,
                   random_state=10)
# Fit
start = datetime.datetime.now()
cuml_model.fit(X_cudf_train, y_cudf_train)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
fil_preds_orig = cuml_model.predict(X_cudf_test)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

# sk_acc = accuracy_score(y_test.to_numpy(), fil_preds_orig)
sk_acc = accuracy_score(y_cudf_test, fil_preds_orig)
print("test accuracy: %.2gs" % sk_acc)

filename = './cuml_random_forest_model.sav'
# save the trained cuml model into a file
pickle.dump(cuml_model, open(filename, 'wb'))
print("saved model to file ", filename)

Overwriting /data/models/RandomForest_main.py


## Step 4 :  Training the cuML model on GPU with Watson Machine Learning Accelerator

<a id = "gpu"></a>
### Re-define the submittion parameters

In [13]:
# specify the conda env of rapids and worker device type
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --workerDeviceType gpu \
                     --conda-env-name rapids-21.06  \
                     --model-main /gpfs/mydatafs/models/' + model_main

print(args)

--exec-start tensorflow --cs-datastore-meta type=fs                      --workerDeviceNum 1                      --workerDeviceType gpu                      --conda-env-name rapids-21.06                       --model-main /gpfs/mydatafs/models/RandomForest_main.py


In [14]:
submit_job_to_wmla (args)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-303,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-303,admin,FINISHED,wmla-303,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-303/_submitted_code,SingleNodeTensorflowTrain,2021-07-09T10:10:48Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-303',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --workerDeviceType gpu                      '
          '--conda-env-name rapids-21.06                       --model-main '
          '/gpfs/mydatafs/models/RandomForest_main.py ',
  'createTime': '2021-07-09T10:10:48Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-303',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-303',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-303/_submitted_code'}

Totallly training cost:  67  seconds.


## Step 5 : Setup Random Forest model using snapML¶

### Create a Random Forest Model based on snapML

In [15]:
model_main='snapML-'+model_main

In [22]:
%%writefile {model_dir}/{model_main}

import numpy as np
import pandas as pd
import pickle
from sklearn.metrics import accuracy_score
#from sklearn.ensemble import RandomForestClassifier as skrfc
from snapml import RandomForestClassifier as SnapRandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import os
import datetime

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/rf"

# Define Parameters for a large regression
n_samples = 2**13 
n_features = 899 
n_info = 600 
data_type = np.float32

# Generate Data using scikit-learn
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)
print(type(X_train))
print(type(X_train.values))
# snapML RandomForestClassifier model
snap_model = SnapRandomForestClassifier(max_depth=16, 
                               n_estimators=100, 
                               n_jobs=4, 
                               random_state=10)
# Fit
start = datetime.datetime.now()
# TypeError: Tree-based models in Snap ML only support numpy.ndarray
snap_model.fit(X_train.values, y_train.values)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
sk_predict = snap_model.predict(X_test.values)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

sk_acc = accuracy_score(y_test, sk_predict)
print("test accuracy: %.2gs" % sk_acc)

filename = './cuml_random_forest_model.sav'
# save the trained cuml model into a file
pickle.dump(snap_model, open(filename, 'wb'))
print("saved model to file ", filename)

Overwriting /data/models/snapML-RandomForest_main.py


## Step 6 : Training the SnapML model on CPU with Watson Machine Learning Accelerator

### Re-define the submission parameters

In [23]:
# specify the conda env of rapids and worker device type
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --workerDeviceType cpu \
                     --conda-env-name snapml-py3.7 \
                     --model-main /gpfs/mydatafs/models/' + model_main
print(args)

--exec-start tensorflow --cs-datastore-meta type=fs                      --workerDeviceNum 1                      --workerDeviceType cpu                      --conda-env-name snapml-py3.7                      --model-main /gpfs/mydatafs/models/snapML-RandomForest_main.py


### Submit WMLA Workload

In [24]:
submit_job_to_wmla (args)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-305,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-305,admin,FINISHED,wmla-305,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-305/_submitted_code,SingleNodeTensorflowTrain,2021-07-09T11:21:06Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-305',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --workerDeviceType cpu                      '
          '--conda-env-name snapml-py3.7                      --model-main '
          '/gpfs/mydatafs/models/snapML-RandomForest_main.py ',
  'createTime': '2021-07-09T11:21:06Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-305',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-305',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-305/_submitted_code'}

Totallly training cost:  42  seconds.
