# Train a Random Forest Model with Watson Machine Learning 

Notebook created by Zeming Zhao on June, 2021

The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.

In this notebook we have three versions of Linear Regression model: scikit-learn version, cuML version and snapML version.

All three versions will be submitted onto WMLA. And we can compare the performance benifit of cuML and snapML version.

This notebook covers the following sections:

1. [Setup Random Forest using sklearning](#skl-model)<br>

1. [Training the model on CPU with Watson Machine Learning Accelerator](#skl-cpu)<br>

1. [Setup Random Forest using cuML](#cuml-model)<br>

1. [Training the model on GPU with Watson Machine Learning Accelerator](#cuml-gpu)<br>

1. [Setup Random Forest using snapML](#snapml-model)<br>

1. [Training the model on CPU with Watson Machine Learning Accelerator](#snapml-cpu)<br>

1. [Setup Random Forest using snapML GPU](#snapml-model-gpu)<br>

1. [Training the model on GPU with Watson Machine Learning Accelerator](#snapml-gpu)<br>

## Preparations
### Prepare directory and file for writing Random Forest engine.

In [1]:
from pathlib import Path
model_dir = f'/project_data/data_asset/models' 
model_base_name = f'RandomForest-main.py'
Path(model_dir).mkdir(exist_ok=True)
print("create model directory done.")

create model directory done.


<a id = "skl-model"></a>
## Step 1 : Setup Random Forest model using scikit-learn.
### Create a Random Forest Model based on scikit-learn on CPU

In [2]:
model_main='sklearn-'+model_base_name

In [3]:
%%writefile {model_dir}/{model_main}

import os, datetime
import numpy as np
import pandas as pd
# import pickle
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier as skrfc

# Define Parameters for a large regression
n_samples = 2**18
n_features = 28 
# n_class = 2
data_type = np.float32

# Generate Data using scikit-learn
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          # n_class=n_class,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)

print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

# scikit-learn RandomForestClassifier model
sk_model = skrfc(n_estimators=40,
                 max_depth=16,
                 max_features=1.0,
                 random_state=10)

# Fit
start = datetime.datetime.now()
sk_model.fit(X_train, y_train)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
sk_predict = sk_model.predict(X_test)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

sk_acc = accuracy_score(y_test, sk_predict)
print("test accuracy: %.2gs" % sk_acc)

# filename = './skl_random_forest_model.pkl'
# # save the trained cuml model into a file
# pickle.dump(sk_model, open(filename, 'wb'))
# print("saved model to file ", filename)

Overwriting /project_data/data_asset/models/sklearn-RandomForest-main.py


<a id = "skl-cpu"></a>
## Step 2 :  Training the SK-Learning model on CPU with Watson Machine Learning Accelerator
### Prepare the model lib for job submission:

In [4]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

from matplotlib import pyplot as plt
%pylab inline

import base64
import json
import time
import urllib

Populating the interactive namespace from numpy and matplotlib


### Configuring your environment and project details
To set up your project details, provide your credentials in this cell. You must include your cluster URL, username, and password.

In [5]:
# please enter Watson Machine Learning Accelerator host name
hostname='wmla-console-wmla.apps.dse-perf.cpolab.ibm.com'
# login='username:password' # please enter the login and password
login='mluser1:mluser1'
es = base64.b64encode(login.encode('utf-8')).decode("utf-8")
# print(es)
commonHeaders={'Authorization': 'Basic '+es}
req = requests.Session()
auth_url = 'https://{}/auth/v1/logon'.format(hostname)
print(auth_url)

a=requests.get(auth_url,headers=commonHeaders, verify=False)
access_token=a.json()['accessToken']
# print("Access_token: ", access_token)

dl_rest_url = 'https://{}/platform/rest/deeplearning/v1'.format(hostname)
commonHeaders={'accept': 'application/json', 'X-Auth-Token': access_token}
req = requests.Session()

# Health check
confUrl = 'https://{}/platform/rest/deeplearning/v1/conf'.format(hostname)
r = req.get(confUrl, headers=commonHeaders, verify=False)

https://wmla-console-wmla.apps.dse-perf.cpolab.ibm.com/auth/v1/logon


### Define the status checking function

In [6]:
import tarfile
import tempfile
import os
import json
import pprint
import pandas as pd
from IPython.display import clear_output

def query_job_status(job_id,refresh_rate=3) :

    execURL = dl_rest_url  +'/execs/'+ job_id['id']
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(execURL, headers=commonHeaders, verify=False)
        monitoring = pd.DataFrame(res.json(), index=[0])
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(monitoring)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['PENDING_CRD_SCHEDULER', 'SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res

### Define the submission parameters

In [7]:
# specify the model file, conda env, device type and device number
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
--workerDeviceNum 1 \
--workerMemory 32G \
--workerDeviceType cpu \
--conda-env-name rapids-21.06-new  \
--model-main ' + model_main

print(args)

--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu --conda-env-name rapids-21.06-new  --model-main sklearn-RandomForest-main.py


### Define the submission function

In [8]:
def submit_job_to_wmla (args, files) :
    starttime = datetime.datetime.now()
    r = requests.post(dl_rest_url+'/execs?args='+args, files=files,
                  headers=commonHeaders, verify=False)
    if not r.ok:
        print('submit job failed: code=%s, %s'%(r.status_code, r.content))
    job_status = query_job_status(r.json(),refresh_rate=5)
    endtime = datetime.datetime.now()
    print("\nTotallly training cost: ", (endtime - starttime).seconds, " seconds.")

### Submit WMLA Workload

In [9]:
files = {'file': open("{0}/{1}".format(model_dir,model_main),'rb')}
submit_job_to_wmla (args, files)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-961,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu --...,wmla-961,mluser1,FINISHED,wmla-961,https://wmla-mss:9080,wmla,/gpfs/myresultfs/mluser1/batchworkdir/wmla-961/_submitted_code,SingleNodeTensorflowTrain,2021-07-29T05:39:46Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-961',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta type=fs '
          '--workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu '
          '--conda-env-name rapids-21.06-new  --model-main '
          'sklearn-RandomForest-main.py ',
  'createTime': '2021-07-29T05:39:46Z',
  'creator': 'mluser1',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-961',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-961',
  'workDir': '/gpfs/myresultfs/mluser1/batchworkdir/wmla-961/_submitted_code'}

Totallly training cost:  351  seconds.


<a id = "cuml-model"></a>
## Step 3 :  Setup Random Forest model using cmML
### Create a Random Forest Model based on cuML on GPU

In [10]:
model_main='cuML-'+model_base_name

In [11]:
%%writefile {model_dir}/{model_main}

import cudf
import numpy as np
import pandas as pd
# import pickle

from cuml.ensemble import RandomForestClassifier as curfc
from cuml.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import os
import datetime

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/rf"

# Define Parameters for a large regression
n_samples = 2**18
n_features = 28 
# n_classes = 2
data_type = np.float32

# Generate Data using cuML
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          # n_classes=n_classes,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                     test_size = 0.2,
                                                     random_state=0)

print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)
y_cudf_train = cudf.Series(y_train.values)
y_cudf_test = cudf.Series(y_test.values)
    
# cuML RandomForestClassifier model
cuml_model = curfc(n_estimators=40,
                   max_depth=16,
                   max_features=1.0,
                   random_state=10)
# Fit
start = datetime.datetime.now()
cuml_model.fit(X_cudf_train, y_cudf_train)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
fil_preds_orig = cuml_model.predict(X_cudf_test)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

# sk_acc = accuracy_score(y_test.to_numpy(), fil_preds_orig)
sk_acc = accuracy_score(y_cudf_test, fil_preds_orig)
print("test accuracy: %.2gs" % sk_acc)

# filename = './cuml_random_forest_model.pkl'
# # save the trained cuml model into a file
# pickle.dump(cuml_model, open(filename, 'wb'))
# print("saved model to file ", filename)

Overwriting /project_data/data_asset/models/cuML-RandomForest-main.py


<a id = "cuml-gpu"></a>
## Step 4 :  Training the cuML model on GPU with Watson Machine Learning Accelerator
### Re-define the submittion parameters

In [12]:
# specify the model file, conda env, device type and device number
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
--workerDeviceNum 1 \
--workerMemory 32G \
--workerDeviceType gpu \
--conda-env-name rapids-21.06-new  \
--model-main ' + model_main
print(args)

--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu --conda-env-name rapids-21.06-new  --model-main cuML-RandomForest-main.py


In [13]:
files = {'file': open("{0}/{1}".format(model_dir,model_main),'rb')}
submit_job_to_wmla (args, files)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-962,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu --...,wmla-962,mluser1,FINISHED,wmla-962,https://wmla-mss:9080,wmla,/gpfs/myresultfs/mluser1/batchworkdir/wmla-962/_submitted_code,SingleNodeTensorflowTrain,2021-07-29T05:45:38Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-962',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta type=fs '
          '--workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu '
          '--conda-env-name rapids-21.06-new  --model-main '
          'cuML-RandomForest-main.py ',
  'createTime': '2021-07-29T05:45:38Z',
  'creator': 'mluser1',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-962',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-962',
  'workDir': '/gpfs/myresultfs/mluser1/batchworkdir/wmla-962/_submitted_code'}

Totallly training cost:  52  seconds.


<a id = "snapml-model"></a>
## Step 5 : Setup Random Forest model using snapML¶
### Create a Random Forest Model based on snapML

In [14]:
model_main='snapML-'+model_base_name

In [15]:
%%writefile {model_dir}/{model_main}

import numpy as np
import pandas as pd
# import pickle
from sklearn.metrics import accuracy_score
#from sklearn.ensemble import RandomForestClassifier as skrfc
from snapml import RandomForestClassifier as SnapRandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import os
import datetime

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/rf"

# Define Parameters for a large regression
n_samples = 2**18
n_features = 28 
# n_class = 2
data_type = np.float32

# Generate Data using scikit-learn
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          # n_class=n_class,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)

print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

# snapML RandomForestClassifier model
snap_model = SnapRandomForestClassifier(max_depth=16, 
                               n_estimators=40, 
                               n_jobs=4, 
                               random_state=10)
# Fit
start = datetime.datetime.now()
# TypeError: Tree-based models in Snap ML only support numpy.ndarray
snap_model.fit(X_train.values, y_train.values)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
sk_predict = snap_model.predict(X_test.values)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

sk_acc = accuracy_score(y_test, sk_predict)
print("test accuracy: %.2gs" % sk_acc)

# filename = './snapml_random_forest_model.pkl'
# # save the trained cuml model into a file
# pickle.dump(snap_model, open(filename, 'wb'))
# print("saved model to file ", filename)

Overwriting /project_data/data_asset/models/snapML-RandomForest-main.py


<a id = "snapml-cpu"></a>
## Step 6 : Training the SnapML model on CPU with Watson Machine Learning Accelerator
### Re-define the submittion parameters

In [16]:
# specify the model file, conda env, device type and device number
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
--workerDeviceNum 1 \
--workerMemory 32G \
--workerDeviceType cpu \
--conda-env-name snapml-177rc \
--model-main ' + model_main
print(args)

--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu --conda-env-name snapml-177rc --model-main snapML-RandomForest-main.py


### Submit WMLA Workload

In [17]:
files = {'file': open("{0}/{1}".format(model_dir,model_main),'rb')}
submit_job_to_wmla (args, files)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-963,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu --...,wmla-963,mluser1,FINISHED,wmla-963,https://wmla-mss:9080,wmla,/gpfs/myresultfs/mluser1/batchworkdir/wmla-963/_submitted_code,SingleNodeTensorflowTrain,2021-07-29T05:46:32Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-963',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta type=fs '
          '--workerDeviceNum 1 --workerMemory 32G --workerDeviceType cpu '
          '--conda-env-name snapml-177rc --model-main '
          'snapML-RandomForest-main.py ',
  'createTime': '2021-07-29T05:46:32Z',
  'creator': 'mluser1',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-963',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-963',
  'workDir': '/gpfs/myresultfs/mluser1/batchworkdir/wmla-963/_submitted_code'}

Totallly training cost:  66  seconds.


<a id = "snapml-model-gpu"></a>
## Step 7 : Setup Random Forest model using snapML GPU¶
### Create a Random Forest Model based on snapML

In [18]:
model_main='snapML-gpu-'+model_base_name

In [19]:
%%writefile {model_dir}/{model_main}

import numpy as np
import pandas as pd
# import pickle
from sklearn.metrics import accuracy_score
#from sklearn.ensemble import RandomForestClassifier as skrfc
from snapml import RandomForestClassifier as SnapRandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import os
import datetime

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/cache/rf"

# Define Parameters for a large regression
n_samples = 2**18
n_features = 28 
# n_class = 2
data_type = np.float32

# Generate Data using scikit-learn
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          # n_class=n_class,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)

print("Number of examples: %d" % (X_train.shape[0]))
print("Number of features: %d" % (X_train.shape[1]))
print("Number of classes:  %d" % (len(np.unique(y_train))))

# snapML RandomForestClassifier model
# "GPU acceleration can only be enabled if use_histograms=True"
# ValueError: GPU acceleration can only be enabled if use_histograms=True
snap_model = SnapRandomForestClassifier(use_gpu=True,
                                        use_histograms=True,max_depth=16, n_estimators=40, 
                                        n_jobs=1, random_state=10)
# Fit
start = datetime.datetime.now()
# TypeError: Tree-based models in Snap ML only support numpy.ndarray
snap_model.fit(X_train.values, y_train.values)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
sk_predict = snap_model.predict(X_test.values)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

sk_acc = accuracy_score(y_test, sk_predict)
print("test accuracy: %.2gs" % sk_acc)

# filename = './snapml_random_forest_model.pkl'
# # save the trained cuml model into a file
# pickle.dump(snap_model, open(filename, 'wb'))
# print("saved model to file ", filename)

Overwriting /project_data/data_asset/models/snapML-gpu-RandomForest-main.py


<a id = "snapml-gpu"></a>
## Step 8 : Training the SnapML model on GPU with Watson Machine Learning Accelerator
### Re-define the submittion parameters

In [20]:
# specify the model file, conda env, device type and device number
args = '--exec-start tensorflow --cs-datastore-meta type=fs \
--workerDeviceNum 1 \
--workerMemory 32G \
--workerDeviceType gpu \
--conda-env-name snapml-177rc \
--model-main ' + model_main
print(args)

--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu --conda-env-name snapml-177rc --model-main snapML-gpu-RandomForest-main.py


In [None]:
files = {'file': open("{0}/{1}".format(model_dir,model_main),'rb')}
submit_job_to_wmla (args, files)

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-965,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu --...,wmla-965,mluser1,FINISHED,wmla-965,https://wmla-mss:9080,wmla,/gpfs/myresultfs/mluser1/batchworkdir/wmla-965/_submitted_code,SingleNodeTensorflowTrain,2021-07-29T05:51:37Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-965',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta type=fs '
          '--workerDeviceNum 1 --workerMemory 32G --workerDeviceType gpu '
          '--conda-env-name snapml-177rc --model-main '
          'snapML-gpu-RandomForest-main.py ',
  'createTime': '2021-07-29T05:51:37Z',
  'creator': 'mluser1',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-965',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-965',
  'workDir': '/gpfs/myresultfs/mluser1/batchworkdir/wmla-965/_submitted_code'}
