# Train a Random Forest Model with Watson Machine Learning 

Notebook created by Zeming Zhao on June, 2021

The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.

In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then we save the cuML model for future use with Python's pickling mechanism and demonstrate how to re-load it for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.

This notebook covers the following sections:

1. [Setup Random Forest using cuML](#rbm-model)<br>

1. [Training the model on GPU with Watson Machine Learning Accelerator](#gpu)<br>

1. [Setup Random Forest using sklearning](#rbm-model)<br>

1. [Training the model on CPU with Watson Machine Learning Accelerator](#cpu)<br>

<a id = "rbm-model"></a>
## Step 1 : Setup Random Forest model using cuML

### Prepare directory and file for writing Random Forest engine.

In [31]:
from pathlib import Path
model_dir = f'/data/models' 
model_main = f'RandomForest_main.py'
Path(model_dir).mkdir(exist_ok=True)
print("create model directory done.")

create model directory done.


### create a Random Forest Model based on cuML on GPU

In [32]:
%%writefile {model_dir}/{model_main}

import cudf
import numpy as np
import pandas as pd
import pickle

from cuml.ensemble import RandomForestClassifier as curfc
from cuml.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import os
import datetime

# %matplotlib inline

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/"

# Define Parameters for a large regression
n_samples = 2**12
n_features = 399
n_info = 300
data_type = np.float32

# Generate Data
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
# cuML Random Forest Classifier requires the labels to be integers
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                     test_size = 0.2,
                                                     random_state=0)

# cuML
X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)

y_cudf_train = cudf.Series(y_train.values)

# # SK model
# sk_model = skrfc(n_estimators=40,
#                  max_depth=16,
#                  max_features=1.0,
#                  random_state=10)

# # Fit
# start = datetime.datetime.now()
# sk_model.fit(X_train, y_train)
# end = datetime.datetime.now()
# print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# # Evaluate
# start = datetime.datetime.now()
# sk_predict = sk_model.predict(X_test)
# end = datetime.datetime.now()
# print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

# sk_acc = accuracy_score(y_test, sk_predict)

             
# cuML Model
cuml_model = curfc(n_estimators=40,
                   max_depth=16,
                   max_features=1.0,
                   random_state=10)
# Fit
start = datetime.datetime.now()
cuml_model.fit(X_cudf_train, y_cudf_train)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
fil_preds_orig = cuml_model.predict(X_cudf_test)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)

filename = './cuml_random_forest_model.sav'
# save the trained cuml model into a file
pickle.dump(cuml_model, open(filename, 'wb'))
# delete the previous model to ensure that there is no leakage of pointers.
# this is not strictly necessary but just included here for demo purposes.
del cuml_model
# load the previously saved cuml model from a file
pickled_cuml_model = pickle.load(open(filename,  'rb'))
                                      
pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)

fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)                                      
                                      
print("CUML accuracy of the RF model before pickling: %s" % fil_acc_orig)
print("CUML accuracy of the RF model after pickling: %s" % fil_acc_after_pickling)


Overwriting /data/models/RandomForest_main.py


## Step 2 :  Training the model on GPU with Watson Machine Learning Accelerator

<a id = "gpu"></a>
#### Prepare the model lib for running on GPU:

In [33]:
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

from matplotlib import pyplot as plt
%pylab inline

import base64
import json
import time
import urllib

Populating the interactive namespace from numpy and matplotlib


#### Configuring your environment and project details
To set up your project details, provide your credentials in this cell. You must include your cluster URL, username, and password.

In [34]:
hostname='wmla-console-wmla.apps.wml1x180.ma.platformlab.ibm.com'  # please enter Watson Machine Learning Accelerator host name
# login='username:password' # please enter the login and password
login='admin:p7PMrMMknVQzEb3ptyj0D6XRTO5PQjYL'
es = base64.b64encode(login.encode('utf-8')).decode("utf-8")
# print(es)
commonHeaders={'Authorization': 'Basic '+es}
req = requests.Session()
auth_url = 'https://{}/auth/v1/logon'.format(hostname)
print(auth_url)

a=requests.get(auth_url,headers=commonHeaders, verify=False)
access_token=a.json()['accessToken']
# print("Access_token: ", access_token)

https://wmla-console-wmla.apps.wml1x180.ma.platformlab.ibm.com/auth/v1/logon


In [35]:
dl_rest_url = 'https://{}/platform/rest/deeplearning/v1'.format(hostname)
commonHeaders={'accept': 'application/json', 'X-Auth-Token': access_token}
req = requests.Session()

In [36]:
# Health check
confUrl = 'https://{}/platform/rest/deeplearning/v1/conf'.format(hostname)
r = req.get(confUrl, headers=commonHeaders, verify=False)


#### define the status checking fuction

In [37]:
import tarfile
import tempfile
import os
import json
import pprint
import pandas as pd
from IPython.display import clear_output

def query_job_status(job_id,refresh_rate=3) :

    execURL = dl_rest_url  +'/execs/'+ job_id['id']
    pp = pprint.PrettyPrinter(indent=2)

    keep_running=True
    res=None
    while(keep_running):
        res = req.get(execURL, headers=commonHeaders, verify=False)
        monitoring = pd.DataFrame(res.json(), index=[0])
        pd.set_option('max_colwidth', 120)
        clear_output()
        print("Refreshing every {} seconds".format(refresh_rate))
        display(monitoring)
        pp.pprint(res.json())
        if(res.json()['state'] not in ['PENDING_CRD_SCHEDULER', 'SUBMITTED','RUNNING']) :
            keep_running=False
        time.sleep(refresh_rate)
    return res

In [38]:
model_file = model_dir+"/"+model_main
files = {'file': open(model_file , 'rb')}

args = '--exec-start tensorflow --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --conda-env-name rapids-21.06  \
                     --model-main /gpfs/mydatafs/models/RandomForest_main.py --workerDeviceType gpu'
                    # --epochs 5 --batch-size 10000 --workerDeviceType gpu'
                    # --model-main '+  "/gpfs/mydatafs/" + model_file + ' --epochs 5 --batch-size 10000'

In [39]:
starttime = datetime.datetime.now()

r = requests.post(dl_rest_url+'/execs?args='+args, files=files,
                  headers=commonHeaders, verify=False)
if not r.ok:
    print('submit job failed: code=%s, %s'%(r.status_code, r.content))
        
job_status = query_job_status(r.json(),refresh_rate=5)

endtime = datetime.datetime.now()

print("\nTraining cost: ", (endtime - starttime).seconds, " seconds.")

Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-209,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-209,admin,FINISHED,wmla-209,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-209/_submitted_code,SingleNodeTensorflowTrain,2021-07-02T07:46:08Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-209',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --conda-env-name '
          'rapids-21.06                       --model-main '
          '/gpfs/mydatafs/models/RandomForest_main.py --workerDeviceType gpu ',
  'createTime': '2021-07-02T07:46:08Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-209',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-209',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-209/_submitted_code'}

Training cost:  32  seconds.


## Step 3 :  Setup Random Forest model using SK-Learning
### create a Random Forest Model based on sk-learning on CPU

In [44]:
%%writefile {model_dir}/{model_main}

# import cudf
import numpy as np
import pandas as pd
import pickle

# from cuml.ensemble import RandomForestClassifier as curfc
# from cuml.metrics import accuracy_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import os
import datetime

# %matplotlib inline

# specify the cache location to /gpfy since ~/.cache is not available
os.environ["CUPY_CACHE_DIR"]="/gpfs/mydatafs/models/"

# Define Parameters for a large regression
n_samples = 2**12
n_features = 399
n_info = 300
data_type = np.float32

# Generate Data
# SK
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
# cuML Random Forest Classifier requires the labels to be integers
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)

# SK model
sk_model = skrfc(n_estimators=40,
                 max_depth=16,
                 max_features=1.0,
                 random_state=10)

# Fit
start = datetime.datetime.now()
sk_model.fit(X_train, y_train)
end = datetime.datetime.now()
print ("train timecost: %.2gs" % ((end-start).total_seconds()))

# Evaluate
start = datetime.datetime.now()
sk_predict = sk_model.predict(X_test)
end = datetime.datetime.now()
print ("evaluate timecost: %.2gs" % ((end-start).total_seconds()))

sk_acc = accuracy_score(y_test, sk_predict)

filename = './cuml_random_forest_model.sav'
# save the trained cuml model into a file
pickle.dump(sk_model, open(filename, 'wb'))
# delete the previous model to ensure that there is no leakage of pointers.
# this is not strictly necessary but just included here for demo purposes.
del sk_model
# load the previously saved cuml model from a file
pickled_sk_model = pickle.load(open(filename,  'rb'))
                                      
pred_after_pickling = pickled_sk_model.predict(X_test)

fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)                                      
                                      
print("CUML accuracy of the RF model before pickling: %s" % sk_acc)
print("CUML accuracy of the RF model after pickling: %s" % fil_acc_after_pickling)
                      

Overwriting /data/models/RandomForest_main.py


## Step 4 :  Training the SK-Learning model on CPU with Watson Machine Learning Accelerator

In [41]:
model_file = model_dir+"/"+model_main
files = {'file': open(model_file , 'rb')}

args = '--exec-start tensorflow --cs-datastore-meta type=fs \
                     --workerDeviceNum 1 \
                     --conda-env-name rapids-21.06  \
                     --model-main /gpfs/mydatafs/models/RandomForest_main.py --workerDeviceType cpu'
                    # --epochs 5 --batch-size 10000 --workerDeviceType gpu'
                    # --model-main '+  "/gpfs/mydatafs/" + model_file + ' --epochs 5 --batch-size 10000'

In [45]:
import datetime

starttime = datetime.datetime.now()

# ! python {model_dir}/{model_main} # --no-cuda --epochs 5 --batch-size 10000
r = requests.post(dl_rest_url+'/execs?args='+args, files=files,
                  headers=commonHeaders, verify=False)
if not r.ok:
    print('submit job failed: code=%s, %s'%(r.status_code, r.content))
        
job_status = query_job_status(r.json(),refresh_rate=5)

endtime = datetime.datetime.now()
print("Training cost: ", (endtime - starttime).seconds, " seconds.")


Refreshing every 5 seconds


Unnamed: 0,id,args,submissionId,creator,state,appId,schedulerUrl,modelFileOwnerName,workDir,appName,createTime,elastic,nameSpace,numWorker,framework
0,wmla-211,--exec-start tensorflow --cs-datastore-meta type=fs --workerDeviceNum 1 --...,wmla-211,admin,FINISHED,wmla-211,https://wmla-mss:9080,wmla,/gpfs/myresultfs/admin/batchworkdir/wmla-211/_submitted_code,SingleNodeTensorflowTrain,2021-07-02T07:49:54Z,False,wmla,1,tensorflow


{ 'appId': 'wmla-211',
  'appName': 'SingleNodeTensorflowTrain',
  'args': '--exec-start tensorflow --cs-datastore-meta '
          'type=fs                      --workerDeviceNum '
          '1                      --conda-env-name '
          'rapids-21.06                       --model-main '
          '/gpfs/mydatafs/models/RandomForest_main.py --workerDeviceType cpu ',
  'createTime': '2021-07-02T07:49:54Z',
  'creator': 'admin',
  'elastic': False,
  'framework': 'tensorflow',
  'id': 'wmla-211',
  'modelFileOwnerName': 'wmla',
  'nameSpace': 'wmla',
  'numWorker': 1,
  'schedulerUrl': 'https://wmla-mss:9080',
  'state': 'FINISHED',
  'submissionId': 'wmla-211',
  'workDir': '/gpfs/myresultfs/admin/batchworkdir/wmla-211/_submitted_code'}
Training cost:  58  seconds.
