# Random Forests Multi-node, Multi-GPU demo

The experimental cuML Multi-node, Multi-GPU (MNMG) implementation of Random Forests leverages Dask to do embarrassingly-parallel model fitting. For a random forest with N trees being fit by W workers, each worker wil build N / W trees. During inference, predictions from all N trees will be combined.

The caller is responsible for partitioning the data efficiently via Dask. To build an accurate model, it's important to ensure that each worker has a representative chunk of the data. This can come by distributing the data evenly after ensuring that it is well shuffled. Or, given sufficient memory capacity, the caller can replicate the data to all workers. This approach will most closely simulate the single-GPU building approach.

**Note:** cuML 0.9 contains the first, experimental preview release of the MNMG Random Forest model. The API is subject to change in future releases, and some known limitations remain:
 * Each worker must have at least one data partition or fitting will throw an error
 * The fit method only works with cuDF dataframe inputs. Future versions will support more flexible input types.
 * Prediction is still carried out on CPU, and it only accepts numpy arrays as an input. Performance will be suboptimal. In future versions, cuML will provide integration with the Forest Inferencing Library (FIL). Prediction requires scattering data to all workers, which can lead to inefficiencies that will be addressed in future versions.

In [14]:
import numpy as np
import sklearn

import pandas as pd
import cudf
import cuml

from cuml.test.dask.utils import dask_make_blobs
from cuml.dask.common import extract_ddf_partitions
from cuml.dask.common import utils as dask_utils

from sklearn.metrics import accuracy_score
from sklearn import model_selection

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf

from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF
from sklearn.ensemble import RandomForestClassifier as sklRF

# Start Dask cluster

In [2]:
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)

# Define Parameters

In addition to the number of examples, Random Forest fitting performance depends heavily on the number of columns in a dataset and (especially) on the maximum depth to which trees are allowed to grow. Lower max_depth values can greatly speed up fitting, though going too low may reduce accuracy.

In [24]:
train_size = 100000
test_size = 1000
n_samples = train_size + test_size
n_features = 20

max_depth = 12
num_trees = 1000

# Generate Data

In this case, we generate data on the client (initial process) and pass it to the workers. You could also load data directly onto the workers via, for example, `dask_cudf.read_csv`. See also the KMeans MNMG notebook for an alternative method of generating data on the worker nodes.

In [25]:
X, y = sklearn.datasets.make_classification(n_samples=n_samples, n_features=n_features,
                                 n_clusters_per_class=1, n_informative=int(n_features / 3),
                                 random_state=123, n_classes=5)
y = y.astype(np.int32)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)

# Distribute data to workers

In [26]:
n_partitions = n_workers

# First convert to cudf (with real data, you would likely load in cuDF format to start)
X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
y_train_cudf = cudf.Series(y_train)

# Partition with Dask
X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions)

# Persist to cache the data in active memory
X_train_dask, y_train_dask = \
  dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask], workers=workers)

# Build a scikit-learn model (single node)

Dask does not currently have a simple wrapper for scikit-learn's RandomForest, but scikit-learn does offer multi-CPU support via joblib, which we'll use.

In [27]:
%%time

# Use all avilable CPU cores
skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)
skl_model.fit(X_train, y_train)

CPU times: user 1min 52s, sys: 44.2 ms, total: 1min 52s
Wall time: 9.94 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=12, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

# Train the distributed cuML model

In [28]:
%%time

cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees)
cuml_model.fit(X_train_dask, y_train_dask)

wait(cuml_model.rfs) # Allow asynchronous training tasks to finish

{'tcp://10.110.47.75:37181': <Future: status: pending, key: _func_build_rf-530d9b776197dffc611ba558d6c3787e>}
CPU times: user 101 ms, sys: 0 ns, total: 101 ms
Wall time: 1.44 s


DoneAndNotDoneFutures(done={<Future: status: finished, type: RandomForestClassifier, key: _func_build_rf-530d9b776197dffc611ba558d6c3787e>}, not_done=set())

# Predict and check accuracy

In [29]:
skl_y_pred = skl_model.predict(X_test)
cuml_y_pred = cuml_model.predict(X_test)

print("SKLearn accuracy:  ", accuracy_score(y_test, skl_y_pred))
print("CuML accuracy:     ", accuracy_score(y_test, cuml_y_pred))

SKLearn accuracy:   0.868
CuML accuracy:      0.855
