Building random forests on GPU
Python Cuda
Latest commit fcd9c7f Nov 17, 2014 @EasonLiao Merge pull request #4 from palicand/fix-numpy
fixed htod not working with numpy scalar types


I decide to stop maintaining CudaTree since scikit-learn made a sigificant improvements on its random forest and it's well maintained.

CudaTree is an implementation of Leo Breiman's Random Forests adapted to run on the GPU. A random forest is an ensemble of randomized decision trees which vote together to predict new labels. CudaTree parallelizes the construction of each individual tree in the ensemble and thus is able to train faster than the latest version of scikits-learn.

We've also implemented a hybrid version of random forest which uses multiple GPU and multicore CPU to fully utilize all the resource your system has. For the multicore version, we use scikits-learn random forest as default, you can also supply other multicore implementations such as WiseRF.


  import numpy as np
  from cudatree import load_data, RandomForestClassifier

  x_train, y_train = load_data("digits")
  forest = RandomForestClassifier(n_estimators = 50, verbose = True, bootstrap = False), y_train, bfs_threshold = 4196)

For hybrid version:

  import numpy as np
  from cudatree import load_data
  from hybridforest import RandomForestClassifier
  from PyWiseRF import WiseRF

  x_train, y_train = load_data("digits")
  #We gonna build random forest on two GPU and 6 CPU core. 
  #For the GPU version we use CudaTree, CPU version we use WiseRF
  forest = RandomForestClassifier(n_estimators=50, 
                                    n_gpus = 2, 
                                    n_jobs = 6, 
                                    cpu_classifier = WiseRF), y_train)


You should be able to install CudaTree from its PyPI package by running:

pip install cudatree


CudaTree is writen for Python 2.7 and depends on:


It's important to remember that a dataset which fits into your computer's main memory may not necessarily fit on a GPU's smaller memory. Furthermore, CudaTree uses several temporary arrays during tree construction which will limit how much space is available. A formula for the total number of bytes required to fit a decision tree for a given dataset is given below. If less than this quantity is available on your GPU, then CudaTree will fail.

gpu memory = dataset + 2*samples*features*ceil(log2(samples)/8) + samples*features

For example, let's assume you have a training dataset which takes up 200MB, and the number of samples = 10000 and the number of features is 3000, then the total GPU memory required will be:

200MB + (2 * 3000 * 10000 * 2 + 3000 * 10000) / 1024 / 1024 = 314MB

In addition to memory requirement, there are several other limitations hard-coded into CudaTree:

  • The maximum number of features allowed is 65,536.
  • The maximum number of categories allowed is 5000(CudaTree performs best when the number of categories is <=100).
  • Your NVIDIA GPU must have compute capability >= 1.3.
  • Currently, the only splitting criterion is GINI impurity, which means CudaTree can't yet do regression (splitting by variance for continuous outputs is planned)

Since scikit-learn changed their impelementation of random forest. So we may be slower than scikit-learn on some of the dataset. But any way. You can always get the performance boost if you use the hybrid mode.

Implementation Details

Trees are first constructed in depth-first order, with a separate kernel launch for each node's subset of the data. Eventually the data gets split into very small subsets and at that point CudaTree switches to breadth-first grouping of multiple subsets for each kernel launch.