FastRGF Version 0.6
C++ Multi-core Implementation of Regularized Greedy Forest (RGF)
Please see the file
CHANGES.md for the changelog of FastRGF.
This software package provides a multi-core implementation of a simplified Regularized Greedy Forest (RGF) described in . Please cite the paper if you find the software useful.
RGF is a machine learning method for building decision forests that have been used to win some Kaggle competitions. In our experience it works better than gradient boosting on many relatively large datasets.
The implementation employs the following concepts described in the original paper :
- tree node regularization;
- fully-corrective update;
- greedy node expansion with trade-off between leaf node splitting for current tree and root splitting for new tree.
However, various simplifications are made to accelerate the training speed. Therefore, unlike the original RGF program, this software does not reproduce the results in the paper.
The implementation of greedy tree node optimization employs second order Newton approximation for general loss functions. For logistic regression loss, which works especially well for many binary classification problems, this approach was considered in ; for general loss functions, second order approximation was considered in .
The software is written in C++11 and requires to be built from the sources with the g++ compiler. Note that compilation only with g++-5 and newer versions is possible. Other compilers are unsupported and older versions produce corrupted files.
Download the package and extract the content. Otherwise, you can use git
git clone https://github.com/RGF-team/rgf.git
The top directory of the extracted (or cloned) content is
rgf and this software package is located into
FastRGF subfolder. Below all the path expressions are relative to
The source files are located in the
The following executables will appear in the
bin directory after you compile them by any method from the listed below.
- forest_train: train FastRGF and save the model;
- forest_predict: apply already trained model on test data.
You may use the option
-h to show the command-line options of the each executable file.
On Windows compilation only with CMake and MinGW-w64 is supported because only this version of MinGW provides POSIX threads.
mkdir build cd build cmake .. -G "MinGW Makefiles" mingw32-make mingw32-make install
2.2. Unix-like Systems
mkdir build cd build cmake .. make make install
Please go to the
4. Hyperparameters Tuning
- forest.ntrees: Controls the number of trees in the forest. Typical range is
[100, 10000]. Default value is
- forest.opt: Optimization method for training the forest. You can select
epsilon-greedy. Default value is
- forest.stepsize: Controls the step size of epsilon-greedy boosting. Meant for being used with
forest.opt=epsilon-greedy. Default value is
- dtree.max_level: Controls the maximum tree depth. Default value is
- dtree.max_nodes: Controls the maximum number of leaf nodes in best-first search. Default value is
- dtree.new_tree_gain_ratio: Controls when to start a new tree. New tree is created when leaf nodes gain < this value * estimated gain of creating new tree. Default value is
- dtree.min_sample: Controls the minimum number of training data points in each leaf node. Default value is
- dtree.loss: You can select
LOGISTICloss function. Default value is
LS. However, for binary classification task
LOGISTICoften works better.
- dtree.lamL1: Controls the degree of L1 regularization. A large value induces sparsity. Typical range is
[0.0, 1000.0]. Default value is
- dtree.lamL2: Controls the degree of L2 regularization. The larger value is, the larger
forest.ntreesyou need to use: the resulting accuracy is often better with a longer training time. Use a relatively large value such as
10000.0. Default value is
discretize.sparse.max_buckets: Controls the maximum number of discretized values. Typical range is
[10, 250]. Default value is
200. Meant for being used with sparse data.
If you want to try a larger value up to
65000, then you need to edit include/header.h and replace
using disc_sparse_value_t=unsigned char;by
using disc_sparse_value_t=unsigned short;. However, this will increase the memory usage.
discretize.dense.max_buckets: Controls the maximum number of discretized values. Typical range is
[10, 65000]. Default value is
65000. Meant for being used with dense data.
discretize.sparse.min_bucket_weights: Controls the minimum number of effective samples for each discretized value. Default value is
5.0. Meant for being used with sparse data.
discretize.dense.min_bucket_weights: Controls the minimum number of effective samples for each discretized value. Default value is
5.0. Meant for being used with dense data.
discretize.sparse.lamL2: Controls the degree of L2 regularization for discretization. Default value is
2.0. Meant for being used with sparse data.
discretize.dense.lamL2: Controls the degree of L2 regularization for discretization. Default value is
2.0. Meant for being used with dense data.
discretize.sparse.max_features: Controls the maximum number of selected features. Typical range is
[1000, 10000000]. Default value is
80000. Meant for being used with sparse data.
discretize.sparse.min_occurrences: Controls the minimum number of occurrences for a feature to be selected. Default value is
5. Meant for being used with sparse data.
Please post an issue at GitHub repository for any errors you encounter.
FastRGF is distributed under the MIT license. Please read the file
 Rie Johnson and Tong Zhang. Learning Nonlinear Functions Using Regularized Greedy Forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):942-954, May 2014.
 Ping Li. Robust LogitBoost and Adaptive Base Class (ABC) LogitBoost. UAI, 2010.
 Zhaohui Zheng, Hongyuan Zha, Tong Zhang, Olivier Chapelle, Keke Chen, Gordon Sun. A General Boosting Method and its Application to Learning Ranking Functions for Web Search. NIPS, 2007.