datasink: A Pipeline for Large-Scale Heterogeneous Ensemble Learning
Datasink is a customizable pipeline for generating diverse ensembles of heterogeneous classifiers, as well as the accompanying metadata needed for ensemble learning approaches utilizing ensemble diversity for improved performance. It also fairly evaluates the performance of several ensemble learning methods including greedy selection, enhanced selection [Caruana2004], and stacked generalization (stacking) [Wolpert1992]. Though other tools exist, we are unaware of a similarly modular, scalable pipeline designed for large-scale ensemble learning. Datasink was developed to support research by Sean Whalen and Gaurav Pandey (see [Whalen2013]) with the support of the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai.
Datasink is designed for generating extremely large ensembles (taking days or weeks to generate) and thus consists of an initial data generation phase tuned for multicore and distributed computing environments. The output is a set of compressed CSV files containing the class distribution produced by each classifier that serves as input to a later ensemble learning phase.
Data is generated by a customized pipeline built around the Java-based Weka machine learning package [Hall2009]. For simplicity and extensibility, the pipeline uses an interpreted variant of Java called Groovy that calls compiled Weka code without performance penalty. Thus the data generation prerequisites are:
Ensemble learning is implemented in Python using the popular pandas/scikit-learn analytics stack [McKinney2012,Pedregosa2011]:
Older versions may work for some packages if current versions are not available.
There is no installer for datasink. However, the installation of the prerequisites and their dependencies can usually be handled by the package manager for your operating system. We assume comfort with command line execution and provide setup instructions for Ubuntu Linux and OS X below.
This README details the setup and use of datasink via several examples but is not intended as a general tutorial on ensemble learning, version control, or particular libraries.
Setup option 1: Ubuntu Linux
Ubuntu and other Debian-based Linux distributions use the
apt-get command for installing packages and their dependencies. See the howto or run
man apt-get for more details.
To install the prerequisites for datasink, run:
sudo apt-get -y install groovy cython python-numpy python-scipy python-pip sudo pip install -U pandas scikit-learn
A suitable version of Weka is unfortunately not bundled with Ubuntu, so run the following:
sudo apt-get -y install curl unzip curl -O -L http://prdownloads.sourceforge.net/weka/weka-3-7-10.zip unzip weka-3-7-10.zip sudo cp weka-3-7-10/weka.jar /usr/share/java
Setup option 2: Ubuntu virtual machine
This option downloads and runs Ubuntu 13.04 64-bit under the VirtualBox virtual machine, incurring some performance penalty but allowing you to evaluate datasink in a completely self-contained, pre-configured environment. Skip this section if you aren't familiar with virtual machines.
First install the following:
mkdir dvm; cd dvm vagrant init vagrant box add base http://cloud-images.ubuntu.com/vagrant/raring/current/raring-server-cloudimg-amd64-vagrant-disk1.box vagrant up vagrant ssh
This will download a fresh Ubuntu disk image and start up the virtual machine, taking several minutes to complete and leaving you with a login prompt inside the virtual machine. Proceed with the instructions from Option 1 to install datasink inside this virtual machine, and type
exit to return to your host OS when desired. The virtual machine can be brought down using
vagrant halt from the host command line.
Due to the performance penalty of VMs, extended use of this option is not recommended; it is provided primarily for self-contained evaluation purposes. Performance can be improved substantially by increasing the number of CPU cores and RAM granted to the VM. See the Vagrant documentation for details.
Thanks to Olivier Grisel for the original document these instructions are based on.
Setup option 3: OS X
There are several options for installing the prerequisites under OS X. Pre-built Python distributions such as Enthought contain the necessary Python components and OS X comes bundled with a suitable version of Java. Advanced users can simply install a binary version of Groovy and Weka from their respective websites, place the Weka JAR file in their
CLASSPATH, and begin generating ensembles.
Other users may wish to use the MacPorts project to install the prerequisites and their dependencies in a self-contained directory that can easily be upgraded or removed later if desired. This option requires Apple's free Xcode developer tools, the optional Xcode command line tools installable from the developer tools GUI, and the MacPorts software for your version of OS X:
MacPorts downloads the required packages and their dependencies, but must compile from source if binaries are not available for your system; this can take hours for a fresh MacPorts installation as there are several dozen large packages to compile. Run the following to update MacPorts and install the prerequisites:
sudo port selfupdate sudo port install groovy py27-cython py27-pandas py27-scikit-learn sudo port select --set python python27
A suitable version of Weka is unfortunately not bundled with MacPorts, so run the following:
curl -O -L http://prdownloads.sourceforge.net/weka/weka-3-7-10.zip unzip weka-3-7-10.zip sudo cp weka-3-7-10/weka.jar /opt/local/share/java
Obtaining the source
The latest source code can be obtained by cloning the public git repository using the following from the command line:
git clone https://github.com/shwhalen/datasink.git
This will create a
datasink subdirectory in your working directory containing the source code. The
git program comes bundled with recent versions of OS X; it can be installed under Ubuntu using
sudo apt-get -y install git. Updates can be obtained by running
git pull from the
Compiling the Cython module
Several functions are accelerated by Cython and must first be compiled by running
make from the git repository directory.
Setting environment variables
Java must be told where Weka is located and how much RAM to use by modifying the
JAVA_OPTS environment variables. A simple way to set these variables is to add the following to your shell's login script for Ubuntu:
export CLASSPATH=$CLASSPATH:/usr/share/java/weka.jar export JAVA_OPTS="-Xmx4g"
or for OS X:
export CLASSPATH=$CLASSPATH:/opt/local/share/java/weka.jar export JAVA_OPTS="-Xmx4g"
The above is Bash syntax and allows Weka to use up to 4 gigs of RAM; adjust accordingly for your setup.
groovy executable must also be somewhere in your search path (and already is if using the Ubuntu or MacPorts instructions above). If groovy was manually installed, for example under
$HOME/groovy on a cluster, add the following to your login script:
Quick setup: Docker image
With a docker image, users don't need to worry about binaries, dependencies and environment variables since all of these are configured in the docker file. A container of all componnets can be easily created based on existing docker image.
As a prerequisite, docker should be installed. Then, you can pull image from docker cloud by running:
docker pull alexwang0106/datasink:2.2 docker run -ti alexwang0106/datasink:2.2 /bin/bash
An alternative way to build datasink docker image is simply downloading the Dockerfile and use the following command:
docker build /PATH/TO/DOCKERFILE docker run -ti IMAGE_NAME /bin/bash
This will create an interactive container shell, you can copy your own data to the container using
docker cp foo.txt YOUR_CONTAINER:/path/to/dir/
You're finally ready to setup and construct an ensemble!
Walkthrough: Building an ensemble
Ensemble generation requires 3 files, ideally inside a self-contained project directory:
- Training data in ARFF format
- A file listing the classifiers to train
weka.propertiesfile pointing to the above files and configuring other pipeline settings
To begin, we create a project directory in our home directory and download an example dataset from the command line:
mkdir ~/diabetes; cd ~/diabetes curl -O http://repository.seasr.org/Datasets/UCI/arff/diabetes.arff
Next we create a
weka.properties file to configure our pipeline:
cat > weka.properties << EOF classifiersFilename = classifiers.txt inputFilename = diabetes.arff classAttribute = class predictClassValue = tested_positive balanceTraining = false foldCount = 10 nestedFoldCount = 10 bagCount = 10 EOF
Finally, we create
classifiers.txt containing the Weka classifiers and associated parameters we want included in the ensemble:
cat > classifiers.txt << EOF weka.classifiers.bayes.NaiveBayes -D weka.classifiers.functions.SGD -F 1 #weka.classifiers.meta.LogitBoost EOF
weka.classifiers.meta.LogitBoost is preceded by a comment marker (#); such lines are skipped and thus excluded from ensemble generation. We'll leave LogitBoost commented out for now, and later see how its inclusion changes ensemble performance.
As specified in the
weka.properties file, the data is first divided into 10 folds of independent training and test splits for cross validation. Each training split is resampled with replacement 10 times (a process called bagging [Breiman1996]), and nested cross validation is performed on each of these resampled training splits to produce the data necessary for ensemble techniques.
Before generating the ensemble, we first examine the non-ensemble performance of each base classifier using 10-fold cross validation in Weka. Several performance metrics are produced by Weka, but datasink focuses on the area under the receiver operating characteristic (ROC) curve (AUC) since it is well-suited to imbalanced class distributions that often occur with real data. However, any metric can be computed using the CSV files generated by the analysis scripts.
weka.classifiers.bayes.NaiveBayes -D TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.816 0.392 0.795 0.816 0.806 0.429 0.806 0.882 tested_negative 0.608 0.184 0.639 0.608 0.623 0.429 0.806 0.676 tested_positive weka.classifiers.functions.SGD -F 1 TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.888 0.444 0.789 0.888 0.835 0.478 0.832 0.892 tested_negative 0.556 0.112 0.727 0.556 0.630 0.478 0.832 0.713 tested_positive
Next we construct an ensemble of 20 Naive Bayes (NB) and Logistic Regression (LR, trained using Stochastic Gradient Descent) base classifiers; recall each classifier type is bagged 10 times. This takes ~3-4 minutes on a modern 4 core system with 8 gigs of RAM and should decrease linearly with the number of cores:
cd ~/datasink python generate.py ~/diabetes
Because the code is architected for multicore and distributed environments, many processes are spawned and each writes its output to a unique file. These files must first be merged:
python combine.py ~/diabetes
Ensemble methods are then applied:
python mean.py ~/diabetes 0.836 python stacking.py ~/diabetes standard 0.837 20 python selection.py ~/diabetes greedy 0.841 15 python selection.py ~/diabetes enhanced 0.839 16
The output after each script gives the AUC calculated over all cross validation folds as well as the average size of the ensemble when applicable. The performance of these methods can vary greatly depending on the dataset and in particular the number of training examples, with simpler methods typically performing better for smaller datasets.
Growing the ensemble
Let's add LogitBoost to the ensemble, looking first at its base performance in Weka:
weka.classifiers.meta.LogitBoost TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.846 0.455 0.776 0.846 0.810 0.410 0.810 0.891 tested_negative 0.545 0.154 0.655 0.545 0.595 0.410 0.810 0.668 tested_positive
We add this classifier to the ensemble by editing
classifiers.txt and using comment markers (#) as discussed above to exclude the previous classifiers and include LogitBoost, or execute the following command as a shortcut:
cat > ~/diabetes/classifiers.txt << EOF #weka.classifiers.bayes.NaiveBayes -D #weka.classifiers.functions.SGD -F 1 weka.classifiers.meta.LogitBoost EOF
Alternately, leave all lines uncommented to see how the ensemble generation script only produces output for LogitBoost as it recognizes NB and LR are already generated. Now create the LogitBoost classifiers (~2 mins), combine with the previous NB and LR output, and run the ensemble methods:
python generate.py ~/diabetes python combine.py ~/diabetes python mean.py ~/diabetes 0.836 python stacking.py ~/diabetes standard 0.841 30 python selection.py ~/diabetes greedy 0.843 17 python selection.py ~/diabetes enhanced 0.840 24
Note that the performance of mean-aggregated predictions remains unchanged, while stacking and selection methods get a small boost.
The importance of bagging
Compare the performance of the Naive Bayes base classifier (0.806) to its bagged performance (0.8254) using mean aggregation: Bagging provides a non-trivial boost. Run
cd ~/diabetes and try the following under
ipython to see how this simple aggregation method works. We first create a pandas DataFrame object indexed by a unique ID and class label for each example:
from glob import glob from pandas import concat from sklearn.metrics import roc_auc_score df = concat([read_csv(_, compression = 'gzip', index_col = [0, 1]) \ for _ in glob('predictions-*.csv.gz')])
The probability assigned to the positive class by each resampled classifier is stored across columns, shown here with a number appended to the classifier name for each bagged version:
print df.columns Index([NaiveBayes.0, NaiveBayes.1, NaiveBayes.2, NaiveBayes.3, NaiveBayes.4, NaiveBayes.5, NaiveBayes.6, NaiveBayes.7, NaiveBayes.8, NaiveBayes.9, SGD.0, SGD.1, SGD.2, SGD.3, SGD.4, SGD.5, SGD.6, SGD.7, SGD.8, SGD.9, LogitBoost.0, LogitBoost.1, LogitBoost.2, LogitBoost.3, LogitBoost.4, LogitBoost.5, LogitBoost.6, LogitBoost.7, LogitBoost.8, LogitBoost.9], dtype=object)
Here we grab the class labels from the index, take the row mean of the first 10 columns corresponding to Naive Bayes, and calculate the AUC:
labels = df.index.get_level_values(1).values roc_auc_score(labels, df.iloc[:, :10].mean(axis = 1)) 0.8254
A similar increase compared to the base classifier is observed for LogitBoost:
roc_auc_score(labels, df.iloc[:, 20:30].mean(axis = 1)) 0.8293
but a small dip occurs for Logistic Regression:
roc_auc_score(labels, df.iloc[:, 10:20].mean(axis = 1)) 0.8283
Recall the base LR performance is 0.832 which is already quite close to the ensemble's performance. Resampling the training data with replacement to create diversity excludes approximately one third of the training instances due to chance. Given its decreased performance, the loss of these training examples is more detrimental to LR than what it gains from ensembling. However, the dip is relatively small and resampling creates the diversity necessary to increase the overall performance of the ensemble when other classifiers are included.
Heterogeneous vs. homogeneous ensembles
The advantage of using heterogeneous ensembles is clear when we compare their performance to state-of-the-art homogeneous ensemble techniques such as Random Forests. To maximize performance of the forest, we increase the number of trees to 500 (a parameter that doesn't overfit [Breiman2001]) and reduce the maximum tree depth to prevent overfitting on this small dataset.
weka.classifiers.trees.RandomForest -I 500 -depth 5 TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0.864 0.429 0.790 0.864 0.825 0.458 0.833 0.904 tested_negative 0.571 0.136 0.692 0.571 0.626 0.458 0.833 0.700 tested_positive
Compare the 0.833 AUC of this homogeneous ensemble to the 0.843 achieved above using greedy selection with only 3 classifier types.
Another (condensed) example
Let's see if these trends hold for another dataset:
mkdir ~/liver; cd ~/liver curl -O http://repository.seasr.org/Datasets/UCI/arff/liver-disorders.arff cat > weka.properties << EOF classifiersFilename = classifiers.txt inputFilename = liver-disorders.arff classAttribute = selector predictClassValue = 2 balanceTraining = false foldCount = 10 nestedFoldCount = 10 bagCount = 10 EOF cat > classifiers.txt << EOF weka.classifiers.functions.MultilayerPerceptron weka.classifiers.lazy.IBk weka.classifiers.meta.AdaBoostM1 weka.classifiers.rules.JRip EOF cd ~/datasink python generate.py ~/liver python combine.py ~/liver python mean.py ~/liver python selection.py ~/liver greedy python selection.py ~/liver enhanced python stacking.py ~/liver standard
|AdaBoostM1||0.684||DecisionStump base learner|
|RandomForest||0.768||200 trees, excluded from ensemble|
|stacking||0.775||RandomForest stacker, max_depth = 5|
This time we use a different set of classifiers and give the performance of a random forest for reference. Note that 40 bagged heterogeneous classifiers outperform a random forest of 200 trees for three out of four aggregation methods. Though enhanced ensemble selection has not performed as well as simpler methods in these examples, it tends to perform similar to stacking for larger datasets where greedy selection begins to fall behind. It is important to emphasize that differences in performance should be evaluated for statistical significance; see [Demšar2006] for a review of non-parametric comparison methods. In a paper currently under review, we find statistically significant differences between heterogeneous ensembles and the best base classifier (homogeneous ensembles) for several complex, real-world datasets.
The UCI machine learning repository provides datasets for benchmarking machine learning algorithms. Below is the AUC for several UCI binary classification datasets using 3 classifier types discussed above: NB, LR, and LogitBoost. These numbers are provided only for verification purposes: This small ensemble will likely be out-performed by a single well-tuned classifier (often a Random Forest or gradient boosted trees), and for many datasets the best classifier will be out-performed by a larger heterogeneous ensemble. To get more experience with datasink, try adding new classifiers until the ensemble beats the best classifier for some of these datasets.
The highest AUC is bolded for each dataset, and ties are broken by preferring the simplest method. Again, one must perform tests for statistical significance such as those presented in [Demšar2006] to draw sound conclusions about performance differences, and more complex methods often require similarly complex, large, real-world datasets to demonstrate their utility.
|breast-cancer||286||0.683||0.67||0.704 (3)||0.691 (19)|
|breast-w||699||0.993||0.992||0.993 (5)||0.993 (17)|
|colic||368||0.872||0.883||0.874 (9)||0.875 (25)|
|credit-a||690||0.934||0.933||0.933 (9)||0.935 (25)|
|credit-g||1000||0.785||0.793||0.795 (7)||0.794 (25)|
|diabetes||768||0.836||0.84||0.842 (18)||0.841 (25)|
|haberman||306||0.662||0.662||0.672 (5)||0.676 (24)|
|heart-statlog||270||0.905||0.908||0.906 (6)||0.908 (24)|
|ionosphere||351||0.967||0.96||0.960 (9)||0.971 (21)|
|kr-vs-kp||3196||0.993||0.996||0.996 (7)||0.996 (26)|
|labor||57||0.979||1.000||0.980 (4)||0.980 (6)|
|liver-disorders||345||0.742||0.768||0.780 (7)||0.758 (21)|
|molecular-biology_promoters||106||0.977||0.97||0.970 (11)||0.956 (18)|
|mushroom||8124||1.000||1.000||1.000 (2)||1.000 (2)|
|sick||3772||0.973||0.963||0.977 (11)||0.978 (27)|
|sonar||208||0.873||0.895||0.876 (5)||0.905 (23)|
|spambase||4601||0.976||0.972||0.978 (15)||0.978 (27)|
|tic-tac-toe||958||0.982||0.996||0.997 (5)||0.997 (24)|
|vote||435||0.991||0.992||0.991 (9)||0.992 (23)|
If a particular class is extremely uncommon, the bagging process may (by chance) produce training splits that do not contain that class due to sampling with replacement. Bagging may not be appropriate in these scenarios and can be disabled in the properties file.
|classifiersFilename||String||Required||File containing a list of full Java classnames and parameters, one per line, of classifiers to include in the ensemble. Lines beginning with a hash (#) are skipped.|
|inputFilename||String||Required||A Weka-formatted ARFF file containing features and class labels.|
|workingDir||String||Current directory||Location to store classifier outputs.|
|classAttribute||String||Required||Name of the ARFF attribute containing class labels. Often the last attribute.|
|predictClassValue||String||Required||Value of the positive class for classAttribute. For example, this could be 1 for instances with 0/1 class labels, or tested_positive for the walkthrough dataset.|
|balanceTraining||Boolean||true||Balance the class distribution of the training set inside each cross validation fold after any resampling, using Weka's SpreadSubsample filter with
|balanceTest||Boolean||false||Identical to balanceTraining for the test set. Note that best practice for non-uniform class distributions is to balance the training set only, then evaluate against the natural class distribution of the test set [Weiss2003,Tan2013].|
|foldCount||Integer||Required*||Number of cross validation folds to use. This or foldAttribute must be specified.|
|foldAttribute||String||Required*||Name of the ARFF attribute containing values for leave-one-value-out cross validation. This or foldCount must be specified.|
|nestedFoldCount||Integer||Required||Number of nested cross validation folds to use for each cross validated training set. Greatly increases execution time.|
|bagCount||Integer||Required||Number of resampled versions of each base classifier to generate. Greatly increases execution time. A value of 0 disables resampling.|
|useCluster||Boolean||false||Submit jobs to a distributed computing cluster (using
|writeModels||Boolean||false||Save compressed, serialized models for each classifier/fold/bag combination to disk. Substantially increases disk usage.|
- Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. doi:10.1023/A:1018054314350
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
- Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (2004). Ensemble Selection from Libraries of Models. In Proceedings of the 21st International Conference on Machine Learning (pp. 18–26). doi:10.1145/1015330.1015432
- Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7(Jan), 1–30.
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1), 10–18. doi:10.1145/1656274.1656278
- McKinney, W. (2012). Python for Data Analysis. O’Reilly.
- Pedregosa, F., Varoquaux, G., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
- Tan, P.-N., Steinbach, M., & Kumar, V. (2013). Introduction to Data Mining (2nd ed.). Addison-Wesley.
- Weiss, G. M., & Provost, F. J. (2003). Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19(1), 315–354.
- Whalen, S., & Pandey, G. (2013). A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics. In Proceedings of the 13th International Conference on Data Mining (pp. 807–816). doi:10.1109/ICDM.2013.21
- Wolpert, D. H. (1992). Stacked Generalization. Neural Networks, 5(2), 241–259. doi:10.1016/S0893-6080(05)80023-1