Skip to content
Branch: master
Go to file
This branch is 4 commits ahead of shwhalen:master.

Latest commit


Failed to load latest commit information.
Latest commit message
Commit time

datasink: A Pipeline for Large-Scale Heterogeneous Ensemble Learning

Datasink is a customizable pipeline for generating diverse ensembles of heterogeneous classifiers, as well as the accompanying metadata needed for ensemble learning approaches utilizing ensemble diversity for improved performance. It also fairly evaluates the performance of several ensemble learning methods including greedy selection, enhanced selection [Caruana2004], and stacked generalization (stacking) [Wolpert1992]. Though other tools exist, we are unaware of a similarly modular, scalable pipeline designed for large-scale ensemble learning. Datasink was developed to support research by Sean Whalen and Gaurav Pandey (see [Whalen2013]) with the support of the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai.

Datasink is designed for generating extremely large ensembles (taking days or weeks to generate) and thus consists of an initial data generation phase tuned for multicore and distributed computing environments. The output is a set of compressed CSV files containing the class distribution produced by each classifier that serves as input to a later ensemble learning phase.

Data is generated by a customized pipeline built around the Java-based Weka machine learning package [Hall2009]. For simplicity and extensibility, the pipeline uses an interpreted variant of Java called Groovy that calls compiled Weka code without performance penalty. Thus the data generation prerequisites are:

Ensemble learning is implemented in Python using the popular pandas/scikit-learn analytics stack [McKinney2012,Pedregosa2011]:

Older versions may work for some packages if current versions are not available.

There is no installer for datasink. However, the installation of the prerequisites and their dependencies can usually be handled by the package manager for your operating system. We assume comfort with command line execution and provide setup instructions for Ubuntu Linux and OS X below.

This README details the setup and use of datasink via several examples but is not intended as a general tutorial on ensemble learning, version control, or particular libraries.

Setup option 1: Ubuntu Linux

Ubuntu and other Debian-based Linux distributions use the apt-get command for installing packages and their dependencies. See the howto or run man apt-get for more details.

To install the prerequisites for datasink, run:

sudo apt-get -y install groovy cython python-numpy python-scipy python-pip
sudo pip install -U pandas scikit-learn

A suitable version of Weka is unfortunately not bundled with Ubuntu, so run the following:

sudo apt-get -y install curl unzip
curl -O -L
sudo cp weka-3-7-10/weka.jar /usr/share/java

Setup option 2: Ubuntu virtual machine

This option downloads and runs Ubuntu 13.04 64-bit under the VirtualBox virtual machine, incurring some performance penalty but allowing you to evaluate datasink in a completely self-contained, pre-configured environment. Skip this section if you aren't familiar with virtual machines.

First install the following:

then run:

mkdir dvm; cd dvm
vagrant init
vagrant box add base
vagrant up
vagrant ssh

This will download a fresh Ubuntu disk image and start up the virtual machine, taking several minutes to complete and leaving you with a login prompt inside the virtual machine. Proceed with the instructions from Option 1 to install datasink inside this virtual machine, and type exit to return to your host OS when desired. The virtual machine can be brought down using vagrant halt from the host command line.

Due to the performance penalty of VMs, extended use of this option is not recommended; it is provided primarily for self-contained evaluation purposes. Performance can be improved substantially by increasing the number of CPU cores and RAM granted to the VM. See the Vagrant documentation for details.

Thanks to Olivier Grisel for the original document these instructions are based on.

Setup option 3: OS X

There are several options for installing the prerequisites under OS X. Pre-built Python distributions such as Enthought contain the necessary Python components and OS X comes bundled with a suitable version of Java. Advanced users can simply install a binary version of Groovy and Weka from their respective websites, place the Weka JAR file in their CLASSPATH, and begin generating ensembles.

Other users may wish to use the MacPorts project to install the prerequisites and their dependencies in a self-contained directory that can easily be upgraded or removed later if desired. This option requires Apple's free Xcode developer tools, the optional Xcode command line tools installable from the developer tools GUI, and the MacPorts software for your version of OS X:

MacPorts downloads the required packages and their dependencies, but must compile from source if binaries are not available for your system; this can take hours for a fresh MacPorts installation as there are several dozen large packages to compile. Run the following to update MacPorts and install the prerequisites:

sudo port selfupdate
sudo port install groovy py27-cython py27-pandas py27-scikit-learn
sudo port select --set python python27

A suitable version of Weka is unfortunately not bundled with MacPorts, so run the following:

curl -O -L
sudo cp weka-3-7-10/weka.jar /opt/local/share/java

Obtaining the source

The latest source code can be obtained by cloning the public git repository using the following from the command line:

git clone

This will create a datasink subdirectory in your working directory containing the source code. The git program comes bundled with recent versions of OS X; it can be installed under Ubuntu using sudo apt-get -y install git. Updates can be obtained by running git pull from the datasink subdirectory.

Compiling the Cython module

Several functions are accelerated by Cython and must first be compiled by running make from the git repository directory.

Setting environment variables

Java must be told where Weka is located and how much RAM to use by modifying the CLASSPATH and JAVA_OPTS environment variables. A simple way to set these variables is to add the following to your shell's login script for Ubuntu:

export CLASSPATH=$CLASSPATH:/usr/share/java/weka.jar
export JAVA_OPTS="-Xmx4g"

or for OS X:

export CLASSPATH=$CLASSPATH:/opt/local/share/java/weka.jar
export JAVA_OPTS="-Xmx4g"

The above is Bash syntax and allows Weka to use up to 4 gigs of RAM; adjust accordingly for your setup.

The groovy executable must also be somewhere in your search path (and already is if using the Ubuntu or MacPorts instructions above). If groovy was manually installed, for example under $HOME/groovy on a cluster, add the following to your login script:

export PATH=$PATH:$HOME/groovy/bin

Quick setup: Docker image

With a docker image, users don't need to worry about binaries, dependencies and environment variables since all of these are configured in the docker file. A container of all componnets can be easily created based on existing docker image.

As a prerequisite, docker should be installed. Then, you can pull image from docker cloud by running:

docker pull alexwang0106/datasink:2.2
docker run -ti alexwang0106/datasink:2.2 /bin/bash

An alternative way to build datasink docker image is simply downloading the Dockerfile and use the following command:

docker build /PATH/TO/DOCKERFILE
docker run -ti IMAGE_NAME /bin/bash

This will create an interactive container shell, you can copy your own data to the container using

docker cp foo.txt YOUR_CONTAINER:/path/to/dir/

You're finally ready to setup and construct an ensemble!

Walkthrough: Building an ensemble

Ensemble generation requires 3 files, ideally inside a self-contained project directory:

  • Training data in ARFF format
  • A file listing the classifiers to train
  • A file pointing to the above files and configuring other pipeline settings

To begin, we create a project directory in our home directory and download an example dataset from the command line:

mkdir ~/diabetes; cd ~/diabetes
curl -O

Next we create a file to configure our pipeline:

cat > << EOF
classifiersFilename = classifiers.txt
inputFilename = diabetes.arff
classAttribute = class
predictClassValue = tested_positive
balanceTraining = false
foldCount = 10
nestedFoldCount = 10
bagCount = 10

Finally, we create classifiers.txt containing the Weka classifiers and associated parameters we want included in the ensemble:

cat > classifiers.txt << EOF
weka.classifiers.bayes.NaiveBayes -D
weka.classifiers.functions.SGD -F 1

Note that weka.classifiers.meta.LogitBoost is preceded by a comment marker (#); such lines are skipped and thus excluded from ensemble generation. We'll leave LogitBoost commented out for now, and later see how its inclusion changes ensemble performance.

As specified in the file, the data is first divided into 10 folds of independent training and test splits for cross validation. Each training split is resampled with replacement 10 times (a process called bagging [Breiman1996]), and nested cross validation is performed on each of these resampled training splits to produce the data necessary for ensemble techniques.

Before generating the ensemble, we first examine the non-ensemble performance of each base classifier using 10-fold cross validation in Weka. Several performance metrics are produced by Weka, but datasink focuses on the area under the receiver operating characteristic (ROC) curve (AUC) since it is well-suited to imbalanced class distributions that often occur with real data. However, any metric can be computed using the CSV files generated by the analysis scripts.

weka.classifiers.bayes.NaiveBayes -D

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.816    0.392    0.795      0.816    0.806      0.429    0.806     0.882     tested_negative
0.608    0.184    0.639      0.608    0.623      0.429    0.806     0.676     tested_positive

weka.classifiers.functions.SGD -F 1

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.888    0.444    0.789      0.888    0.835      0.478    0.832     0.892     tested_negative
0.556    0.112    0.727      0.556    0.630      0.478    0.832     0.713     tested_positive

Next we construct an ensemble of 20 Naive Bayes (NB) and Logistic Regression (LR, trained using Stochastic Gradient Descent) base classifiers; recall each classifier type is bagged 10 times. This takes ~3-4 minutes on a modern 4 core system with 8 gigs of RAM and should decrease linearly with the number of cores:

cd ~/datasink
python ~/diabetes

Because the code is architected for multicore and distributed environments, many processes are spawned and each writes its output to a unique file. These files must first be merged:

python ~/diabetes

Ensemble methods are then applied:

python ~/diabetes
python ~/diabetes standard
0.837 20
python ~/diabetes greedy
0.841 15
python ~/diabetes enhanced
0.839 16

The output after each script gives the AUC calculated over all cross validation folds as well as the average size of the ensemble when applicable. The performance of these methods can vary greatly depending on the dataset and in particular the number of training examples, with simpler methods typically performing better for smaller datasets.

Growing the ensemble

Let's add LogitBoost to the ensemble, looking first at its base performance in Weka:


TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.846    0.455    0.776      0.846    0.810      0.410    0.810     0.891     tested_negative
0.545    0.154    0.655      0.545    0.595      0.410    0.810     0.668     tested_positive

We add this classifier to the ensemble by editing classifiers.txt and using comment markers (#) as discussed above to exclude the previous classifiers and include LogitBoost, or execute the following command as a shortcut:

cat > ~/diabetes/classifiers.txt << EOF
#weka.classifiers.bayes.NaiveBayes -D
#weka.classifiers.functions.SGD -F 1

Alternately, leave all lines uncommented to see how the ensemble generation script only produces output for LogitBoost as it recognizes NB and LR are already generated. Now create the LogitBoost classifiers (~2 mins), combine with the previous NB and LR output, and run the ensemble methods:

python ~/diabetes
python ~/diabetes
python ~/diabetes
python ~/diabetes standard
0.841 30
python ~/diabetes greedy
0.843 17
python ~/diabetes enhanced
0.840 24

Note that the performance of mean-aggregated predictions remains unchanged, while stacking and selection methods get a small boost.

The importance of bagging

Compare the performance of the Naive Bayes base classifier (0.806) to its bagged performance (0.8254) using mean aggregation: Bagging provides a non-trivial boost. Run cd ~/diabetes and try the following under ipython to see how this simple aggregation method works. We first create a pandas DataFrame object indexed by a unique ID and class label for each example:

from glob import glob
from pandas import concat
from sklearn.metrics import roc_auc_score

df = concat([read_csv(_, compression = 'gzip', index_col = [0, 1]) \
	for _ in glob('predictions-*.csv.gz')])

The probability assigned to the positive class by each resampled classifier is stored across columns, shown here with a number appended to the classifier name for each bagged version:

print df.columns
Index([NaiveBayes.0, NaiveBayes.1, NaiveBayes.2, NaiveBayes.3, NaiveBayes.4, NaiveBayes.5, NaiveBayes.6, NaiveBayes.7, NaiveBayes.8, NaiveBayes.9, SGD.0, SGD.1, SGD.2, SGD.3, SGD.4, SGD.5, SGD.6, SGD.7, SGD.8, SGD.9, LogitBoost.0, LogitBoost.1, LogitBoost.2, LogitBoost.3, LogitBoost.4, LogitBoost.5, LogitBoost.6, LogitBoost.7, LogitBoost.8, LogitBoost.9], dtype=object)

Here we grab the class labels from the index, take the row mean of the first 10 columns corresponding to Naive Bayes, and calculate the AUC:

labels = df.index.get_level_values(1).values
roc_auc_score(labels, df.iloc[:, :10].mean(axis = 1))

A similar increase compared to the base classifier is observed for LogitBoost:

roc_auc_score(labels, df.iloc[:, 20:30].mean(axis = 1))

but a small dip occurs for Logistic Regression:

roc_auc_score(labels, df.iloc[:, 10:20].mean(axis = 1))

Recall the base LR performance is 0.832 which is already quite close to the ensemble's performance. Resampling the training data with replacement to create diversity excludes approximately one third of the training instances due to chance. Given its decreased performance, the loss of these training examples is more detrimental to LR than what it gains from ensembling. However, the dip is relatively small and resampling creates the diversity necessary to increase the overall performance of the ensemble when other classifiers are included.

Heterogeneous vs. homogeneous ensembles

The advantage of using heterogeneous ensembles is clear when we compare their performance to state-of-the-art homogeneous ensemble techniques such as Random Forests. To maximize performance of the forest, we increase the number of trees to 500 (a parameter that doesn't overfit [Breiman2001]) and reduce the maximum tree depth to prevent overfitting on this small dataset.

weka.classifiers.trees.RandomForest -I 500 -depth 5

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.864    0.429    0.790      0.864    0.825      0.458    0.833     0.904     tested_negative
0.571    0.136    0.692      0.571    0.626      0.458    0.833     0.700     tested_positive

Compare the 0.833 AUC of this homogeneous ensemble to the 0.843 achieved above using greedy selection with only 3 classifier types.

Another (condensed) example

Let's see if these trends hold for another dataset:

mkdir ~/liver; cd ~/liver
curl -O

cat > << EOF
classifiersFilename = classifiers.txt
inputFilename = liver-disorders.arff
classAttribute = selector
predictClassValue = 2
balanceTraining = false
foldCount = 10
nestedFoldCount = 10
bagCount = 10

cat > classifiers.txt << EOF

cd ~/datasink
python ~/liver
python ~/liver
python ~/liver
python ~/liver greedy
python ~/liver enhanced
python ~/liver standard
Method AUC Notes
AdaBoostM1 0.684 DecisionStump base learner
IBk 0.637
JRip 0.653
MultilayerPerceptron 0.742
RandomForest 0.768 200 trees, excluded from ensemble
mean 0.772
greedy 0.772
enhanced 0.764
stacking 0.775 RandomForest stacker, max_depth = 5

This time we use a different set of classifiers and give the performance of a random forest for reference. Note that 40 bagged heterogeneous classifiers outperform a random forest of 200 trees for three out of four aggregation methods. Though enhanced ensemble selection has not performed as well as simpler methods in these examples, it tends to perform similar to stacking for larger datasets where greedy selection begins to fall behind. It is important to emphasize that differences in performance should be evaluated for statistical significance; see [Demšar2006] for a review of non-parametric comparison methods. In a paper currently under review, we find statistically significant differences between heterogeneous ensembles and the best base classifier (homogeneous ensembles) for several complex, real-world datasets.

UCI benchmarks

The UCI machine learning repository provides datasets for benchmarking machine learning algorithms. Below is the AUC for several UCI binary classification datasets using 3 classifier types discussed above: NB, LR, and LogitBoost. These numbers are provided only for verification purposes: This small ensemble will likely be out-performed by a single well-tuned classifier (often a Random Forest or gradient boosted trees), and for many datasets the best classifier will be out-performed by a larger heterogeneous ensemble. To get more experience with datasink, try adding new classifiers until the ensemble beats the best classifier for some of these datasets.

The highest AUC is bolded for each dataset, and ties are broken by preferring the simplest method. Again, one must perform tests for statistical significance such as those presented in [Demšar2006] to draw sound conclusions about performance differences, and more complex methods often require similarly complex, large, real-world datasets to demonstrate their utility.

Dataset Instances Mean Stacking Greedy Enhanced
breast-cancer 286 0.683 0.67 0.704 (3) 0.691 (19)
breast-w 699 0.993 0.992 0.993 (5) 0.993 (17)
colic 368 0.872 0.883 0.874 (9) 0.875 (25)
credit-a 690 0.934 0.933 0.933 (9) 0.935 (25)
credit-g 1000 0.785 0.793 0.795 (7) 0.794 (25)
diabetes 768 0.836 0.84 0.842 (18) 0.841 (25)
haberman 306 0.662 0.662 0.672 (5) 0.676 (24)
heart-statlog 270 0.905 0.908 0.906 (6) 0.908 (24)
ionosphere 351 0.967 0.96 0.960 (9) 0.971 (21)
kr-vs-kp 3196 0.993 0.996 0.996 (7) 0.996 (26)
labor 57 0.979 1.000 0.980 (4) 0.980 (6)
liver-disorders 345 0.742 0.768 0.780 (7) 0.758 (21)
molecular-biology_promoters 106 0.977 0.97 0.970 (11) 0.956 (18)
mushroom 8124 1.000 1.000 1.000 (2) 1.000 (2)
sick 3772 0.973 0.963 0.977 (11) 0.978 (27)
sonar 208 0.873 0.895 0.876 (5) 0.905 (23)
spambase 4601 0.976 0.972 0.978 (15) 0.978 (27)
tic-tac-toe 958 0.982 0.996 0.997 (5) 0.997 (24)
vote 435 0.991 0.992 0.991 (9) 0.992 (23)


If a particular class is extremely uncommon, the bagging process may (by chance) produce training splits that do not contain that class due to sampling with replacement. Bagging may not be appropriate in these scenarios and can be disabled in the properties file.


Property Type Default Description
classifiersFilename String Required File containing a list of full Java classnames and parameters, one per line, of classifiers to include in the ensemble. Lines beginning with a hash (#) are skipped.
inputFilename String Required A Weka-formatted ARFF file containing features and class labels.
workingDir String Current directory Location to store classifier outputs.
classAttribute String Required Name of the ARFF attribute containing class labels. Often the last attribute.
predictClassValue String Required Value of the positive class for classAttribute. For example, this could be 1 for instances with 0/1 class labels, or tested_positive for the walkthrough dataset.
balanceTraining Boolean true Balance the class distribution of the training set inside each cross validation fold after any resampling, using Weka's SpreadSubsample filter with -M 1.
balanceTest Boolean false Identical to balanceTraining for the test set. Note that best practice for non-uniform class distributions is to balance the training set only, then evaluate against the natural class distribution of the test set [Weiss2003,Tan2013].
foldCount Integer Required* Number of cross validation folds to use. This or foldAttribute must be specified.
foldAttribute String Required* Name of the ARFF attribute containing values for leave-one-value-out cross validation. This or foldCount must be specified.
nestedFoldCount Integer Required Number of nested cross validation folds to use for each cross validated training set. Greatly increases execution time.
bagCount Integer Required Number of resampled versions of each base classifier to generate. Greatly increases execution time. A value of 0 disables resampling.
useCluster Boolean false Submit jobs to a distributed computing cluster (using qsub, for example) instead of spawning processes on the local machine.
writeModels Boolean false Save compressed, serialized models for each classifier/fold/bag combination to disk. Substantially increases disk usage.


  • Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. doi:10.1023/A:1018054314350
  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
  • Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (2004). Ensemble Selection from Libraries of Models. In Proceedings of the 21st International Conference on Machine Learning (pp. 18–26). doi:10.1145/1015330.1015432
  • Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7(Jan), 1–30.
  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1), 10–18. doi:10.1145/1656274.1656278
  • McKinney, W. (2012). Python for Data Analysis. O’Reilly.
  • Pedregosa, F., Varoquaux, G., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
  • Tan, P.-N., Steinbach, M., & Kumar, V. (2013). Introduction to Data Mining (2nd ed.). Addison-Wesley.
  • Weiss, G. M., & Provost, F. J. (2003). Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19(1), 315–354.
  • Whalen, S., & Pandey, G. (2013). A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics. In Proceedings of the 13th International Conference on Data Mining (pp. 807–816). doi:10.1109/ICDM.2013.21
  • Wolpert, D. H. (1992). Stacked Generalization. Neural Networks, 5(2), 241–259. doi:10.1016/S0893-6080(05)80023-1


No description, website, or topics provided.



No releases published
You can’t perform that action at this time.