SMILE-WIDE is a Bayesian network library. Initially, SMILE-WIDE is a version of the well known SMILE library, augmented With Integrated Distributed Execution. This allows execution on very large datasets. As SMILE-WIDE is developed, BigData-specific capabilities will surpass the standard Bayesian network interfaces.
Programmer-facing, SMILE-WIDE is a .jar library which you can include in your software. User-facing, it is also integrated into Hive as a UDF to provide posterior probabilities of missing values, given the observed values for each instance.
SMILE-WIDE is written in Java, using the underlying SMILE library, which is written in C++. It uses Hadoop for inference on large data.
- How to build the software
- How to run an example SMILE-WIDE Hadoop job
- How to test Hive integration
- Problems and solutions
- How to generate Javadoc API documentation
Please contact the authors with any questions or problems:
SMILE-WIDE is an Eclipse project configured to use Maven. All external dependencies are pulled from the appropriate Maven repositories. The code can be built from the IDE or directly from the command line. The basic build can be started with the following command:
mvn clean package</code>
This creates two jars in the target directory and copies the appropriate native library
to the target/lib
directory.
The binary files are:
smile-wide-0.0.1-SNAPSHOT.jar
- Contains the SMILE-WIDE code
smile-wide-0.0.1-SNAPSHOT-job.jar
- Contains the SMILE-WIDE code and the core SMILE jar in its lib subdirectory. This makes running SMILE-WIDE-based Hadoop jobs easier, because Hadoop will automatically add SMILE jar to the classpath on the machines running in the cluster.
libjsmile.so
,libjsmile.jnilib
orjsmile.dll
- JNI library containing the C++ SMILE code
It's possible to build for a platform different from the one running the Maven by overriding the smile.native.platform
variable. For example, when building for Hadoop on 64-bit Linux cluster with Maven or Eclipse running on OSX, the command
should be extended to:
mvn clean package -Dsmile.native.platform=linux64 -Dmaven.test.skip=true
The example below executes a Hadoop job loop which learns the parameters of
probability distributions for the kiva.xdsl
network. Note that the jar file
contains the SMILE jar in its lib directory. However, the native library
must be explicitly added to the job's distributed cache with the Hadoop's
-files
option. Additionally, since the specifics of EM require the access
to SMILE functionality locally, the .so
file should be copied to the
$HADOOP_BIN/native
directory.
hadoop jar smile-wide-0.0.1-SNAPSHOT-job.jar smile.wide.algorithms.em.RunDistributedEM \
-files em-tmp.xdsl,libjsmile.so \
-D mapred.max.split.size=250000 -D mapred.reduce.tasks=12 \
em.initial.netfile=kiva.xdsl em.work.netfile=em-tmp.xdsl \
em.data.file=pitt/kiva500k.txt em.stat.file=pitt/em-out \
em.separator=9 em.local.stat.file=em-local.txt
The file kiva.xdsl
is located in the project's input
directory; pitt/kiva500k.txt
is in the compute cluster's HDFS.
The output of the job is the local file named em-tmp.xdsl
, containing the modified kiva.xdsl
network with
learned parameters.
To test the Hive UDFs, execute the normal maven package build followed by runscripts/hivePosteriors.sh.
This
creates the target/hive-test
directory, containing all the files required for UDF test. The command to run the test is:
hive -f hivePosteriors.q
Hive will import small data file and perform four queries, each calling into SMILE-WIDE UDFs.
This exception is caused by missing native library. The platform-specific library is placed in target/lib
during the Maven build,
but Hadoop and Hive must be made aware of its existence. This is done with the Hadoop's -files
option or
Hive's 'ADD FILE'. Some of SMILE-WIDE algorithms contain nontrivial local component running within the
Hadoop's client JVM. In such case the shared library should be added to $HADOOP_BIN/native directory
The SMILE-WIDE API Javadoc documentation can be generated from the command line. With 'javadoc' on the path, issue the following command:
javadoc @options.javadoc.text
This will generate HTML documentation in the 'javadocs' directory.