DeepDive Developer's Guide
DeepDive developer's guide
This document describes useful information for those who want to make modifications to the DeepDive infrastructure itself and contribute new code. Most of the content here are irrelevant for DeepDive users who just want to build a DeepDive application.
DeepDive project at GitHub
Nearly all DeepDive development activities happen over GitHub.
Branches and releases of DeepDive
masterbranch points to the latest code.
- We use Semantic Versioning.
- Every MAJOR.MINOR version has a maintenance branch that starts with
vand ends with
- Every release is pointed by a tag, e.g.,
0.05-RELEASE. Since 0.6, release tag names start with
v, followed by MAJOR.MINOR.PATCH versions. They usually point to a commit in the release maintenance branch.
- Any other branch points to someone's work in progress.
Contributing code to DeepDive
- If you are part of the Hazy Research group, you can push your commits to a new branch, then create a Pull Request to
master. Otherwise, you need to first fork our repository, then push your code to that fork to create a Pull Request.
- If you already know who can review your code, assign that member to the Pull Request.
- The reviewer leaves comments about the code, then lets you know to fix them.
- You improve the code and push more commits to the branch for the Pull Request, then tell the reviewer to have another look. Remember that GitHub doesn't send out notifications (emails) unless you leave an actual comment on the Pull Request. The reviewer assumes the Pull Request is not ready for another look until you explicitly say so.
- Steps 3-4 repeat until the reviewer says everything looks good.
- The reviewer could merge your code to the master branch him/herself or ask you to do so (if you have permission).
- Your branch should be deleted after the Pull Request is merged or closed.
DeepDive is written in several programming languages.
- Bash and jq are the main programming languages for generating SQL queries and shell scripts that run the actual data pipeline, defined by the user's extractors and inference rules.
- C++ is used for writing the high performance Gibbs sampler that takes care of learning and inference of the model defined by user's inference rules.
- C is used for the high performance data router, mkmimo that enables executing many UDF processes in parallel efficiently.
- Python is the main language we use for the udfs in our examples.
- Scala and other mini languages are used for other minor parts.
DeepDive code structure
compiler/contains the code that compiles DeepDive application configuration into an execution plan.
database/contains database drivers as well as code implementing other database operations.
ddlib/contains the ddlib Python library that helps users write their applications.
doc/contains the Markdown/Jekyll source for the DeepDive website and documentation.
examples/contains the DeepDive examples.
extern/contains scripts for building and bundling runtime dependencies from external 3rd parties.
inference/contains the engine and necessary utilities for statistical learning and inference.
runner/contains the engine for running the execution plan compiled by the compiler.
shell/contains the code for the general
test/at the top as well as
*/test/under each subdirectory contain the test code.
util/contains other utilities for installation, build, and development.
DeepDive build is controlled by several files:
Makefiletakes care of the overall build process.
stage.shcontains the commands that stages built code under
dist/, which is the default location where the built executables and runtime data will be staged.
test/bats.mkcontains the Make recipes for running tests written in BATS under
test/*/should-work.shdetermines the .bats files to run for
.travis.ymlenables our continuous integration builds and tests at Travis CI, which are triggered every time a new commit is pushed to our GitHub repository.
DeepDive source tree includes several git submodules and ports:
compiler/ddlog/is the DDlog compiler.
inference/dimmwitted/is the DimmWitted Gibbs sampler.
runner/mkmimo/is a data routing component that is used for executing parallel UDF processes and efficiently streaming data through them.
util/mindbender/is the collection of tools supporting development, such as Mindtagger.
First, get DeepDive's source tree and move into it, by running:
git clone https://github.com/HazyResearch/deepdive.git cd deepdive
DeepDive build and tests can be done using Docker, which can simplify the development environment setup dramatically.
To build the source tree inside a container and create a new Docker image, run:
Or, if you don't even have
make, just run:
This pulls the
latestimage from Docker Hub (hazyresearch/deepdive-build), then inside a fresh container, runs the build after applying changes made to the current source tree. This is the default behavior for
make(without any target argument) when Docker is available on your system.
CAVEAT: Note that only files that are tracked by git is reflected in the build inside containers. Use
git addto make sure any new files are also considered when transfering changes to containers.
To test the most recent build, run:
You can pass the
EXCEPT=filters as you do for the normal builds (described below).
Or, the equivalent without
You can in fact override the entire test command with this:
./DockerBuild/test-in-container-postgres make test ONLY=test/postgresql/*.bats
To inspect the most recent build or test, run:
You can pass a command to run as arguments:
./DockerBuild/inspect-container latest-run make test
The most recent image for the current branch is automatically updated after the most recent test finishes successfully. To make it also the new
latestimage for all other branches on your local machine, run:
Until you run this command, new builds will always start from the
latestimage from the central Docker Hub, not from the latest build on your local machine. If your source tree has diverged a lot from the master branch, it's a good idea to update the latest image once the initial long build finishes and passes all tests. That way builds for your branches won't have to repeat the same long build.
If you have permission, you can push your master image to DockerHub and have others start build from there by running:
docker push hazyresearch/deepdive-build
Running containerized builds and tests in Docker is the recommended way, but you are welcome to run normal builds directly on the host in the old way. Everything described here about normal builds in fact applies to the source tree inside the container. Moreover, normal build is the only way to produce releases for Mac and environments other than the one used in the master image.
To disable the containerized builds even if you have Docker installed and to force normal build, simply set:
To install all build and runtime dependencies, run:
Or, if you don't have even
util/install.sh _deepdive_build_deps _deepdive_runtime_deps
Basically, DeepDive requires C/C++ compiler, JDK, Python, GNU coreutils and several libraries with headers to build from source.
util/install/install.Ubuntu.shscript enumerates most of the build dependencies as APT packages. You may easily find corresponding packages for your platform and install them. On the other hand, most of the runtime dependencies will be built and bundled (see:
depends/bundled/), so eventually users will just grab a DeepDive binary and run it without having to waste time on installing the correct software packages.
To build most of what's under DeepDive's source tree and install at
PREFIXvariable allows the installation destination to be changed. For example:
make install PREFIX=/opt/deepdive
To run all tests, from the top of the source tree, run:
Note that at least one of PostgreSQL, MySQL, or Greenplum database must be running to run the tests.
TEST_DBHOSTenvironment to a
user:password@hostname, it is possible to specify against which database the tests should run. For specifying non-default ports for different database types, there are more specific variables:
To run tests selectively, use
EXCEPTMake variables for
For example, to run only the test with spouse example against PostgreSQL:
make test ONLY=test/postgresql/spouse_example.bats
Or, to skip the tests against MySQL:
make test EXCEPT=test/mysql/*.bats
To create a tarball package from the built and staged code, run:
The tarball is created at
To build the DDlog compiler from source and place the jar under
To build the sampler from source and replace the binaries, run:
To build the Mindbender toolchain from source and place the binary under
All commands shown above should be run from the top of the source tree.
Modifying DeepDive documentation
To preview your changes to the documentation locally, run:
make -C doc/ test
To deploy changes to the main website, run:
make -C doc/ deploy