DeepDive execution plan compiler #445

netj · 2015-12-17T02:07:38Z

This not-so-small PR adds a new implementation of DeepDive that literally compiles an execution plan to run the app in a much more efficient way in both human time and machine time/space. I invite everyone to review the code and give a try on your existing app (especially @feiranwang @senwu @ajratner @zhangce @raphaelhoffmann @alldefector @zifeishan @xiaoling @Colossus @ThomasPalomares @juhanaka) The plan is to get feedback over the next few days while I update the documentation and carve out release v0.8.0.

An execution plan is basically a set of shell scripts that tells what to run for user-defined extractors as well as the built-in processes for grounding the factor graph and performing learning and inference, and a Makefile and that describes the complete dependencies among them. These are compiled from the app's deepdive.conf, app.ddlog, and schema.json, mainly by a series of JSON transformations implemented as jq programs with a little bit of help from bash, which all resides under compiler/ in the source tree.

It implements most of the existing functionality provided by the current Scala implementation, except a few things, e.g., Greenplum parallel unloading/loading, which won't be difficult to add given the much much modular architecture. On the other hand, exciting new features and improvements are added. To highlight just a few:

Full Dependency Support with Selective Execution. It's now possible to selectively run, repeat, or skip certain parts of the app's data flow or extraction pipeline without being aware of all the dependencies between them. (fixes Why remove tasks dependencies that are not in the pipeline? #431, closes If an extractor depends on inference, DD halts indefinitely before doing inference #427, closes extractorRunner hanging problem #273) The user has full control over every step of the execution plan. Not only that, but grounding is also broken down into smaller processes, so it's possible to just change or add one inference rule and update the grounded factor graph without having to recompute everything from scratch. (fixes Restart DeepDive from a "checkpoint" #280)
Zero Footprint Extraction. tsv_extractors now have nearly zero footprint on the filesystem. The data is streamed from the database through the UDFs and back to the database. mkmimo is used with named pipes to make the connection between the database and the parallel UDF processes (cf. $DEEPDIVE_NUM_PROCESSES). (fixes Possible to get rid of UDF input data splitting? #428) It doesn't support other extractor styles yet, but we can probably drop them unless there's a compelling reason. (closes PGXL PL/Python extractors keep running after Ctrl-C'ing DeepDive #384)
Compute Drivers. A new compute driver architecture for executing such extractors is now in place, so it's now clear where to extend to support remote execution or clusters with a job scheduler, such as Hadoop/YARN, SLURM, Torque/GridEngine/PBS. (Parallelization - scale-out / distributed execution of UDFs #426) The local execution driver is what implements the streaming tsv_extractor mentioned above. Moreover, the grounding processes as well as the user-defined extractors also make use of the compute drivers, so some parts of the grounding will automatically take advantage of such extensions.
Zero Footprint Grounding. The grounding processes also minimizes footprint on the filesystem. No data for the factor graph are duplicated. Instead of creating a concatenated copies of factors, weights, and variables, they are merged as the sampler loads them. Also, the binary format conversion is done on the fly as the grounded rows are unloaded from the database, so no ephemeral textual form is ever stored anywhere. In fact, only a few line changes to the compiler can compress the binary forms and shrink the factor graphs footprint on the filesystem by an order of magnitude (not included in this PR).
More User Commands. deepdive.pipeline config is obsolete as well as the deepdive run command, although they'll work the same as before. Now, the user can simply state what is the goal for execution to the deepdive do command, e.g., deepdive do model/calibration-plot or deepdive do data/has_spouse_features, as many times as necessary once the app is compiled with the deepdive compile command. Supporting commands such as deepdive plan, deepdive mark, deepdive redo are there to speed up typical user workflows with DeepDive apps. deepdive initdb and deepdive load have been improved to be more useful, and is used by a few compiled processes. (fixes Separate initdb command / flag for input data vs. extractions schemas #351, fixes schema.sql file generated from ddlog? #357)
Error Checking. Errors in the app (and of course the compiler itself) are checked at compile time, and can also be done by the user with the deepdive check command. The checkers are modular and isolated, so many useful checks can be quickly added, such as test firing UDF commands, etc. Only basic checks have been implemented so far. (fixes deepdive run doesn't report an error when extractor dependencies are misspelled #349, fixes Configuration sanity check #1)
Simpler, Efficient Multinomial Factors. Multinomial factors are won't materialize unnecessary data and use VIEWs as much as possible, e.g., dd_*_cardinality tables and the dd_graph_weights. Also, nearly no code has been duplicated to support it, so that's good news for developers.
Bundled Runtime Dependencies. DeepDive now builds essential runtime dependencies, such as bash, coreutils, jq, bc, graphviz, so no more Mac vs. Linux or software installation version issues will pop up. (fixes no install instructions for building from source #441 as documentation also ended up in this PR)

Also some good stuffs that comes with a clean rewrite:

Closes deepdive run with ddlog is not idempotent #412 by doing a reasonable job and making the base relations clear to the user, and also closes DDLOG fails when views depend on views #421.
Closes enable SQL comments #110 as no more SQL parsing is done.
Closes Error messages not appearing in log file #383 as logging has been completely redone.
Closes INSERT count update error with Greenplum #361, closes Catch the exception of no database connection #20 as JDBC is no longer used.
Closes Refactor Scala code #329 as Scala implementation will be simply dropped in a future release (v0.8.1 or v0.9?).

- Bugfix for mishandling of '&' in URLs that prevented DEEPDIVE_JDBC_URL from being set - Support for all [sslmode=](http://www.postgresql.org/docs/9.2/static/libpq-connect.html#LIBPQ-CONNECT-SSLMODE) instead of just `ssl=true` that corresponds to `sslmode=require`. - PGSSLMODE default changed to `prefer` instead of `disable`.

…mpiled

…n and running it

that ensures the parent path exists

that can be set up via: source $(deepdive whereis etc/deepdive_bash_completion.sh) `deepdive whereis` command has been added for finding resources in installation.

…iled from ddlog

Both schema JSON compiled from DDlog as well as the user's schema.json file are considered.

and some cosmetic changes: relative times for timestamps, substep printing for easier progress tracking within a process etc.

Allowing users to - initialize selected relations without recreating the database - load a relation from multiple .tsv, .csv, .json-seq sources, compressed (.bz2, .gz) and generated (.sh) sources

- The augmented, normalized deepdive config object will stay under .deepdive_ - All normalized/qualified fields will stay under fields whose names end with underscore

Also adds processes for loading weights, marginal probabilities back to database

except ddlog spouse example which requires calibration

netj · 2015-12-30T08:35:13Z

@alldefector Sorry the conversation went private through Slack. Yes, that's exactly our plan, except using bash processes substitution to turn $sql into a readable file. This is so much better than dangerously splitting the SQL queries in a sloppy way. (btw you mean the block is run as a transaction, right?)

@zifeishan Could you make the changes, confirm it working, and add the commit to this PR?

alldefector · 2015-12-30T09:18:40Z

Apparently -f does not run the file in one transaction unless you also specify -1 or --single-transaction: http://postgres-xc.sourceforge.net/docs/1_0/app-psql.html

That's actually what we need because PGXL would throw up (Zifei's error messages) if we try to run a bunch of mutating / DDL statements in one transaction.

feiranwang · 2015-12-30T21:48:24Z

@netj I tried this on several examples. deepdive run and deepdive do all work great. I have a small problem with deepdive do. When I type deepdive do, it gives a list of targets. In smoke example there's a process/grounding/variable/person_has_cancer/assign_id. When I tried to run with that target, it gives an error

process/grounding/variable/person_has_cancer/assign_id: Unknown target

Is this expected?

netj · 2015-12-30T21:52:32Z

@feiranwang Can you try putting a .done suffix to the target? Maybe I forgot to add that case. Will also take a look into this.

feiranwang · 2015-12-30T21:58:31Z

@netj I added a .done suffix to the target and it gives the same error

process/grounding/variable/person_has_cancer/assign_id.done: Unknown target

and fixes it to surface dependency errors between the processes or any error from make(1) instead of obscuring everything as "Unknown target" error.

by converting schema.sql into ddlog schema declarations, that in turns become schema.json.

netj · 2015-12-30T22:54:52Z

@feiranwang It turns out to be a missing schema.json issue, so I converted schema.sql into app.ddlog as an easy way to generate the json file. The compiled Makefile was missing some dependencies for the mentioned target, which gets generated from the relational schema. I fixed things to show such errors more transparently, such as:

$ deepdive plan process/grounding/variable/person_has_cancer/assign_id
make: *** No rule to make target `data/person_has_cancer.done', needed by `process/grounding/variable_id_partition.done'.  Stop.
Error in dependencies found for process/grounding/variable/person_has_cancer/assign_id

Ideally, these should be caught in the deepdive compile phase. It'd be nice if you can contribute a checker that prevents this kind of error. Also, it seems we're missing a test for the smoke example.

f9e0534

instead of separating the process group which seems to have a lot of undesirable side effects, e.g., `^C` doesn't work when deepdive-do is run from a script instead of shell prompt. Moreover, any descendant could start its own session/process group, so it was a fragile approach anyway. This reverts commit 9242b4e, removing the setsid utility.

and also enhances postgresql's db-load to use `\COPY` psql client command instead of `COPY` server statement to deal with potential permission issues. See: https://wiki.postgresql.org/wiki/COPY

competing with the log by forcing `stty echo` after the `pv -c`.

to make multinomial grounding work correctly. Also drops fallback to postgresql's SEQUENCE as it does not work.

and makes postgresql tests not enumerated for postgresql-xl or greenplum

…ests

DeepDive execution plan compiler

netj added 30 commits October 1, 2015 18:52

Makes sure deepdive sql aborts when DB url unavailable

8c03f24

Updates ddlog to be backwards compatible (fixes #423)

d44d3f3

Reorgs into db-driver/

0d94cb8

WIP defining compute driver interface and how extractors should be co…

ea0f3bf

…mpiled

Adds a preliminary version of deepdive.conf compiler with a design doc

cbcffc5

Adds progress bar for tsv_extractor and simplifies touch

2e462e5

Adds deepdive-do for running extractors with compiled Makefile

0a2619a

Adds deepdive plan/do/redo/mark commands for generating execution pla…

1e85de3

…n and running it

Moves some execution related code under runner/

665b863

Adds mark_done as a fancy touch

6514bc8

that ensures the parent path exists

Adds basic bash_completion for deepdive

ddc6904

that can be set up via: source $(deepdive whereis etc/deepdive_bash_completion.sh) `deepdive whereis` command has been added for finding resources in installation.

Adds a compiler that produces DOT/SVG from dependencies

5e26c9d

Adds support for input_relations field on extractors and factors comp…

89a9e57

…iled from ddlog

Enhances DOT compiler to de-emphasize pipelines

bc9bb60

Takes schema.json into account to extend the data flow with loaders

dd9905d

Both schema JSON compiled from DDlog as well as the user's schema.json file are considered.

Adds grounding/learning/inference to the overall data flow

57a5c70

Reverts changes to deepdive-run to get tests pass

b18338b

Adds init/db step that performs deepdive-initdb

80ed191

Adds timestamped logging when running plans

9f90eed

and some cosmetic changes: relative times for timestamps, substep printing for easier progress tracking within a process etc.

Revises deepdive initdb and load to work nicely with new dataflow

4cde7a6

Allowing users to - initialize selected relations without recreating the database - load a relation from multiple .tsv, .csv, .json-seq sources, compressed (.bz2, .gz) and generated (.sh) sources

Adds calibration-plot process in the dataflow

f9b1a01

Enhances progress indicator for tsv_extractors

b98c65d

Changes the dataflow SVG to bottom-to-top drawing

0ad47df

Compiles grounding correctly

e7e3fee

Cleans up naming scheme and how the original deepdive.conf is dealt with

543075f

- The augmented, normalized deepdive config object will stay under .deepdive_ - All normalized/qualified fields will stay under fields whose names end with underscore

Adds some compiler TODOs

c87a304

Makes learning and inference directly call sampler

a3684c2

Also adds processes for loading weights, marginal probabilities back to database

Passes integration tests with new dataflow compiler

57fa97f

except ddlog spouse example which requires calibration

Adds calibration views (but not plots yet)

eb7ddd5

netj added 3 commits December 30, 2015 14:10

Makes deepdive-do create symlinks only when a plan is decided

f9e0534

Improves how runner deals with make(1) target names

f3924ae

and fixes it to surface dependency errors between the processes or any error from make(1) instead of obscuring everything as "Unknown target" error.

Fixes a dependency error in smoke example

8287d17

by converting schema.sql into ddlog schema declarations, that in turns become schema.json.

netj added 14 commits December 30, 2015 15:03

Fixes broken deepdive-mark due to 4ef7c66

b913db6

Fixes a mistake in deepdive-do's symlink managing

902b346

f9e0534

Fixes mkmimo's hanging bug in handling large records

e86bd69

Fixes more mistakes in deepdive-do by f9e0534

86cef21

Fixes postgresql-xl's db-execute to support multiple SQL statements

13643ff

and also enhances postgresql's db-load to use `\COPY` psql client command instead of `COPY` server statement to deal with potential permission issues. See: https://wiki.postgresql.org/wiki/COPY

Fixes terminal echo issue caused by progress bars

d9092b7

competing with the log by forcing `stty echo` after the `pv -c`.

Fixes compiler check to recognize input/init_*.sh as well

d34518b

Fixes postgresql-xl's db-assign_sequential_id to support increment > 1

e2f1fcd

to make multinomial grounding work correctly. Also drops fallback to postgresql's SEQUENCE as it does not work.

Fixes examples and test to work with latest compiler and postgresql-xl

711ea6f

Defines a separate test suite for pgxl

c7437b8

and makes postgresql tests not enumerated for postgresql-xl or greenplum

Supresses garbage output from pgxl to pass deepdive sql format=json t…

d2e7f18

…ests

Makes deepdive-load to work outside apps and fixes tests

9da2a8a

Enables deepdive-load tests for postgresql-xl

1815fa8

feiranwang added a commit that referenced this pull request Jan 5, 2016

Merge pull request #445 from HazyResearch/dataflow-compiler

13e8788

DeepDive execution plan compiler

feiranwang merged commit 13e8788 into master Jan 5, 2016

This was referenced Jan 5, 2016

Correctly compile multiple function call rules defining the same head HazyResearch/ddlog#68

Closed

Multi-thread HazyResearch/ddlog#54

Closed

Add ability to specify custom pipelines? HazyResearch/ddlog#50

Closed

netj deleted the dataflow-compiler branch January 12, 2016 22:51

netj mentioned this pull request Jan 30, 2016

Truncate table before each individual extractor? HazyResearch/ddlog#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepDive execution plan compiler #445

DeepDive execution plan compiler #445

netj commented Dec 17, 2015

netj commented Dec 30, 2015

alldefector commented Dec 30, 2015

feiranwang commented Dec 30, 2015

netj commented Dec 30, 2015

feiranwang commented Dec 30, 2015

netj commented Dec 30, 2015

DeepDive execution plan compiler #445

DeepDive execution plan compiler #445

Conversation

netj commented Dec 17, 2015

netj commented Dec 30, 2015

alldefector commented Dec 30, 2015

feiranwang commented Dec 30, 2015

netj commented Dec 30, 2015

feiranwang commented Dec 30, 2015

netj commented Dec 30, 2015