DeepDive execution plan compiler #445

Merged
merged 200 commits into from Jan 5, 2016

Conversation

Projects
None yet
8 participants
@netj
Contributor

netj commented Dec 17, 2015

This not-so-small PR adds a new implementation of DeepDive that literally compiles an execution plan to run the app in a much more efficient way in both human time and machine time/space. I invite everyone to review the code and give a try on your existing app (especially @feiranwang @SenWu @ajratner @zhangce @raphaelhoffmann @alldefector @zifeishan @xiaoling @Colossus @ThomasPalomares @juhanaka) The plan is to get feedback over the next few days while I update the documentation and carve out release v0.8.0.

An execution plan is basically a set of shell scripts that tells what to run for user-defined extractors as well as the built-in processes for grounding the factor graph and performing learning and inference, and a Makefile and that describes the complete dependencies among them. These are compiled from the app's deepdive.conf, app.ddlog, and schema.json, mainly by a series of JSON transformations implemented as jq programs with a little bit of help from bash, which all resides under compiler/ in the source tree.

It implements most of the existing functionality provided by the current Scala implementation, except a few things, e.g., Greenplum parallel unloading/loading, which won't be difficult to add given the much much modular architecture. On the other hand, exciting new features and improvements are added. To highlight just a few:

  • Full Dependency Support with Selective Execution. It's now possible to selectively run, repeat, or skip certain parts of the app's data flow or extraction pipeline without being aware of all the dependencies between them. (fixes #431, closes #427, closes #273) The user has full control over every step of the execution plan. Not only that, but grounding is also broken down into smaller processes, so it's possible to just change or add one inference rule and update the grounded factor graph without having to recompute everything from scratch. (fixes #280)
  • Zero Footprint Extraction. tsv_extractors now have nearly zero footprint on the filesystem. The data is streamed from the database through the UDFs and back to the database. mkmimo is used with named pipes to make the connection between the database and the parallel UDF processes (cf. $DEEPDIVE_NUM_PROCESSES). (fixes #428) It doesn't support other extractor styles yet, but we can probably drop them unless there's a compelling reason. (closes #384)
  • Compute Drivers. A new compute driver architecture for executing such extractors is now in place, so it's now clear where to extend to support remote execution or clusters with a job scheduler, such as Hadoop/YARN, SLURM, Torque/GridEngine/PBS. (#426) The local execution driver is what implements the streaming tsv_extractor mentioned above. Moreover, the grounding processes as well as the user-defined extractors also make use of the compute drivers, so some parts of the grounding will automatically take advantage of such extensions.
  • Zero Footprint Grounding. The grounding processes also minimizes footprint on the filesystem. No data for the factor graph are duplicated. Instead of creating a concatenated copies of factors, weights, and variables, they are merged as the sampler loads them. Also, the binary format conversion is done on the fly as the grounded rows are unloaded from the database, so no ephemeral textual form is ever stored anywhere. In fact, only a few line changes to the compiler can compress the binary forms and shrink the factor graphs footprint on the filesystem by an order of magnitude (not included in this PR).
  • More User Commands. deepdive.pipeline config is obsolete as well as the deepdive run command, although they'll work the same as before. Now, the user can simply state what is the goal for execution to the deepdive do command, e.g., deepdive do model/calibration-plot or deepdive do data/has_spouse_features, as many times as necessary once the app is compiled with the deepdive compile command. Supporting commands such as deepdive plan, deepdive mark, deepdive redo are there to speed up typical user workflows with DeepDive apps. deepdive initdb and deepdive load have been improved to be more useful, and is used by a few compiled processes. (fixes #351, fixes #357)
  • Error Checking. Errors in the app (and of course the compiler itself) are checked at compile time, and can also be done by the user with the deepdive check command. The checkers are modular and isolated, so many useful checks can be quickly added, such as test firing UDF commands, etc. Only basic checks have been implemented so far. (fixes #349, fixes #1)
  • Simpler, Efficient Multinomial Factors. Multinomial factors are won't materialize unnecessary data and use VIEWs as much as possible, e.g., dd_*_cardinality tables and the dd_graph_weights. Also, nearly no code has been duplicated to support it, so that's good news for developers.
  • Bundled Runtime Dependencies. DeepDive now builds essential runtime dependencies, such as bash, coreutils, jq, bc, graphviz, so no more Mac vs. Linux or software installation version issues will pop up. (fixes #441 as documentation also ended up in this PR)

Also some good stuffs that comes with a clean rewrite:

  • Closes #412 by doing a reasonable job and making the base relations clear to the user, and also closes #421.
  • Closes #110 as no more SQL parsing is done.
  • Closes #383 as logging has been completely redone.
  • Closes #361, closes #20 as JDBC is no longer used.
  • Closes #329 as Scala implementation will be simply dropped in a future release (v0.8.1 or v0.9?).

netj added some commits Aug 28, 2015

Fixes postgresql driver's SSL support
- Bugfix for mishandling of '&' in URLs that prevented DEEPDIVE_JDBC_URL
  from being set
- Support for all [sslmode=](http://www.postgresql.org/docs/9.2/static/libpq-connect.html#LIBPQ-CONNECT-SSLMODE)
  instead of just `ssl=true` that corresponds to `sslmode=require`.
- PGSSLMODE default changed to `prefer` instead of `disable`.
Adds mark_done as a fancy touch
that ensures the parent path exists
Adds basic bash_completion for deepdive
that can be set up via:

    source $(deepdive whereis etc/deepdive_bash_completion.sh)

`deepdive whereis` command has been added for finding resources in
installation.
Takes schema.json into account to extend the data flow with loaders
Both schema JSON compiled from DDlog as well as the user's schema.json
file are considered.
Adds timestamped logging when running plans
and some cosmetic changes: relative times for timestamps, substep
printing for easier progress tracking within a process etc.
Revises deepdive initdb and load to work nicely with new dataflow
Allowing users to
- initialize selected relations without recreating the database
- load a relation from multiple .tsv, .csv, .json-seq sources, compressed (.bz2, .gz) and generated (.sh) sources
Cleans up naming scheme and how the original deepdive.conf is dealt with
- The augmented, normalized deepdive config object will stay under .deepdive_
- All normalized/qualified fields will stay under fields whose names end
  with underscore
Makes learning and inference directly call sampler
Also adds processes for loading weights, marginal probabilities back to database
Passes integration tests with new dataflow compiler
except ddlog spouse example which requires calibration

netj added some commits Dec 30, 2015

Improves submodule builds
by aborting the build whenever a different commit is checked out than
the index or HEAD and printing out instructions to proceed.

Fixes build-mindbender to work on Linux as well.
Fixes X11 build error in graphviz bundled dependency
Also drops unnecessarily large executables
Disables coverage reporting to coveralls.io
as it is no longer relevant to the main codebase
@zifeishan

This comment has been minimized.

Show comment
Hide comment
@zifeishan

zifeishan Dec 30, 2015

Contributor

@netj
This seems to be a bigger issue than sql_extractor for PGXL: I am trying out some other extractors but it had other problems with this line: https://github.com/HazyResearch/deepdive/blob/dataflow-compiler/runner/compute-driver/local/compute-execute#L123 since it's also calling multiple queries with deepdive sql. I guess a cleaner workaround would be splitting queries for https://github.com/HazyResearch/deepdive/blob/dataflow-compiler/database/db-driver/postgresql-xl/db-execute?

Regarding the PGXL issue, I could not find discussions for it online. A sample query that would fail on my environment:

$ psql $DBNAME -c  "create table tmptmp(id int);
insert into tmptmp values(1);
select * from tmptmp;
drop table tmptmp;"
ERROR:  Unexpected response from the Datanodes for 'T' message, current request type 1
ERROR:  Unexpected response from the Datanodes for 'T' message, current request type 1
ERROR:  Unexpected response from the data nodes for 'D' message, current request type 0

Server version: Postgres-XL 9.2.0.

Contributor

zifeishan commented Dec 30, 2015

@netj
This seems to be a bigger issue than sql_extractor for PGXL: I am trying out some other extractors but it had other problems with this line: https://github.com/HazyResearch/deepdive/blob/dataflow-compiler/runner/compute-driver/local/compute-execute#L123 since it's also calling multiple queries with deepdive sql. I guess a cleaner workaround would be splitting queries for https://github.com/HazyResearch/deepdive/blob/dataflow-compiler/database/db-driver/postgresql-xl/db-execute?

Regarding the PGXL issue, I could not find discussions for it online. A sample query that would fail on my environment:

$ psql $DBNAME -c  "create table tmptmp(id int);
insert into tmptmp values(1);
select * from tmptmp;
drop table tmptmp;"
ERROR:  Unexpected response from the Datanodes for 'T' message, current request type 1
ERROR:  Unexpected response from the Datanodes for 'T' message, current request type 1
ERROR:  Unexpected response from the data nodes for 'D' message, current request type 0

Server version: Postgres-XL 9.2.0.

@netj

This comment has been minimized.

Show comment
Hide comment
@netj

netj Dec 30, 2015

Contributor

@zifeishan If this is a blocker, we can quickly apply the workaround to pgxl driver, which is actually super simple. db-execute can call psql multiple times after splitting the query argument by semicolon and that'll handle everything. Wanna take a stab at it?

Contributor

netj commented Dec 30, 2015

@zifeishan If this is a blocker, we can quickly apply the workaround to pgxl driver, which is actually super simple. db-execute can call psql multiple times after splitting the query argument by semicolon and that'll handle everything. Wanna take a stab at it?

@zifeishan

This comment has been minimized.

Show comment
Hide comment
@zifeishan

zifeishan Dec 30, 2015

Contributor

@netj I just found an alternative and it works: instead of

deepdive sql  "create table tmptmp(id int);
insert into tmptmp values(1);
select * from tmptmp;
drop table tmptmp;"

, if I run:

echo  "create table tmptmp(id int);
insert into tmptmp values(1);
select * from tmptmp;
drop table tmptmp;" | deepdive sql

it will work for PGXL. I think if we change db-execute of PGXL, to pipe the SQL query into a deepdive sql command, it would work out nicely.

Contributor

zifeishan commented Dec 30, 2015

@netj I just found an alternative and it works: instead of

deepdive sql  "create table tmptmp(id int);
insert into tmptmp values(1);
select * from tmptmp;
drop table tmptmp;"

, if I run:

echo  "create table tmptmp(id int);
insert into tmptmp values(1);
select * from tmptmp;
drop table tmptmp;" | deepdive sql

it will work for PGXL. I think if we change db-execute of PGXL, to pipe the SQL query into a deepdive sql command, it would work out nicely.

@netj

This comment has been minimized.

Show comment
Hide comment
@netj

netj Dec 30, 2015

Contributor

@zifeishan Interesting. However, that'll take away stdin for psql, which is problematic for db-load and other potential use cases. Is there a way to tell psql a sql file to run instead?

Contributor

netj commented Dec 30, 2015

@zifeishan Interesting. However, that'll take away stdin for psql, which is problematic for db-load and other potential use cases. Is there a way to tell psql a sql file to run instead?

@zifeishan

This comment has been minimized.

Show comment
Hide comment
@zifeishan

zifeishan Dec 30, 2015

Contributor

@netj Yes. Try psql -f commands.sql. It works with pgxl as well.

Contributor

zifeishan commented Dec 30, 2015

@netj Yes. Try psql -f commands.sql. It works with pgxl as well.

Updates bats.mk with the generic version also used for mkmimo
with pretty printing of .bats files being run
@alldefector

This comment has been minimized.

Show comment
Hide comment
@alldefector

alldefector Dec 30, 2015

Contributor

@zifeishan @netj are you guys suggesting that we write $sql to a temp file and use -f to run it (so that the whole block isn't run as one transaction)? Sounds like it should work. We only need to update PGXL's db-execute driver, right?

Contributor

alldefector commented Dec 30, 2015

@zifeishan @netj are you guys suggesting that we write $sql to a temp file and use -f to run it (so that the whole block isn't run as one transaction)? Sounds like it should work. We only need to update PGXL's db-execute driver, right?

@netj

This comment has been minimized.

Show comment
Hide comment
@netj

netj Dec 30, 2015

Contributor

@alldefector Sorry the conversation went private through Slack. Yes, that's exactly our plan, except using bash processes substitution to turn $sql into a readable file. This is so much better than dangerously splitting the SQL queries in a sloppy way. (btw you mean the block is run as a transaction, right?)

@zifeishan Could you make the changes, confirm it working, and add the commit to this PR?

Contributor

netj commented Dec 30, 2015

@alldefector Sorry the conversation went private through Slack. Yes, that's exactly our plan, except using bash processes substitution to turn $sql into a readable file. This is so much better than dangerously splitting the SQL queries in a sloppy way. (btw you mean the block is run as a transaction, right?)

@zifeishan Could you make the changes, confirm it working, and add the commit to this PR?

@alldefector

This comment has been minimized.

Show comment
Hide comment
@alldefector

alldefector Dec 30, 2015

Contributor

Apparently -f does not run the file in one transaction unless you also specify -1 or --single-transaction: http://postgres-xc.sourceforge.net/docs/1_0/app-psql.html

That's actually what we need because PGXL would throw up (Zifei's error messages) if we try to run a bunch of mutating / DDL statements in one transaction.

Contributor

alldefector commented Dec 30, 2015

Apparently -f does not run the file in one transaction unless you also specify -1 or --single-transaction: http://postgres-xc.sourceforge.net/docs/1_0/app-psql.html

That's actually what we need because PGXL would throw up (Zifei's error messages) if we try to run a bunch of mutating / DDL statements in one transaction.

@feiranwang

This comment has been minimized.

Show comment
Hide comment
@feiranwang

feiranwang Dec 30, 2015

Contributor

@netj I tried this on several examples. deepdive run and deepdive do all work great. I have a small problem with deepdive do. When I type deepdive do, it gives a list of targets. In smoke example there's a process/grounding/variable/person_has_cancer/assign_id. When I tried to run with that target, it gives an error

process/grounding/variable/person_has_cancer/assign_id: Unknown target

Is this expected?

Contributor

feiranwang commented Dec 30, 2015

@netj I tried this on several examples. deepdive run and deepdive do all work great. I have a small problem with deepdive do. When I type deepdive do, it gives a list of targets. In smoke example there's a process/grounding/variable/person_has_cancer/assign_id. When I tried to run with that target, it gives an error

process/grounding/variable/person_has_cancer/assign_id: Unknown target

Is this expected?

@netj

This comment has been minimized.

Show comment
Hide comment
@netj

netj Dec 30, 2015

Contributor

@feiranwang Can you try putting a .done suffix to the target? Maybe I forgot to add that case. Will also take a look into this.

Contributor

netj commented Dec 30, 2015

@feiranwang Can you try putting a .done suffix to the target? Maybe I forgot to add that case. Will also take a look into this.

@feiranwang

This comment has been minimized.

Show comment
Hide comment
@feiranwang

feiranwang Dec 30, 2015

Contributor

@netj I added a .done suffix to the target and it gives the same error

process/grounding/variable/person_has_cancer/assign_id.done: Unknown target
Contributor

feiranwang commented Dec 30, 2015

@netj I added a .done suffix to the target and it gives the same error

process/grounding/variable/person_has_cancer/assign_id.done: Unknown target

netj added some commits Dec 30, 2015

Improves how runner deals with make(1) target names
and fixes it to surface dependency errors between the processes or any
error from make(1) instead of obscuring everything as "Unknown target"
error.
Fixes a dependency error in smoke example
by converting schema.sql into ddlog schema declarations, that in turns
become schema.json.
@netj

This comment has been minimized.

Show comment
Hide comment
@netj

netj Dec 30, 2015

Contributor

@feiranwang It turns out to be a missing schema.json issue, so I converted schema.sql into app.ddlog as an easy way to generate the json file. The compiled Makefile was missing some dependencies for the mentioned target, which gets generated from the relational schema. I fixed things to show such errors more transparently, such as:

$ deepdive plan process/grounding/variable/person_has_cancer/assign_id
make: *** No rule to make target `data/person_has_cancer.done', needed by `process/grounding/variable_id_partition.done'.  Stop.
Error in dependencies found for process/grounding/variable/person_has_cancer/assign_id

Ideally, these should be caught in the deepdive compile phase. It'd be nice if you can contribute a checker that prevents this kind of error. Also, it seems we're missing a test for the smoke example.

Contributor

netj commented Dec 30, 2015

@feiranwang It turns out to be a missing schema.json issue, so I converted schema.sql into app.ddlog as an easy way to generate the json file. The compiled Makefile was missing some dependencies for the mentioned target, which gets generated from the relational schema. I fixed things to show such errors more transparently, such as:

$ deepdive plan process/grounding/variable/person_has_cancer/assign_id
make: *** No rule to make target `data/person_has_cancer.done', needed by `process/grounding/variable_id_partition.done'.  Stop.
Error in dependencies found for process/grounding/variable/person_has_cancer/assign_id

Ideally, these should be caught in the deepdive compile phase. It'd be nice if you can contribute a checker that prevents this kind of error. Also, it seems we're missing a test for the smoke example.

netj added some commits Dec 30, 2015

Improves deepdive-do to kill descendant processes
instead of separating the process group which seems to have a lot of
undesirable side effects, e.g., `^C` doesn't work when deepdive-do is
run from a script instead of shell prompt.  Moreover, any descendant
could start its own session/process group, so it was a fragile approach
anyway.

This reverts commit 9242b4e, removing
the setsid utility.
Fixes postgresql-xl's db-execute to support multiple SQL statements
and also enhances postgresql's db-load to use `\COPY` psql client
command instead of `COPY` server statement to deal with potential
permission issues.  See: https://wiki.postgresql.org/wiki/COPY
Fixes terminal echo issue caused by progress bars
competing with the log by forcing `stty echo` after the `pv -c`.
Fixes postgresql-xl's db-assign_sequential_id to support increment > 1
to make multinomial grounding work correctly.

Also drops fallback to postgresql's SEQUENCE as it does not work.
Defines a separate test suite for pgxl
and makes postgresql tests not enumerated for postgresql-xl or
greenplum

feiranwang added a commit that referenced this pull request Jan 5, 2016

@feiranwang feiranwang merged commit 13e8788 into master Jan 5, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@netj netj deleted the dataflow-compiler branch Jan 12, 2016

@netj netj referenced this pull request in HazyResearch/ddlog Jan 30, 2016

Closed

Truncate table before each individual extractor? #51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment