Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepDive execution plan compiler #445

Merged
merged 200 commits into from
Jan 5, 2016
Merged

DeepDive execution plan compiler #445

merged 200 commits into from
Jan 5, 2016

Conversation

netj
Copy link
Contributor

@netj netj commented Dec 17, 2015

This not-so-small PR adds a new implementation of DeepDive that literally compiles an execution plan to run the app in a much more efficient way in both human time and machine time/space. I invite everyone to review the code and give a try on your existing app (especially @feiranwang @senwu @ajratner @zhangce @raphaelhoffmann @alldefector @zifeishan @xiaoling @Colossus @ThomasPalomares @juhanaka) The plan is to get feedback over the next few days while I update the documentation and carve out release v0.8.0.

An execution plan is basically a set of shell scripts that tells what to run for user-defined extractors as well as the built-in processes for grounding the factor graph and performing learning and inference, and a Makefile and that describes the complete dependencies among them. These are compiled from the app's deepdive.conf, app.ddlog, and schema.json, mainly by a series of JSON transformations implemented as jq programs with a little bit of help from bash, which all resides under compiler/ in the source tree.

It implements most of the existing functionality provided by the current Scala implementation, except a few things, e.g., Greenplum parallel unloading/loading, which won't be difficult to add given the much much modular architecture. On the other hand, exciting new features and improvements are added. To highlight just a few:

  • Full Dependency Support with Selective Execution. It's now possible to selectively run, repeat, or skip certain parts of the app's data flow or extraction pipeline without being aware of all the dependencies between them. (fixes Why remove tasks dependencies that are not in the pipeline? #431, closes If an extractor depends on inference, DD halts indefinitely before doing inference #427, closes extractorRunner hanging problem #273) The user has full control over every step of the execution plan. Not only that, but grounding is also broken down into smaller processes, so it's possible to just change or add one inference rule and update the grounded factor graph without having to recompute everything from scratch. (fixes Restart DeepDive from a "checkpoint" #280)
  • Zero Footprint Extraction. tsv_extractors now have nearly zero footprint on the filesystem. The data is streamed from the database through the UDFs and back to the database. mkmimo is used with named pipes to make the connection between the database and the parallel UDF processes (cf. $DEEPDIVE_NUM_PROCESSES). (fixes Possible to get rid of UDF input data splitting? #428) It doesn't support other extractor styles yet, but we can probably drop them unless there's a compelling reason. (closes PGXL PL/Python extractors keep running after Ctrl-C'ing DeepDive #384)
  • Compute Drivers. A new compute driver architecture for executing such extractors is now in place, so it's now clear where to extend to support remote execution or clusters with a job scheduler, such as Hadoop/YARN, SLURM, Torque/GridEngine/PBS. (Parallelization - scale-out / distributed execution of UDFs #426) The local execution driver is what implements the streaming tsv_extractor mentioned above. Moreover, the grounding processes as well as the user-defined extractors also make use of the compute drivers, so some parts of the grounding will automatically take advantage of such extensions.
  • Zero Footprint Grounding. The grounding processes also minimizes footprint on the filesystem. No data for the factor graph are duplicated. Instead of creating a concatenated copies of factors, weights, and variables, they are merged as the sampler loads them. Also, the binary format conversion is done on the fly as the grounded rows are unloaded from the database, so no ephemeral textual form is ever stored anywhere. In fact, only a few line changes to the compiler can compress the binary forms and shrink the factor graphs footprint on the filesystem by an order of magnitude (not included in this PR).
  • More User Commands. deepdive.pipeline config is obsolete as well as the deepdive run command, although they'll work the same as before. Now, the user can simply state what is the goal for execution to the deepdive do command, e.g., deepdive do model/calibration-plot or deepdive do data/has_spouse_features, as many times as necessary once the app is compiled with the deepdive compile command. Supporting commands such as deepdive plan, deepdive mark, deepdive redo are there to speed up typical user workflows with DeepDive apps. deepdive initdb and deepdive load have been improved to be more useful, and is used by a few compiled processes. (fixes Separate initdb command / flag for input data vs. extractions schemas #351, fixes schema.sql file generated from ddlog? #357)
  • Error Checking. Errors in the app (and of course the compiler itself) are checked at compile time, and can also be done by the user with the deepdive check command. The checkers are modular and isolated, so many useful checks can be quickly added, such as test firing UDF commands, etc. Only basic checks have been implemented so far. (fixes deepdive run doesn't report an error when extractor dependencies are misspelled #349, fixes Configuration sanity check  #1)
  • Simpler, Efficient Multinomial Factors. Multinomial factors are won't materialize unnecessary data and use VIEWs as much as possible, e.g., dd_*_cardinality tables and the dd_graph_weights. Also, nearly no code has been duplicated to support it, so that's good news for developers.
  • Bundled Runtime Dependencies. DeepDive now builds essential runtime dependencies, such as bash, coreutils, jq, bc, graphviz, so no more Mac vs. Linux or software installation version issues will pop up. (fixes no install instructions for building from source #441 as documentation also ended up in this PR)

Also some good stuffs that comes with a clean rewrite:

netj added 30 commits October 1, 2015 18:52
- Bugfix for mishandling of '&' in URLs that prevented DEEPDIVE_JDBC_URL
  from being set
- Support for all [sslmode=](http://www.postgresql.org/docs/9.2/static/libpq-connect.html#LIBPQ-CONNECT-SSLMODE)
  instead of just `ssl=true` that corresponds to `sslmode=require`.
- PGSSLMODE default changed to `prefer` instead of `disable`.
that ensures the parent path exists
that can be set up via:

    source $(deepdive whereis etc/deepdive_bash_completion.sh)

`deepdive whereis` command has been added for finding resources in
installation.
Both schema JSON compiled from DDlog as well as the user's schema.json
file are considered.
and some cosmetic changes: relative times for timestamps, substep
printing for easier progress tracking within a process etc.
Allowing users to
- initialize selected relations without recreating the database
- load a relation from multiple .tsv, .csv, .json-seq sources, compressed (.bz2, .gz) and generated (.sh) sources
- The augmented, normalized deepdive config object will stay under .deepdive_
- All normalized/qualified fields will stay under fields whose names end
  with underscore
Also adds processes for loading weights, marginal probabilities back to database
except ddlog spouse example which requires calibration
@netj
Copy link
Contributor Author

netj commented Dec 30, 2015

@alldefector Sorry the conversation went private through Slack. Yes, that's exactly our plan, except using bash processes substitution to turn $sql into a readable file. This is so much better than dangerously splitting the SQL queries in a sloppy way. (btw you mean the block is run as a transaction, right?)

@zifeishan Could you make the changes, confirm it working, and add the commit to this PR?

@alldefector
Copy link
Contributor

Apparently -f does not run the file in one transaction unless you also specify -1 or --single-transaction: http://postgres-xc.sourceforge.net/docs/1_0/app-psql.html

That's actually what we need because PGXL would throw up (Zifei's error messages) if we try to run a bunch of mutating / DDL statements in one transaction.

@feiranwang
Copy link
Contributor

@netj I tried this on several examples. deepdive run and deepdive do all work great. I have a small problem with deepdive do. When I type deepdive do, it gives a list of targets. In smoke example there's a process/grounding/variable/person_has_cancer/assign_id. When I tried to run with that target, it gives an error

process/grounding/variable/person_has_cancer/assign_id: Unknown target

Is this expected?

@netj
Copy link
Contributor Author

netj commented Dec 30, 2015

@feiranwang Can you try putting a .done suffix to the target? Maybe I forgot to add that case. Will also take a look into this.

@feiranwang
Copy link
Contributor

@netj I added a .done suffix to the target and it gives the same error

process/grounding/variable/person_has_cancer/assign_id.done: Unknown target

and fixes it to surface dependency errors between the processes or any
error from make(1) instead of obscuring everything as "Unknown target"
error.
by converting schema.sql into ddlog schema declarations, that in turns
become schema.json.
@netj
Copy link
Contributor Author

netj commented Dec 30, 2015

@feiranwang It turns out to be a missing schema.json issue, so I converted schema.sql into app.ddlog as an easy way to generate the json file. The compiled Makefile was missing some dependencies for the mentioned target, which gets generated from the relational schema. I fixed things to show such errors more transparently, such as:

$ deepdive plan process/grounding/variable/person_has_cancer/assign_id
make: *** No rule to make target `data/person_has_cancer.done', needed by `process/grounding/variable_id_partition.done'.  Stop.
Error in dependencies found for process/grounding/variable/person_has_cancer/assign_id

Ideally, these should be caught in the deepdive compile phase. It'd be nice if you can contribute a checker that prevents this kind of error. Also, it seems we're missing a test for the smoke example.

instead of separating the process group which seems to have a lot of
undesirable side effects, e.g., `^C` doesn't work when deepdive-do is
run from a script instead of shell prompt.  Moreover, any descendant
could start its own session/process group, so it was a fragile approach
anyway.

This reverts commit 9242b4e, removing
the setsid utility.
and also enhances postgresql's db-load to use `\COPY` psql client
command instead of `COPY` server statement to deal with potential
permission issues.  See: https://wiki.postgresql.org/wiki/COPY
competing with the log by forcing `stty echo` after the `pv -c`.
to make multinomial grounding work correctly.

Also drops fallback to postgresql's SEQUENCE as it does not work.
and makes postgresql tests not enumerated for postgresql-xl or
greenplum
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment