-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepDive execution plan compiler #445
Conversation
- Bugfix for mishandling of '&' in URLs that prevented DEEPDIVE_JDBC_URL from being set - Support for all [sslmode=](http://www.postgresql.org/docs/9.2/static/libpq-connect.html#LIBPQ-CONNECT-SSLMODE) instead of just `ssl=true` that corresponds to `sslmode=require`. - PGSSLMODE default changed to `prefer` instead of `disable`.
that ensures the parent path exists
that can be set up via: source $(deepdive whereis etc/deepdive_bash_completion.sh) `deepdive whereis` command has been added for finding resources in installation.
Both schema JSON compiled from DDlog as well as the user's schema.json file are considered.
and some cosmetic changes: relative times for timestamps, substep printing for easier progress tracking within a process etc.
Allowing users to - initialize selected relations without recreating the database - load a relation from multiple .tsv, .csv, .json-seq sources, compressed (.bz2, .gz) and generated (.sh) sources
- The augmented, normalized deepdive config object will stay under .deepdive_ - All normalized/qualified fields will stay under fields whose names end with underscore
Also adds processes for loading weights, marginal probabilities back to database
except ddlog spouse example which requires calibration
@alldefector Sorry the conversation went private through Slack. Yes, that's exactly our plan, except using bash processes substitution to turn @zifeishan Could you make the changes, confirm it working, and add the commit to this PR? |
Apparently That's actually what we need because PGXL would throw up (Zifei's error messages) if we try to run a bunch of mutating / DDL statements in one transaction. |
@netj I tried this on several examples.
Is this expected? |
@feiranwang Can you try putting a |
@netj I added a
|
and fixes it to surface dependency errors between the processes or any error from make(1) instead of obscuring everything as "Unknown target" error.
by converting schema.sql into ddlog schema declarations, that in turns become schema.json.
@feiranwang It turns out to be a missing schema.json issue, so I converted schema.sql into app.ddlog as an easy way to generate the json file. The compiled Makefile was missing some dependencies for the mentioned target, which gets generated from the relational schema. I fixed things to show such errors more transparently, such as:
Ideally, these should be caught in the |
instead of separating the process group which seems to have a lot of undesirable side effects, e.g., `^C` doesn't work when deepdive-do is run from a script instead of shell prompt. Moreover, any descendant could start its own session/process group, so it was a fragile approach anyway. This reverts commit 9242b4e, removing the setsid utility.
and also enhances postgresql's db-load to use `\COPY` psql client command instead of `COPY` server statement to deal with potential permission issues. See: https://wiki.postgresql.org/wiki/COPY
competing with the log by forcing `stty echo` after the `pv -c`.
to make multinomial grounding work correctly. Also drops fallback to postgresql's SEQUENCE as it does not work.
and makes postgresql tests not enumerated for postgresql-xl or greenplum
DeepDive execution plan compiler
This not-so-small PR adds a new implementation of DeepDive that literally compiles an execution plan to run the app in a much more efficient way in both human time and machine time/space. I invite everyone to review the code and give a try on your existing app (especially @feiranwang @senwu @ajratner @zhangce @raphaelhoffmann @alldefector @zifeishan @xiaoling @Colossus @ThomasPalomares @juhanaka) The plan is to get feedback over the next few days while I update the documentation and carve out release v0.8.0.
An execution plan is basically a set of shell scripts that tells what to run for user-defined extractors as well as the built-in processes for grounding the factor graph and performing learning and inference, and a Makefile and that describes the complete dependencies among them. These are compiled from the app's deepdive.conf, app.ddlog, and schema.json, mainly by a series of JSON transformations implemented as jq programs with a little bit of help from bash, which all resides under
compiler/
in the source tree.It implements most of the existing functionality provided by the current Scala implementation, except a few things, e.g., Greenplum parallel unloading/loading, which won't be difficult to add given the much much modular architecture. On the other hand, exciting new features and improvements are added. To highlight just a few:
tsv_extractor
s now have nearly zero footprint on the filesystem. The data is streamed from the database through the UDFs and back to the database. mkmimo is used with named pipes to make the connection between the database and the parallel UDF processes (cf.$DEEPDIVE_NUM_PROCESSES
). (fixes Possible to get rid of UDF input data splitting? #428) It doesn't support other extractor styles yet, but we can probably drop them unless there's a compelling reason. (closes PGXL PL/Python extractors keep running after Ctrl-C'ing DeepDive #384)tsv_extractor
mentioned above. Moreover, the grounding processes as well as the user-defined extractors also make use of the compute drivers, so some parts of the grounding will automatically take advantage of such extensions.deepdive.pipeline
config is obsolete as well as thedeepdive run
command, although they'll work the same as before. Now, the user can simply state what is the goal for execution to thedeepdive do
command, e.g.,deepdive do model/calibration-plot
ordeepdive do data/has_spouse_features
, as many times as necessary once the app is compiled with thedeepdive compile
command. Supporting commands such asdeepdive plan
,deepdive mark
,deepdive redo
are there to speed up typical user workflows with DeepDive apps.deepdive initdb
anddeepdive load
have been improved to be more useful, and is used by a few compiled processes. (fixes Separate initdb command / flag for input data vs. extractions schemas #351, fixes schema.sql file generated from ddlog? #357)deepdive check
command. The checkers are modular and isolated, so many useful checks can be quickly added, such as test firing UDF commands, etc. Only basic checks have been implemented so far. (fixes deepdive run doesn't report an error when extractor dependencies are misspelled #349, fixes Configuration sanity check #1)Also some good stuffs that comes with a clean rewrite: