Skip to content

lang_datalog

Boris Glavic edited this page Jan 30, 2017 · 11 revisions

Datalog Provenance

GProM supports the generation of provenance graphs that explain why a tuple exists in the result of a Datalog query respective why it is missing from the result. Several programs, provenance requests, and provenance graphs can be found at the datalog_prov page (https://github.com/IITDBGroup/gprom/wiki/datalog_prov) as an example. More example programs and provenance requests are available in the examples/datalogAndProvGraphs directory, and the provenance graphs produced for these examples are included in examples/datalogAndProvGraphs/graphs. We provide convenience scripts for running these examples using the included SQLite database. Use genProvGraph.sh to generate a pdf provenance graph and runQuery.sh to run a query that returns the edge relation of the provenance graphs (or the result if provided with a Datalog query without provenance requests).

Run GProM with Datalog as a Frontend Language

The datalog parser is activated by changing the parser, analyzer, and translator plugins to dl using the following command line options.

gprom -frontend dl

This will cause GProM to translate the input Datalog program into relational algebra (and translate this into SQL). Use -Pexecutor run to run the SQL code or -Pexecutor sql to just print the SQL code to stdout.

If the program asks for provenance (see below) then the -Pexecutor gp option will cause GProM to output a .dot script for the provenance graph. For convenience, we provide several shell scripts in the scripts folder that activate Datalog support:

  • gprom-dl.sh - starts an interactive shell to run Datalog queries.
  • gprom-dl-to-dl.sh - takes two parameters: 1) a loglevel, 2) a Datalog program with optional provenance requests. This script returns an instrumented Datalog program that computes the edge relation of the provenance graph based on the user's provenance request (if any).
  • gprom-dl-to-sql.sh - takes two parameters: 1) a loglevel, 2) a Datalog program. Returns an SQL query that computes the edge relation of the provenance graph based on the user's provenance request (if any) or an SQL translation of the input program otherwise.
  • gprom-dl-from-file-to-sql.sh - takes two parameters: 1) a loglevel, 2) a file storing a Datalog program. Returns an SQL query that computes the edge relation of the provenance graph based on the user's provenance request (if any) or an SQL translation of the input program otherwise.
  • gprom-dl-prov-graph.sh - takes two parameters: 1) a Datalog program with a provenance request, 2) an output file name. This script generates a pdf file storing the provenance graph constructed for the user's provenance request. This requires graphviz to be installed.
  • gprom-dl-from-file-prov-graph.sh - takes two parameters: 1) a file storing a Datalog program with a provenance request, 2) , 2) an output file name. This script generates a pdf file storing the provenance graph constructed for the user's provenance request. This requires graphviz to be installed.

For example, assuming you have a relation R(A,B) in an Oracle database, to generate the provenance graph for explaining why Q(1) is in the result of query Q(X) :- R(X,Y). and store the graph in my_prov_graph.pdf you would run:

gprom-dl-prov-graph.sh "Q(X) :- R(X,Y). WHY(Q(1))." my_prov_graph -host 1.1.1.1 -user usr -passwd mypass -port 1521 -db orcl

Datalog Dialect

So far our parser and internal model supports non-recursive Datalog with negation, the ability to specify an answer relation, and constructs that ask for game provenance. A Datalog rule consists of a head (a relational atom) and a body (a list of relational and comparison atoms and is denoted as head :- body .. An relational atom is an identifier followed by a list of arguments in parenthesis. Arguments can be constants (string constants enclosed in ' and numeric constants) or variables.

A few example Datalog rules are shown below:

Q(X) :- R(X,Y), S(Y,Z).
Q(X) :- R(X,Y), X < 5, X < Y.

Rule Safety

We require all rules to be safe:

  • all variables in the head have to appear in a positive relational atom in the body
  • all variables in a negative relational atom in the body have to occur in at least one positive relational atom in the body
  • all variables used in a comparison atom have to occur in at least one positive relational atom in the body

Expressions in the Rule Head

We allow function application, constants, and arithmetic expressions in the head of a rules. This works like projection in relational algebra. For example:

Q(X,3+ Y) :- R(X,Y).

Datalog Programs

A Datalog program consists of a number of rules and optionally an answer relation specification. In the translation to SQL this answer relation will be the one computed by the resulting SQL query. For example, in the following Datalog program we will compute IDB relation Q

Q(X) :- R(X,Y).
Q2(X) :- Q(X).
ANS : Q.

Currently we do not support recursive programs

Provenance Features

Our Datalog dialect currently supports two types of provenance questions. Both ask for the game provenance rooted a particular IDB relational atom with constant values. WHY(Q(C)) where C is a list of constants computes the move relation for the game provenance graph that explains why Q(C) is won, i.e., why Q(C) is in the result of the program. WHYNOT(Q(C)) computes the move relation for the game provenance graph for why Q(C) is lost, i.e., why Q(C) is not in the result of the program. For example, the following computes why a is reachable in 3 hops from a:

Thop(X,Y) :- hop(X,A), hop(A,B), hop(B,Y).
WHY(Thop('a','a')).

RPQ support

We support RPQ (Regular Path Queries) as a macro-like construct that is expanded into a set of Datalog rules before analyzing and translating the program. Note that some RPQ language constructs will result in recursive queries that are currently not supported by GProM. For such queries you can request GProM to return the resulting Datalog program.

The new construct is RPQ(query,resultType,edgeRelation,resultRelation).

  • query is an RPQ expression enclosed in single quotes 'a.b.c*'
  • resultType determines what kind of results are produced:
    • RESULT returns a relation with two attributes storing pairs of nodes that are connected via at least one path that matches the regular path expression
    • PROV returns a relation with five attributes that stores result nodes (first two attributes) paired with edges from the input graph that lie on paths between the result nodes that match the regular path expression
    • SUBGRAPH returns the subgraph induced by the regular path expression, i.e., the subgraph of the input graph limited to edges that lie on paths matched by the regular path expression.
  • edgeRelation the relation with stores the input graph, i.e., labelled edges as tuples (inNode,label,outNode)
  • resultRelation the storing the query results

for example to find all nodes that are reachable from node 'n1' with a path a.b.:

RPQ('a.b',RESULT,e,RP).
Q(Y) :- RP(X,Y), X = 'n1'.