Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset / ConjunctiveGraph #307

Closed
gromgull opened this issue Jun 20, 2013 · 20 comments
Closed

Dataset / ConjunctiveGraph #307

gromgull opened this issue Jun 20, 2013 · 20 comments

Comments

@gromgull
Copy link
Member

This issue is to collect to discussion of dataset vs. conjunctive graph from: #301, #302, #303 and friends.

Similarities between ConjunctiveGraph and DataSet:

  • Both allow you to group triples into separate graphs. You can either look at it as "quads" (with a fourth context element), or as a collection of graphs, each containing triples, the two views are interchangeable.
  • Both should have quad methods, like addN, add_quad, remove_quad, quads, etc.
  • Both should have graph methods, like get_context (or get_graph), listing, deleting, etc.

The difference between ConjuctiveGraph and DataSet:

  • ConjunctiveGraph allows (as the named implies) querying the conjunction (union) of all graphs
  • ConjunctiveGraph has a named default graph, implicitly named with a bnode if nothing else is given.
  • DataSet has the concept of a default graph, but the exact definition is open.
  • DataSet tracks the existence of empty graphs, a ConjuctiveGraph only contains the graphs that contain triples

Some more concerns:

  • Our Store API support (somewhat implicitly) that idea of union of graphs if no context is specified. I would rather not change the store API at this point, i.e. I would not like to have a solution that requires stores to change to accemodate yet-another-graph class
  • For SPARQL, it is often interesting to on-the-fly define a new DataSet, i.e. selecting some subset of graphs from another DataSet, or even from several datasets. It would be nice to be able to do this without copying all the triples.

I suggest:

  • Add a new class ContextAwareGraph, that has flags for union_default_graph and track_empty_graphs. This class also provides quads/graph methods.
  • Let ConjunctiveGraph and DataSet extend this class, each setting the flags the way they want.

Not clear to me just now:

  • How are the empty graphs of a DataSet persisted? Looking at the current code I would say they are not. I.e. if I open a DataSet on a Sleepycat store, create some graphs, exit my program. Next time I open, only graphs that contained triples will exists.
  • What about python operators on Context-aware graphs? __sub__, __contains__, etc. This is Think about __iadd__, __isub__ etc. for ConjunctiveGraph #225.
@gromgull
Copy link
Member Author

I started working on this in a branch (not yet pushed), and realised there is another difference:

  • DataSet only allows adding triples to graphs that exist already. If you use add_quad with a context that does not already exists, the quad is silently ignored. ConjunctiveGraph on the other hand (for addN), will create graphs as necessary. (DataSet.graph is the only way to CREATE a graph)

@gromgull
Copy link
Member Author

@joernhees had an idea for persisting the list of graphs that "exist" - without really changing the store API. By adding a (None,None,None) "marker triple" to the Graph. Any query methods must be modified to not actually return this triple of course.

This is a bit dirty :), but maybe better than the alternative.

This does not let us solve having a DataSet only expose SOME of the graphs that are stored in a store though. This would be useful for SPARQL DataSet support for instance, however, it is very unlike the ConjunctiveGraph then, and maybe an indicator that sub-classing wasn't a good idea after all.

gromgull added a commit that referenced this issue Jun 20, 2013
@uholzer
Copy link
Contributor

uholzer commented Jun 20, 2013

@gromgull What kind of stores are there around? We have some in-memory stores, the sleepycat one, and SPARQLStore, right? Are there other (third-party) stores, for example based on virtuoso, Jena, or Sesame? Do these stores all behave exactly like the store API requires? I could imagine that depending on the backend, the default graph is treated differently. Some could even add triples generated by a reasoner.

So, maybe it would be wise to let the store decide. Also, stores could be configured by passing parameters down to them and they could tell which options they support. I know that you don't want to change the store API and I agree, but I also like to know how consistently stores implement the API. Breaking old implementations of stores is not acceptable, but maybe one could carefully do a small refinement?

About adding a marker triple: Do all stores support a (None, None, None) triple? I highly doubt that. And what about other applications reading the same store through other means?

@uholzer
Copy link
Contributor

uholzer commented Jun 25, 2013

@gromgull and @iherman: I still wonder what a Store is supposed to be and how it should interact with the wild world out there. So please, can anyone comment on my last question?

I was working on the SPARQLStore some weeks ago and there I am completely at the endpoint's mercy. The endpoint decides what it wants to do with the default graph and whether it stores empty graphs. I can not even know beforehand. This is why I still think that these issues should be left to the stores to some degree.

@iherman
Copy link
Contributor

iherman commented Jun 25, 2013

Well...

https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#section-dataset

There were lots of discussions in the group about that and, unfortunately, the
reality is that SPARQL does not say too much about this either and therefore we
get to what you experienced: some SPARQL engines use the default graph as a
union and others do not. The group may (but only may) define some rdf terms
characterizing the whole dataset (essentially, <> rdf:SOMETHING "true" triples
in the default graph) and one of those characterization may be that the default
graph contains the union of all graphs.

Which means that, as you say, at the moment the endpoint decides indeed. Sigh...

This is also why I believe the Dataset class in RDFLib should not use the
default graph as a union...

Ivan

Urs Holzer wrote:

@gromgull https://github.com/gromgull and @iherman
https://github.com/iherman: I still wonder what a Store is supposed to be and
how it should interact with the wild world out there. So please, can anyone
comment on my last question?

I was working on the SPARQLStore some weeks ago and there I am completely at the
endpoint's mercy. The endpoint decides what it wants to do with the default
graph and whether it stores empty graphs. I can not even know beforehand. This
is why I still think that these issues should be left to the stores to some degree.


Reply to this email directly or view it on GitHub
#307 (comment).

Ivan Herman
4, rue Beauvallon, Clos St. Joseph
13090 Aix-en-Provence
France
tel: +31-64-1044153 ou +33 6 52 46 00 43
http://www.ivan-herman.net

@uholzer
Copy link
Contributor

uholzer commented Jun 25, 2013

I and gromgull (in #301 (comment)) think that the default graph should be implemented by using a special named graph (for example urn:x-rdflib:default). Stores that know better can then always treat this graph specially. It should be easy to do this using ConjunctiveGraph (see commit 68d0b5d by gromgull).

Solutions for empty graphs:

  • Using a dummy triple (None, None, None). I think this is too spooky. One can not expect stores to be able to handle None, because backends like Jena, Virtuoso and so on likely can't.
  • Track empty graphs in Dataset internally. This does not work with persistent stores, because when starting the application the next time, the empty graphs are forgotten.
  • Track each graph by adding a triple to a special named graph. (I like this one)
  • The stores should implement this. (gromgull doesn't like this one. It is my favourite ;-))

Now I am at the end of my wisdom.

@gromgull
Copy link
Member Author

Sorry for being unresponsive, I am 3 days away from changing jobs, and turns out there were quite a few things to sort out :)

It doesn't seem to matter though, because @uholzer very accurately represents my opinions anyway!

What kind of stores are there around? We have some in-memory stores, the sleepycat one, and SPARQLStore, right?

These are in core rdflib. There are also a SQLAlchemy store, and a ZODB store, and various key-value stores (leveldb, kyotocabinet) that we claim to maintain. In the wild, there are stores for 4store, virtuoso, mysql, redland, postgresql, sqlite and probably more, most of which are not up-to-date as far as I know.
Then there are some meta-stores in core rdflib.plugins.stores, some of which are also poorly maintained: auditable, concurrent and regex matching.

For most stores, we control how triples/contexts are stored, and they can do whatever we decide with default graphs, unions, and tracking of empty graphs. The SPARQLStores, and the ones wrapping another RDF store (4store, virtuoso), are at the mercy of whatever they are talking to.

I agree that the (None, None, None) triple solution really is too dirty, and I see no other clean way to do this than extending the store API.

I would propose:

  • a flag, defaulting to false, that would state whether the new api methods are supported (for instance graph_aware) - this would let us have default implementation of the methods that raise an exception, and we would be sort of backwards compatible.
  • as small a change as possible :)
  • Maybe add_graph, remove_graph is all we need? Adding/removing triples from contexts is already possible. Listing all graphs can be done with contexts

How this is actually implemented internally in the store is up to each implementation?
If some store want to stay pure, they can do the "save the graph meta-data in a magic context" trick?

@gromgull
Copy link
Member Author

For the DataSet union-default graph or not, it should be a flag when you construct the DataSet - just like Jena does it for TDB.

@joernhees
Copy link
Member

it seems someone passed me the dirty solutions hat, so:
what about (BNode(), BNode(), BNode()) ?

@joernhees
Copy link
Member

and no, i don't really think it's a good idea, but how is a "void" triple different / worse than a special named graph?
The idea of (None, None, None) came from it not being a valid triple but being accepted by rdflib's stores.
A special named graph or a visible void triple would certainly lead to more confusion, won't they?

@iherman
Copy link
Contributor

iherman commented Jun 26, 2013

This would be rejected by the system, because a BNOde is not allowed as a
predicate...

Ivan

Jörn Hees wrote:

it seems someone passed me the dirty solutions hat, so:
what about (BNode(), BNode(), BNode()) ?

On 25.06.2013, at 20:54, Gunnar Aastrand Grimnes notifications@github.com wrote:

Sorry for being unresponsive, I am 3 days away from changing jobs, and turns
out there were quite a few things to sort out :)

It doesn't seem to matter though, because @uholzer very accurately represents
my opinions anyway!

What kind of stores are there around? We have some in-memory stores, the
sleepycat one, and SPARQLStore, right?

These are in core rdflib. There are also a SQLAlchemy store, and a ZODB store,
and various key-value stores (leveldb, kyotocabinet) that we claim to maintain.
There are stores for 4store, virtuoso, mysql, redland, postgresql, sqlite and
probably more, most of which are not up-to-date as far as I know.
Then there are some meta-stores in core rdflib.plugins.stores, some of which
are also poorly maintained: auditable, concurrent and regex matching.

For most stores, we control how triples/contexts are stored, and they can do
whatever we decide with default graphs, unions, and tracking of empty graphs.
The SPARQLStores, and the ones wrapping another RDF store (4store, virtuoso),
are at the mercy of whatever they are talking to.

I agree that the (None, None, None) triple solution really is too dirty, and I
see no other clean way to do this than extending the store API.

I would propose:

• a flag, defaulting to false, that would state whether the new api methods are
supported (for instance graph_aware) - this would let us have default
implementation of the methods that raise an exception, and we would be sort of
backwards compatible.
• as small a change as possible :)
• Maybe add_graph, remove_graph is all we need? Adding/removing triples from
contexts is already possible. Listing all graphs can be done with contexts
How this is actually implemented internally in the store is up to each
implementation?
If some store want to stay pure, they can do the "save the graph meta-data in a
magic context" trick?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#307 (comment).

Ivan Herman
4, rue Beauvallon, Clos St. Joseph
13090 Aix-en-Provence
France
tel: +31-64-1044153 ou +33 6 52 46 00 43
http://www.ivan-herman.net

@iherman
Copy link
Contributor

iherman commented Jun 26, 2013

Urs Holzer wrote:

I and gromgull (in #301
#301 (comment)) think that
the default graph should be implemented by using a special named graph (for
example |urn:x-rdflib:default|). Stores that know better can then always treat
this graph specially. It should be easy to do this using ConjunctiveGraph (see
commit 68d0b5d
68d0b5d
by gromgull).

Yes, this is already what I proposed to do and have in my temporary
experimentation. The only difference between what you say and what I did was to
use this specific URN for the default_context of the CG (that can be set at
initialization time).

Solutions for empty graphs:

  • Using a dummy triple |(None, None, None)|. I think this is too spooky. One
    can not expect stores to be able to handle |None|, because backends like
    Jena, Virtuoso and so on likely can't.
  • Track empty graphs in Dataset internally. This does not work with persistent
    stores, because when starting the application the next time, the empty
    graphs are forgotten.
  • Track each graph by adding a triple to a special named graph. (I like this one)
  • The stores should implement this. (gromgull doesn't like this one. It is my
    favourite ;-))

I must admit that adding something to the store interface is obviously the
cleanest approach. Everything else is a hack.

If we do a hack, I would actually prefer to make it meaningful. Adding a bona
fide triple to a graph is nice with the open world assumption; so adding
something like ([], rdflib:isInDataset, "yes"), or something like that, looks
like the least evil. Serializers might even choose to filter that one out. But
that is only if the store interface is not extended...

Ivan

Now I am at the end of my wisdom.


Reply to this email directly or view it on GitHub
#307 (comment).

Ivan Herman
4, rue Beauvallon, Clos St. Joseph
13090 Aix-en-Provence
France
tel: +31-64-1044153 ou +33 6 52 46 00 43
http://www.ivan-herman.net

@joernhees
Copy link
Member

On 26.06.2013, at 06:52, Ivan Herman notifications@github.com wrote:

This would be rejected by the system, because a BNOde is not allowed as a
predicate...

What system rejects it? As i recently learned rdflib is pretty much cool with everything, including BNodes as predicates or Literals as subjects ;)
It will cause problems with some serializers though and probably some stores though.

I thought it might be a solution to abuse the fact that rdflib can do some more stuff internally, and have the whole thing never be visible for the user.
But then at a second thought it will end up in stores and from there maybe exported without rdflib sanitizing it, resulting in invalid triples out there, so apologies for the dirty solutions idea.

The more i think about this, the better i like the explicit solutions, adding either triples to each graph or adding a special graph.

I first favored the adding triples to each graph solution, but that was when i still thought rdflib would keep the whole thing transparent to the user.
If we want to do this explicitly though this method has the disadvantage that many people use graphs as "source" fields. For example if i pull in some RDF from a URI i might just put that into a graph with that URI and later on i might iterate over all triples in there. While there is an open world assumption, i wouldn't expect rdflib to inject something in this graph, but i'd expect it to be exactly those triples that i received.

Adding a special graph for tracking the graphs on the other hand at least wouldn't have that "hah look, dbpedia uses the rdflib vocabulary now" side effect.
It could still cause surprise if you iterated over all graphs in a ConjunctiveGraph or Dataset, but internally we could handle this transparently. Externally people would have to deal with it, but at least they would be able to easily retrieve the "original" graphs.
I'd favor this solution now, probably having a graph like this:
[
x a rdflib:Graph.
y a rdflib:Graph.

]
where x and y are the graphs in the Dataset / ConjunctiveGraph.

@uholzer
Copy link
Contributor

uholzer commented Jun 26, 2013

@joernhees (BNode(),BNode(),BNode()) is a bad idea for two reasons: First, it is a valid triple for rdflib, although not for RDF. Such a triple could turn up for example in the process of reasoning and it could also happen that the W3C will allow BNodes in a future revision of RDF. Second, not all stores can handle it. Therefore, this would give us at least as many problems as changing the store API directly.

gromgull added a commit that referenced this issue Jun 26, 2013
discussed in #307

Summary of changes:

 * added methods ```add_graph``` and ```remove_graph``` to the Store
   API, implemented these for Sleepycat and IOMemory. A flag,
   ```graph_awareness``` is set on the store if they methods are
   supported, default implementations will raise an exception.

 * made the dataset require a store with the ```graph_awareness```
   flag set.

 * removed the graph-state kept in the ```Dataset``` class directly.

 * removed ```dataset.add_quads```, ```remove_quads``` methods. The
   ```add/remove``` methods of ```ConjunctiveGraph``` are smart enough
   to work with triples or quads.

 * removed the ```dataset.graphs``` method - it now does exactly the
   same as ```contexts```

 * cleaned up a bit more confusion of whether Graph instance or the
   Graph identifiers are passed to store methods. (#225)
@gromgull
Copy link
Member Author

"Luckily" I got to wait 2-3 hours for a doctors appointment today - and I had some time to look into this, I made a pull request with my changes, see: #309 for details.

gromgull added a commit that referenced this issue Jul 29, 2013
discussed in #307

Summary of changes:

 * added methods ```add_graph``` and ```remove_graph``` to the Store
   API, implemented these for Sleepycat and IOMemory. A flag,
   ```graph_awareness``` is set on the store if they methods are
   supported, default implementations will raise an exception.

 * made the dataset require a store with the ```graph_awareness```
   flag set.

 * removed the graph-state kept in the ```Dataset``` class directly.

 * removed ```dataset.add_quads```, ```remove_quads``` methods. The
   ```add/remove``` methods of ```ConjunctiveGraph``` are smart enough
   to work with triples or quads.

 * removed the ```dataset.graphs``` method - it now does exactly the
   same as ```contexts```

 * cleaned up a bit more confusion of whether Graph instance or the
   Graph identifiers are passed to store methods. (#225)
@gromgull
Copy link
Member Author

This was hopefully solved by the merge of #309 - for any remaining issues please open a new issue!

@sindikat
Copy link

ConjunctiveGraph allows (as the named implies) querying the conjunction (union) of all graphs

What should

ds = Dataset()
# add some triples to a non-default named graph in `ds`
ds.query(some_sparql_query)

return then?

@sindikat
Copy link

DataSet tracks the existence of empty graphs

This is not true as of commit 06dae6a.

from rdflib import Namespace, Graph, Dataset
from rdflib.plugins.memory import IOMemory

ns = Namespace("http://love.com#")

store = IOMemory()
triple = (ns.s, ns.p, ns.o)

identifier = ns.mary
# No triple
ds = Dataset(store=store)
g = Graph(store=store, identifier=identifier)
print 'No triple:', list(ds.contexts())
# No triple: [<Graph identifier=urn:x-rdflib:default (<class 'rdflib.graph.Graph'>)>]

# Added triple
g.add(triple)
print 'Added triple:', list(ds.contexts())
# Added triple: [<Graph identifier=urn:x-rdflib:default (<class 'rdflib.graph.Graph'>)>, <Graph identifier=http://love.com#mary (<class 'rdflib.graph.Graph'>)>]

@gromgull
Copy link
Member Author

You are not creating a graph in the dataset though you are creating another graph using the same store. To track the graph, it must be added with Dataset.add_graph, i.e.:

ds = Dataset(store)
g = ds.add_graph("urn:my.graph")
...

@gromgull
Copy link
Member Author

What should

ds = Dataset()
# add some triples to a non-default named graph in `ds`
ds.query(some_sparql_query)
return then?

that depends on the SPARQL query, if the query only queries the default graph, it returns nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants