Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

namespace.py fix compute_qname missing namespaces #649

Merged
merged 14 commits into from Mar 16, 2020

Conversation

tgbugs
Copy link
Contributor

@tgbugs tgbugs commented Aug 25, 2016

The way the compute_qname worked was to call split_uri
and then see if there was a match in self.store.prefix.
This produced incomplete behavior for namesapces that
end in LETTERS_ instead of / or #. This commit corrects
this behavior by iterating through name and testing
namespace + name[:i] to see if there is a matching prefix.

@joernhees
Copy link
Member

joernhees commented Oct 15, 2016

tests fail... what spec are you referring to when saying that namespaces should not end in / or # ?

@joernhees joernhees added this to the rdflib 4.2.2 milestone Oct 15, 2016
@joernhees joernhees added serialization Related to serialization. discussion labels Oct 15, 2016
@tgbugs
Copy link
Contributor Author

tgbugs commented Oct 18, 2016

Maybe I have incorrectly assumed that OWLAPI implements prefix handling according to the spec. This change was intended to match its behavior. To clarify about / and # endings: these are correct, and they remain correct with my changes, essentially this adds the ability to use arbitrary prefixes.

I will check again to see if this is in 4.2.2 and figure out why the tests are failing.

@satra
Copy link
Contributor

satra commented Oct 21, 2016

@tgbugs - i decided to fix this together with #632 in the turtle serializer itself. see here: #661

and for the full discussion between @gromgull and myself see #660

@tgbugs
Copy link
Contributor Author

tgbugs commented Oct 21, 2016

@satra the solution in #661 looks good to me. Closing this. Thanks!

@tgbugs
Copy link
Contributor Author

tgbugs commented Oct 25, 2016

I have reverted the change to NAME_START_CATEGORIES that was causing the build failures. Tests should pass now.

@tgbugs tgbugs reopened this Oct 25, 2016
@satra
Copy link
Contributor

satra commented Oct 26, 2016

@tgbugs - with this change does the following work for you:

    graph = Graph()
    graph.bind('GENO', 'http://purl.obolibrary.org/obo/GENO_')
    graph.bind('RO_has_phenotype',
                    'http://purl.obolibrary.org/obo/RO_0002200')
    graph.add((URIRef('http://example.org'),
               URIRef('http://purl.obolibrary.org/obo/RO_0002200'),
               URIRef('http://purl.obolibrary.org/obo/GENO_0000385')))
    output = [val for val in
              graph.serialize(format='turtle').decode().splitlines()
              if not val.startswith('@prefix')]
    output = ' '.join(output)
    assert 'RO_has_phenotype: ' in output
    assert 'GENO:0000385' in output

@tgbugs
Copy link
Contributor Author

tgbugs commented Oct 26, 2016

@satra No, that example fails, producing ' <http://example.org> ns1:RO_0002200 ns1:GENO_0000385 . '. From my experiments the issues arrises because a call to graph.namespace_manager.qname(URIRef('http://purl.obolibrary.org/obo/RO_0002200')) (which is buried in serialize) forces the creation of ('ns1', rdflib.term.URIRef('http://purl.obolibrary.org/obo/')). A call to graph.namespace_manager.qname(URIRef('http://purl.obolibrary.org/obo/GENO_0000385')) does not cause this behavior, which leads me to believe it is because my code stops just short of the end (it assumes that users would not prefix an entire iri) I think I can fix this and will look into it.

@tgbugs
Copy link
Contributor Author

tgbugs commented Oct 26, 2016

@satra Fixed. Was stopping at len(name) instead of len(name) + 1. Your tests now pass. If we care about efficiency the slowest part of this seems to be the fact that I have to convert the concatenated name to a URIRef every time in order to check the prefixes.

I can pull in your changes to test/test_turtle_serialize.py to this pr if you want.

@satra
Copy link
Contributor

satra commented Oct 26, 2016

@tgbugs - great. could you please add the test below to: test_turtle_serialize.py

@gromgull and @joernhees - let's focus on this PR and i'll close the other one. any questions or considerations we are overlooking?

def test_turtle_namespace():
    graph = Graph()
    graph.bind('GENO', 'http://purl.obolibrary.org/obo/GENO_')
    graph.bind('RO_has_phenotype',
                    'http://purl.obolibrary.org/obo/RO_0002200')
    graph.add((URIRef('http://example.org'),
               URIRef('http://purl.obolibrary.org/obo/RO_0002200'),
               URIRef('http://purl.obolibrary.org/obo/GENO_0000385')))
    output = [val for val in
              graph.serialize(format='turtle').decode().splitlines()
              if not val.startswith('@prefix')]
    output = ' '.join(output)
    assert 'RO_has_phenotype: ' in output
    assert 'GENO:0000385' in output

@tgbugs
Copy link
Contributor Author

tgbugs commented Oct 27, 2016

I slightly modified the test case for this so that it forces us to match the longest prefix which is contained in an iri instead of the shortest. I'm pretty sure we can improve significantly on the performance here if needs be.

@tgbugs
Copy link
Contributor Author

tgbugs commented Oct 28, 2016

Actual issue: if there is an exact prefix match then it doesn't check to see if there are longer matches. Will fix.

@satra
Copy link
Contributor

satra commented Dec 9, 2016

@gromgull and @joernhees - anything else needed here?

@satra
Copy link
Contributor

satra commented Jan 19, 2017

@gromgull - just a ping here to see if this can be merged. also is there a timeline for a release?

@gromgull
Copy link
Member

I hope to get a 4.2.2 out "quite soon", i.e. maybe before end of february. I'll review this before then!

output_prefix = prefix
output_namespace = namespace
output_name = name
for i in xrange(1, len(name) + 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really cannot have this loop here. This function gets called for every single URI in the store on serialisation.

You can have a loong URI and this will call the store for each substring in that URL. For some stores, this means network activity (sqlalchemy, etc).

If we really want to fix this, the namespace manager needs to index the available prefixes by length in init and then try the longest first. But even this means many "if uri starts with x" calls. Some sort of trie-data-structure may be useful here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #213

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. This is what I expected we would have to do because this implementation produces worst case behavior all the time. Will take a look and see how it can be improved. Do we have any nice pathological cases that I could use for performance testing?

Copy link
Contributor Author

@tgbugs tgbugs Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my first round of performance testing on the 3 different implementations produced some weird results which I think are not representative of the spectrum of workloads. I serialized https://github.com/SciCrunch/NIF-Ontology/blob/master/ttl/NIF-Molecule.ttl 10 times using 3 different implementations and average the time spend calling compute_qname in namespace.py as measured by cProfile in python3.5, there were 246680 calls to compute_qname for each run.

  • current mainline rdflib implementation of compute_qname: 1.2658s
  • loop over uri implementation (current for this pull request) of compute_qname: 1.1765s
  • loop on namespaces implementation of compute_qname: 1.2025s

This was a quick and dirty test and I am going to set it up a more robust suite since I don't believe that the implementation in question here is actually the fastest for all workloads as the numbers above would seem to suggest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out string.split is very, very slow. New implementation using len and slicing gives 1.162s using the same method as above.

@gromgull gromgull modified the milestones: rdflib 5.0.0, rdflib 4.2.2 Jan 24, 2017
if prefix is None:
name = None
for lennamespace, prefix, namespace in self.ordered_prefixnames:
if namespace in uri:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be if uri.startswith(namespace):

Copy link
Contributor Author

@tgbugs tgbugs Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My testing showed that startswith is slower, however if there are cases where a namespace might somehow appear somewhere other than at the start of a URI then we would need to use startswith to prevent false positives. Personally I haven't been able to come up with any cases where this might happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yupp... correctness > speed

@joernhees
Copy link
Member

joernhees commented Feb 7, 2017

i'd leave the old method in as is and only if that doesn't find anything try your approach...

anyhow, we should probably overhaul the whole thing and use some trie / KMP

@tgbugs
Copy link
Contributor Author

tgbugs commented Feb 7, 2017

Do you have any examples where the old method is faster than the new one?

@joernhees
Copy link
Member

The old method uses a dict for lookup (so O(1) per lookup), based on splits by URI parts, so (O(m) for m parts per URI). Overall O(m).

Your method goes through all namespaces (on avg. n/2), so uses O(n) for n registered namespaces.

The old method will scale better in light of many registered namespaces. Your method is likely to be faster if only few namespaces are registered.

@tgbugs
Copy link
Contributor Author

tgbugs commented Feb 13, 2017

I'm happy to try the old implementation first, but it can produce incorrect results in cases where a valid namespace is defined that does not terminate in members of NAME_START_CATEGORIES (eg https://github.com/tgbugs/rdflib/blob/master/test/test_turtle_serialize.py#L78 will fail) so it will take more changes to make sure we are actually getting the longest matching namespace. In addition, looking at the existing split_uri method I wonder whether on average URIs are longer (m) or there are more prefixes defined (n). Clearly without knowing the empirical distributions of m and n we cant know for certain, but I suspect that on average m is going to be larger than n. Thoughts?

edit: My analysis is not quite right, because split_uri actually starts from the back of the URI so m is actually determined by the average fragment length, which means that m is indeed smaller than n on average. However, this means we still need to deal with the prefix length issue so I will take a look.

@nicholascar
Copy link
Member

OK, we're going through an rdflib management change and when that settles down and all roles are known - only a few days off - we'll be reviewing all open PRs and dealing with as many as we can. Sorry this has waited for so long but just give it one week more please! Nick

@tgbugs
Copy link
Contributor Author

tgbugs commented Mar 13, 2020

I'm working on rebasing this PR so that it is easier to review and merge.

satra and others added 14 commits March 12, 2020 19:23
test for longest substring
check if split_uri produced an exact prefix
squashed previous bad implementation attemps here

merged with rdflib/master resolving conflicts in favor of tgbugs
Attempting to implement the 'faster' version of compute_qname
revealed significant issues with the current split_uri implementation.
The bug caused only the first of multiple urls with a common
prefix to be inserted into the subtrie of a new prefix
If Nd is not in namestart categories then we get nasty
nondeterminism in serialization of hex strings.
If we do not repopulate the trie with known namespaces then we will not
be able to retrieve them with split_uri leading to major confusion.
Added a test to catch this case.
Fixed compute_qname bad behavior where it would fail to split a url and
return nothing despite the fact that the url itself was a valid
namespace. compute_qname now behaves consistently so that
http://foo.com/ -> foo: if foo is defined as the prefix. Added tests to
make sure this doesnt happened again.
These tests make it possible to catch issues with xml
serialization of qnames.
The rdfxml serializer relied on namespace.py compute_qname to enforce
qname correctness. Chagnging split characteristics to support n3
prefixes broke that assumption. This commit is a first pass at fixing
the behvior. It may be worth renaming functions to make it clear that
the current 'compute_qname' function is no longer actually computing
a qname but a n3 prefix (or something like that).
@nicholascar
Copy link
Member

Opening and closing to trigger Travis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion serialization Related to serialization.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants