Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI Validation Performance Improvements #1177

Merged
merged 3 commits into from
Oct 8, 2020
Merged

Conversation

jbmchuck
Copy link
Contributor

@jbmchuck jbmchuck commented Oct 3, 2020

I've found _is_valid_uri to be a hotspot in serialization and
deserialization.

Benchmarking this version of the function against the original with 1e7
urls results in runtime improvement of ~5x

Tests against real RDF graphs with 2e6+ triples results in runtime
improvement of ~10-12% in serialization and deserialization.

Functionally there is one diff - I don't believe the ord(c) check is
necessary. In the original code effectively the check for each char
is that:

ord(c) > 256
        OR
c not in _invalid_uri_chars

Since for all c in _invalid_uri_chars ord(c) <= 256 there is no case where
the ord(c) > 256 condition holds true but c not in _invalid_uri_chars
does not also.

I tried a few other approaches including performing this check in a list
comprehension, e.g.

return bool([c for c in _invalid_uri_chars if c in uri])

but this was significantly slower for large numbers of calls, though faster than
the original code by ~2x.

I've found _is_valid_uri to be a hotspot in serialization and
deserialization.

Benchmarking this version of the function against the original with 1e7
urls results in runtime improvement of ~2x.

Tests against real RDF graphs with 2e6+ triples results in runtime
improvement of ~8% in serialization and deserialization.
@jbmchuck
Copy link
Contributor Author

jbmchuck commented Oct 3, 2020

Benchmark script:

 import time

_invalid_uri_chars = '<>" {}|\\^`'

def _is_valid_uri_orig(uri):
    return all([ord(c) > 256 or not c in _invalid_uri_chars for c in uri])


def _is_valid_uri_new(uri):
    for c in _invalid_uri_chars:
        if c in uri:
            return False
    return True

sample_uris = [
    "http://foo.{x}.com".format(x=x) for x in range(1, int(1e7))
]


while True:
    start_t = time.time()
    for sample_uri in sample_uris:
        _is_valid_uri_orig(sample_uri)
    print("orig: ", time.time()-start_t)
    start_t = time.time()

    start_t = time.time()
    for sample_uri in sample_uris:
        _is_valid_uri_new(sample_uri)
    print("new: ", time.time()-start_t)
    start_t = time.time()

@jbmchuck jbmchuck marked this pull request as draft October 3, 2020 20:35
@coveralls
Copy link

coveralls commented Oct 3, 2020

Coverage Status

Coverage increased (+0.03%) to 75.694% when pulling 6484fcd on jbmchuck:master into 56dc420 on RDFLib:master.

@jbmchuck jbmchuck marked this pull request as ready for review October 3, 2020 20:42
@jbmchuck jbmchuck marked this pull request as draft October 3, 2020 21:54
…fectively a no-op since ord(c) is <= 256 for all _invalid_uri_chars
@jbmchuck jbmchuck marked this pull request as ready for review October 3, 2020 22:05
@ashleysommer
Copy link
Contributor

ashleysommer commented Oct 3, 2020

Hi @jbmchuck
I see this approach is a tiny bit slower than your proposal in #1176
Why did you close that one and go this route?

I wrote out a longer message here, but I see now what I said is incorrect. Please ignore previous edits of this comment.

@jbmchuck
Copy link
Contributor Author

jbmchuck commented Oct 3, 2020

Hi @ashleysommer ,

The approach in 1176 is incorrect, it incorrectly flags anything with non-ascii chars as invalid. I think it's actually slightly slower than the current (6484fcd) which is about 5x the original func.

@FlorianLudwig
Copy link
Contributor

When profiling RDFlib I came across the same issue but actually looked into how to get rdflib to do less validations in the first place.

@jbmchuck approach looks correct, readable and fast to me. Looking forward to having this merged.

@ashleysommer
Copy link
Contributor

We'll discuss this in today's RDFLib maintainers meeting and will probably merge it.

@nicholascar nicholascar merged commit 2a696df into RDFLib:master Oct 8, 2020
@white-gecko white-gecko added this to the rdflib 6.0.0 milestone Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants