-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URI Validation Performance Improvements #1177
Conversation
I've found _is_valid_uri to be a hotspot in serialization and deserialization. Benchmarking this version of the function against the original with 1e7 urls results in runtime improvement of ~2x. Tests against real RDF graphs with 2e6+ triples results in runtime improvement of ~8% in serialization and deserialization.
Benchmark script:
|
…fectively a no-op since ord(c) is <= 256 for all _invalid_uri_chars
Hi @ashleysommer , The approach in 1176 is incorrect, it incorrectly flags anything with non-ascii chars as invalid. I think it's actually slightly slower than the current (6484fcd) which is about 5x the original func. |
When profiling RDFlib I came across the same issue but actually looked into how to get rdflib to do less validations in the first place. @jbmchuck approach looks correct, readable and fast to me. Looking forward to having this merged. |
We'll discuss this in today's RDFLib maintainers meeting and will probably merge it. |
I've found _is_valid_uri to be a hotspot in serialization and
deserialization.
Benchmarking this version of the function against the original with 1e7
urls results in runtime improvement of ~5x
Tests against real RDF graphs with 2e6+ triples results in runtime
improvement of ~10-12% in serialization and deserialization.
Functionally there is one diff - I don't believe the
ord(c)
check isnecessary. In the original code effectively the check for each char
is that:
Since for all c in _invalid_uri_chars ord(c) <= 256 there is no case where
the
ord(c) > 256
condition holds true butc not in _invalid_uri_chars
does not also.
I tried a few other approaches including performing this check in a list
comprehension, e.g.
but this was significantly slower for large numbers of calls, though faster than
the original code by ~2x.