New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URI / IRI checking should complain about ASCII control chars and maybe properly check URIs #625
Comments
|
There's an additional problem that IRIs aren't being handled properly either, at least in the SPARQL store: |
|
@joernhees me, @guvi007 and @prince17080 would like to work on this issue, can we go forward with this? |
|
@joernhees Hey, we (me, @reeshabhkumarranjan, @guvi007) want to solve this issue. We came up with the solution that a valid URI can only contain characters among 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;='. If you think this is correct, I (@prince17080) can submit a pull request for the same. |
|
This approach will exclude IRIs, which I think we are trying to support.
On Mon, May 25, 2020 at 2:30 PM prince17080 ***@***.***> wrote:
@joernhees <https://github.com/joernhees> Hey, we (me,
@reeshabhkumarranjan <https://github.com/reeshabhkumarranjan>, @guvi007
<https://github.com/guvi007>) want to solve this issue. We came up with
the solution that a valid URI can only contain characters among
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;='.
If you think this is correct, I ***@***.***
<https://github.com/prince17080>) can submit a pull request for the same.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#625 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETCEIVQBR36HREDD2IBP3RTK2MJANCNFSM4CDMMQCA>
.
--
Jim McCusker
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
mccusj2@rpi.edu <mccusj@cs.rpi.edu>
http://tw.rpi.edu
|
|
Last I checked is_valid_uri is one of the most time consuming checks that is run by rdflib. I second @joernhees's original suggestion to use urlparse for this, but whatever implementation is ultimately used we will need performance numbers. |
|
Hi @joernhees @jpmccu @tgbugs, |
|
Hm, not quite as straightforward as it might first appear. fwiw, I decanted rfc3987 into its own file in diff --git a/rdflib/term.py b/rdflib/term.py
index c0ea4a28..44a919ff 100644
--- a/rdflib/term.py
+++ b/rdflib/term.py
@@ -70,6 +70,7 @@ from isodate import (
import rdflib
from rdflib.compat import long_type
+from rdflib.extras import rfc3987
if TYPE_CHECKING:
from .namespace import NamespaceManager
@@ -81,15 +82,26 @@ rdflib_skolem_genid = "/.well-known/genid/rdflib/"
skolems: Dict[str, "BNode"] = {}
+def _is_valid_uri(uri: str) -> bool:
+ full_iri_check = True
+ return __is_valid_iri(uri) if full_iri_check else __is_valid_uri(uri)
+
+
_invalid_uri_chars = '<>" {}|\\^`'
-def _is_valid_uri(uri: str) -> bool:
+def __is_valid_uri(uri: str) -> bool:
for c in _invalid_uri_chars:
if c in uri:
return False
return True
+def __is_valid_iri(iri: str) -> bool:
+ if not isinstance(iri, str):
+ iri = str(iri)
+ res = rfc3987.parse(iri)
+ return True if res is not None else False
+
_lang_tag_regex = compile("^[a-zA-Z]+(?:-[a-zA-Z0-9]+)*$")
with the following rather interesting results: Standard, Experimental, |
|
@gjhiggins we tried to fix something in a similar manner and faced the same issues. Specifically the tests |
This is the error that we are getting. So, I'm not sure if it's being encoded properly. The test case if I am not wrong is trying to check the above input for a valid URI. |
|
I remembered I'd created a branch for this so was able to return to it. Yeah, I misread the trig, it is supposed to be a URI as indicated by the prefixing |
I was also going through some of the unicode codes manually and all of them do exist in the rfc3987 unicode specifications. So, found it strange that the errors were thrown. |
|
As regards the original issue report, the topic of validating input and using rfc3987 as a supporting package was discussed in #266 and gromgull's comment on using the rfc3987 module was:
AIUI, the general advice for library code is "try to be tolerant of input but strict in output" so there isn't much of a role for rfc3987 as regards input-checking. However, if a user has a need to be strict about output, rfc3987 is a reasonable choice of tool to check the statements prior to serialization. |
|
I mean, in that scenario the presence of a simple check is offering better tradeoffs. |
I'm disinclined to mark the issue as "closed", at least not until the discussion has been transcribed and edited into the documentation because it's a not-unreasonable expectation and it's appropriate that the implications of satisfying that expectation should be addressed as part of an explanation of why such a feature isn't present. fwiw, adding rdf3987 checking to diff --git a/rdflib/term.py b/rdflib/term.py
index c0ea4a28..200c7d62 100644
--- a/rdflib/term.py
+++ b/rdflib/term.py
@@ -67,6 +67,7 @@ from isodate import (
parse_duration,
parse_time,
)
+import rfc3987
import rdflib
from rdflib.compat import long_type
@@ -271,11 +272,10 @@ class URIRef(IdentifiedNode):
if not value.endswith("#"):
value += "#"
- if not _is_valid_uri(value):
- logger.warning(
- "%s does not look like a valid URI, trying to serialize this will break."
- % value
- )
+ # Handle DefinedNamespaceMeta namespaces (i.e. core namespaces)
+ value = str(value) if not isinstance(value, str) else value
+
+ rfc3987.parse(value)
try:
rt = str.__new__(cls, value)On my i7 laptop, the time taken to parse 250K triples generated from the sp2b generator goes from:
to
Nevertheless, it is undeniably a useful feature when it's needed so it would seem helpful to provide some documented guidelines of how to implement this feature for oneself as and when necessary. |
|
Would it make sense to add a hook for URI/IRI validation? That way users could determine their own performance vs correctness tradeoffs ranging from no validation at all, to checking for forbidden characters, to A hook would also make for very natural place to documentation expectations, defaults, and options. |
|
I like this approach!
…On Mon, May 1, 2023 at 5:09 PM Henry Andrews ***@***.***> wrote:
Would it make sense to add a hook for URI/IRI validation? That way users
could determine their own performance vs correctness tradeoffs ranging from
no validation at all, to checking for forbidden characters, to rfc3987's
regular expressions, or even using the abnf package. It would also be
friendly to systems that already validate IRIs for other reasons prior to
interacting with rdflib.
A hook would also make for very natural place to documentation
expectations, defaults, and options.
—
Reply to this email directly, view it on GitHub
<#625 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETCELWAOQAIQEUSHNNHUTXEAQ77ANCNFSM4CDMMQCA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Jamie McCusker (she/her/hers)
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
***@***.*** ***@***.***>
http://tw.rpi.edu
|
|
I took a shot at implementing the hook approach from my previous comment, and noticed some things: There are three places where validation can occur, with variations in behavior
The wording of the warning/error has minor variations regarding serializations (sometimes specific serializations, sometimes serialization in general) and/or urlencoding (which is reasonable but occasionally misleading guidance, and depending on the user-supplied validation option, might not make any sense at all) It would seem like a good idea to both consolidate the default wording and allow flexibility in that wording when changing the validation behavior. The UX and performance of warnings, errors, and multiple checksI personally strongly dislike the warning, in part because it's not clear to me how to do anything about it if my usage of the library is not done in a context where warnings are monitored (as libraries can get layered before being incorporated into an application, I may or may not have any control over what the actual application does). The use of a warning also requires that the check be done a second time during serialization, which seems to go against the performance concerns expressed by others. I much prefer failures to happen as soon as possible, so I tried out a strict constructor option (which defaulted to off, to maintain compatibility) that turned that first warning into an error, and skipped the check in Unexpected encapsulation violationI tend to expect single-leading-underscore functions to not be imported into other modules, but My inclination would be to have a class property that defaults to the current check, and can be changed. If more flexibility is needed, the check could be overridden with a constructor parameter (the approach I implemented in the first place before realizing the encapsulation issue), but that would not help the If a context manager approach were to be implemented, it would be flexible enough that it could be the only mechanism involved. I'm not sure if/when I'll have time to rework what I did before, but I'll monitor this issue to see what the reaction and interest levels are. |
|
Only a partial response, will respond to the rest when I have more time:
We try to follow PEP-8, and the most critical reasoning for using leading underscores is to signal that a function is not part of our public API. This is in accordance with my understanding of PEP-8: https://peps.python.org/pep-0008/#public-and-internal-interfaces
https://peps.python.org/pep-0008/#descriptive-naming-styles
Even with all that said, Python tooling does seem to assume single underscore identifiers are not used from outside modules. To us, the most critical thing is keeping our public API small, and even reducing its size, there is a lot of code that should not be used that is not clearly internal only. |
currently the
_is_valid_URImethod only checks if a URI includes any of the not allowed printable ASCII chars (see #620)ASCII control chars like backspace, tab, ... won't lead to an early error / warning.
maybe we should really try to parse URIs with a python builtin lib or maybe https://pypi.python.org/pypi/rfc3987 ?
The text was updated successfully, but these errors were encountered: