Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forbid truthy values for lang when initializing Literal #1494

Merged
merged 3 commits into from Dec 18, 2021
Merged

Forbid truthy values for lang when initializing Literal #1494

merged 3 commits into from Dec 18, 2021

Conversation

veyndan
Copy link
Contributor

@veyndan veyndan commented Dec 9, 2021

With the proposed change, an exception will be raised when lang is a truthy value. I found this quite confusing when messing around with the library that Literal("Hello", lang={}) was somehow considered a valid language tag.

@ghost
Copy link

ghost commented Dec 9, 2021

The change that you propose doesn't have quite the desired effect. An Exception is raised but it's raised in an unexpected context and the message is misleading, suggesting that a bytes-like object might be acceptable when it isn't:

>>> Literal("foo", lang={})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../rdflib/rdflib/term.py", line 551, in __new__
    if lang is not None and not _is_valid_langtag(lang):
  File ".../rdflib/rdflib/term.py", line 92, in _is_valid_langtag
    return bool(_lang_tag_regex.match(tag))
TypeError: expected string or bytes-like object

(Ill-informed speculation removed)

@veyndan
Copy link
Contributor Author

veyndan commented Dec 10, 2021

Yep true, though I think for the initial step, wouldn't it be good to have the same error message when lang is invalid? When lang is some falsy value but is also invalid, we get the same error message. I was thinking of doing a follow up PR that introduces a nicer error message when the lang is invalid, but I think it's out of the scope of this PR.

Edit: I replied to you without refreshing the page, so I didn't see your edit. Please check my edit history, otherwise I can just create some PRs/Issues for what was discusssed.
Edit 2: I think I misunderstood you, thinking that you were seeing this error before the proposed changes, but now I'm guessing you mean with the proposed changes. I edited it to reflect this.

@ghost
Copy link

ghost commented Dec 10, 2021

I saw your edits, sorry for any confusion. I appreciate your observations about keeping RDFLib light and in order to preseve the potential utility of the rfc5646 solution for those occasions where it might come in handy, I've distilled out a patch which I'll include as a recipe (along with a reference to osbigcat's rfc5646 repos) in an RDFLIb "Cookbook" which I've (just) decided to create as an addition to the RDFLib documentation.

w.r.t. to the PR, ultimately, I boiled it down to just this:

diff --git a/rdflib/term.py b/rdflib/term.py
index 796a76b3..36d14db6 100644
--- a/rdflib/term.py
+++ b/rdflib/term.py
@@ -548,8 +548,13 @@ class Literal(Identifier):
                 "per http://www.w3.org/TR/rdf-concepts/#section-Graph-Literal"
             )
 
-        if lang and not _is_valid_langtag(lang):
-            raise Exception("'%s' is not a valid language tag!" % lang)
+        if lang is not None:
+            if not isinstance(lang, str):
+                raise TypeError(
+                    f"language tag '{lang}' is {type(lang)}, must be a string!"
+                )
+            elif not _is_valid_langtag(lang):
+                raise ValueError(f"'{lang}' is not a valid language tag!")
 
         if datatype:
             datatype = URIRef(datatype)

and a minimalist test:

import pytest
from rdflib import Literal

def test_literal_language_tag():

    with pytest.raises(Exception) as e_info:
        Literal("Hello", lang={})
    assert (
        repr(e_info.value)
        == """TypeError("language tag '{}' is <class 'dict'>, must be a string!")"""
    )

    with pytest.raises(Exception) as e_info:
        Literal("Hello", lang="999")
    assert repr(e_info.value) == """ValueError("'999' is not a valid language tag!")"""

Cheers,

Graham

@nicholascar
Copy link
Member

@veyndan, are you happy to adopt @gjhiggins's suggested updates and the test?

@gjhiggins where are you making up the CookBook? That seems like a great idea and perhaps we can put a bit of effort into a CookBook and release it with rdflib 7.0.0 perhaps March next year? I've wanted to come up with rdflib patterns or a CookBooks style thing for some time, so if there is momentum here...

@nicholascar
Copy link
Member

nicholascar commented Dec 11, 2021

@veyndan @gjhiggins since IANA maintains an authoritative list of the language tags RFC 5646 Registry Format and Maintenance:

The IANA Language Subtag Registry ("the registry") contains a comprehensive list of all of the subtags valid in language tags.

with the Registry actually at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Why can't we test not for string, not for non binary etc but for actual values in that list? Storing a copy of the list in rdflib with function to regenerate it per release and another function for users to locally append to the rdflib list (for local use, if our master list is out of date) wouldn't be too hard.

In further support of this, the W3C says:

Language tag syntax is defined by the IETF's BCP 47. In the past it was necessary to consult lists of codes in various ISO standards to find the right subtags, but now you only need to look in the IANA Language Subtag Registry.

@ghost
Copy link

ghost commented Dec 11, 2021

@veyndan @gjhiggins since IANA maintains an authoritative list of the language tags RFC 5646 Registry Format and Maintenance:

with the Registry actually at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Why can't we test not for string, not for non binary etc but for actual values in that list? Storing a copy of the list in rdflib with function to regenerate it per release and another function for users to locally append to the rdflib list (for local use, if our master list is out of date) wouldn't be too hard.

That's pretty much what I had, following ozbigcat's suggestion - the repos includes an updater for when the IANA list changes and simple instructions.

Given that the IANA source does arbitrarily change, my concern was not to add yet another chore to the RDFLib maintenance burden for what is arguably a minor nuance of support. However, it's a reasonably clean change which a proficient user could easily make, so I thought it would make a good starter for a Cookbook.

There is also the danger that if the library starts checking lang tags against the spec, some well-meaning soul will inevitably point out that Literal values aren't checked for validity w.r.t. a declared datatype and for consistency's sake, perhaps they should be :)

@ghost
Copy link

ghost commented Dec 11, 2021

@gjhiggins where are you making up the CookBook? That seems like a great idea and perhaps we can put a bit of effort into a CookBook and release it with rdflib 7.0.0 perhaps March next year? I've wanted to come up with rdflib patterns or a CookBooks style thing for some time, so if there is momentum here...

Just locally atm, while I collect some "recipes" as a side effect of working through two-dozen-odd issues related to the identifier-as-context isssue, checking if they're still extant w.r.t. the rebased PR. As soon as I've got a few putative entries I can begin to hallucinate a structure and will create a repos for the content. (e,g, Edmond Chuc's response to #1469 is another excellent candidate for a Cookbook recipe).

FWIW, I was the original perpetrator of the RDFLib docs, back in the day (mostly created from pulling in stuff the other devs had written) and although I haven't communicated as much, I was looking to assist you with refreshing/reworking the RDFLib docs (happily, I've managed to regain access to my readthedocs a/c).

Again, this is also a side-effect of the id-as-ctx work, f'rinstance the confusion around the param bindings for Dataset.graph() is quite understandable given that the API is completely undocumented.

@nicholascar
Copy link
Member

not to add yet another chore to the RDFLib maintenance burden for what is arguably a minor nuance of support

OK, since the benefits are low, let's not add this burden!

@nicholascar
Copy link
Member

Last thing: policy is to see a test added as well that indicates how that the change is working as expected, so please add a short test (to an existing test file if one look appropriate or add a new file) and then, i all good, I'll approve.

@nicholascar nicholascar self-requested a review December 15, 2021 23:06
@aucampia
Copy link
Member

@veyndan not sure if you are working on tests here, if you are let me know, if not I'm going to make a PR to include some tests from https://github.com/veyndan/rdflib/pull/2

@veyndan
Copy link
Contributor Author

veyndan commented Dec 16, 2021

I'm not sure what tests to add here tbh, as I'm still not convinced that we should add tests for non-strings (per our discussion in #1498). So the truthy value for string is "" which (debatably) should (and does) pass, so apart from that I'm not sure what tests make sense assuming that we only want to test those which abide by the type definition of Optional[str] (which is in #1498, but we can assume that both the PRs will be merged at the same-ish time).

@aucampia
Copy link
Member

I'm not sure what tests to add here tbh, as I'm still not convinced that we should add tests for non-strings (per our discussion in #1498). So the truthy value for string is "" which (debatably) should (and does) pass, so apart from that I'm not sure what tests make sense assuming that we only want to test those which abide by the type definition of Optional[str] (which is in #1498, but we can assume that both the PRs will be merged at the same-ish time).

Quoting response on #1498 (comment)

I understand that lang being anything other than a string is invalid. My point is that we have "restricted" the type of lang to a string with the type annotations. Is it really necessary to do a runtime check when we already have the type annotations. This is a general question — it's not specific to lang.

The answer to this question is not really very straight forward. In brief, I think having something like this would be wrong:

if not isinstance(lang, str):
    raise TypeError(f"Type of lang {lang!r} is invalid")

And actually if you did this, mypy would error out, because warn_unreachable is enabled for rdflib, and the above would be unreachable as far as mypy is concerned, and within this is probably the best answer I can give you to this question, if mypy determines that the type checking is unreachable, then it probably should not be there.

However in this case (state of veyndan#2) mypy will not (and did not) determine that anything is unreachable, because nothing is unreachable.

Now as to the tests, there is a well defined value space for the language tag: it must be a string that conforms to the RDF spec for lang tag, and this value space is something that has to be checked outside of the type system as it is not expressible in the type system, and if this check does not pass an error should be raised, and we know this check should never pass for any value in this array, because they do not lie within the value space: [ "", {}, [], 1, False, True ] - so no check that adequately verifies that a value lies within the valid value space of language tag should succeed for these, and the way in which it should fail should follow the guidelines python sets for errors, which is that ValueError should occur for invalid values, and TypeError for invalid types.

So to me, it is reasonable to test that a value of {} for lang raises TypeError - because I know that there must be runtime validation of the value space, and I know that this should follow python's guidance on error checks, which is that Values outside of the value range but with right type should raise ValueError, and values outside of the value space with wrong type should raise TypeError. And again, I emphasize, what the tests I added in veyndan#2 check for is that the value space validation for lang is sufficiently strict.

@aucampia
Copy link
Member

I made a pull request against your branch to add tests: https://github.com/veyndan/rdflib/pull/4

This follows the reasoning explained here: #1498 (comment)

Copy link
Member

@aucampia aucampia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated, if a PR fixes a runtime problem (as this test will) then it should have tests that reproduce the runtime problem that is being fixed.

The PR i made against your branch include tests which I think are sufficient but other tests which cover similar cases can also be accepted.

aucampia and others added 2 commits December 17, 2021 23:49
Added tests to ensure TypeError occurs if invalid types of langtags are
supplied. Also added tests to ensure that invalid language tag values
raise ValueError.
Added additional tests for Literal language tag.
@veyndan
Copy link
Contributor Author

veyndan commented Dec 17, 2021

@nicholascar This PR should be good to go now (thanks @aucampia for adding the tests!).

Copy link
Member

@aucampia aucampia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, will wait for one more approval from another maintainer.

Copy link
Member

@nicholascar nicholascar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed considerations for such a small PR!

@nicholascar nicholascar merged commit 23ba8ee into RDFLib:master Dec 18, 2021
@veyndan veyndan deleted the veyndan/bug/lang-truthy branch December 18, 2021 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants