Forbid truthy values for lang when initializing Literal #1494

veyndan · 2021-12-09T14:50:10Z

With the proposed change, an exception will be raised when lang is a truthy value. I found this quite confusing when messing around with the library that Literal("Hello", lang={}) was somehow considered a valid language tag.

ghost · 2021-12-09T20:30:44Z

The change that you propose doesn't have quite the desired effect. An Exception is raised but it's raised in an unexpected context and the message is misleading, suggesting that a bytes-like object might be acceptable when it isn't:

>>> Literal("foo", lang={})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../rdflib/rdflib/term.py", line 551, in __new__
    if lang is not None and not _is_valid_langtag(lang):
  File ".../rdflib/rdflib/term.py", line 92, in _is_valid_langtag
    return bool(_lang_tag_regex.match(tag))
TypeError: expected string or bytes-like object

(Ill-informed speculation removed)

veyndan · 2021-12-10T10:38:27Z

Yep true, though I think for the initial step, wouldn't it be good to have the same error message when lang is invalid? When lang is some falsy value but is also invalid, we get the same error message. I was thinking of doing a follow up PR that introduces a nicer error message when the lang is invalid, but I think it's out of the scope of this PR.

Edit: I replied to you without refreshing the page, so I didn't see your edit. Please check my edit history, otherwise I can just create some PRs/Issues for what was discusssed.
Edit 2: I think I misunderstood you, thinking that you were seeing this error before the proposed changes, but now I'm guessing you mean with the proposed changes. I edited it to reflect this.

ghost · 2021-12-10T16:41:56Z

I saw your edits, sorry for any confusion. I appreciate your observations about keeping RDFLib light and in order to preseve the potential utility of the rfc5646 solution for those occasions where it might come in handy, I've distilled out a patch which I'll include as a recipe (along with a reference to osbigcat's rfc5646 repos) in an RDFLIb "Cookbook" which I've (just) decided to create as an addition to the RDFLib documentation.

w.r.t. to the PR, ultimately, I boiled it down to just this:

diff --git a/rdflib/term.py b/rdflib/term.py
index 796a76b3..36d14db6 100644
--- a/rdflib/term.py
+++ b/rdflib/term.py
@@ -548,8 +548,13 @@ class Literal(Identifier):
                 "per http://www.w3.org/TR/rdf-concepts/#section-Graph-Literal"
             )
 
-        if lang and not _is_valid_langtag(lang):
-            raise Exception("'%s' is not a valid language tag!" % lang)
+        if lang is not None:
+            if not isinstance(lang, str):
+                raise TypeError(
+                    f"language tag '{lang}' is {type(lang)}, must be a string!"
+                )
+            elif not _is_valid_langtag(lang):
+                raise ValueError(f"'{lang}' is not a valid language tag!")
 
         if datatype:
             datatype = URIRef(datatype)

and a minimalist test:

import pytest
from rdflib import Literal

def test_literal_language_tag():

    with pytest.raises(Exception) as e_info:
        Literal("Hello", lang={})
    assert (
        repr(e_info.value)
        == """TypeError("language tag '{}' is <class 'dict'>, must be a string!")"""
    )

    with pytest.raises(Exception) as e_info:
        Literal("Hello", lang="999")
    assert repr(e_info.value) == """ValueError("'999' is not a valid language tag!")"""

Cheers,

Graham

nicholascar · 2021-12-11T02:36:48Z

@veyndan, are you happy to adopt @gjhiggins's suggested updates and the test?

@gjhiggins where are you making up the CookBook? That seems like a great idea and perhaps we can put a bit of effort into a CookBook and release it with rdflib 7.0.0 perhaps March next year? I've wanted to come up with rdflib patterns or a CookBooks style thing for some time, so if there is momentum here...

nicholascar · 2021-12-11T02:44:23Z

@veyndan @gjhiggins since IANA maintains an authoritative list of the language tags RFC 5646 Registry Format and Maintenance:

The IANA Language Subtag Registry ("the registry") contains a comprehensive list of all of the subtags valid in language tags.

with the Registry actually at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Why can't we test not for string, not for non binary etc but for actual values in that list? Storing a copy of the list in rdflib with function to regenerate it per release and another function for users to locally append to the rdflib list (for local use, if our master list is out of date) wouldn't be too hard.

In further support of this, the W3C says:

Language tag syntax is defined by the IETF's BCP 47. In the past it was necessary to consult lists of codes in various ISO standards to find the right subtags, but now you only need to look in the IANA Language Subtag Registry.

ghost · 2021-12-11T11:55:15Z

@veyndan @gjhiggins since IANA maintains an authoritative list of the language tags RFC 5646 Registry Format and Maintenance:

with the Registry actually at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Why can't we test not for string, not for non binary etc but for actual values in that list? Storing a copy of the list in rdflib with function to regenerate it per release and another function for users to locally append to the rdflib list (for local use, if our master list is out of date) wouldn't be too hard.

That's pretty much what I had, following ozbigcat's suggestion - the repos includes an updater for when the IANA list changes and simple instructions.

Given that the IANA source does arbitrarily change, my concern was not to add yet another chore to the RDFLib maintenance burden for what is arguably a minor nuance of support. However, it's a reasonably clean change which a proficient user could easily make, so I thought it would make a good starter for a Cookbook.

There is also the danger that if the library starts checking lang tags against the spec, some well-meaning soul will inevitably point out that Literal values aren't checked for validity w.r.t. a declared datatype and for consistency's sake, perhaps they should be :)

ghost · 2021-12-11T12:31:01Z

@gjhiggins where are you making up the CookBook? That seems like a great idea and perhaps we can put a bit of effort into a CookBook and release it with rdflib 7.0.0 perhaps March next year? I've wanted to come up with rdflib patterns or a CookBooks style thing for some time, so if there is momentum here...

Just locally atm, while I collect some "recipes" as a side effect of working through two-dozen-odd issues related to the identifier-as-context isssue, checking if they're still extant w.r.t. the rebased PR. As soon as I've got a few putative entries I can begin to hallucinate a structure and will create a repos for the content. (e,g, Edmond Chuc's response to #1469 is another excellent candidate for a Cookbook recipe).

FWIW, I was the original perpetrator of the RDFLib docs, back in the day (mostly created from pulling in stuff the other devs had written) and although I haven't communicated as much, I was looking to assist you with refreshing/reworking the RDFLib docs (happily, I've managed to regain access to my readthedocs a/c).

Again, this is also a side-effect of the id-as-ctx work, f'rinstance the confusion around the param bindings for Dataset.graph() is quite understandable given that the API is completely undocumented.

nicholascar · 2021-12-15T23:05:19Z

not to add yet another chore to the RDFLib maintenance burden for what is arguably a minor nuance of support

OK, since the benefits are low, let's not add this burden!

nicholascar · 2021-12-15T23:06:41Z

Last thing: policy is to see a test added as well that indicates how that the change is working as expected, so please add a short test (to an existing test file if one look appropriate or add a new file) and then, i all good, I'll approve.

aucampia · 2021-12-16T22:15:11Z

@veyndan not sure if you are working on tests here, if you are let me know, if not I'm going to make a PR to include some tests from https://github.com/veyndan/rdflib/pull/2

veyndan · 2021-12-16T22:21:20Z

I'm not sure what tests to add here tbh, as I'm still not convinced that we should add tests for non-strings (per our discussion in #1498). So the truthy value for string is "" which (debatably) should (and does) pass, so apart from that I'm not sure what tests make sense assuming that we only want to test those which abide by the type definition of Optional[str] (which is in #1498, but we can assume that both the PRs will be merged at the same-ish time).

aucampia · 2021-12-16T23:04:31Z

I'm not sure what tests to add here tbh, as I'm still not convinced that we should add tests for non-strings (per our discussion in #1498). So the truthy value for string is "" which (debatably) should (and does) pass, so apart from that I'm not sure what tests make sense assuming that we only want to test those which abide by the type definition of Optional[str] (which is in #1498, but we can assume that both the PRs will be merged at the same-ish time).

Quoting response on #1498 (comment)

I understand that lang being anything other than a string is invalid. My point is that we have "restricted" the type of lang to a string with the type annotations. Is it really necessary to do a runtime check when we already have the type annotations. This is a general question — it's not specific to lang.

The answer to this question is not really very straight forward. In brief, I think having something like this would be wrong:
if not isinstance(lang, str):
    raise TypeError(f"Type of lang {lang!r} is invalid")
And actually if you did this, mypy would error out, because warn_unreachable is enabled for rdflib, and the above would be unreachable as far as mypy is concerned, and within this is probably the best answer I can give you to this question, if mypy determines that the type checking is unreachable, then it probably should not be there.

However in this case (state of veyndan#2) mypy will not (and did not) determine that anything is unreachable, because nothing is unreachable.

Now as to the tests, there is a well defined value space for the language tag: it must be a string that conforms to the RDF spec for lang tag, and this value space is something that has to be checked outside of the type system as it is not expressible in the type system, and if this check does not pass an error should be raised, and we know this check should never pass for any value in this array, because they do not lie within the value space: [ "", {}, [], 1, False, True ] - so no check that adequately verifies that a value lies within the valid value space of language tag should succeed for these, and the way in which it should fail should follow the guidelines python sets for errors, which is that ValueError should occur for invalid values, and TypeError for invalid types.

So to me, it is reasonable to test that a value of {} for lang raises TypeError - because I know that there must be runtime validation of the value space, and I know that this should follow python's guidance on error checks, which is that Values outside of the value range but with right type should raise ValueError, and values outside of the value space with wrong type should raise TypeError. And again, I emphasize, what the tests I added in veyndan#2 check for is that the value space validation for lang is sufficiently strict.

aucampia · 2021-12-16T23:26:22Z

I made a pull request against your branch to add tests: https://github.com/veyndan/rdflib/pull/4

This follows the reasoning explained here: #1498 (comment)

aucampia

As indicated, if a PR fixes a runtime problem (as this test will) then it should have tests that reproduce the runtime problem that is being fixed.

The PR i made against your branch include tests which I think are sufficient but other tests which cover similar cases can also be accepted.

Added tests to ensure TypeError occurs if invalid types of langtags are supplied. Also added tests to ensure that invalid language tag values raise ValueError.

Added additional tests for Literal language tag.

veyndan · 2021-12-17T23:12:52Z

@nicholascar This PR should be good to go now (thanks @aucampia for adding the tests!).

aucampia

Looks good to me, will wait for one more approval from another maintainer.

nicholascar

Thanks for the detailed considerations for such a small PR!

This was referenced Dec 11, 2021

Raise a more specific Exception when lang isn't valid #1497

Merged

Add type annotations to constructor parameters in Literal #1498

Merged

Forbid truthy values for lang when initializing Literal

ce1f3f8

nicholascar self-requested a review December 15, 2021 23:06

aucampia requested changes Dec 17, 2021

View reviewed changes

aucampia and others added 2 commits December 17, 2021 23:49

Added additional tests for Literal language tag.

7596b49

Added tests to ensure TypeError occurs if invalid types of langtags are supplied. Also added tests to ensure that invalid language tag values raise ValueError.

Merge pull request #4 from aucampia/iwana-20211217T0021-add_tests

42cbf96

Added additional tests for Literal language tag.

aucampia approved these changes Dec 17, 2021

View reviewed changes

aucampia requested review from ashleysommer and white-gecko December 17, 2021 23:16

nicholascar approved these changes Dec 18, 2021

View reviewed changes

nicholascar merged commit 23ba8ee into RDFLib:master Dec 18, 2021

veyndan deleted the veyndan/bug/lang-truthy branch December 18, 2021 10:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forbid truthy values for lang when initializing Literal #1494

Forbid truthy values for lang when initializing Literal #1494

veyndan commented Dec 9, 2021

ghost commented Dec 9, 2021 •

edited by ghost

veyndan commented Dec 10, 2021 •

edited

ghost commented Dec 10, 2021

nicholascar commented Dec 11, 2021

nicholascar commented Dec 11, 2021 •

edited

ghost commented Dec 11, 2021

ghost commented Dec 11, 2021 •

edited by ghost

nicholascar commented Dec 15, 2021

nicholascar commented Dec 15, 2021

aucampia commented Dec 16, 2021

veyndan commented Dec 16, 2021 •

edited

aucampia commented Dec 16, 2021

aucampia commented Dec 16, 2021

aucampia left a comment

veyndan commented Dec 17, 2021

aucampia left a comment

nicholascar left a comment

Forbid truthy values for lang when initializing Literal #1494

Forbid truthy values for lang when initializing Literal #1494

Conversation

veyndan commented Dec 9, 2021

ghost commented Dec 9, 2021 • edited by ghost

veyndan commented Dec 10, 2021 • edited

ghost commented Dec 10, 2021

nicholascar commented Dec 11, 2021

nicholascar commented Dec 11, 2021 • edited

ghost commented Dec 11, 2021

ghost commented Dec 11, 2021 • edited by ghost

nicholascar commented Dec 15, 2021

nicholascar commented Dec 15, 2021

aucampia commented Dec 16, 2021

veyndan commented Dec 16, 2021 • edited

aucampia commented Dec 16, 2021

aucampia commented Dec 16, 2021

aucampia left a comment

Choose a reason for hiding this comment

veyndan commented Dec 17, 2021

aucampia left a comment

Choose a reason for hiding this comment

nicholascar left a comment

Choose a reason for hiding this comment

ghost commented Dec 9, 2021 •

edited by ghost

veyndan commented Dec 10, 2021 •

edited

nicholascar commented Dec 11, 2021 •

edited

ghost commented Dec 11, 2021 •

edited by ghost

veyndan commented Dec 16, 2021 •

edited