Prevent `XPathEvalError`s when Codelist Mapping XPath identifies non-attribute #229

hayfield · 2017-11-08T12:08:48Z

Fixes #224

It is generally assumed that a Codelist shall be used to check a value in an attribute. This is even stated in the IATI-Codelists README.

It's structured as a list of mapping elements, which each have a path element that describes the relevant attribute

Unfortunately, not everything is straightforward in IATIland.

This PR adds support for the current (incorrect) implementation of 2.02 where //iati-activity/crs-add/channel-code/text() is an XPath to validate against a Codelist. This may generalise, though should be deemed highly unstable and a likely candidate for removal since it is extremely questionable whether this should be permitted.

Something that should be more stable is that a ValueError is raised when an unsupported XPath is provided as part of a mapping.

IATI/IATI-Codelists#109 is also integrated.

… checks To deal with text() as a valid way of locating code values, there need to be a number of if-else statements added. It is less confusing and easier to read by having a single if-else at the top of the function, rather than lots in the middle.

This is a fix for #224 The Codelist README states the following in relation to mapping files: It's structured as a list of mapping elements, which each have a path element that describes the relevant attribute This indicates that it's incorrect that element text should match a Codelist. IATI, however, is special. This means that at the moment this is deemed something the Standard does. Until the restriction is removed, this fix is required. Yay, IATIland...

…ttributes

I did a silly and forgot to lint

andylolz · 2017-11-08T13:07:21Z

iati/validator.py

+
+    """
+    parent_el_xpath = '/'.join(split_xpath[:-1])
+    last_xpath_section = split_xpath[-1:][0]


I think this is functionally equivalent to:

last_xpath_section = split_xpath[-1]

I.e. the code that’s there makes a list of all items from the last one onwards, then takes the first item of that list.

andylolz · 2017-11-08T13:09:22Z

iati/validator.py

@@ -290,32 +356,29 @@ def _check_codes(dataset, codelist):
        base_xpath = mapping['xpath']
        condition = mapping['condition']
        split_xpath = base_xpath.split('/')
-        parent_el_xpath = '/'.join(split_xpath[:-1])
-        attr_name = split_xpath[-1:][0][1:]
+        last_xpath_section = split_xpath[-1:][0]


Same as https://github.com/IATI/pyIATI/pull/229/files#r149662369 goes for this line.

andylolz · 2017-11-08T13:13:13Z

iati/validator.py

+    while not parent_el_xpath.startswith('//'):
+        parent_el_xpath = '/' + parent_el_xpath
+    if parent_el_xpath.startswith('//['):
+        parent_el_xpath = '//*[' + parent_el_xpath[3:]


I guess you’ll hit a merge conflict with #226 here.

andylolz · 2017-11-08T13:24:57Z

iati/validator.py

+
+    located_codes = list()
+    for parent in parents_to_check:
+        located_codes.append((parent.text, parent.sourceline))


I wonder if this should be:

located_codes.append((parent.text.strip(), parent.sourceline))

Then the valid test case could be something like:

[…] <crs-add> <channel-code> 21039 </channel-code> </crs-add> […]

Matching of Codelists in attributes is exact, with whitespace being deemed incorrect. As such, stripping whitespace for element text doesn't seem like a correct interpretation. Maybe a warning?

So I agree for attributes… I’d be tempted to be more forgiving for element text. As a quite hand-wavy justification for this, IATI-Rulesets allows for whitespace around (for instance) an iati-identifier:

https://github.com/IATI/IATI-Rulesets/blob/c33d11885e1084bb99d897c9a5dc3bfcea91c0fb/iatirulesets/__init__.py#L63

https://github.com/IATI/IATI-Rulesets/blob/c33d11885e1084bb99d897c9a5dc3bfcea91c0fb/rulesets/standard.json#L19-L20

…so the following passes (despite the whitespace):

<iati-identifier> some-valid-identifier </iati-identifier>

As a quite hand-wavy justification, IATI-Rulesets allows for whitespace around an iati-identifier

In that same implementation, startswith doesn't permit whitespace leading whitespace.

The source to go back to is the XPath spec which shows that text() returns the entire text node. The XPath in the mapping file says that the comparison is against this value (rather than a trimmed or stripped version). As such, checking against a modified text node is not correct.

Further, the XML Spec shows that white space is significant and must be treated as part of a value - treating it as not-present is incorrect.

In that same implementation, startswith doesn't permit whitespace leading whitespace.

Well, but that’s not used in the standard ruleset, so I’m not sure that it’s relevant… But yes – I take your point and I think you’re quite right!

Just for reference: It looks like slovakaid are the only publisher currently using the element anyway, and they get it right. Also, very few publishers seem to make the mistake of leading/trailing whitespace in iati-identifier/text()s (or other text nodes for that matter).

andylolz · 2017-11-08T16:53:28Z

It is generally assumed that a Codelist shall be used to check a value in an attribute. This is even stated in the IATI-Codelists README.

Rather than updating the schema, couldn’t you just update the IATI-Codelists README?

Put another way: It seems we’re treating text nodes exactly like attributes… So can you explain why it’s a problem to use text nodes for codelist items?

hayfield · 2017-11-08T17:14:04Z

So can you explain why it’s a problem to use text nodes for codelist items?

It's an unnecessary special case that complicates data use.

The purpose of a Codelist is to restrict the values to a list that can be used between different systems. A Code has multiple components - a code, name, description and more. Some of these, such as name and description can be long portions of text. The code, however, is a short language-independent value - generally a numeric value, though potentially with a letter or two as a prefix.

The only notable (for this use case) difference between the permitted content of elements and attributes is the ability to maintain newlines. There are a grand total of zero cases within the IATI Standard where it could possibly be necessary for a Code's code component to contain a newline. Even with namespaces, there is zero need if you're using Codelists correctly.

The one potentially arguable case is codes in external vocabularies where an external organisation maintains the code. All IATI elements with a @vocabulary-uri attribute require that the value from this fits within a @code attribute. Additionally, no relevant Replicated Codelist includes newlines in any code values. As such, this doesn't appear a relevant argument.

Of all these cases within the Standard itself, there are zero reasons to use an element's text node over an attribute. All it does is complicate data use by having a single case that breaks all other conventions (IATI has quite a few of these).

External to the Standard itself, the situation where this might possibly be relevant is that you wish to validate //narrative/text() against fixed vocabularies (checking against some internal system or something). A custom Codelist would be a manner that could be undertaken since there's not really anything better.

Basically, The Standard has zero need to use an element's text node over an attribute. It could, however, be beneficial in a wider content-validation sense, though this doesn't appear to be permitted within the current specification of components and should be discussed separately.

andylolz · 2017-11-08T18:27:31Z

Of all these cases within the Standard itself, there are zero reasons to use an element's text node over an attribute.

Yeah, agreed – that makes sense to me.

I take from this that while there’s nothing technically wrong here, it’s needlessly inconsistent, which is bad.

That is: if a standard were to use text nodes for codelist items instead of attributes – but did it consistently – that would be a reasonable design choice. But mixing the two is bad. Is that fair?

andylolz · 2017-11-08T22:15:54Z

iati/validator.py

                else:
-                    error = ValidationError('warn-code-not-on-codelist', locals())
+                    el_name = split_xpath[-2:-1][0]  # used via `locals()` # pylint: disable=unused-variable


It takes me a while to get my head around this notation every time! I’d write this as:

el_name = split_xpath[-2]

More radical suggestion: I wonder if this whole bit might be more legible if you scrapped split_xpath, used rsplit('/', 1) all over the place, and passed strings around instead of lists of strings. So like:

parent_el_xpath, last_xpath_section = base_xpath.rsplit('/', 1)

Then this bit would be:

_, el_name = parent_el_xpath.rsplit('/', 1) # used via `locals()` # pylint: disable=unused-variable

Hmm on second thought that is still a bit ugly. Yes – feel free to ignore this suggestion! This pull request looks good to me.

ec3f996

(I clearly haven't properly got my head around negative array indexing >.< - super useful feature that few languages I've used much have)

With the rsplit version, it would require _extract_codes_from_attrib() and _extract_codes_from_element_text() to have different argument lists.

If these functions were public, that wouldn't be a great thing (lack of consistency). Since they're private, this isn't ideal, though is less of a problem.

It would, however, make the code a bit more DRY since last_xpath_section and parent_el_xpath would each only need calculating once - in _check_codes(). Hum.

¯\(ツ)/¯

...actually, the rsplit version is cleaner and the parameter thing doesn't matter - it's fine that _extract_codes_from_attrib() takes an extra parameter since it makes logical sense that you need to specify the attribute to extract data from. With _extract_codes_from_element_text() you only need the element (or some way of finding it).

Hmmmm! Well... Different args for those functions seems intuitively correct, though... One needs to know am attribute reference; the other doesn't.

I will stop peddling my reckons on this now, because I'm unsure so I trust whatever you decide!

hayfield · 2017-11-09T08:50:17Z

That is: if a standard were to use text nodes for codelist items instead of attributes – but did it consistently – that would be a reasonable design choice. But mixing the two is bad. Is that fair?

Seems like a good assessment of the situation, yep.

As suggested in: #229 (review) Passing strings (rather than lists of strings) around is clearer. It also means that a couple of variables only need defining once. It is fine that the attrib and text extraction functions take different numbers of parameters since the former clearly needs more information to undertake its task than the latter.

Before this point, there were too many locals. This removes a couple of unnecessary ones, while also increasing clarity.

allthatilk · 2017-11-09T11:03:49Z

iati/validator.py

-
-        for parent in parents_to_check:
-            code = parent.attrib[attr_name]
+        if last_xpath_section.startswith('@'):


This should be a function called _extract_codes

allthatilk · 2017-11-09T11:04:01Z

iati/validator.py

+
+    return located_codes
+
+
 def _check_codes(dataset, codelist):


This badly needs refactoring. Function neither DRY nor SRP.

allthatilk · 2017-11-09T11:09:13Z

iati/validator.py

-                if codelist.complete:
-                    error = ValidationError('err-code-not-on-codelist', locals())
+                if last_xpath_section.startswith('@'):
+                    if codelist.complete:


extract setting of error name to a variable

This separates responsibilities. It becomes necessary to define 'attr_name' twice, though this seems like a reasonable trade-off

A common prefix can be extracted. Therefore it has been extracted.

allthatilk

Smells like 🌹 s to me!

allthatilk

Again xD

hayfield added 4 commits November 8, 2017 09:39

Add a failing test for validation against a populated Schema

e8d6a8b

Update Codelist about Codelist mapping file paths evaluating to non-a…

0471ce8

…ttributes

hayfield added bug This issue identifies and details a bug. codelists Relating to IATI Codelists. complete A PR that is in a state that is ready for review. validation Changes to validation functionality. and removed complete A PR that is in a state that is ready for review. labels Nov 8, 2017

Fix a reference to a PR in another repo

1393432

hayfield mentioned this pull request Nov 8, 2017

Error in codelist-mapping.xml causing validator problems #224

Closed

hayfield requested a review from a team November 8, 2017 12:21

hayfield added complete A PR that is in a state that is ready for review. and removed complete A PR that is in a state that is ready for review. labels Nov 8, 2017

Add pylint disables

b4eaba5

I did a silly and forgot to lint

hayfield added the complete A PR that is in a state that is ready for review. label Nov 8, 2017

andylolz reviewed Nov 8, 2017

View reviewed changes

Simplify access to the last item in a list

d12d03f

hayfield added the standard-support Relating to how pyIATI supports a major component within the IATI Standard. label Nov 8, 2017

andylolz reviewed Nov 8, 2017

View reviewed changes

hayfield added 3 commits November 9, 2017 08:51

Simplify access to the penultimate item in a list

ec3f996

Remove a couple of helper variables that manage to reduce clarity

b6ad977

Before this point, there were too many locals. This removes a couple of unnecessary ones, while also increasing clarity.

hayfield mentioned this pull request Nov 9, 2017

Test correct handling of whitespace in Ruleset tests #233

Open

6 tasks

hayfield added 3 commits November 9, 2017 10:37

Merge branch 'dev' into 'crs-add-crs-add'

98a9577

Merge branch 'dev' into crs-add-crs-add

7e5c4f4

Move test files to new 2.02 test folder

a32c5a6

allthatilk suggested changes Nov 9, 2017

View reviewed changes

hayfield added 2 commits November 9, 2017 11:28

Extract code extraction code to a function

9930cdb

This separates responsibilities. It becomes necessary to define 'attr_name' twice, though this seems like a reasonable trade-off

Reduce number of nested if-s

c2df432

A common prefix can be extracted. Therefore it has been extracted.

allthatilk previously approved these changes Nov 9, 2017

View reviewed changes

hayfield added 2 commits November 9, 2017 11:41

Merge branch 'dev' into crs-add-crs-add

6f1627a

Fix expected location of resource-loading function

a3d4a4f

hayfield dismissed allthatilk’s stale review via a3d4a4f November 9, 2017 11:43

allthatilk approved these changes Nov 9, 2017

View reviewed changes

hayfield merged commit d39c120 into dev Nov 9, 2017

hayfield deleted the crs-add-crs-add branch November 9, 2017 12:04

hayfield mentioned this pull request Nov 9, 2017

Add new @crs-channel-code attribute for the participating-org element IATI/IATI-Schemas#338

Closed

3 tasks

hayfield mentioned this pull request Nov 20, 2017

Document codelist items on text nodes IATI/IATI-Standard-SSOT#123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent `XPathEvalError`s when Codelist Mapping XPath identifies non-attribute #229

Prevent `XPathEvalError`s when Codelist Mapping XPath identifies non-attribute #229

hayfield commented Nov 8, 2017 •

edited

Loading

andylolz Nov 8, 2017 •

edited

Loading

andylolz Nov 8, 2017

hayfield Nov 8, 2017

hayfield Nov 8, 2017

andylolz Nov 8, 2017

hayfield Nov 8, 2017

andylolz Nov 8, 2017

andylolz Nov 8, 2017 •

edited

Loading

hayfield Nov 8, 2017

andylolz Nov 8, 2017 •

edited

Loading

hayfield Nov 8, 2017 •

edited

Loading

andylolz Nov 8, 2017 •

edited

Loading

andylolz commented Nov 8, 2017 •

edited

Loading

hayfield commented Nov 8, 2017

andylolz commented Nov 8, 2017

andylolz Nov 8, 2017

andylolz Nov 8, 2017 •

edited

Loading

hayfield Nov 9, 2017

hayfield Nov 9, 2017

andylolz Nov 9, 2017 •

edited

Loading

hayfield Nov 9, 2017 •

edited

Loading

hayfield Nov 9, 2017

andylolz Nov 9, 2017 •

edited

Loading

hayfield commented Nov 9, 2017

allthatilk Nov 9, 2017

hayfield Nov 9, 2017

allthatilk Nov 9, 2017

allthatilk Nov 9, 2017

hayfield Nov 9, 2017

allthatilk left a comment

allthatilk left a comment

Prevent XPathEvalErrors when Codelist Mapping XPath identifies non-attribute #229

Prevent XPathEvalErrors when Codelist Mapping XPath identifies non-attribute #229

Conversation

hayfield commented Nov 8, 2017 • edited Loading

andylolz Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andylolz Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andylolz Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

hayfield Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

andylolz Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

andylolz commented Nov 8, 2017 • edited Loading

hayfield commented Nov 8, 2017

andylolz commented Nov 8, 2017

Choose a reason for hiding this comment

andylolz Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andylolz Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

hayfield Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andylolz Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

hayfield commented Nov 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allthatilk left a comment

Choose a reason for hiding this comment

allthatilk left a comment

Choose a reason for hiding this comment

Prevent `XPathEvalError`s when Codelist Mapping XPath identifies non-attribute #229

Prevent `XPathEvalError`s when Codelist Mapping XPath identifies non-attribute #229

hayfield commented Nov 8, 2017 •

edited

Loading

andylolz Nov 8, 2017 •

edited

Loading

andylolz Nov 8, 2017 •

edited

Loading

andylolz Nov 8, 2017 •

edited

Loading

hayfield Nov 8, 2017 •

edited

Loading

andylolz Nov 8, 2017 •

edited

Loading

andylolz commented Nov 8, 2017 •

edited

Loading

andylolz Nov 8, 2017 •

edited

Loading

andylolz Nov 9, 2017 •

edited

Loading

hayfield Nov 9, 2017 •

edited

Loading

andylolz Nov 9, 2017 •

edited

Loading