Skip to content
This repository has been archived by the owner on Nov 17, 2022. It is now read-only.

Prevent XPathEvalErrors when Codelist Mapping XPath identifies non-attribute #229

Merged
merged 17 commits into from
Nov 9, 2017

Conversation

hayfield
Copy link
Contributor

@hayfield hayfield commented Nov 8, 2017

Fixes #224

It is generally assumed that a Codelist shall be used to check a value in an attribute. This is even stated in the IATI-Codelists README.

It's structured as a list of mapping elements, which each have a path element that describes the relevant attribute

Unfortunately, not everything is straightforward in IATIland.

This PR adds support for the current (incorrect) implementation of 2.02 where //iati-activity/crs-add/channel-code/text() is an XPath to validate against a Codelist. This may generalise, though should be deemed highly unstable and a likely candidate for removal since it is extremely questionable whether this should be permitted.

Something that should be more stable is that a ValueError is raised when an unsupported XPath is provided as part of a mapping.

IATI/IATI-Codelists#109 is also integrated.

… checks

To deal with text() as a valid way of locating code values, there need to be a number of if-else statements added.
It is less confusing and easier to read by having a single if-else at the top of the function, rather than lots in the middle.
This is a fix for #224

The Codelist README states the following in relation to mapping files:

  It's structured as a list of mapping elements, which each have a path element that describes the relevant attribute

This indicates that it's incorrect that element text should match a Codelist. IATI, however, is special. This means that at the moment this is deemed something the Standard does.

Until the restriction is removed, this fix is required.

Yay, IATIland...
@hayfield hayfield added bug This issue identifies and details a bug. codelists Relating to IATI Codelists. complete A PR that is in a state that is ready for review. validation Changes to validation functionality. and removed complete A PR that is in a state that is ready for review. labels Nov 8, 2017
@hayfield hayfield requested a review from a team November 8, 2017 12:21
@hayfield hayfield added complete A PR that is in a state that is ready for review. and removed complete A PR that is in a state that is ready for review. labels Nov 8, 2017
I did a silly and forgot to lint
@hayfield hayfield added the complete A PR that is in a state that is ready for review. label Nov 8, 2017

"""
parent_el_xpath = '/'.join(split_xpath[:-1])
last_xpath_section = split_xpath[-1:][0]
Copy link
Contributor

@andylolz andylolz Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is functionally equivalent to:

last_xpath_section = split_xpath[-1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. the code that’s there makes a list of all items from the last one onwards, then takes the first item of that list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -290,32 +356,29 @@ def _check_codes(dataset, codelist):
base_xpath = mapping['xpath']
condition = mapping['condition']
split_xpath = base_xpath.split('/')
parent_el_xpath = '/'.join(split_xpath[:-1])
attr_name = split_xpath[-1:][0][1:]
last_xpath_section = split_xpath[-1:][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while not parent_el_xpath.startswith('//'):
parent_el_xpath = '/' + parent_el_xpath
if parent_el_xpath.startswith('//['):
parent_el_xpath = '//*[' + parent_el_xpath[3:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you’ll hit a merge conflict with #226 here.


located_codes = list()
for parent in parents_to_check:
located_codes.append((parent.text, parent.sourceline))
Copy link
Contributor

@andylolz andylolz Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be:

located_codes.append((parent.text.strip(), parent.sourceline))

Then the valid test case could be something like:

[…]
<crs-add>
  <channel-code>
    21039
  </channel-code>
</crs-add>
[…]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching of Codelists in attributes is exact, with whitespace being deemed incorrect. As such, stripping whitespace for element text doesn't seem like a correct interpretation. Maybe a warning?

Copy link
Contributor

@andylolz andylolz Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree for attributes… I’d be tempted to be more forgiving for element text. As a quite hand-wavy justification for this, IATI-Rulesets allows for whitespace around (for instance) an iati-identifier:

…so the following passes (despite the whitespace):

<iati-identifier>
    some-valid-identifier
</iati-identifier>

Copy link
Contributor Author

@hayfield hayfield Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a quite hand-wavy justification, IATI-Rulesets allows for whitespace around an iati-identifier

In that same implementation, startswith doesn't permit whitespace leading whitespace.


The source to go back to is the XPath spec which shows that text() returns the entire text node. The XPath in the mapping file says that the comparison is against this value (rather than a trimmed or stripped version). As such, checking against a modified text node is not correct.

Further, the XML Spec shows that white space is significant and must be treated as part of a value - treating it as not-present is incorrect.

Copy link
Contributor

@andylolz andylolz Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that same implementation, startswith doesn't permit whitespace leading whitespace.

Well, but that’s not used in the standard ruleset, so I’m not sure that it’s relevant… But yes – I take your point and I think you’re quite right!

Just for reference: It looks like slovakaid are the only publisher currently using the element anyway, and they get it right. Also, very few publishers seem to make the mistake of leading/trailing whitespace in iati-identifier/text()s (or other text nodes for that matter).

@hayfield hayfield added the standard-support Relating to how pyIATI supports a major component within the IATI Standard. label Nov 8, 2017
@andylolz
Copy link
Contributor

andylolz commented Nov 8, 2017

It is generally assumed that a Codelist shall be used to check a value in an attribute. This is even stated in the IATI-Codelists README.

Rather than updating the schema, couldn’t you just update the IATI-Codelists README?

Put another way: It seems we’re treating text nodes exactly like attributes… So can you explain why it’s a problem to use text nodes for codelist items?

@hayfield
Copy link
Contributor Author

hayfield commented Nov 8, 2017

So can you explain why it’s a problem to use text nodes for codelist items?

It's an unnecessary special case that complicates data use.

The purpose of a Codelist is to restrict the values to a list that can be used between different systems. A Code has multiple components - a code, name, description and more. Some of these, such as name and description can be long portions of text. The code, however, is a short language-independent value - generally a numeric value, though potentially with a letter or two as a prefix.

The only notable (for this use case) difference between the permitted content of elements and attributes is the ability to maintain newlines. There are a grand total of zero cases within the IATI Standard where it could possibly be necessary for a Code's code component to contain a newline. Even with namespaces, there is zero need if you're using Codelists correctly.

The one potentially arguable case is codes in external vocabularies where an external organisation maintains the code. All IATI elements with a @vocabulary-uri attribute require that the value from this fits within a @code attribute. Additionally, no relevant Replicated Codelist includes newlines in any code values. As such, this doesn't appear a relevant argument.

Of all these cases within the Standard itself, there are zero reasons to use an element's text node over an attribute. All it does is complicate data use by having a single case that breaks all other conventions (IATI has quite a few of these).


External to the Standard itself, the situation where this might possibly be relevant is that you wish to validate //narrative/text() against fixed vocabularies (checking against some internal system or something). A custom Codelist would be a manner that could be undertaken since there's not really anything better.


Basically, The Standard has zero need to use an element's text node over an attribute. It could, however, be beneficial in a wider content-validation sense, though this doesn't appear to be permitted within the current specification of components and should be discussed separately.

@andylolz
Copy link
Contributor

andylolz commented Nov 8, 2017

Of all these cases within the Standard itself, there are zero reasons to use an element's text node over an attribute.

Yeah, agreed – that makes sense to me.

I take from this that while there’s nothing technically wrong here, it’s needlessly inconsistent, which is bad.

That is: if a standard were to use text nodes for codelist items instead of attributes – but did it consistently – that would be a reasonable design choice. But mixing the two is bad. Is that fair?

else:
error = ValidationError('warn-code-not-on-codelist', locals())
el_name = split_xpath[-2:-1][0] # used via `locals()` # pylint: disable=unused-variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes me a while to get my head around this notation every time! I’d write this as:

el_name = split_xpath[-2]

Copy link
Contributor

@andylolz andylolz Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More radical suggestion: I wonder if this whole bit might be more legible if you scrapped split_xpath, used rsplit('/', 1) all over the place, and passed strings around instead of lists of strings. So like:

parent_el_xpath, last_xpath_section = base_xpath.rsplit('/', 1)

Then this bit would be:

_, el_name = parent_el_xpath.rsplit('/', 1)  # used via `locals()` # pylint: disable=unused-variable

Hmm on second thought that is still a bit ugly. Yes – feel free to ignore this suggestion! This pull request looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ec3f996

(I clearly haven't properly got my head around negative array indexing >.< - super useful feature that few languages I've used much have)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the rsplit version, it would require _extract_codes_from_attrib() and _extract_codes_from_element_text() to have different argument lists.

If these functions were public, that wouldn't be a great thing (lack of consistency). Since they're private, this isn't ideal, though is less of a problem.

It would, however, make the code a bit more DRY since last_xpath_section and parent_el_xpath would each only need calculating once - in _check_codes(). Hum.

Copy link
Contributor

@andylolz andylolz Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯\(ツ)

Copy link
Contributor Author

@hayfield hayfield Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...actually, the rsplit version is cleaner and the parameter thing doesn't matter - it's fine that _extract_codes_from_attrib() takes an extra parameter since it makes logical sense that you need to specify the attribute to extract data from. With _extract_codes_from_element_text() you only need the element (or some way of finding it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@andylolz andylolz Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm! Well... Different args for those functions seems intuitively correct, though... One needs to know am attribute reference; the other doesn't.

I will stop peddling my reckons on this now, because I'm unsure so I trust whatever you decide!

@hayfield
Copy link
Contributor Author

hayfield commented Nov 9, 2017

That is: if a standard were to use text nodes for codelist items instead of attributes – but did it consistently – that would be a reasonable design choice. But mixing the two is bad. Is that fair?

Seems like a good assessment of the situation, yep.

As suggested in:
  #229 (review)

Passing strings (rather than lists of strings) around is clearer. It also means that a couple of variables only need defining once.

It is fine that the attrib and text extraction functions take different numbers of parameters since the former clearly needs more information to undertake its task than the latter.
Before this point, there were too many locals. This removes a couple of unnecessary ones, while also increasing clarity.

for parent in parents_to_check:
code = parent.attrib[attr_name]
if last_xpath_section.startswith('@'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a function called _extract_codes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return located_codes


def _check_codes(dataset, codelist):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This badly needs refactoring. Function neither DRY nor SRP.

if codelist.complete:
error = ValidationError('err-code-not-on-codelist', locals())
if last_xpath_section.startswith('@'):
if codelist.complete:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract setting of error name to a variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This separates responsibilities. It becomes necessary to define 'attr_name' twice, though this seems like a reasonable trade-off
A common prefix can be extracted. Therefore it has been extracted.
allthatilk
allthatilk previously approved these changes Nov 9, 2017
Copy link
Contributor

@allthatilk allthatilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smells like 🌹 s to me!

Copy link
Contributor

@allthatilk allthatilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again xD

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue identifies and details a bug. codelists Relating to IATI Codelists. complete A PR that is in a state that is ready for review. standard-support Relating to how pyIATI supports a major component within the IATI Standard. validation Changes to validation functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants