Skip to content
This repository has been archived by the owner on Nov 17, 2022. It is now read-only.

Add is_xml() errors to error log #90

Merged
merged 40 commits into from
Aug 16, 2017
Merged

Conversation

hayfield
Copy link
Contributor

@hayfield hayfield commented Aug 3, 2017

When is_xml() indicates that a value is not XML, errors are produced.

This PR adds such errors to a ValidationErrorLog.

The XMSyntaxError occurs when there is a problem with a string.
The other errors occur when the inputted value is completely wrong.
As such, they should be separated so they can be handled separately.

This will become useful when creating detailed error logs.
@hayfield hayfield added the incomplete A PR that is in a state that is not ready for review. label Aug 3, 2017
@hayfield hayfield added this to the Validation - content checking milestone Aug 3, 2017
This makes the not-a-string errors use an error log.
This has been implemented first since there is a single error message
required, unlike the many different ones that exist for the other types
of error such as XMLSyntaxError.
To implement, have also had to add a name attribute to ValidationErrors.
@hayfield
Copy link
Contributor Author

hayfield commented Aug 8, 2017

Has had first couple of commits from #91 merged so that the functionality is available.

The above-referenced functionality is now merged into validation-info so this doesn't matter.

This converts the lxml errors into something IATI-specific.

Should there not be a specific case for a particular error, a
generic error will be provided. In an actual implementation, any
errors that do not have a specific case should be logged to help
identify errors that occur frequently but have not been categorised.
It appears that lxml contains an internal error log that builds up over time.
When a test function is parameterised by pytest, all calls to the function
add to this singular error log. As such, later calls to the function will be
provided with the errors from earlier calls. This is bad.

By providing a custom XMLParser every time, the error log only refers to a single
call to a test function. As such, buildup of past errors does not occur and the
tests are checking the correct information. This is good.
The name attribute has been added, so it seems reasonable to check
that the returned error has the correct name and is not something
else in disguise.
@hayfield
Copy link
Contributor Author

hayfield commented Aug 8, 2017

A limited number of specific errors have been tested and handled. For more errors to be added, it should be determined which errors occur most frequently in real data. This will allow resources to be put towards actual problems rather than trawling through a long list and hoping the correct things are being picked out.

@hayfield hayfield requested a review from a team August 8, 2017 14:08
@hayfield hayfield removed the incomplete A PR that is in a state that is not ready for review. label Aug 8, 2017
return False
if isinstance(error_log, ValidationErrorLog):
return not error_log.contains_errors()
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if-else is no longer required now that it it fully implemented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cc978b5

@hayfield hayfield removed the complete A PR that is in a state that is ready for review. label Aug 9, 2017
By using validator to determine whether something is XML, the code is more DRY.
This provides a secondary dimension for classification, alongside
the category.
The new implementation that uses TypeErrors is more general.
There are some encodings that are supported, some that are not.
Unhelpfully, some unsupported encodings return an incorrect error -
EMPTY_DOCUMENT rather than UNSUPPORTED_ENCODING
When there is an encoding mismatch it would be expected that the
ERR_INVALID_ENCODING error is raised. This is not always the case.

Other errors including ERR_INVALID_CHAR, ERR_GT_REQUIRED and
ERR_DOCUMENT_EMPTY may also occur depending on the document being
processed. Something like a newline can completely change the
errors that are output.

The XML spec permits a full range of character encodings to be
used. libxml2, however, only supports UTF-8, UTF-16, ASCII and
ISO-Latin-1 by default. Even within this set of supported
encodings, the expected errors are not returned consistently
(if at all).

With this in mind, a limited set of encodings will be tested.
Other encodings will lead to somewhat unhelpful errors being
returned. Since the probability of these situations occurring
is fairly low, this shouldn't be a major problem.
This makes the code more DRY.
A class has also been created to group some related tests.
hayfield and others added 3 commits August 15, 2017 10:28
89fb099

A detailed commit message was made about some problems with libxml2.
Some of this has now been extracted into the relevant docstring in
a slightly simplified format so that you don't need to go hunting
for the information.
Switch Dataset creation to use iati.validator
@hayfield hayfield added complete A PR that is in a state that is ready for review. and removed incomplete A PR that is in a state that is not ready for review. labels Aug 15, 2017
"""A valid XML string with the text declaration removed."""
return '\n'.join(xml_str.strip().split('\n')[1:])

@pytest.fixture(params=iati.core.tests.utilities.find_parameter_by_type(['str'], False) + [iati.core.tests.utilities.XML_STR_INVALID])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to construct a list of invalid types!

"""Check whether a given parameter is valid XML.

Args:
maybe_xml (str): An string that may or may not contain valid XML.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contain valid XML or be valid XML ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be - fixed!

return _check_codelist_values(dataset, schema)
error_log = ValidationErrorLog()

error_log.extend(_check_is_xml(dataset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be doing a check here for is_iati_xml too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes in a Dataset. Since this is guaranteed to contain actual XML, the check is not necessary here.

iati.validator.ValidationErrorLog: A log of the errors that occurred.

"""
return _check_is_xml(dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_is_xml takes a str, but this seems to be passing it a iati.core.Dataset.

Copy link
Contributor

@dalepotter dalepotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just suggest changing the docstring for validate_is_xml.

Fix a couple of docstrings
@hayfield
Copy link
Contributor Author

Docstring changed. Now merging.

@hayfield hayfield merged commit 521e86f into validation-info Aug 16, 2017
@hayfield hayfield deleted the error-log-xml-errors branch August 16, 2017 10:28
@hayfield hayfield added the validation Changes to validation functionality. label Sep 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
complete A PR that is in a state that is ready for review. validation Changes to validation functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants