Add `is_xml()` errors to error log #90

hayfield · 2017-08-03T16:16:04Z

When is_xml() indicates that a value is not XML, errors are produced.

This PR adds such errors to a ValidationErrorLog.

The XMSyntaxError occurs when there is a problem with a string. The other errors occur when the inputted value is completely wrong. As such, they should be separated so they can be handled separately. This will become useful when creating detailed error logs.

This makes the not-a-string errors use an error log. This has been implemented first since there is a single error message required, unlike the many different ones that exist for the other types of error such as XMLSyntaxError.

To implement, have also had to add a name attribute to ValidationErrors.

…rors

hayfield · 2017-08-08T09:01:16Z

~~Has had first couple of commits from #91 merged so that the functionality is available.~~

The above-referenced functionality is now merged into validation-info so this doesn't matter.

This converts the lxml errors into something IATI-specific. Should there not be a specific case for a particular error, a generic error will be provided. In an actual implementation, any errors that do not have a specific case should be logged to help identify errors that occur frequently but have not been categorised.

It appears that lxml contains an internal error log that builds up over time. When a test function is parameterised by pytest, all calls to the function add to this singular error log. As such, later calls to the function will be provided with the errors from earlier calls. This is bad. By providing a custom XMLParser every time, the error log only refers to a single call to a test function. As such, buildup of past errors does not occur and the tests are checking the correct information. This is good.

The name attribute has been added, so it seems reasonable to check that the returned error has the correct name and is not something else in disguise.

hayfield · 2017-08-08T14:08:29Z

A limited number of specific errors have been tested and handled. For more errors to be added, it should be determined which errors occur most frequently in real data. This will allow resources to be put towards actual problems rather than trawling through a long list and hoping the correct things are being picked out.

hayfield · 2017-08-08T15:44:33Z

iati/validator.py

-        return False
+    if isinstance(error_log, ValidationErrorLog):
+        return not error_log.contains_errors()
+    else:


This if-else is no longer required now that it it fully implemented

Fixed in cc978b5

By using validator to determine whether something is XML, the code is more DRY.

This provides a secondary dimension for classification, alongside the category.

The new implementation that uses TypeErrors is more general.

There are some encodings that are supported, some that are not. Unhelpfully, some unsupported encodings return an incorrect error - EMPTY_DOCUMENT rather than UNSUPPORTED_ENCODING

When there is an encoding mismatch it would be expected that the ERR_INVALID_ENCODING error is raised. This is not always the case. Other errors including ERR_INVALID_CHAR, ERR_GT_REQUIRED and ERR_DOCUMENT_EMPTY may also occur depending on the document being processed. Something like a newline can completely change the errors that are output. The XML spec permits a full range of character encodings to be used. libxml2, however, only supports UTF-8, UTF-16, ASCII and ISO-Latin-1 by default. Even within this set of supported encodings, the expected errors are not returned consistently (if at all). With this in mind, a limited set of encodings will be tested. Other encodings will lead to somewhat unhelpful errors being returned. Since the probability of these situations occurring is fairly low, this shouldn't be a major problem.

This makes the code more DRY. A class has also been created to group some related tests.

89fb099 A detailed commit message was made about some problems with libxml2. Some of this has now been extracted into the relevant docstring in a slightly simplified format so that you don't need to go hunting for the information.

Switch Dataset creation to use iati.validator

dalepotter · 2017-08-15T11:54:18Z

iati/tests/test_validator.py

+        """A valid XML string with the text declaration removed."""
+        return '\n'.join(xml_str.strip().split('\n')[1:])
+
+    @pytest.fixture(params=iati.core.tests.utilities.find_parameter_by_type(['str'], False) + [iati.core.tests.utilities.XML_STR_INVALID])


Nice way to construct a list of invalid types!

dalepotter · 2017-08-15T11:56:50Z

iati/validator.py

+    """Check whether a given parameter is valid XML.
+
+    Args:
+        maybe_xml (str): An string that may or may not contain valid XML.


contain valid XML or be valid XML ?

be - fixed!

dalepotter · 2017-08-15T12:03:54Z

iati/validator.py

-    return _check_codelist_values(dataset, schema)
+    error_log = ValidationErrorLog()
+
+    error_log.extend(_check_is_xml(dataset))


Should we be doing a check here for is_iati_xml too?

This takes in a Dataset. Since this is guaranteed to contain actual XML, the check is not necessary here.

dalepotter · 2017-08-15T12:05:36Z

iati/validator.py

+        iati.validator.ValidationErrorLog: A log of the errors that occurred.
+
+    """
+    return _check_is_xml(dataset)


_check_is_xml takes a str, but this seems to be passing it a iati.core.Dataset.

dalepotter

Looks good, just suggest changing the docstring for validate_is_xml.

Fix a couple of docstrings

hayfield · 2017-08-16T10:28:06Z

Docstring changed. Now merging.

hayfield added the incomplete A PR that is in a state that is not ready for review. label Aug 3, 2017

hayfield added this to the Validation - content checking milestone Aug 3, 2017

hayfield added 4 commits August 8, 2017 09:45

Start transitioning is_xml check to use error log

e2f5c17

This makes the not-a-string errors use an error log. This has been implemented first since there is a single error message required, unlike the many different ones that exist for the other types of error such as XMLSyntaxError.

Better parameterise test error names

0d2eeb5

Add ErrorLog functionality to identify errors with name

f5f03d3

To implement, have also had to add a name attribute to ValidationErrors.

Merge branch 'check-validator-err-log-contents' into error-log-xml-er…

4e70546

…rors

hayfield added 18 commits August 8, 2017 10:19

Fix out-of-date return value

731a6a3

Test detailed output from is_xml() with non-XML values

87682b7

Remove redundant except statement

3f75f96

Test detailed output of is_xml() with valid XML

6265865

Change a copypasted function name

c0de330

Use a dictionary rather than an if-else tree

eb76913

Test for more types of invalid XML

2796003

Test content being before XML prolog.

0f63903

Add tests dealing with misplaced XML prolog

a994ca0

Check for a secondary error that should exist

f189e82

Add suffix to test function names

d6e5d2a

Improve docstrings

5536903

Test concatenation of XML strings with no prolog

ced53d9

Add a TODO about a next step

19117b0

Add is_xml checks to full_validation()

bf675ba

Add assertions about error names

46ed9d0

The name attribute has been added, so it seems reasonable to check that the returned error has the correct name and is not something else in disguise.

hayfield requested a review from a team August 8, 2017 14:08

hayfield removed the incomplete A PR that is in a state that is not ready for review. label Aug 8, 2017

hayfield commented Aug 8, 2017

View reviewed changes

hayfield removed the complete A PR that is in a state that is ready for review. label Aug 9, 2017

hayfield added 11 commits August 9, 2017 10:58

Switch Dataset creation to use iati.validator

de90d1d

By using validator to determine whether something is XML, the code is more DRY.

Add a base exception type to each of the validation errors

56751bf

This provides a secondary dimension for classification, alongside the category.

Add base_exception as a required attribute for error codes

05f7ec9

Merge branch 'error-log-xml-errors' into dataset-xml-only

7eaed16

Check for less-specific errors in error log

08dec47

The new implementation that uses TypeErrors is more general.

Change from ValueError to iati.core.exceptions.ValidationError

c65cd59

Better deal with invalid character encodings

a998e75

There are some encodings that are supported, some that are not. Unhelpfully, some unsupported encodings return an incorrect error - EMPTY_DOCUMENT rather than UNSUPPORTED_ENCODING

Unname unused variable

b6e43e9

Add assertions for specific error types

043977c

Split a string used 3 times into a fixture

3d979e5

This makes the code more DRY. A class has also been created to group some related tests.

hayfield mentioned this pull request Aug 9, 2017

Switch Dataset creation to use iati.validator #95

Merged

hayfield and others added 3 commits August 15, 2017 10:28

XML prolog -> XML text declaration

86b9665

Extract information from commit message to docstring

4ea05f9

89fb099 A detailed commit message was made about some problems with libxml2. Some of this has now been extracted into the relevant docstring in a slightly simplified format so that you don't need to go hunting for the information.

Merge pull request #95 from IATI/dataset-xml-only

ed9e9ba

Switch Dataset creation to use iati.validator

hayfield added complete A PR that is in a state that is ready for review. and removed incomplete A PR that is in a state that is not ready for review. labels Aug 15, 2017

dalepotter reviewed Aug 15, 2017

View reviewed changes

dalepotter suggested changes Aug 15, 2017

View reviewed changes

Fix a couple of docstrings

bec0b32

hayfield mentioned this pull request Aug 15, 2017

Fix a couple of docstrings #99

Merged

Merge pull request #99 from IATI/docstring-fix

8442537

Fix a couple of docstrings

hayfield merged commit 521e86f into validation-info Aug 16, 2017

hayfield deleted the error-log-xml-errors branch August 16, 2017 10:28

hayfield added the validation Changes to validation functionality. label Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `is_xml()` errors to error log #90

Add `is_xml()` errors to error log #90

hayfield commented Aug 3, 2017

hayfield commented Aug 8, 2017 •

edited

Loading

hayfield commented Aug 8, 2017

hayfield Aug 8, 2017

hayfield Aug 8, 2017

dalepotter Aug 15, 2017

dalepotter Aug 15, 2017

hayfield Aug 15, 2017

dalepotter Aug 15, 2017

hayfield Aug 15, 2017

dalepotter Aug 15, 2017

dalepotter left a comment

hayfield commented Aug 16, 2017

Add is_xml() errors to error log #90

Add is_xml() errors to error log #90

Conversation

hayfield commented Aug 3, 2017

hayfield commented Aug 8, 2017 • edited Loading

hayfield commented Aug 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalepotter left a comment

Choose a reason for hiding this comment

hayfield commented Aug 16, 2017

Add `is_xml()` errors to error log #90

Add `is_xml()` errors to error log #90

hayfield commented Aug 8, 2017 •

edited

Loading