The org.ecoinformatics.eml.EMLParser does not perform well when processing large EML documents (for instance, a document with 250 to 1000 attribute fully fleshed out elements defined). It can take 10, 30, 45 or more minutes to validate a document -- the duration scales with document size.
To try to alleviate this, change the parser to use a SAX-based model rather than a DOM.
org.ecoinformatics.eml.EMLParser uses two methods to validate a document: parseKeys() and parseKeyrefs(), both of which call getPathContent() and pass in an XPath selector. getPathContent() creates a DOM and passes back an org.w3.dom.NodeList.
Tested the validator against the 4.3MB eml250.xml document above, and it is much faster -- now less than a second, down from many (tens?) of minutes with the old parser. I think this performance bug can be closed. Comments appreciated. Timing information is below:
$ time java -cp $CP$pkg/EMLValidator src/test/resources/eml250.xml
isValid: true
real 0m1.013s
user 0m2.627s
sys 0m0.149s
$ time java -cp $CP$pkg/EMLValidator src/test/resources/invalidEML/eml-error-annot-missing-id.xml
isValid: false
real 0m0.312s
user 0m0.405s
sys 0m0.046s
$ time java -cp $CP$pkg/EMLValidator src/test/resources/eml-sample.xml
isValid: true
real 0m0.337s
user 0m0.508s
sys 0m0.046s
The
org.ecoinformatics.eml.EMLParser
does not perform well when processing large EML documents (for instance, a document with 250 to 1000attribute
fully fleshed out elements defined). It can take 10, 30, 45 or more minutes to validate a document -- the duration scales with document size.To try to alleviate this, change the parser to use a SAX-based model rather than a DOM.
org.ecoinformatics.eml.EMLParser
uses two methods to validate a document:parseKeys()
andparseKeyrefs()
, both of which callgetPathContent()
and pass in an XPath selector.getPathContent()
creates a DOM and passes back anorg.w3.dom.NodeList
.See the attached file as an example.
eml250.xml.txt
The text was updated successfully, but these errors were encountered: