Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMLParser is slow to process large EML documents #1

Closed
csjx opened this issue Mar 9, 2017 · 2 comments
Closed

EMLParser is slow to process large EML documents #1

csjx opened this issue Mar 9, 2017 · 2 comments
Assignees
Labels
bug
Milestone

Comments

@csjx
Copy link
Member

@csjx csjx commented Mar 9, 2017

The org.ecoinformatics.eml.EMLParser does not perform well when processing large EML documents (for instance, a document with 250 to 1000 attribute fully fleshed out elements defined). It can take 10, 30, 45 or more minutes to validate a document -- the duration scales with document size.

To try to alleviate this, change the parser to use a SAX-based model rather than a DOM.

org.ecoinformatics.eml.EMLParser uses two methods to validate a document: parseKeys() and parseKeyrefs(), both of which call getPathContent() and pass in an XPath selector. getPathContent() creates a DOM and passes back an org.w3.dom.NodeList.

See the attached file as an example.

eml250.xml.txt

@csjx csjx added the bug label Mar 9, 2017
@csjx csjx self-assigned this Mar 9, 2017
@mbjones mbjones added this to the EML2.2.0 milestone Mar 12, 2017
@mbjones mbjones added this to TODO in EML 2.2.0 Release Mar 12, 2017
@mbjones mbjones moved this from TODO to High priority in EML 2.2.0 Release Apr 22, 2017
@mbjones mbjones added the next label Oct 30, 2017
@mbjones mbjones added backlog and removed next labels Jun 29, 2018
@mobb mobb moved this from High priority to In progress in EML 2.2.0 Release Jul 12, 2018
@mobb mobb moved this from In progress to High priority in EML 2.2.0 Release Jul 12, 2018
@mbjones mbjones added next and removed backlog labels Jan 31, 2019
@mbjones

This comment has been minimized.

Copy link
Contributor

@mbjones mbjones commented Jan 31, 2019

Need to test the EMLValidator against a large document. Its way faster than the old EMLParser. See #328.

@mbjones

This comment has been minimized.

Copy link
Contributor

@mbjones mbjones commented Jan 31, 2019

Tested the validator against the 4.3MB eml250.xml document above, and it is much faster -- now less than a second, down from many (tens?) of minutes with the old parser. I think this performance bug can be closed. Comments appreciated. Timing information is below:

$ time java -cp $CP $pkg/EMLValidator src/test/resources/eml250.xml
isValid: true

real	0m1.013s
user	0m2.627s
sys	0m0.149s

$ time java -cp $CP $pkg/EMLValidator src/test/resources/invalidEML/eml-error-annot-missing-id.xml 
isValid: false

real	0m0.312s
user	0m0.405s
sys	0m0.046s

$ time java -cp $CP $pkg/EMLValidator src/test/resources/eml-sample.xml 
isValid: true

real	0m0.337s
user	0m0.508s
sys	0m0.046s
@mbjones mbjones added needs-review and removed next labels Jan 31, 2019
@mbjones mbjones closed this Feb 3, 2019
@mbjones mbjones removed the needs-review label Feb 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
EML 2.2.0 Release
High priority
2 participants
You can’t perform that action at this time.