-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to allow for .lblx
label file extension
#139
Conversation
fix typo
Sorry for the typos |
Use the .xml and .lblx as the primary filter then open files that are xml or lblx and see if they contain a lid and product class within an Identification_Area. If so, then treat them as a label.
Here is a cleaner and more comprehensive testing to make sure the XML file is a label. Could be made better, but this was quick and easy. I also see I broke the testing. I added a JUnit test to verify that the code works as desired (both positive and negative tests). However, it seems that I need to update something like the pom maybe? I am going to dig at it, but, if already known, a quick answer would be welcome. |
Yup, pom updated. |
.lblx
label file extension
Source source = new SAXSource(new InputSource(new FileReader(filename))); | ||
TreeInfo docInfo = configuration.buildDocumentTree(source , options); | ||
NodeInfo ia=null,lid=null,pcls=null; | ||
for (NodeInfo top : docInfo.getRootNode().children()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using SAX make the code to read the product class a bit complicated, but that is the right approach since we noticed some label can be very big.
Apparently an alternative could be to use StAX , but that might not be worth the work to use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fast enough as this the approach validate takes. Just reading a couple of nodes right at the top should not be too bad. Even a million nodes loops pretty quick since not descending them all.
import org.junit.jupiter.api.Test; | ||
import gov.nasa.pds.harvest.util.xml.XmlIs; | ||
|
||
class XmlIsSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks Al
Should be ready finally. |
for (NodeInfo child : top.children()) { | ||
if ("Identification_Area".equals(child.getLocalPart())) { | ||
if (ia == null) { | ||
ia = child; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@al-niessner do we want to break out of here once we find the identification area in the event this is a huge label? if there are 2 identification areas, there are bigger problems, and it isn't responsibility of Harvest to validate that is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Pluto might have an XML as data with multiple Identification_Area and this helps us say, not a PDS4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@al-niessner I would almost prefer the software just breaks down on bad data that we assume is right, versus assuming it all could be wrong, but not worth the time right now.
🗒️ Summary
Repeat of #137 but with correct ordering of letters.
⚙️ Test Data and/or Report
Output from running harvest:
While it finds all of the files, the configuration does not allow it to push for reasons unclear to me. However, this ticket is about finding the files and not about pushing them to opensearch.
♻️ Related Issues
Closes #129
Closes #130