Update to allow for `.lblx` label file extension #139

al-niessner · 2023-10-24T19:49:07Z

🗒️ Summary

Repeat of #137 but with correct ordering of letters.

⚙️ Test Data and/or Report

Output from running harvest:

[SUMMARY] Reading configuration from /tmp/local.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://elasticsearch:9200, index: registry
[INFO] Connecting to Elasticsearch
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Processing bundle directory /home/niessner/Projects/PDS/harvest/src/test/resources/test_data
[INFO] Processing bundle /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/sample_dossier.lblx
[INFO] Processing collection /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/document/collection_document.lblx
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1K00.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1K00.JSON to /tmp/LDD-16866812774272577974.JSON
Oct 27, 2023 10:42:42 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=4vSGPzREGMWpfPblkKUs3KiNwf2BJxN2gu45sLHK0oadpJgqgVvMP2pI0W0Szxm4LKMHA8r0gpeKDzr+JAJGOuQgX2eAZcCvbrO0o4fW+wFDKibT1pktJuM3fKp2; Expires=Fri, 03 Nov 2023 17:40:29 GMT; Path=/". Invalid 'expires' attribute: Fri, 03 Nov 2023 17:40:29 GMT
Oct 27, 2023 10:42:42 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=4vSGPzREGMWpfPblkKUs3KiNwf2BJxN2gu45sLHK0oadpJgqgVvMP2pI0W0Szxm4LKMHA8r0gpeKDzr+JAJGOuQgX2eAZcCvbrO0o4fW+wFDKibT1pktJuM3fKp2; Expires=Fri, 03 Nov 2023 17:40:29 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Fri, 03 Nov 2023 17:40:29 GMT
[INFO] Creating temporary ES data file /tmp/es-13056359143747470525.json
[WARN] Could not parse LDD date 2023-03-10T08:22:37
[ERROR] Could not parse date from 2023-03-10T08:22:37 using patterns defined in LddUtils.Accepted_LDD_DateFormats
[WARN] Will use field definitions from [PDS4_PDS_1500.JSON, PDS4_PDS_1900.JSON, PDS4_PDS_1A10.JSON, PDS4_PDS_1B00.JSON, PDS4_PDS_1F00.JSON]
[INFO] Updating Elasticsearch schema.
[INFO] Updated 2 fields
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing collection /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/initial_reports/collection_initial_reports.lblx
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing products...
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/document/sample_dossier_release_notes.lblx
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/initial_reports/initial_reports_volume1.lblx
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1K00.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1K00.JSON to /tmp/LDD-1398215671416462861.JSON
Oct 27, 2023 10:42:45 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=4vSGPzREGMWpfPblkKUs3KiNwf2BJxN2gu45sLHK0oadpJgqgVvMP2pI0W0Szxm4LKMHA8r0gpeKDzr+JAJGOuQgX2eAZcCvbrO0o4fW+wFDKibT1pktJuM3fKp2; Expires=Fri, 03 Nov 2023 17:40:29 GMT; Path=/". Invalid 'expires' attribute: Fri, 03 Nov 2023 17:40:29 GMT
Oct 27, 2023 10:42:45 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=4vSGPzREGMWpfPblkKUs3KiNwf2BJxN2gu45sLHK0oadpJgqgVvMP2pI0W0Szxm4LKMHA8r0gpeKDzr+JAJGOuQgX2eAZcCvbrO0o4fW+wFDKibT1pktJuM3fKp2; Expires=Fri, 03 Nov 2023 17:40:29 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Fri, 03 Nov 2023 17:40:29 GMT
[INFO] Creating temporary ES data file /tmp/es-3056316296870078169.json
[WARN] Could not parse LDD date 2023-03-10T08:22:37
[ERROR] Could not parse date from 2023-03-10T08:22:37 using patterns defined in LddUtils.Accepted_LDD_DateFormats
[WARN] Will use field definitions from [PDS4_PDS_1500.JSON, PDS4_PDS_1900.JSON, PDS4_PDS_1A10.JSON, PDS4_PDS_1B00.JSON, PDS4_PDS_1F00.JSON]
[INFO] Updating Elasticsearch schema.
[INFO] Updated 1 fields
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/initial_reports/initial_reports_volume2.lblx
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier::1.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier::1.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:document::2.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:document::2.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:initial_reports::2.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:initial_reports::2.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:document:sample_dossier_release_notes::2.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:document:sample_dossier_release_notes::2.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:initial_reports:initial_reports_volume1::3.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:initial_reports:initial_reports_volume1::3.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:initial_reports:initial_reports_volume2::3.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:initial_reports:initial_reports_volume2::3.0'. Preview of field's value: 'not set by harvest'
[INFO] Wrote 0 product(s)
[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 0
[SUMMARY] Failed files: 6
[SUMMARY] Package ID: a445c7e1-079b-4f1c-9092-c0d4c8503905

While it finds all of the files, the configuration does not allow it to push for reasons unclear to me. However, this ticket is about finding the files and not about pushing them to opensearch.

♻️ Related Issues

Closes #129
Closes #130

fix typo

al-niessner · 2023-10-24T19:51:22Z

@jordanpadams @tloubrieu-jpl

Sorry for the typos

Use the .xml and .lblx as the primary filter then open files that are xml or lblx and see if they contain a lid and product class within an Identification_Area. If so, then treat them as a label.

al-niessner · 2023-10-26T20:18:42Z

@jordanpadams @tloubrieu-jpl

Here is a cleaner and more comprehensive testing to make sure the XML file is a label. Could be made better, but this was quick and easy.

I also see I broke the testing. I added a JUnit test to verify that the code works as desired (both positive and negative tests). However, it seems that I need to update something like the pom maybe? I am going to dig at it, but, if already known, a quick answer would be welcome.

al-niessner · 2023-10-26T20:21:29Z

@jordanpadams @tloubrieu-jpl

Yup, pom updated.

tloubrieu-jpl · 2023-10-27T17:04:05Z

src/main/java/gov/nasa/pds/harvest/util/xml/XmlIs.java

+        Source source = new SAXSource(new InputSource(new FileReader(filename)));
+        TreeInfo docInfo = configuration.buildDocumentTree(source , options);
+        NodeInfo ia=null,lid=null,pcls=null;
+        for (NodeInfo top : docInfo.getRootNode().children()) {


Using SAX make the code to read the product class a bit complicated, but that is the right approach since we noticed some label can be very big.
Apparently an alternative could be to use StAX , but that might not be worth the work to use that.

Should be fast enough as this the approach validate takes. Just reading a couple of nodes right at the top should not be too bad. Even a million nodes loops pretty quick since not descending them all.

tloubrieu-jpl · 2023-10-27T17:04:09Z

src/test/java/harvest/util/xml/XmlIsSuite.java

+import org.junit.jupiter.api.Test;
+import gov.nasa.pds.harvest.util.xml.XmlIs;
+
+class XmlIsSuite {


Great! Thanks Al

al-niessner · 2023-10-27T17:48:23Z

@jordanpadams @tloubrieu-jpl

Should be ready finally.

jordanpadams · 2023-10-27T21:08:13Z

src/main/java/gov/nasa/pds/harvest/util/xml/XmlIs.java

+          for (NodeInfo child : top.children()) {
+            if ("Identification_Area".equals(child.getLocalPart())) {
+              if (ia == null) {
+                ia = child;


@al-niessner do we want to break out of here once we find the identification area in the event this is a huge label? if there are 2 identification areas, there are bigger problems, and it isn't responsibility of Harvest to validate that is true.

No. Pluto might have an XML as data with multiple Identification_Area and this helps us say, not a PDS4.

@al-niessner I would almost prefer the software just breaks down on bad data that we assume is right, versus assuming it all could be wrong, but not worth the time right now.

Update BundleProcessor.java

51365b7

fix typo

al-niessner self-assigned this Oct 24, 2023

Update CollectionProcessor.java

9c1256f

Al Niessner added 3 commits October 26, 2023 13:11

Code changes for better label detection

7c7236d

Use the .xml and .lblx as the primary filter then open files that are xml or lblx and see if they contain a lid and product class within an Identification_Area. If so, then treat them as a label.

the class that does the new testing

9f9cc85

add unit tests to show content testing works

45c05e3

maybe enough for junit

e4c872f

jordanpadams requested review from tloubrieu-jpl and alexdunnjpl October 26, 2023 20:23

jordanpadams changed the title ~~issue 129: allow for lblx extension~~ Update to allow for .lblx label file extension Oct 26, 2023

pdsen-ci requested a review from a team as a code owner October 26, 2023 20:27

tloubrieu-jpl approved these changes Oct 27, 2023

View reviewed changes

need full path not butchered filename

c096e19

jordanpadams requested changes Oct 27, 2023

View reviewed changes

jordanpadams approved these changes Oct 31, 2023

View reviewed changes

jordanpadams merged commit 728a1bd into main Oct 31, 2023
1 check passed

jordanpadams deleted the issue_129.1 branch October 31, 2023 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to allow for `.lblx` label file extension #139

Update to allow for `.lblx` label file extension #139

al-niessner commented Oct 24, 2023 •

edited

Loading

al-niessner commented Oct 24, 2023

al-niessner commented Oct 26, 2023

al-niessner commented Oct 26, 2023

tloubrieu-jpl Oct 27, 2023

al-niessner Oct 27, 2023

tloubrieu-jpl Oct 27, 2023

al-niessner commented Oct 27, 2023

jordanpadams Oct 27, 2023

al-niessner Oct 27, 2023

jordanpadams Oct 31, 2023

Update to allow for .lblx label file extension #139

Update to allow for .lblx label file extension #139

Conversation

al-niessner commented Oct 24, 2023 • edited Loading

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues

al-niessner commented Oct 24, 2023

al-niessner commented Oct 26, 2023

al-niessner commented Oct 26, 2023

tloubrieu-jpl Oct 27, 2023

Choose a reason for hiding this comment

al-niessner Oct 27, 2023

Choose a reason for hiding this comment

tloubrieu-jpl Oct 27, 2023

Choose a reason for hiding this comment

al-niessner commented Oct 27, 2023

jordanpadams Oct 27, 2023

Choose a reason for hiding this comment

al-niessner Oct 27, 2023

Choose a reason for hiding this comment

jordanpadams Oct 31, 2023

Choose a reason for hiding this comment

Update to allow for `.lblx` label file extension #139

Update to allow for `.lblx` label file extension #139

al-niessner commented Oct 24, 2023 •

edited

Loading