Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to allow for .lblx label file extension #139

Merged
merged 7 commits into from
Oct 31, 2023
Merged

Conversation

al-niessner
Copy link
Contributor

@al-niessner al-niessner commented Oct 24, 2023

🗒️ Summary

Repeat of #137 but with correct ordering of letters.

⚙️ Test Data and/or Report

Output from running harvest:

[SUMMARY] Reading configuration from /tmp/local.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://elasticsearch:9200, index: registry
[INFO] Connecting to Elasticsearch
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Loading PDS to ES data type mapping from /home/niessner/Projects/PDS/harvest/target/classes/elastic/data-dic-types.cfg
[INFO] Processing bundle directory /home/niessner/Projects/PDS/harvest/src/test/resources/test_data
[INFO] Processing bundle /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/sample_dossier.lblx
[INFO] Processing collection /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/document/collection_document.lblx
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1K00.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1K00.JSON to /tmp/LDD-16866812774272577974.JSON
Oct 27, 2023 10:42:42 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=4vSGPzREGMWpfPblkKUs3KiNwf2BJxN2gu45sLHK0oadpJgqgVvMP2pI0W0Szxm4LKMHA8r0gpeKDzr+JAJGOuQgX2eAZcCvbrO0o4fW+wFDKibT1pktJuM3fKp2; Expires=Fri, 03 Nov 2023 17:40:29 GMT; Path=/". Invalid 'expires' attribute: Fri, 03 Nov 2023 17:40:29 GMT
Oct 27, 2023 10:42:42 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=4vSGPzREGMWpfPblkKUs3KiNwf2BJxN2gu45sLHK0oadpJgqgVvMP2pI0W0Szxm4LKMHA8r0gpeKDzr+JAJGOuQgX2eAZcCvbrO0o4fW+wFDKibT1pktJuM3fKp2; Expires=Fri, 03 Nov 2023 17:40:29 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Fri, 03 Nov 2023 17:40:29 GMT
[INFO] Creating temporary ES data file /tmp/es-13056359143747470525.json
[WARN] Could not parse LDD date 2023-03-10T08:22:37
[ERROR] Could not parse date from 2023-03-10T08:22:37 using patterns defined in LddUtils.Accepted_LDD_DateFormats
[WARN] Will use field definitions from [PDS4_PDS_1500.JSON, PDS4_PDS_1900.JSON, PDS4_PDS_1A10.JSON, PDS4_PDS_1B00.JSON, PDS4_PDS_1F00.JSON]
[INFO] Updating Elasticsearch schema.
[INFO] Updated 2 fields
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing collection /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/initial_reports/collection_initial_reports.lblx
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing products...
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/document/sample_dossier_release_notes.lblx
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/initial_reports/initial_reports_volume1.lblx
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1K00.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1K00.JSON to /tmp/LDD-1398215671416462861.JSON
Oct 27, 2023 10:42:45 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=4vSGPzREGMWpfPblkKUs3KiNwf2BJxN2gu45sLHK0oadpJgqgVvMP2pI0W0Szxm4LKMHA8r0gpeKDzr+JAJGOuQgX2eAZcCvbrO0o4fW+wFDKibT1pktJuM3fKp2; Expires=Fri, 03 Nov 2023 17:40:29 GMT; Path=/". Invalid 'expires' attribute: Fri, 03 Nov 2023 17:40:29 GMT
Oct 27, 2023 10:42:45 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=4vSGPzREGMWpfPblkKUs3KiNwf2BJxN2gu45sLHK0oadpJgqgVvMP2pI0W0Szxm4LKMHA8r0gpeKDzr+JAJGOuQgX2eAZcCvbrO0o4fW+wFDKibT1pktJuM3fKp2; Expires=Fri, 03 Nov 2023 17:40:29 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Fri, 03 Nov 2023 17:40:29 GMT
[INFO] Creating temporary ES data file /tmp/es-3056316296870078169.json
[WARN] Could not parse LDD date 2023-03-10T08:22:37
[ERROR] Could not parse date from 2023-03-10T08:22:37 using patterns defined in LddUtils.Accepted_LDD_DateFormats
[WARN] Will use field definitions from [PDS4_PDS_1500.JSON, PDS4_PDS_1900.JSON, PDS4_PDS_1A10.JSON, PDS4_PDS_1B00.JSON, PDS4_PDS_1F00.JSON]
[INFO] Updating Elasticsearch schema.
[INFO] Updated 1 fields
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/test_data/initial_reports/initial_reports_volume2.lblx
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier::1.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier::1.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:document::2.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:document::2.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:initial_reports::2.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:initial_reports::2.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:document:sample_dossier_release_notes::2.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:document:sample_dossier_release_notes::2.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:initial_reports:initial_reports_volume1::3.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:initial_reports:initial_reports_volume1::3.0'. Preview of field's value: 'not set by harvest'
[ERROR] LIDVID = urn:nasa:pds:mars2020_sample_dossier:initial_reports:initial_reports_volume2::3.0, Message = failed to parse field [ops:Harvest_Info/ops:harvest_date_time] of type [date] in document with id 'urn:nasa:pds:mars2020_sample_dossier:initial_reports:initial_reports_volume2::3.0'. Preview of field's value: 'not set by harvest'
[INFO] Wrote 0 product(s)
[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 0
[SUMMARY] Failed files: 6
[SUMMARY] Package ID: a445c7e1-079b-4f1c-9092-c0d4c8503905

While it finds all of the files, the configuration does not allow it to push for reasons unclear to me. However, this ticket is about finding the files and not about pushing them to opensearch.

♻️ Related Issues

Closes #129
Closes #130

@al-niessner al-niessner self-assigned this Oct 24, 2023
@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Sorry for the typos

Al Niessner added 3 commits October 26, 2023 13:11
Use the .xml and .lblx as the primary filter then open files that are xml or lblx and see if they contain a lid and product class within an Identification_Area. If so, then treat them as a label.
@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Here is a cleaner and more comprehensive testing to make sure the XML file is a label. Could be made better, but this was quick and easy.

I also see I broke the testing. I added a JUnit test to verify that the code works as desired (both positive and negative tests). However, it seems that I need to update something like the pom maybe? I am going to dig at it, but, if already known, a quick answer would be welcome.

@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Yup, pom updated.

@jordanpadams jordanpadams changed the title issue 129: allow for lblx extension Update to allow for .lblx label file extension Oct 26, 2023
@pdsen-ci pdsen-ci requested a review from a team as a code owner October 26, 2023 20:27
Source source = new SAXSource(new InputSource(new FileReader(filename)));
TreeInfo docInfo = configuration.buildDocumentTree(source , options);
NodeInfo ia=null,lid=null,pcls=null;
for (NodeInfo top : docInfo.getRootNode().children()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using SAX make the code to read the product class a bit complicated, but that is the right approach since we noticed some label can be very big.
Apparently an alternative could be to use StAX , but that might not be worth the work to use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fast enough as this the approach validate takes. Just reading a couple of nodes right at the top should not be too bad. Even a million nodes loops pretty quick since not descending them all.

import org.junit.jupiter.api.Test;
import gov.nasa.pds.harvest.util.xml.XmlIs;

class XmlIsSuite {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks Al

@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Should be ready finally.

for (NodeInfo child : top.children()) {
if ("Identification_Area".equals(child.getLocalPart())) {
if (ia == null) {
ia = child;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@al-niessner do we want to break out of here once we find the identification area in the event this is a huge label? if there are 2 identification areas, there are bigger problems, and it isn't responsibility of Harvest to validate that is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Pluto might have an XML as data with multiple Identification_Area and this helps us say, not a PDS4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@al-niessner I would almost prefer the software just breaks down on bad data that we assume is right, versus assuming it all could be wrong, but not worth the time right now.

@jordanpadams jordanpadams merged commit 728a1bd into main Oct 31, 2023
1 check passed
@jordanpadams jordanpadams deleted the issue_129.1 branch October 31, 2023 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants