m2020 PIXL bundle is not properly loading all collections/products #143

jordanpadams · 2023-11-29T17:05:08Z

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I tried to load m2020 PIXL bundle, it doesn't load the urn:nasa:pds:mars2020_pixl:data_oxides_pmc collection.

🕵️ Expected behavior

All products within this collection to be loaded.

📜 To Reproduce

2023-11-28 12:33:03,517 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_oxides_pmc\collection_data_oxides_pmc.xml
2023-11-28 12:33:04,486 [INFO] Wrote 1 collection inventory document(s)

Most other collections have more documents like:

2023-11-28 12:33:07,252 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_raw_ancillary\collection_data_raw_ancillary.xml
2023-11-28 12:33:09,596 [INFO] Wrote 20 collection inventory document(s)

Running registry-mgr on the bundle gives:

[INFO] Setting product status. LIDVID = urn:nasa:pds:mars2020_pixl::2.0, status = archived
[WARN] Collection urn:nasa:pds:mars2020_pixl:data_imaging::8.0 doesn't have primary products.
[ERROR] [_doc][urn:nasa:pds:mars2020_pixl:data_oxides_pmc:ps__0558_0716511093_000rqa__02800001941181450000___j01::1.0]: document missing
[ERROR] [_doc][urn:nasa:pds:mars2020_pixl:data_oxides_pmc:ps__0558_0716511093_000rqb__02800001941181450000___j01::1.0]: document missing
[ERROR] [_doc][urn:nasa:pds:mars2020_pixl:data_oxides_pmc:ps__0558_0716511093_000rqc__02800001941181450000___j01::1.0]: document missing
[ERROR] [_doc][urn:nasa:pds:mars2020_pixl:data_oxides_pmc:ps__0560_0716654103_000rqa__02800001947735050000___j01::1.0]: document missing
...

🖥 Environment Info

Windows Enterprise Server

📚 Version of Software Used

Harvest version: 3.9.0-SNAPSHOT
Build time: 2023-10-31T16:02:29Z

🩺 Test Data / Additional context

https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-mars2020_pixl/

🦄 Related requirements

All

⚙️ Engineering Details

No response

The text was updated successfully, but these errors were encountered:

al-niessner · 2023-11-29T17:08:39Z

@jordanpadams

The whole bundle (complete test) requires about 4 GB of disk space. Are we going to want this as test via postman etc?

jordanpadams · 2023-11-29T17:11:11Z

@al-niessner Not the entire data set. Hopefully we can pick out a representative test case to include, if possible.

al-niessner · 2023-11-29T20:08:00Z

@jordanpadams

Was not expecting problems so did not save to log, but validate does not pass:

Summary:

  2 error(s)
  829 warning(s)

  Product Validation Summary:
    11137      product(s) passed
    1          product(s) failed
    0          product(s) skipped

  Referential Integrity Check Summary:
    11138      check(s) passed
    0          check(s) failed
    0          check(s) skipped

  Message Types:
    2            error.pdf.file.not_pdfa_compliant
    829          warning.integrity.reference_not_found

End of Report
Completed execution in 12324484 ms

Will rerun and detail problems to make sure they are not the cause. Will report back after it has run again.

al-niessner · 2023-12-01T21:22:28Z

@jordanpadams @scholes-ds

Sorry, but I cannot reproduce this problem. I used wget to grab the bundle and bring it local to my machine for testing.

While validate finds a failure:

  FAIL: file:/home/niessner/Projects/PDS/validate/src/test/resources/harvest141/document/pixl_edr_sis.xml
      ERROR  [error.pdf.file.not_pdfa_compliant]   Validation failed for flavour PDF/A-1b in file mars2020_pixl_labels_sort_pds.pdf.
      ERROR  [error.pdf.file.not_pdfa_compliant]   Validation failed for flavour PDF/A-1b in file mars2020_pixl_labels_sort_vicar.pdf.

harvest simply does not care about it. Nor is it in this tickets purview to care about PDF. Point is, harvest and validate agree on the total number of files (see comment above for validate numbers):

[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 11138
[SUMMARY]   Product_Bundle: 1
[SUMMARY]   Product_Collection: 6
[SUMMARY]   Product_Document: 5
[SUMMARY]   Product_Observational: 11126
[SUMMARY] Failed files: 0
[SUMMARY] Package ID: 4976b2fd-e5bf-4a7d-a685-6cf43f3adb80

No matter how many times I wipe and fiddle with the configuration file, all files load. I do not see the missing references when changing the state from staged to archived. Even dug in to the code to watch it through the debugger and there absolutely nothing that would make it not descend into the data oxides sol directories.

What was reported in users harvest log:

2023-11-28 12:53:06,196 [SUMMARY] Summary:
2023-11-28 12:53:06,196 [SUMMARY] Skipped files: 0
2023-11-28 12:53:06,196 [SUMMARY] Loaded files: 11036
2023-11-28 12:53:06,196 [SUMMARY]   Product_Bundle: 1
2023-11-28 12:53:06,196 [SUMMARY]   Product_Collection: 6
2023-11-28 12:53:06,196 [SUMMARY]   Product_Document: 5
2023-11-28 12:53:06,196 [SUMMARY]   Product_Observational: 11024
2023-11-28 12:53:06,196 [SUMMARY] Failed files: 1
2023-11-28 12:53:06,196 [SUMMARY] Package ID: c4a137a0-88cc-49be-96f9-17f4222a0a50

I should note that the failed files may be a batch of them not a single file but I am not sure why it says read when it writes int batch. Also, the difference between files loaded 11138 - 11036 is 102 the missing data oxides sol files. Same with Product_Observational.

To be clear, lets cover what actually happens. harvest first searches the given directory and no other (does not descend) for the bundle. It finds it, reads, and pushes into the DB. Once the bundle is done it then descends to all directories below the one given for collections. Both harvest log files show the same collections being processed:

2023-11-28 12:32:59,517 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_imaging\collection_data_imaging.xml
2023-11-28 12:33:03,471 [INFO] Wrote 26 collection inventory document(s)
2023-11-28 12:33:03,517 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_oxides_pmc\collection_data_oxides_pmc.xml
2023-11-28 12:33:04,486 [INFO] Wrote 1 collection inventory document(s)
2023-11-28 12:33:05,127 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_processed\collection_data_processed.xml
2023-11-28 12:33:05,486 [INFO] Wrote 1 collection inventory document(s)
2023-11-28 12:33:07,252 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_raw_ancillary\collection_data_raw_ancillary.xml
2023-11-28 12:33:09,596 [INFO] Wrote 20 collection inventory document(s)
2023-11-28 12:33:49,847 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_raw_spectroscopy\collection_data_raw_spectroscopy.xml
2023-11-28 12:33:50,644 [INFO] Wrote 2 collection inventory document(s)
2023-11-28 12:33:54,019 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\document\collection_document.xml

The numbers that it writes are array blocks of 500 references or less, which is why they vary. It then goes on to descend through all directories again and processes all products - it uses incredibly generic java code provided by the JDK to find all the files so incredibly unlikely to be influenced by this specific use case. Since neither harvest configuration file, one used for testing and one provided by user, has include/exclude filters all non-bundle and non-collection items are processed. In my log file it looks like:

[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00125/ps__0125_0678032243_000rqb__00417120483005510000___j01.xml
/home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00125/ps__0125_0678032243_000rqc__00417120483005510000___j01.xml
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00125/ps__0125_0678032243_000rqc__00417120483005510000___j01.xml
/home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqa__00518120528225320000___j01.xml
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqa__00518120528225320000___j01.xml
/home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqb__00518120528225320000___j01.xml
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqb__00518120528225320000___j01.xml
/home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqc__00518120528225320000___j01.xml
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqc__00518120528225320000___j01.xml

Processing of these files is obviously missing from the one given to us via email. The products are then batched into the DB. If the batch write is not successful, then should see error messages like:

2023-11-28 12:42:16,507 [INFO] Processing product \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_raw_ancillary\sol_00489\PE__0489_0710387107_000E08__02610041706562580025___J04.xml
2023-11-28 12:42:21,944 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (5 retries remaining)
2023-11-28 12:42:27,616 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (4 retries remaining)
2023-11-28 12:42:33,335 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (3 retries remaining)
2023-11-28 12:42:39,053 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (2 retries remaining)
2023-11-28 12:42:44,788 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (1 retries remaining)
2023-11-28 12:42:50,600 [ERROR] Read timed out

However it will show that it descends into the directory even when it fails with a timeout. It is quite possible that this batch of data did not make it into the DB but they would be part of data_raw_ancillary not data_oxides_pmc. Using a local DB for testing; never experienced this failure. The other interesting part is that I get a lot of messages that are not seen in user supplied harvest log:

[INFO] Updating 'mars2020' LDD. Schema location: https://pds.nasa.gov/pds4/mission/mars2020/v1/PDS4_MARS2020_1G00_1000.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/mission/mars2020/v1/PDS4_MARS2020_1G00_1000.JSON to /tmp/LDD-12555520411772030214.JSON
Dec 01, 2023 12:54:09 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=YVxwgger9zQer3z+a4LGBwjyIYQem/b+KtkIIiMSNjGAJjmhuZKKFwu1CKXV6FRWKcVxMr9hfwFGqpFZTKKucMW4Tbu+z+2fUizHlF/jGpvQl9UHIoPJtqj6P31i; Expires=Fri, 08 Dec 2023 18:30:58 GMT; Path=/". Invalid 'expires' attribute: Fri, 08 Dec 2023 18:30:58 GMT
Dec 01, 2023 12:54:09 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=YVxwgger9zQer3z+a4LGBwjyIYQem/b+KtkIIiMSNjGAJjmhuZKKFwu1CKXV6FRWKcVxMr9hfwFGqpFZTKKucMW4Tbu+z+2fUizHlF/jGpvQl9UHIoPJtqj6P31i; Expires=Fri, 08 Dec 2023 18:30:58 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Fri, 08 Dec 2023 18:30:58 GMT
[INFO] Creating temporary ES data file /tmp/es-6622149530635502458.json
[INFO] Loading ES data file: /tmp/es-6622149530635502458.json
[INFO] Loaded 321 document(s)

Do not know if this is an environment difference or harvest version difference. Given the code, doubt it adds or subtracts from the reported problem. However it might if the harvest used by the user is sufficiently old and that code is not so generic java.

Leaves just one other option. The data oxides sol directories were not present when harvest was run. Not present could simply mean not readable by the user running harvest. Since I grabbed them from the net, the server could be storing them in a completely different file tree and the wget recombined them into one tree. Could invent plenty more similar stories for "they looked to be there but were not when harvest ran" but like the ones here, they are just stories.

jordanpadams · 2023-12-04T15:18:15Z

@scholes-ds if you are still encountering this issue with the latest snapshot of harvest, let us know and we will move this data set over to a Windows VM to see if we can reproduce over there.

jordanpadams · 2024-02-05T21:10:47Z

Closing as invalid for the time being, but will re-open if this is still an issue.

jordanpadams · 2024-02-06T00:12:45Z

Confirmed with user this is no longer and issue

jordanpadams added bug Something isn't working B14.1 sprint-backlog s.high labels Nov 29, 2023

jordanpadams assigned al-niessner Nov 29, 2023

jordanpadams changed the title ~~Fix issue with m2020 PIXL bundle not properly loading all collections/products.~~ Fix issue with m2020 PIXL bundle not properly loading all collections/products Nov 29, 2023

jordanpadams changed the title ~~Fix issue with m2020 PIXL bundle not properly loading all collections/products~~ m2020 PIXL bundle is not properly loading all collections/products Nov 29, 2023

al-niessner mentioned this issue Nov 30, 2023

As a user, I want to sacrifice resources for faster processing NASA-PDS/validate#774

Open

jordanpadams added the needs:receivable label Dec 4, 2023

This was referenced Dec 7, 2023

check for duplicate lids pointing at same file NASA-PDS/validate#784

Merged

As a user, I want validate with the registry when a file is being referenced by more than one label NASA-PDS/validate#773

Closed

al-niessner mentioned this issue Jan 22, 2024

Refactor harvest to operate with new multi-tenant, serverless OpenSearch architecture #146

Merged

jordanpadams added the invalid This doesn't seem right label Feb 5, 2024

jordanpadams closed this as completed Feb 5, 2024

al-niessner mentioned this issue Feb 29, 2024

validate is slow or runs out of memory when validating a bundle NASA-PDS/validate#826

Closed

jordanpadams removed the sprint-backlog label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

m2020 PIXL bundle is not properly loading all collections/products #143

m2020 PIXL bundle is not properly loading all collections/products #143

jordanpadams commented Nov 29, 2023 •

edited

Loading

al-niessner commented Nov 29, 2023

jordanpadams commented Nov 29, 2023

al-niessner commented Nov 29, 2023

al-niessner commented Dec 1, 2023

jordanpadams commented Dec 4, 2023 •

edited

Loading

jordanpadams commented Feb 5, 2024

jordanpadams commented Feb 6, 2024

m2020 PIXL bundle is not properly loading all collections/products #143

m2020 PIXL bundle is not properly loading all collections/products #143

Comments

jordanpadams commented Nov 29, 2023 • edited Loading

Checked for duplicates

🐛 Describe the bug

🕵️ Expected behavior

📜 To Reproduce

🖥 Environment Info

📚 Version of Software Used

🩺 Test Data / Additional context

🦄 Related requirements

⚙️ Engineering Details

al-niessner commented Nov 29, 2023

jordanpadams commented Nov 29, 2023

al-niessner commented Nov 29, 2023

al-niessner commented Dec 1, 2023

jordanpadams commented Dec 4, 2023 • edited Loading

jordanpadams commented Feb 5, 2024

jordanpadams commented Feb 6, 2024

jordanpadams commented Nov 29, 2023 •

edited

Loading

jordanpadams commented Dec 4, 2023 •

edited

Loading