-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
m2020 PIXL bundle is not properly loading all collections/products #143
Comments
The whole bundle (complete test) requires about 4 GB of disk space. Are we going to want this as test via postman etc? |
@al-niessner Not the entire data set. Hopefully we can pick out a representative test case to include, if possible. |
Was not expecting problems so did not save to log, but validate does not pass:
Will rerun and detail problems to make sure they are not the cause. Will report back after it has run again. |
Sorry, but I cannot reproduce this problem. I used wget to grab the bundle and bring it local to my machine for testing. While validate finds a failure:
harvest simply does not care about it. Nor is it in this tickets purview to care about PDF. Point is, harvest and validate agree on the total number of files (see comment above for validate numbers):
No matter how many times I wipe and fiddle with the configuration file, all files load. I do not see the missing references when changing the state from staged to archived. Even dug in to the code to watch it through the debugger and there absolutely nothing that would make it not descend into the data oxides sol directories. What was reported in users harvest log:
I should note that the failed files may be a batch of them not a single file but I am not sure why it says read when it writes int batch. Also, the difference between files loaded 11138 - 11036 is 102 the missing data oxides sol files. Same with Product_Observational. To be clear, lets cover what actually happens. harvest first searches the given directory and no other (does not descend) for the bundle. It finds it, reads, and pushes into the DB. Once the bundle is done it then descends to all directories below the one given for collections. Both harvest log files show the same collections being processed:
The numbers that it writes are array blocks of 500 references or less, which is why they vary. It then goes on to descend through all directories again and processes all products - it uses incredibly generic java code provided by the JDK to find all the files so incredibly unlikely to be influenced by this specific use case. Since neither harvest configuration file, one used for testing and one provided by user, has include/exclude filters all non-bundle and non-collection items are processed. In my log file it looks like:
Processing of these files is obviously missing from the one given to us via email. The products are then batched into the DB. If the batch write is not successful, then should see error messages like:
However it will show that it descends into the directory even when it fails with a timeout. It is quite possible that this batch of data did not make it into the DB but they would be part of data_raw_ancillary not data_oxides_pmc. Using a local DB for testing; never experienced this failure. The other interesting part is that I get a lot of messages that are not seen in user supplied harvest log:
Do not know if this is an environment difference or harvest version difference. Given the code, doubt it adds or subtracts from the reported problem. However it might if the harvest used by the user is sufficiently old and that code is not so generic java. Leaves just one other option. The data oxides sol directories were not present when harvest was run. Not present could simply mean not readable by the user running harvest. Since I grabbed them from the net, the server could be storing them in a completely different file tree and the wget recombined them into one tree. Could invent plenty more similar stories for "they looked to be there but were not when harvest ran" but like the ones here, they are just stories. |
@scholes-ds if you are still encountering this issue with the latest snapshot of harvest, let us know and we will move this data set over to a Windows VM to see if we can reproduce over there. |
Closing as invalid for the time being, but will re-open if this is still an issue. |
Confirmed with user this is no longer and issue |
Checked for duplicates
Yes - I've already checked
π Describe the bug
When I tried to load m2020 PIXL bundle, it doesn't load the
urn:nasa:pds:mars2020_pixl:data_oxides_pmc
collection.π΅οΈ Expected behavior
All products within this collection to be loaded.
π To Reproduce
Most other collections have more documents like:
Running registry-mgr on the bundle gives:
π₯ Environment Info
Windows Enterprise Server
π Version of Software Used
Harvest version: 3.9.0-SNAPSHOT
Build time: 2023-10-31T16:02:29Z
π©Ί Test Data / Additional context
https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-mars2020_pixl/
π¦ Related requirements
All
βοΈ Engineering Details
No response
The text was updated successfully, but these errors were encountered: