-
Notifications
You must be signed in to change notification settings - Fork 0
Harvest service sometimes skips collection inventory files #25
Comments
And just to prove how inconsistent this is, after the above test case I did the following:
And it successfully ingested the data_btemp collection, but not data_raw
Actually, I think I see the pattern now in both the logs above. If there is a collection inventory ingestion happening during the point at which it encounters another collection inventory, it is skipped. That's why the first and last ones worked initially, and just the first worked this second time. I'm not crazy! |
copy @mdrum . we will take a look |
@jordanpadams, should this issue be reproducible with our docker compose setup too? |
@ramesh-maddegoda depends. I'm not sure what versions our docker compose setup is using. Per above Mike is using: registry-harvest-service-1.0.1-SNAPSHOT |
@mdrum, are you using the docker images of scalable harvest service? If so, what the versions of those docker images? |
No, I think documentation stated that the docker setup wasn’t suitable for
production so we didn’t go with that.
On Fri, Aug 19, 2022 at 11:52 AM Ramesh Maddegoda ***@***.***> wrote:
@mdrum <https://github.com/mdrum>, are you using the docker images of
scalable harvest service? If so, what the versions of those docker images?
—
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHOGHMKAALEES4UBNA5AYQTVZ7JWRANCNFSM56TVW26Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Mike Drum
NASA PDS Small Bodies Node
Planetary Science Institute
520-382-0601
|
@mdrum sorry for the confusion. We will revise those instructions. It should explicitly call out that deploying the Registry using Docker is not suitable for operations. The Registry Loader tools on the other hand good to go with Docker. Will create another ticket for that. |
Status: @ramesh-maddegoda researching PDS4 and what is the difference between bundles/collections/products. |
… overwrite the same ID added as a result of a situation where Harvest Service consumed a FileBatch, which includes the ID of Collection Inventory, before processing the CollectionInventory. Refer to issue NASA-PDS/registry-harvest-service#25
… overwrite the same ID added as a result of a situation where Harvest Service consumed a FileBatch, which includes the ID of Collection Inventory, before processing the CollectionInventory. Refer to issue NASA-PDS/registry-harvest-service#25
@mdrum and @jordanpadams, I investigated this issue and found out 2 problems causing this issue. Problem 1: Too many Open Files errorThe As a result, when This issue can be resolved by updating the related configuration in the operating system. The following articles discuss about the workaround to fix this. In Mac OS I was able to fix this by using the following command sudo launchctl limit maxfiles 65536 200000 However, please note that above value is only applicable to current terminal session and it is required to add this to Problem 2: The labels of some collection inventory files were added to a batch of products and processed before processing the collection inventory fileIn The same collection inventory files can be also submitted to When However, if the To resolve this, a change was made in the following pull request. However, it is required to thoroughly review this before merging the change. NASA-PDS/registry-crawler-service#22 The change is in src/main/java/gov/nasa/pds/crawler/proc/DirectoryProcessor.java: // Collection label
if(info instanceof PdsCollectionInfo)
{
// Allow a Collection Inventory to overwrite the same ID added as a
// result of a situation where Harvest Service consumed a FileBatch, which includes the ID
// of Collection Inventory, before processing the CollectionInventory.
// Refer to https://github.com/NASA-PDS/registry-harvest-service/issues/25
dirMsg.overwrite = true;
publishCollectionInventory(dirMsg, path, (PdsCollectionInfo)info);
} |
@mdrum , the attached log file contains the output of registry components after the fix suggested above. Can you please check the attached log file and let us know your thoughts? Please note that it has processed all 5 Collection Inventory files. docker-registry-harvest-service-1 | [INFO] Started collection inventory consumer
docker-registry-harvest-service-1 | [INFO] Processing collection inventory file /tmp/data/browse/collection_hyb2_tir_browse.csv
docker-registry-harvest-service-1 | [INFO] Processing batch of 1 products: /tmp/data/bundle_hyb2_tir.xml, ...
docker-registry-harvest-service-1 | [INFO] Started manager command consumer
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20180801
docker-registry-harvest-service-1 | [INFO] Processing /tmp/data/bundle_hyb2_tir.xml docker-registry-harvest-service-1 | [INFO] Processing collection inventory file /tmp/data/calibration/collection_hyb2_tir_calibration.csv
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20181231
docker-registry-harvest-service-1 | [INFO] Loaded 21 document(s) docker-registry-harvest-service-1 | [INFO] Processing collection inventory file /tmp/data/data_btemp/collection_hyb2_tir_data_btemp.csv
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190308
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190313
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190321 docker-registry-harvest-service-1 | [INFO] Processing collection inventory file /tmp/data/data_raw/collection_hyb2_tir_data_raw.csv
docker-registry-harvest-service-1 | [INFO] 404 - Not Found
docker-registry-harvest-service-1 | [INFO] Will retry in 5 seconds
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190404
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190405
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190417
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190418
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190419
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190424
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190425 docker-registry-harvest-service-1 | [INFO] Processing collection inventory file /tmp/data/document/collection_hyb2_tir_document.csv
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190612
docker-registry-crawler-service-1 | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190613
docker-registry-harvest-service-1 | [INFO] 404 - Not Found
docker-registry-harvest-service-1 | [ERROR] Could not download https://data.darts.isas.jaxa.jp/pub/pds4/mission/hyb2/v1/PDS4_HYB2_1E00_1100.JSON
docker-registry-harvest-service-1 | [WARN] Will use 'keyword' data type.
docker-registry-harvest-service-1 | [INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1E00.xsd
docker-registry-harvest-service-1 | [INFO] This LDD already loaded.
docker-registry-harvest-service-1 | [INFO] Updating Elasticsearch schema. |
@ramesh-maddegoda for the too many open files issue, can we resolve this within the software itself by closing some of those files or stopping at the max files open until some are closed? The reason being is we cannot expect our users to update this OS config every time they run the software against a large data set. If we require this, we will have to figure out and document this setting on approx 6-10 different OSes, including numerous variations of shells. |
@jordanpadams, in fact handling that in the code is one of first things I considered too. This "Too many open files" issue happens in the following Please note that in the existing code the file reader is already being closed after each message. However, I think we should come-up with a different algorithm or design to address this issue at software level. Having said that, please note that default max files value for MacOS is 256, which is very limited compared to the scale of this task. |
The second issue sounds like exactly what I was experiencing. I've never run into file limit issue, since we're using this on a linux server that must have a different config. The logs look good to me. Thanks for digging into this @ramesh-maddegoda |
… overwrite the same ID added as a result of a situation where Harvest Service consumed a FileBatch, which includes the ID of Collection Inventory, before processing the CollectionInventory. (#22) Refer to issue NASA-PDS/registry-harvest-service#25
🐛 Describe the bug
Inconsistently, the scalable harvest service (might be the harvest server or crawler) will see a directory containing a collection inventory and skip it, instead of ingesting it for the references.
On several occasions, the harvest service has been pointed to a directory containing a bundle, and ingested some of the collection inventory files but skipped others. Even further, I will delete the collection docs in the registry index and re-harvest, only to see it ingest a different subset of the collections it was pointed to.
So, frustratingly, I don't have perfect steps to reproduce. I do have some logs demonstrating the harvest service output in these scenarios though.
In the below example, the harvest job was pointed at this directory with the four collections having been deleted from the registry (there were no matching docs in
registry
orregistry-refs
). Note that it sees and harvest all four collection labels, but only picks up the collection inventories for the document and calibration collections. (The browse collection had already previously been correctly ingested.)🕵️ Expected behavior
I expect harvest service to detect any collection inventory files, and harvest them into the registry-refs directory
📚 Version of Software Used
registry-harvest-service-1.0.1-SNAPSHOT
registry-crawler-service-1.0.0
🩺 Test Data / Additional context
The bundle mentioned above can be downloaded here (24GB direct download)
🦄 Related requirements
⚙️ Engineering Details
The text was updated successfully, but these errors were encountered: