Skip to content
This repository has been archived by the owner on Feb 21, 2024. It is now read-only.

Harvest service sometimes skips collection inventory files #25

Closed
Tracked by #30
mdrum opened this issue Aug 15, 2022 · 13 comments
Closed
Tracked by #30

Harvest service sometimes skips collection inventory files #25

mdrum opened this issue Aug 15, 2022 · 13 comments
Assignees
Labels
B13.0 B13.1 bug Something isn't working i&t.done s.high High severity

Comments

@mdrum
Copy link

mdrum commented Aug 15, 2022

🐛 Describe the bug

Inconsistently, the scalable harvest service (might be the harvest server or crawler) will see a directory containing a collection inventory and skip it, instead of ingesting it for the references.

On several occasions, the harvest service has been pointed to a directory containing a bundle, and ingested some of the collection inventory files but skipped others. Even further, I will delete the collection docs in the registry index and re-harvest, only to see it ingest a different subset of the collections it was pointed to.

So, frustratingly, I don't have perfect steps to reproduce. I do have some logs demonstrating the harvest service output in these scenarios though.

In the below example, the harvest job was pointed at this directory with the four collections having been deleted from the registry (there were no matching docs in registry or registry-refs). Note that it sees and harvest all four collection labels, but only picks up the collection inventories for the document and calibration collections. (The browse collection had already previously been correctly ingested.)

[INFO] Processing batch of 2 products: /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/calibration/collection_hyb2_tir_calibration.xml, ...
[INFO] Processing collection inventory file /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/calibration/collection_hyb2_tir_calibration.csv
[INFO] Processing /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/calibration/collection_hyb2_tir_calibration.xml
[INFO] Extract metadata with rule /dsk1/www/archive/pds4/
[INFO] Loading data.
[INFO] Loaded 1 products.
[INFO] Processing batch of 1 products: /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_btemp/collection_hyb2_tir_data_btemp.xml, ...
[INFO] Processing /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_btemp/collection_hyb2_tir_data_btemp.xml
[INFO] Extract metadata with rule /dsk1/www/archive/pds4/
[INFO] Loading data.
[INFO] Loaded 1 products.
[INFO] Processing batch of 1 products: /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_raw/collection_hyb2_tir_data_raw.xml, ...
[INFO] Processing /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_raw/collection_hyb2_tir_data_raw.xml
[INFO] Extract metadata with rule /dsk1/www/archive/pds4/
[INFO] Loading data.
[INFO] Wrote 23 collection inventory document(s)
[INFO] Loaded 1 products.
[INFO] Processing batch of 5 products: /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/document/TIR-Calibration_Okada_2020.xml, ...
[INFO] Processing /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/document/collection_hyb2_tir_document.xml
[INFO] Extract metadata with rule /dsk1/www/archive/pds4/
[INFO] Loading data.
[INFO] Processing collection inventory file /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/document/collection_hyb2_tir_document.csv
[INFO] Loaded 1 products.
[INFO] Wrote 2 collection inventory document(s)

🕵️ Expected behavior

I expect harvest service to detect any collection inventory files, and harvest them into the registry-refs directory

📚 Version of Software Used

registry-harvest-service-1.0.1-SNAPSHOT
registry-crawler-service-1.0.0

🩺 Test Data / Additional context

The bundle mentioned above can be downloaded here (24GB direct download)


🦄 Related requirements

⚙️ Engineering Details

@mdrum mdrum added bug Something isn't working needs:triage labels Aug 15, 2022
@mdrum
Copy link
Author

mdrum commented Aug 15, 2022

And just to prove how inconsistent this is, after the above test case I did the following:

registry-manager delete-data -lid urn:jaxa:darts:hyb2_tir:data_btemp -es {redacted} -auth {redacted}
registry-manager delete-data -lid urn:jaxa:darts:hyb2_tir:data_raw -es {redacted} -auth {redacted}
harvest-client harvest -j harvest-dir.xml

And it successfully ingested the data_btemp collection, but not data_raw

[INFO] Processing batch of 1 products: /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_btemp/collection_hyb2_tir_data_btemp.xml, ...
[INFO] Processing collection inventory file /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_btemp/collection_hyb2_tir_data_btemp.csv
[INFO] Processing /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_btemp/collection_hyb2_tir_data_btemp.xml
[INFO] Extract metadata with rule /dsk1/www/archive/pds4/
[INFO] Loading data.
[INFO] Loaded 1 products.
[INFO] Processing batch of 1 products: /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_raw/collection_hyb2_tir_data_raw.xml, ...
[INFO] Processing /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/data_raw/collection_hyb2_tir_data_raw.xml
[INFO] Extract metadata with rule /dsk1/www/archive/pds4/
[INFO] Loading data.
[INFO] Loaded 1 products.
[INFO] Processing batch of 5 products: /dsk1/www/archive/pds4/hayabusa2/hyb2_tir_v1.0/document/TIR-Calibration_Okada_2020.xml, ...
[INFO] Wrote 23 collection inventory document(s)

Actually, I think I see the pattern now in both the logs above. If there is a collection inventory ingestion happening during the point at which it encounters another collection inventory, it is skipped. That's why the first and last ones worked initially, and just the first worked this second time. I'm not crazy!

@jordanpadams
Copy link
Member

copy @mdrum . we will take a look

@ramesh-maddegoda
Copy link
Contributor

@jordanpadams, should this issue be reproducible with our docker compose setup too?

@jordanpadams
Copy link
Member

@ramesh-maddegoda depends. I'm not sure what versions our docker compose setup is using. Per above Mike is using:

registry-harvest-service-1.0.1-SNAPSHOT
registry-crawler-service-1.0.0

@ramesh-maddegoda
Copy link
Contributor

@mdrum, are you using the docker images of scalable harvest service? If so, what the versions of those docker images?

@mdrum
Copy link
Author

mdrum commented Aug 19, 2022 via email

@jordanpadams
Copy link
Member

@mdrum sorry for the confusion. We will revise those instructions. It should explicitly call out that deploying the Registry using Docker is not suitable for operations. The Registry Loader tools on the other hand good to go with Docker. Will create another ticket for that.

@jordanpadams
Copy link
Member

Status: @ramesh-maddegoda researching PDS4 and what is the difference between bundles/collections/products.

ramesh-maddegoda added a commit to NASA-PDS/registry-crawler-service that referenced this issue Sep 8, 2022
… overwrite the same ID added as a result of a situation where Harvest Service consumed a FileBatch, which includes the ID of Collection Inventory, before processing the CollectionInventory.

Refer to issue NASA-PDS/registry-harvest-service#25
ramesh-maddegoda added a commit to NASA-PDS/registry-crawler-service that referenced this issue Sep 8, 2022
… overwrite the same ID added as a result of a situation where Harvest Service consumed a FileBatch, which includes the ID of Collection Inventory, before processing the CollectionInventory.

Refer to issue NASA-PDS/registry-harvest-service#25
@ramesh-maddegoda
Copy link
Contributor

ramesh-maddegoda commented Sep 8, 2022

@mdrum and @jordanpadams, I investigated this issue and found out 2 problems causing this issue.

Problem 1: Too many Open Files error

The hyb2_tir_v1.0.zip data file used for this has about 168,139 items, 25.26 GB in size in compressed format and 36.6 GB in extracted format.

As a result, when registry-crawler-service is processing this data directory, it was opening a number of files that are beyond the maximum number of files allowed to be opened by the operating system be default.

This issue can be resolved by updating the related configuration in the operating system.

The following articles discuss about the workaround to fix this.

In Mac OS I was able to fix this by using the following command

sudo launchctl limit maxfiles 65536 200000

However, please note that above value is only applicable to current terminal session and it is required to add this to .bash_profile to make it permanent.

Problem 2: The labels of some collection inventory files were added to a batch of products and processed before processing the collection inventory file

In registry-crawler-service, the collection inventory files are submitted to registry-harvest-service as a CollectionInventoryMessage through RabbitMQ.

The same collection inventory files can be also submitted to registry-harvest-service as part of a batch as a ProductMessage through RabbitMQ.

When registry-harvest-service, receives the CollectionInventoryMessage before the ProductMessage batch, then it works as expected (uncertain, depending on the RabbitMQ queue message delivery time).

However, if the ProductMessage batch (which includes the label of the collection inventory file) is received before the CollectionInventoryMessage, then the registry-harvest-service sees it as an already existing product and fails to process the CollectionInventory.

To resolve this, a change was made in the following pull request. However, it is required to thoroughly review this before merging the change.

NASA-PDS/registry-crawler-service#22

The change is in src/main/java/gov/nasa/pds/crawler/proc/DirectoryProcessor.java:

        // Collection label
        if(info instanceof PdsCollectionInfo)
        {
            // Allow a Collection Inventory to overwrite the same ID added as a
            // result of a situation where Harvest Service consumed a FileBatch, which includes the ID
            // of Collection Inventory, before processing the CollectionInventory.
            // Refer to https://github.com/NASA-PDS/registry-harvest-service/issues/25
            dirMsg.overwrite = true;

            publishCollectionInventory(dirMsg, path, (PdsCollectionInfo)info);
        }

@ramesh-maddegoda
Copy link
Contributor

ramesh-maddegoda commented Sep 8, 2022

@mdrum , the attached log file contains the output of registry components after the fix suggested above.

after-fix.log.zip

Can you please check the attached log file and let us know your thoughts?

Please note that it has processed all 5 Collection Inventory files.

docker-registry-harvest-service-1  | [INFO] Started collection inventory consumer
docker-registry-harvest-service-1  | [INFO] Processing collection inventory file /tmp/data/browse/collection_hyb2_tir_browse.csv
docker-registry-harvest-service-1  | [INFO] Processing batch of 1 products: /tmp/data/bundle_hyb2_tir.xml, ...
docker-registry-harvest-service-1  | [INFO] Started manager command consumer
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20180801
docker-registry-harvest-service-1  | [INFO] Processing /tmp/data/bundle_hyb2_tir.xml
docker-registry-harvest-service-1  | [INFO] Processing collection inventory file /tmp/data/calibration/collection_hyb2_tir_calibration.csv
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20181231
docker-registry-harvest-service-1  | [INFO] Loaded 21 document(s)
docker-registry-harvest-service-1  | [INFO] Processing collection inventory file /tmp/data/data_btemp/collection_hyb2_tir_data_btemp.csv
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190308
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190313
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190321
docker-registry-harvest-service-1  | [INFO] Processing collection inventory file /tmp/data/data_raw/collection_hyb2_tir_data_raw.csv
docker-registry-harvest-service-1  | [INFO] 404 - Not Found
docker-registry-harvest-service-1  | [INFO] Will retry in 5 seconds
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190404
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190405
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190417
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190418
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190419
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190424
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190425
docker-registry-harvest-service-1  | [INFO] Processing collection inventory file /tmp/data/document/collection_hyb2_tir_document.csv
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190612
docker-registry-crawler-service-1  | [INFO] Processing directory /tmp/data/data_btemp/proximity/20190613
docker-registry-harvest-service-1  | [INFO] 404 - Not Found
docker-registry-harvest-service-1  | [ERROR] Could not download https://data.darts.isas.jaxa.jp/pub/pds4/mission/hyb2/v1/PDS4_HYB2_1E00_1100.JSON
docker-registry-harvest-service-1  | [WARN] Will use 'keyword' data type.
docker-registry-harvest-service-1  | [INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1E00.xsd
docker-registry-harvest-service-1  | [INFO] This LDD already loaded.
docker-registry-harvest-service-1  | [INFO] Updating Elasticsearch schema.

@jordanpadams
Copy link
Member

@ramesh-maddegoda for the too many open files issue, can we resolve this within the software itself by closing some of those files or stopping at the max files open until some are closed?

The reason being is we cannot expect our users to update this OS config every time they run the software against a large data set. If we require this, we will have to figure out and document this setting on approx 6-10 different OSes, including numerous variations of shells.

@ramesh-maddegoda
Copy link
Contributor

@jordanpadams, in fact handling that in the code is one of first things I considered too.

This "Too many open files" issue happens in the following while loop.
https://github.com/NASA-PDS/registry-crawler-service/blob/e5fbe97f0e998a1b4208c63d07022814abb26877/src/main/java/gov/nasa/pds/crawler/proc/DirectoryProcessor.java#L95

Please note that in the existing code the file reader is already being closed after each message.
https://github.com/NASA-PDS/registry-crawler-service/blob/e5fbe97f0e998a1b4208c63d07022814abb26877/src/main/java/gov/nasa/pds/crawler/proc/DirectoryProcessor.java#L158

However, I think we should come-up with a different algorithm or design to address this issue at software level.

Having said that, please note that default max files value for MacOS is 256, which is very limited compared to the scale of this task.

@mdrum
Copy link
Author

mdrum commented Sep 9, 2022

The second issue sounds like exactly what I was experiencing. I've never run into file limit issue, since we're using this on a linux server that must have a different config. The logs look good to me. Thanks for digging into this @ramesh-maddegoda

jimmie pushed a commit to NASA-PDS/registry-crawler-service that referenced this issue Sep 19, 2022
… overwrite the same ID added as a result of a situation where Harvest Service consumed a FileBatch, which includes the ID of Collection Inventory, before processing the CollectionInventory. (#22)

Refer to issue NASA-PDS/registry-harvest-service#25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
B13.0 B13.1 bug Something isn't working i&t.done s.high High severity
Projects
None yet
Development

No branches or pull requests

4 participants