Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest skips path that is the root of a soft link #102

Closed
Tracked by #30
rchenatjpl opened this issue Aug 28, 2022 · 13 comments · Fixed by #116
Closed
Tracked by #30

Harvest skips path that is the root of a soft link #102

rchenatjpl opened this issue Aug 28, 2022 · 13 comments · Fixed by #116
Assignees
Labels
B13.1 bug Something isn't working i&t.done s.high

Comments

@rchenatjpl
Copy link

rchenatjpl commented Aug 28, 2022

You may have deliberately chosen not to do this. I don't know.

If the config file's directories/path is either (on pdscloud-prod1) of
/data/pds4/context-pds4
/data/pds4/context-pds4/
harvest skips the directory. However, if directories/path is a subdir, harvest works
/data/pds4/context-pds4/agency
The resolved link also works:
/data/pds4/1700/PDS4_context_bundle_20180723

From the pds4 account on pdscloud-prod1:

[pds4@pdscloud-prod1 ~]$ cd test
[pds4@pdscloud-prod1 test]$ ls -g -o  /data/pds4 | grep context
lrwxrwxrwx.  1   35 Feb 26  2017 context-pds3 -> ./1700/PDS3_context_bundle_20161220
lrwxrwxrwx.  1   35 Jul 23  2018 context-pds4 -> ./1700/PDS4_context_bundle_20180723
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ grep "<path>" issue1?.xml
issue1a.xml:    <path>/data/pds4/context-pds4</path>  <!-- fails -->
issue1b.xml:    <path>/data/pds4/context-pds4/</path>  <!-- fails -->
issue1c.xml:    <path>/data/pds4/context-pds4/agency</path>  <!-- works. 5 xml files there -->
issue1d.xml:    <path>/data/pds4/1700/PDS4_context_bundle_20180723</path>  <!-- works. 5000+ xml files-->
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ /usr/local/build11/harvest-3.6.0/bin/harvest -c issue1a.xml 
[SUMMARY] Reading configuration from /data/home/pds4/test/issue1a.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443, index: registry
[INFO] Connecting to Elasticsearch
[INFO] Loading PDS to ES data type mapping from /usr/local/build11/harvest-3.6.0/elastic/data-dic-types.cfg
[INFO] Processing directory: /data/pds4/context-pds4
[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 0
[SUMMARY] Failed files: 0
[SUMMARY] Package ID: de9c2077-4e40-495e-87f7-0052f5821379
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ /usr/local/build11/harvest-3.6.0/bin/harvest -c issue1b.xml 
[SUMMARY] Reading configuration from /data/home/pds4/test/issue1b.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443, index: registry
[INFO] Connecting to Elasticsearch
[INFO] Loading PDS to ES data type mapping from /usr/local/build11/harvest-3.6.0/elastic/data-dic-types.cfg
[INFO] Processing directory: /data/pds4/context-pds4
[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 0
[SUMMARY] Failed files: 0
[SUMMARY] Package ID: 8777a5d1-d0b1-4f5c-b876-596b5459284f
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ /usr/local/build11/harvest-3.6.0/bin/harvest -c issue1c.xml 
[SUMMARY] Reading configuration from /data/home/pds4/test/issue1c.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443, index: registry
[INFO] Connecting to Elasticsearch
[INFO] Loading PDS to ES data type mapping from /usr/local/build11/harvest-3.6.0/elastic/data-dic-types.cfg
[INFO] Processing directory: /data/pds4/context-pds4/agency
[INFO] Processing /data/pds4/context-pds4/agency/Collection_agency_jaxa_v1.0.xml
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing /data/pds4/context-pds4/agency/agency.jaxa_1.0.xml
[INFO] Processing /data/pds4/context-pds4/agency/agency.nasa_1.0.xml
[INFO] Processing /data/pds4/context-pds4/agency/Collection_agency_v1.0.xml
[INFO] Wrote 1 collection inventory document(s)
[INFO] Processing /data/pds4/context-pds4/agency/agency.esa_1.0.xml
[INFO] Wrote 5 product(s)
[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 5
[SUMMARY]   Product_Collection: 2
[SUMMARY]   Product_Context: 3
[SUMMARY] Failed files: 0
[SUMMARY] Package ID: 864691d3-cec9-43da-bee2-13ceb2c64173
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 ~]$ ls -l /data/pds4 | grep context
lrwxrwxrwx.  1 pds pds   35 Feb 26  2017 context-pds3 -> ./1700/PDS3_context_bundle_20161220
lrwxrwxrwx.  1 pds pds   35 Jul 23  2018 context-pds4 -> ./1700/PDS4_context_bundle_20180723
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ 
[pds4@pdscloud-prod1 test]$ /usr/local/build11/harvest-3.6.0/bin/harvest -c issue1d.xml 
[SUMMARY] Reading configuration from /data/home/pds4/test/issue1d.xml
[SUMMARY] Output directory: /tmp/harvest/out
[SUMMARY] Elasticsearch URL: https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443, index: registry
[INFO] Connecting to Elasticsearch
[INFO] Loading PDS to ES data type mapping from /usr/local/build11/harvest-3.6.0/elastic/data-dic-types.cfg
[INFO] Processing directory: /data/pds4/1700/PDS4_context_bundle_20180723
[INFO] Processing /data/pds4/1700/PDS4_context_bundle_20180723/personnel-affiliate/personnel.paul_geissler_1.0.xml
[INFO] Processing /data/pds4/1700/PDS4_context_bundle_20180723/personnel-affiliate/personnel.michael_evans_1.0.xml
[INFO] Processing /data/pds4/1700/PDS4_context_bundle_20180723/personnel-affiliate/personnel.raymond_arvidson_1.0.xml
[INFO] Processing /data/pds4/1700/PDS4_context_bundle_20180723/personnel-affiliate/personnel.raymond_walker_1.0.xml
[snip...]

issue1.zip

@jordanpadams
Copy link
Member

@rchenatjpl in the future, the bug template would have been better here. Not a huge deal, just a heads up.

@tloubrieu-jpl
Copy link
Member

@ramesh-maddegoda @jordanpadams Should we re-assign this ticket to @nutjob4life or @al-niessner ?

@jordanpadams
Copy link
Member

@tloubrieu-jpl reassigned to @alexdunnjpl

@alexdunnjpl
Copy link
Contributor

@rchenatjpl @jordanpadams checking my understanding here

Given

  • path/to/some/symlinkToDir
  • path/to/some/symlinkToDir/someBundleDir

a config containing the following will not pick up someBundleDir

<directories>
    <path>path/to/some/symlinkToDir</path>
</directories>

but a config containing the following will pick it up

<directories>
    <path>path/to/some/symlinkToDir/someBundleDir</path>
</directories>

and the behaviour of <directories> is supposed to be "harvest all harvestable artifacts in the subtree rooted at <path>, yes?

@rchenatjpl
Copy link
Author

If I understand you correctly, that's right. I think harvest should run through all subdirs of directories/path .

@alexdunnjpl
Copy link
Contributor

@tloubrieu-jpl @jordanpadams is this (following symlinks) something we want to support? Seems like there's an unavoidable trade-off between

  • Symlinks not traversed by harvest
  • Symlinks are traversed by harvest, but may cause infinite loops

@rchenatjpl is there a generalized use-case which is persuasive enough to require implementation of symlink support, or is this a once-off or something that can be worked around by providing the actual filesystem path to the link-target directory?

@rchenatjpl
Copy link
Author

I'd like harvest to follow the symlink, but if not, it is definitely work-aroundable for my needs.

@jordanpadams
Copy link
Member

@alexdunnjpl I think we have to risk infinite loops. symlinks are ubiquitous in the PDS

@alexdunnjpl
Copy link
Contributor

@jordanpadams roger that - I'll get on it today.

@alexdunnjpl
Copy link
Contributor

@jordanpadams is the intended behaviour that only symlinked roots will be followed, or that symlinks at any depth within the tree rooted at the target dir will be followed?

@al-niessner
Copy link
Contributor

@jordanpadams @alexdunnjpl

I have no idea why I am on this thread, but use absolute or canonical path function to convert given path with symlink in it then see if you know it (use a Set() for what you know already). If not seen, then process it. Otherwise off to next file. Infinite loop no more.

@jordanpadams
Copy link
Member

@al-niessner I think you are just following this repo :-) .

@alexdunnjpl follow all symlinks

@miguelp1986
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B13.1 bug Something isn't working i&t.done s.high
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants