Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3-only download can't get metadata #82

Closed
dmd opened this issue Oct 26, 2023 · 9 comments
Closed

S3-only download can't get metadata #82

dmd opened this issue Oct 26, 2023 · 9 comments

Comments

@dmd
Copy link

dmd commented Oct 26, 2023

I don't know if this is a dupe of #67 but it seems that if you want to download to S3, you have to download package_file_metadata.txt.gz locally first.

Is this intentional or a bug?

E.g.:

$ downloadcmd -u ddrucker -dp 1220860 -t onefile  -s3 s3://rapidtide-nda/test20231026
Running NDATools Version 0.2.25

No value specified for --workerThreads. Using the default option of 7
Important - You can configure the thread count setting using the --workerThreads argument to maximize your download speed.


Getting Package Information...

Package-id: 1220860
Name: HCPAgingAllFiles
Has associated files?: Yes
Number of files in package: 1414735
Total Package Size: 22.33TB

Starting download: s3://rapidtide-nda/test20231026/package_file_metadata.txt.gz
Traceback (most recent call last):
  File "/Users/dmd/venvs/apc-ve/bin/downloadcmd", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/clientscripts/downloadcmd.py", line 200, in main
    s3Download.start()
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/Download.py", line 198, in start
    self.download_package_metadata_file()
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/Download.py", line 889, in download_package_metadata_file
    with gzip.open(download_location, 'rb') as f_in:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py", line 174, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/Users/dmd/cloud-brains/hcpage/package_file_metadata.txt.gz'
@dmd
Copy link
Author

dmd commented Oct 26, 2023

Related question - assuming this is intentional, how can you download just the metadata? I know I can run

downloadcmd -u ddrucker -dp 1220860 -d .

and then Control-c out of it once it's downloaded package_file_metadata.txt.gz, but that can't be automated. How can I tell it "download the metadata, and nothing else" so I can then, in the next command, use -s3?

@gregmagdits
Copy link
Contributor

gregmagdits commented Oct 26, 2023

yes, the issue mentioned in the first comment is related to #67. We are planning to have this fixed in the next release.

If you want to download just the package-file-metadata file, you can run :
downloadcmd -dp 1220860 --file-regex "package_file_metadata.txt.gz"

The tool will say it didn't find any matching files because the metadata file doesn't contain a record for itself, but the tool always downloads this file before downloading any other file in the package, so it will be downloaded (if it doens't already exist locally). We can add a low priority ticket to include a record for the metadata file itself to the metadata file so that the output of the program is accurate in this particular case.

@gregmagdits
Copy link
Contributor

I don't think you were asking about the data-structure files, but in case you were, you can get those with the following regex:
downloadcmd -dp 1220860 --file-regex "^[^/]+.txt"

@dmd
Copy link
Author

dmd commented Nov 9, 2023

I wasn't, but that made me realize the solution to "download just the metadata" is:

downloadcmd -dp 1220860 --file-regex match-nothing

which will happily download the package_file_metadata.txt.gz, then download 0 files and exit, which is exactly what I want.

This still doesn't solve the underlying bug that if you're using S3, the metadata should live there too, though.

@gregmagdits
Copy link
Contributor

In all new packages the metadata file includes itself, which means you can run

downloadcmd -dp <package-id> --file-regex package_file_metadata_<package-id>.txt.gz

and there should be 1 file that meets the filter criteria.

@dmd
Copy link
Author

dmd commented Dec 5, 2023

Any reason to do that vs. match-nothing?

@dmd
Copy link
Author

dmd commented Dec 5, 2023

And what about what I said about S3? Or is the idea that you don't want the metadata remote? (In my case, it is anyway, because the "local" directory is S3-mounted.)

@gregmagdits
Copy link
Contributor

Any reason to do that vs. match-nothing?

End result is the same. I guess the intent of the user is more clear if you actually specify the file you want to download

And what about what I said about S3? Or is the idea that you don't want the metadata remote? (In my case, it is anyway, because the "local" directory is S3-mounted.)

The metadata file is used extensively by the program so for now we decided to always have that local. Regarding your use of s3fs - the way the program works now (when downloading locally) is to append a .partial extension to files as they are being downloaded, and then rename the file when the download is complete. The rename operation is not implemented by tools like s3fs, so we were under the impression that this needs to change before the downloadcmd can work with s3 mounted directories. Have you not run into this situation?
In any case, I think using the -s3 flag would be more efficient than downloading to a s3 mounted file system because s3-to-s3 transfers dont leave Amazon's cloud.

@dmd
Copy link
Author

dmd commented Dec 5, 2023

(a) s3fs does support rename (even deep directory rename), and has for many years now (almost a decade!)

(b) But regardless, we are in fact using the -s3 flag for the actual download. We just mount the bucket for later use using s3fs. (And we're doing all this from EC2, so nothing leaves AWS regardless.)

AHA - but, in re-testing this just now, it appears you made an important change in 01a7b08 that changes all of this - you're downloading package_file_metadata to nda-tools/downloadcmd/packages/<package-id>/ rather than to the supplied --directory.

So this is moot anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants