Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downloadcmd renames files from .nii.gz to .nii_N.gz #88

Open
dmd opened this issue Dec 12, 2023 · 8 comments
Open

downloadcmd renames files from .nii.gz to .nii_N.gz #88

dmd opened this issue Dec 12, 2023 · 8 comments

Comments

@dmd
Copy link

dmd commented Dec 12, 2023

downloadcmd appears to be renaming some files it downloads.

(base) root@60cc05ff57df:/# grep '33287.*HCA9953406.*brainmask_fs.2.nii.gz' /home/rapidtide/NDA/nda-tools/downloadcmd/packages/1223154/datastructure_manifest.txt  | awk '{print $6}'
"s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_AP/brainmask_fs.2.nii.gz"
"s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_AP/brainmask_fs.2.nii.gz"
"s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/brainmask_fs.2.nii.gz"
"s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_PA/brainmask_fs.2.nii.gz"

Here we have four identically named files but in 4 different directories (rest1 vs 2, AP vs PA). When I download them, I expect them to end up all named brainmask_fs.2.nii.gz, in those four directories.

I use that to create a 4 line long file which will be the s3 files I request.

(base) root@60cc05ff57df:/# cat /s3-files-requested
s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_AP/brainmask_fs.2.nii.gz
s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_AP/brainmask_fs.2.nii.gz
s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/brainmask_fs.2.nii.gz
s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_PA/brainmask_fs.2.nii.gz

Run the downloadcmd:

(base) root@60cc05ff57df:/# downloadcmd  -dp $NDADATASET -u $NDAUSER -t /s3-files-requested  -d /downloaded
Running NDATools Version 0.2.26

No value specified for --workerThreads. Using the default option of 3
Important - You can configure the thread count setting using the --workerThreads argument to maximize your download speed.


Getting Package Information...

Package-id: 1223154
Name: HCPAgingAllFiles
Has associated files?: Yes
Number of files in package: 1414735
Total Package Size: 22.33TB

Downloading S3 links from text file: /s3-files-requested

S3 links for files that failed to download will be written out to /home/rapidtide/NDA/nda-tools/downloadcmd/logs/failed_s3_links_file_20231212T124033.txtzre58rxj. You can attempt to download these files later by running:
	downloadcmd -dp 1223154 -t /s3-files-requested -u ddrucker -d /downloaded -wt 3 -t "/home/rapidtide/NDA/nda-tools/downloadcmd/logs/failed_s3_links_file_20231212T124033.txtzre58rxj"


Beginning download of 4 files (85.97KB) to /downloaded using 3 threads
Adding 4 files to download queue. Queue contains 4 files

Starting download: /downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_AP/brainmask_fs.2.nii.gz.partial
Starting download: /downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_AP/brainmask_fs.2.nii.gz.partial
Starting download: /downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/brainmask_fs.2.nii_1.gz.partial
Completed download /downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/brainmask_fs.2.nii_1.gz
Starting download: /downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_PA/brainmask_fs.2.nii_2.gz.partial
Completed download /downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_AP/brainmask_fs.2.nii.gz
Completed download /downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_AP/brainmask_fs.2.nii.gz
Completed download /downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_PA/brainmask_fs.2.nii_2.gz
No failures detected. Removing file /home/rapidtide/NDA/nda-tools/downloadcmd/logs/failed_s3_links_file_20231212T124033.txtzre58rxj

Finished processing all download requests @ 2023-12-12 12:40:33.914311.
     Total download requests 4
     Total errors encountered: 0

 Exiting Program...

Note it has renamed the files!

(base) root@60cc05ff57df:/# find /downloaded -type f
/downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_AP/brainmask_fs.2.nii.gz
/downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_AP/brainmask_fs.2.nii.gz
/downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/brainmask_fs.2.nii_1.gz
/downloaded/fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_PA/brainmask_fs.2.nii_2.gz

This is problematic for two reasons:

  1. The filenames are not predictable - I expect to be able to operate on the filename I told it to download, not one downloadcmd makes up

  2. FSL tools like FLIRT refuse to open these files because the extension is wrong.

@dmd
Copy link
Author

dmd commented Dec 12, 2023

My workaround right now is simply:

find /downloaded/ -type f -name '*_1.*' | while read fname; do   mv "$fname" "${fname/_1/}"; done
find /downloaded/ -type f -name '*_2.*' | while read fname; do   mv "$fname" "${fname/_2/}"; done

but I'd much rather not have to do that.

@gregmagdits
Copy link
Contributor

Mappings from s3 urls to locations on disk are stored in the package metadata file. These mappings are generated at the time of package creation. S3-urls are converted to relative file locations by removing the s3://NDAR_Central_x/submission_xxxxx prefix and replacing it with the data-structure to which the file was submitted. For example

s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_AP/brainmask_fs.2.nii.gz
s3://NDAR_Central_4/submission_33287/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_AP/brainmask_fs.2.nii.gz

should both become

fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST1_AP/brainmask_fs.2.nii.gz
fmriresults01/HCA9953406_V1_MR/MNINonLinear/Results/rfMRI_REST2_AP/brainmask_fs.2.nii.gz

Renaming files is necessary when files from 2 submissions have the same name after the submission_xxxx prefix in the s3url however it appears the logic is only taking the file-name into account (and not the file-path). We will adjust the logic and at the same time, we will update the logic to ensure the extension remains intact. This will not require an update to the client tool, but it will require users to create new packages in order to see the changes. We will post an update to this ticket when the fix has been deployed.

@dmd
Copy link
Author

dmd commented Jun 12, 2024

Is this still planned to be fixed, or ...

@gregmagdits
Copy link
Contributor

After doing some more investigation it turned out that the procedure correctly renamed the files in your package because there were other files in your package that had the same exact file-name and path. The current logic already tries to preserve the extension by adding the _# before the last ".". This logic isn't sophisticated enough to preserve extensions like .nii.gz and we started looking into potential solutions but for the time being this ticket has been de-prioritized

@dmd
Copy link
Author

dmd commented Jun 17, 2024

I'm not sure I understand how a package can have multiple files with the same filename and path. That seems like a bug in the design of the packaging system - as in, it should not be possible to create a package like that, the package build should fail and the creator should try again.

@gregmagdits
Copy link
Contributor

Yes, it was overlooked when the system was first designed. In our s3 repo, files are grouped under a submission-id, i.e.

/NDAR_Central_1/submission_33280/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv
/NDAR_Central_3/submission_33278/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv
/NDAR_Central_4/submission_33287/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv

When users create packages, we group files by data-structure instead of submission-id, so each of the three files mentioned above would map to

fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv

When the system was designed it did not take into consideration that the same file name can be used across multiple submissions, which would result in naming collisions during packaging. I do not think throwing an error during the package creation step would be ideal since it unnecessarily prevents users from accessing the rest of their data in their package, which at the moment takes a fair amount of time to create.
We would consider alternatives to adding unique suffixes to files if it can be implemented quickly, but we are trying to redesign the process of downloading data from NDA so anything requiring substantial rework of the current system will not likely be considered.

@dmd
Copy link
Author

dmd commented Jun 18, 2024

Maybe I'm misunderstanding what a submission is, but if a file like that is submitted multiple times, is it expected for all the files to be identical, or not?

If they are not identical, could either the first or the last submitted have the "original" (correct) filename, and all the others be renamed?

The problem is that analyses expect the files to be named a specific thing - the names often have a structure or meaning to them or are even referenced by name in other files.

@gregmagdits
Copy link
Contributor

NDA doesn't know whether the files are identical. In this case, since they come from the same collection I would suspect that they are, but you can calculate and compare the md5 sums to double check.
The concept of an "original" file wouldn't make much sense if two collections just happened to have the same file name, but it might be better than randomly choosing which files get suffixes added to them, which is what we are doing now.
We are aware that adding these suffixes is not an ideal solution for the reason you mentioned above, but we didn't see much of an alternative which didn't also require relatively large changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants