Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wget command retrieves index.html of all study directories before downloading study images #1012

Closed
ayhyap opened this issue Oct 26, 2020 · 6 comments

Comments

@ayhyap
Copy link

ayhyap commented Oct 26, 2020

The command listed on the dataset page is only downloading the index.html of each study directory, and not the dicom files.
wget -r -N -c -np --user <PHYSIONETUSERNAME> --ask-password https://alpha.physionet.org/files/mimic-cxr/2.0.0/

That is, the command is downloading
/physionet.org/files/mimic-cxr/2.0.0/files/p**/p******/s*******/index.html

instead of the two dicom files under that directory
/physionet.org/files/mimic-cxr/2.0.0/files/p**/p******/s*******/*****.dcm

MIMIC-CXR-JPG has the same problem.

Am I missing something obvious?
(I interrupted the wget command after seeing all of the 200+ downloaded studies lacking image files)

@ayhyap ayhyap changed the title wget command only retrieves index.html of leaf directories wget command retrieves index.html of study directories, but not study images Oct 26, 2020
@alistairewj
Copy link
Member

We are hoping to improve how to download many file datasets such as this one on PhysioNet - for now I'd recommend using GCP to download the images.

@alistairewj
Copy link
Member

I can confirm that the wget will actually download the images; but only after it's downloaded ~500,000 index files or so. Clearly not ideal!

@huangtao36
Copy link

I hope there is a way to download sub-packages instead of one file after another. Maybe you can pack all the files into 50 small zip files

@ayhyap ayhyap changed the title wget command retrieves index.html of study directories, but not study images wget command retrieves index.html of all study directories before downloading study images Jan 22, 2021
@alistairewj
Copy link
Member

I'd suggest using GCP in that instance. You can use their command line tools to filter much more easily using wildcards in the gsutil cp command. Or alternatively work with the data within the cloud to save us egress costs :)

@meghalD
Copy link

meghalD commented May 16, 2023

I started with physionet url to download but it gives index.html files as mentioned in the issue. As per the suggested solution, I used GCP for download but it always add .gstmp extension for all the files (For eg. 174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.dcm.gstmp). Is there a method to read these files or avoid this extension?

@alistairewj
Copy link
Member

gstmp means the download is incomplete. it's a new feature of gsutil allowing download resume I believe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants