wget command retrieves index.html of all study directories before downloading study images #1012

ayhyap · 2020-10-26T19:28:45Z

The command listed on the dataset page is only downloading the index.html of each study directory, and not the dicom files.
wget -r -N -c -np --user <PHYSIONETUSERNAME> --ask-password https://alpha.physionet.org/files/mimic-cxr/2.0.0/

That is, the command is downloading
/physionet.org/files/mimic-cxr/2.0.0/files/p**/p******/s*******/index.html

instead of the two dicom files under that directory
/physionet.org/files/mimic-cxr/2.0.0/files/p**/p******/s*******/*****.dcm

MIMIC-CXR-JPG has the same problem.

Am I missing something obvious?
(I interrupted the wget command after seeing all of the 200+ downloaded studies lacking image files)

The text was updated successfully, but these errors were encountered:

alistairewj · 2020-11-18T17:50:29Z

We are hoping to improve how to download many file datasets such as this one on PhysioNet - for now I'd recommend using GCP to download the images.

alistairewj · 2020-11-19T21:26:57Z

I can confirm that the wget will actually download the images; but only after it's downloaded ~500,000 index files or so. Clearly not ideal!

huangtao36 · 2021-01-22T08:48:26Z

I hope there is a way to download sub-packages instead of one file after another. Maybe you can pack all the files into 50 small zip files

alistairewj · 2021-01-29T03:47:47Z

I'd suggest using GCP in that instance. You can use their command line tools to filter much more easily using wildcards in the gsutil cp command. Or alternatively work with the data within the cloud to save us egress costs :)

meghalD · 2023-05-16T09:08:50Z

I started with physionet url to download but it gives index.html files as mentioned in the issue. As per the suggested solution, I used GCP for download but it always add .gstmp extension for all the files (For eg. 174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.dcm.gstmp). Is there a method to read these files or avoid this extension?

alistairewj · 2023-05-17T18:50:07Z

gstmp means the download is incomplete. it's a new feature of gsutil allowing download resume I believe

ayhyap changed the title ~~wget command only retrieves index.html of leaf directories~~ wget command retrieves index.html of study directories, but not study images Oct 26, 2020

ayhyap changed the title ~~wget command retrieves index.html of study directories, but not study images~~ wget command retrieves index.html of all study directories before downloading study images Jan 22, 2021

briangow transferred this issue from MIT-LCP/mimic-cxr May 14, 2021

briangow added the mimic-cxr label May 14, 2021

alistairewj closed this as completed May 20, 2021

tompollard mentioned this issue Sep 23, 2022

Cannot download using wget or gcp MIT-LCP/physionet#145

Open

tompollard mentioned this issue Feb 5, 2024

Issue in downloading the dataset. MIT-LCP/physionet#153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wget command retrieves index.html of all study directories before downloading study images #1012

wget command retrieves index.html of all study directories before downloading study images #1012

ayhyap commented Oct 26, 2020

alistairewj commented Nov 18, 2020

alistairewj commented Nov 19, 2020

huangtao36 commented Jan 22, 2021

alistairewj commented Jan 29, 2021

meghalD commented May 16, 2023 •

edited

Loading

alistairewj commented May 17, 2023

wget command retrieves index.html of all study directories before downloading study images #1012

wget command retrieves index.html of all study directories before downloading study images #1012

Comments

ayhyap commented Oct 26, 2020

alistairewj commented Nov 18, 2020

alistairewj commented Nov 19, 2020

huangtao36 commented Jan 22, 2021

alistairewj commented Jan 29, 2021

meghalD commented May 16, 2023 • edited Loading

alistairewj commented May 17, 2023

meghalD commented May 16, 2023 •

edited

Loading