Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong file name for download from a FigShare URL #760

Closed
briochemc opened this issue Sep 18, 2021 · 4 comments · Fixed by #761
Closed

Wrong file name for download from a FigShare URL #760

briochemc opened this issue Sep 18, 2021 · 4 comments · Fixed by #761
Assignees

Comments

@briochemc
Copy link

Julia 1.6.2
HTTP v0.9.14
MbedTLS v1.0.3

This minimal example,

using DataDeps, XLSX

register(DataDep(
            "FigShare_dataset",
            "No description needed",
            "https://ndownloader.figshare.com/files/6294558",
            "ba250b1b64b8c43d1130a7aee39674e9ee5936ebb49050c4c224c740bff588b0"
))

XLSX.read(joinpath(datadep"FigShare_dataset", "rsta20150293_si_001.xlsx"))

downloads a table from FigShare (doi: https://doi.org/10.6084/m9.figshare.3980064.v1) and tries to read it (using DataDeps.jl) but fails with No such file or directory. It fails because instead of the correct name ("rsta20150293_si_001.xlsx"), the file is saved as 6294558 (which is the last "word" of the URL). However, this exact same snippet used to work sometime in the past year, so something changed since then. Talking briefly with @oxinabox, he suggested that this could be an issue for HTTP.jl or an issue from FigShare itself. I tried to dig when/where something changed using blame here, but I failed to figure it out, and pinning earlier package versions did not work either.

I don't come with just a problem, FWIW, a solution/workaround (thanks to Lyndon as well) is to rename the file in post-processing after download. Thus at this stage this is not an issue for me anymore, but hopefully posting all these details will help someone here find a fix! 😃

@fredrikekre
Copy link
Member

What filename does curl give?

@briochemc
Copy link
Author

Apologies if this is dumb, but I'm not sure how to answer that! 😅

@oxinabox oxinabox self-assigned this Sep 18, 2021
@oxinabox
Copy link
Member

Don't worry, I got this.

❱ curl -Li "https://ndownloader.figshare.com/files/6294558"                    
HTTP/1.1 302 Found
Server: nginx
Date: Sat, 18 Sep 2021 10:28:44 GMT
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Content-Length: 0
Connection: keep-alive
...
Location: https://s3-eu-west-1.amazonaws.com/pstorage-rs-4828782598/6294558/rsta20150293_si_001.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-
...
Content-Disposition: attachment;filename=rsta20150293_si_001.xlsx
...

That content-disposition is what we should (and normally do) use to determine file name.

I wonder if we are getting tricked by the 302 redirect?

@oxinabox
Copy link
Member

oxinabox commented Sep 18, 2021

Yeah looks like it is the redirect confusing things
HTTP.jl see code 200 and no Content-Disposition header:

julia> resp = HTTP.headers(HTTP.request("GET", "https://ndownloader.figshare.com/files/6294558"))
12-element Vector{Pair{SubString{String}, SubString{String}}}:
               "x-amz-id-2" => "t5e96lUgqwlnTS65M5hrdcLtnZ/K3vhlDScYBehbxxFL85CqPMfrqsc8nMbXy4KG1FL8nB/3NCw="
         "x-amz-request-id" => "PQM7QC3VMVZPTGFZ"
                     "Date" => "Sat, 18 Sep 2021 10:45:02 GMT"
 "x-amz-replication-status" => "COMPLETED"
            "Last-Modified" => "Fri, 03 Sep 2021 08:47:47 GMT"
                     "ETag" => "\"bf518a09be3cf14d4d7abb47489cbae8\""
      "x-amz-tagging-count" => "1"
         "x-amz-version-id" => "U6trxKUd0lhNhFEHAmnrHQsVVofp9yxk"
            "Accept-Ranges" => "bytes"
             "Content-Type" => "binary/octet-stream"
                   "Server" => "AmazonS3"
           "Content-Length" => "463645"

fredrikekre added a commit that referenced this issue Sep 26, 2021
in HTTP.download, fixes #760.

Co-authored-by: Lyndon White <lyndon.white@invenialabs.co.uk>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>
fredrikekre added a commit that referenced this issue Sep 26, 2021
Use Content-Disposition for 3xx requests for filename detection in HTTP.download, fixes #760.

Co-authored-by: Lyndon White <lyndon.white@invenialabs.co.uk>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants