New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: copy/paste-able download links with good filenames #7020

Closed
ctb opened this Issue Mar 24, 2017 · 12 comments

Comments

Projects
None yet
4 participants
@ctb
Contributor

ctb commented Mar 24, 2017

Yesterday I posted a talk to OSF,

https://osf.io/mhwa5/

and I had a surprising amount of trouble getting a direct download link that I could curl.

  • The download button doesn't seem to have a link associated, so I couldn't copy-the-link for posting;
  • when I did find a direct download link (under Revisions, the download buttons there are direct links), it didn't have a sensible URL name so the curl -O saved a badly named file that had special characters in it.

To be clear, I can click on various places and (presumably through JavaScript magic) get the file to download. But I would like to be able to do this from e.g. command line or Python urllib.open as well, if possible, because that would let me use data files stored on the OSF in scripts and notebooks for teaching and research.

A typical solution to the filename problem is to have a "fake URL" that has all the right IDs to download the file encoded in it, and then ends with the filename that it should be saved under (which doesn't necessarily matter for retrieval, but is used by curl and wget and human readers).

Happy to discuss!

@saradbowman

This comment has been minimized.

Show comment
Hide comment
@saradbowman

saradbowman Mar 28, 2017

Contributor

Hey @ctb, thanks for the feedback! This is certainly worth exploring/supporting, and I'll pass it along to the Product team.

In the interim, the download link for the current version of a file is:
osf.io/GUID/?action=download , where GUID is the 5 character ID for the file (in your case mhwa5)
Previous versions append "&version=#" where # is the version you want to download (so, for version 1 of your file, osf.io/mhwa5/?action=download&version=1)

You can also do https://api.osf.io/v2/guids/mhwa5/ and get the download link from that if you want a programatic way.

Contributor

saradbowman commented Mar 28, 2017

Hey @ctb, thanks for the feedback! This is certainly worth exploring/supporting, and I'll pass it along to the Product team.

In the interim, the download link for the current version of a file is:
osf.io/GUID/?action=download , where GUID is the 5 character ID for the file (in your case mhwa5)
Previous versions append "&version=#" where # is the version you want to download (so, for version 1 of your file, osf.io/mhwa5/?action=download&version=1)

You can also do https://api.osf.io/v2/guids/mhwa5/ and get the download link from that if you want a programatic way.

@ctb

This comment has been minimized.

Show comment
Hide comment
Contributor

ctb commented May 6, 2017

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jul 6, 2017

Contributor

OK! Found another use case that illustrates the problem quite well -- and this one I can't fix because it's hosted on another site :)

If I go to the UCSC genome browser here, https://genome.ucsc.edu/cgi-bin/hgCustom, and attempt to use an uploaded file as a custom track, it doesn't work. Specifically, choose mouse (mm10) and put the URL 'https://osf.io/87trm/download' in the top box ("paste URLs or data"). Then on submit I get the error

Error It appears that you are directly uploading binary data in an unrecognized format (https://files.osf.io/v1/resources/pyvfg/providers/osfstorage/595e8a0f9ad5a1022c00ceb5?action=download&version=1&direct).

In contrast, if I use the same file stored on github, https://raw.githubusercontent.com/ngs-docs/angus/2017/_static/SRR3152806.bw, it works fine.

I think what is going on is that UCSC is looking for a '.bw' on the end of the URL. Silly, I know, but there you are.

This is a really big use case (like, literally thousands of people want to do this with files too big for GitHub) and I can't figure out how to do it via OSF. It might be enough to update the URL so that you can include anything you want after download/ e.g download/SRR.bw...

Contributor

ctb commented Jul 6, 2017

OK! Found another use case that illustrates the problem quite well -- and this one I can't fix because it's hosted on another site :)

If I go to the UCSC genome browser here, https://genome.ucsc.edu/cgi-bin/hgCustom, and attempt to use an uploaded file as a custom track, it doesn't work. Specifically, choose mouse (mm10) and put the URL 'https://osf.io/87trm/download' in the top box ("paste URLs or data"). Then on submit I get the error

Error It appears that you are directly uploading binary data in an unrecognized format (https://files.osf.io/v1/resources/pyvfg/providers/osfstorage/595e8a0f9ad5a1022c00ceb5?action=download&version=1&direct).

In contrast, if I use the same file stored on github, https://raw.githubusercontent.com/ngs-docs/angus/2017/_static/SRR3152806.bw, it works fine.

I think what is going on is that UCSC is looking for a '.bw' on the end of the URL. Silly, I know, but there you are.

This is a really big use case (like, literally thousands of people want to do this with files too big for GitHub) and I can't figure out how to do it via OSF. It might be enough to update the URL so that you can include anything you want after download/ e.g download/SRR.bw...

@icereval

This comment has been minimized.

Show comment
Hide comment
@icereval

icereval Jul 7, 2017

Member

curl -O -J -L https://osf.io/mhwa5/download

You'll need a newer version of curl which supports the Content-Disposition header.

see: https://stackoverflow.com/a/8841522

Member

icereval commented Jul 7, 2017

curl -O -J -L https://osf.io/mhwa5/download

You'll need a newer version of curl which supports the Content-Disposition header.

see: https://stackoverflow.com/a/8841522

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jul 7, 2017

Contributor

Ahh, interestingly, this was proposed in e-mail but didn't make its way on to this issue - thanks!

"Upgrade the software to the latest" is only a partial solution because a lot of the learners I teach are stuck on older OSes/installs.

Also, it does not address the problem in this comment where (for better or for worse) the UCSC folk have not implemented the grokking of Content-Disposition.

Contributor

ctb commented Jul 7, 2017

Ahh, interestingly, this was proposed in e-mail but didn't make its way on to this issue - thanks!

"Upgrade the software to the latest" is only a partial solution because a lot of the learners I teach are stuck on older OSes/installs.

Also, it does not address the problem in this comment where (for better or for worse) the UCSC folk have not implemented the grokking of Content-Disposition.

@icereval icereval assigned sloria and unassigned sloria Jul 7, 2017

@icereval

This comment has been minimized.

Show comment
Hide comment
@icereval

icereval Jul 7, 2017

Member

Reviewing the genome application source and website error messages, I believe these might be the two issues causing problems.

** works **

track type=bigWig name="Example One" description="A bigWig file222" bigDataUrl=https://files.osf.io/v1/resources/pyvfg/providers/osfstorage/595e8a0f9ad5a1022c00ceb5?action=download&version=1
  • The response to a valid /<guid>/download HEAD request is sending back a 200 rather than a 302 to get file metadata like Last-Modified from WaterButler.
    PR/Fix: #7433

** works **

track type=bigWig name="Example One" description="A bigWig file222" bigDataUrl=https://osf.io/87trm/download

We'll have the OSF side released soon and I'll follow up.

Member

icereval commented Jul 7, 2017

Reviewing the genome application source and website error messages, I believe these might be the two issues causing problems.

** works **

track type=bigWig name="Example One" description="A bigWig file222" bigDataUrl=https://files.osf.io/v1/resources/pyvfg/providers/osfstorage/595e8a0f9ad5a1022c00ceb5?action=download&version=1
  • The response to a valid /<guid>/download HEAD request is sending back a 200 rather than a 302 to get file metadata like Last-Modified from WaterButler.
    PR/Fix: #7433

** works **

track type=bigWig name="Example One" description="A bigWig file222" bigDataUrl=https://osf.io/87trm/download

We'll have the OSF side released soon and I'll follow up.

@icereval

This comment has been minimized.

Show comment
Hide comment
@icereval

icereval Jul 7, 2017

Member

Fixes have been applied and the following track configuration now works when providing a name parameter.

track type=bigWig name="Example One" bigDataUrl=https://osf.io/87trm/download

Reviewing the genome application code base a bit further capturing Content-Disposition appears to be possible @:

net.c -> netSkipHttpHeaderLinesWithRedirect -> netSkipHttpHeaderLinesHandlingRedirect

and it can be passed up the chain @:

customPp.c -> customPpNext-> netLineFileOpen


Capturing the file name would allow for the desired user experience of simply pasting in the URL.

Member

icereval commented Jul 7, 2017

Fixes have been applied and the following track configuration now works when providing a name parameter.

track type=bigWig name="Example One" bigDataUrl=https://osf.io/87trm/download

Reviewing the genome application code base a bit further capturing Content-Disposition appears to be possible @:

net.c -> netSkipHttpHeaderLinesWithRedirect -> netSkipHttpHeaderLinesHandlingRedirect

and it can be passed up the chain @:

customPp.c -> customPpNext-> netLineFileOpen


Capturing the file name would allow for the desired user experience of simply pasting in the URL.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jul 7, 2017

Contributor
Contributor

ctb commented Jul 7, 2017

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jul 7, 2017

Contributor

I can confirm the URL now works for me - thank you VERY much!

Contributor

ctb commented Jul 7, 2017

I can confirm the URL now works for me - thank you VERY much!

@icereval

This comment has been minimized.

Show comment
Hide comment
@icereval

icereval Jul 7, 2017

Member

It is also worth mentioning adding the file name into the url on the OSF side is non-trivial / in some cases not possible due to the OSF Storage backend data de-duplication efforts.

Once authorization is handled between WaterButler <-> OSF (assuming the direct parameter is not included, which will soon be most). WaterButler will opt to sign a time sensitive URL, passing it back to the client and giving them the opportunity to download the file directly from the third party object store (S3, CloudFiles, etc...).

https://storage101.iad3.clouddrive.com/v1/MossoCloudFS_.../osf_storage/3044f194333361594118e20531e9f48ee05685930b50f66d7cfea75f23ae2aba?temp_url_expires=...&temp_url_sig=...&filename=SRR3152806.bw

As you can see the file name is a SHA256 value (for de-duplication), and in this case the storage provider uses a query parameter for the filename.

Hope this helps, and always happy to discuss further.

Member

icereval commented Jul 7, 2017

It is also worth mentioning adding the file name into the url on the OSF side is non-trivial / in some cases not possible due to the OSF Storage backend data de-duplication efforts.

Once authorization is handled between WaterButler <-> OSF (assuming the direct parameter is not included, which will soon be most). WaterButler will opt to sign a time sensitive URL, passing it back to the client and giving them the opportunity to download the file directly from the third party object store (S3, CloudFiles, etc...).

https://storage101.iad3.clouddrive.com/v1/MossoCloudFS_.../osf_storage/3044f194333361594118e20531e9f48ee05685930b50f66d7cfea75f23ae2aba?temp_url_expires=...&temp_url_sig=...&filename=SRR3152806.bw

As you can see the file name is a SHA256 value (for de-duplication), and in this case the storage provider uses a query parameter for the filename.

Hope this helps, and always happy to discuss further.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jul 8, 2017

Contributor

thanks for the explanation @icereval - really appreciate it.

Contributor

ctb commented Jul 8, 2017

thanks for the explanation @icereval - really appreciate it.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jul 8, 2018

Contributor

I think this issue can be closed as fixed! thanks!

Contributor

ctb commented Jul 8, 2018

I think this issue can be closed as fixed! thanks!

@ctb ctb closed this Jul 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment