Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourceURL -- does not work if basename(sourceURL) is not a valid filename #327

Closed
eliotmcintire opened this issue Dec 16, 2016 · 14 comments
Closed

Comments

@eliotmcintire
Copy link
Contributor

inside downloadData, the whole structure assumes that the sourceURL can provide a filename for the local file once downloaded. If this is a dropbox file, google drive file, or short URL, or many other cases, this will not work.

Likely this requires a 4th column in the CHECKSUMS.txt file, indicating the sourceURL which goes with the local filename

The culprit line where it breaks is this:

id <- which(chksums$expectedFile == basename(x))
        
@achubaty
Copy link
Contributor

achubaty commented Dec 16, 2016

Do you mean that with e.g., a Dropbox file url there is a ?dl=1 or some such thing? we can likely do something more sophisticated than simply basename(x) -- strip anything after a ? then do a basename on the result. Either that, or get more low-level control with curl etc. like we do with downloadModule and avoid using download.file. The latter approach may be tricky because of different protocols (ftp. http, etc.)

@eliotmcintire
Copy link
Contributor Author

There are 2 issues here:

  1. The URL may not work (easily?) from the command line, and must be accessed via a browser. In the case of a dropbox link, the ?dl=1 does appaer to work for a direct download. But equivalents for Google Drive, Sync.com and any other arbitrary file sharing cloud service may not work (easily?). So, this is not solved for e.g., Google Drive.

  2. The actual filename in the URL may have punctuation that does not work on the OS where the downloaded file is coming. In the cases I have been dealing with, the ? can't be in a Windows filename, but it is regularly in a URL. I have implemented something that strips punctuation, and it seems to work. Not yet pushed to development branch.

@eliotmcintire
Copy link
Contributor Author

A work around for those a cloud service for which a direct link is not possible, could be:

browseURL(sim@depends@dependencies[[1]]@inputObjects$sourceURL) 
readline("Please download file in browser. When download is finished press any key")

@achubaty
Copy link
Contributor

I think this is easily doable with httr::GET like I did for downloadModule. Do you have some sample files / test cases I can try?

@achubaty
Copy link
Contributor

achubaty commented Dec 19, 2016

E.g., using the actual file url (not the browser-only one you get using the "shareable link" in google drive), the file name is contained in the headers. Get the actual file url by downloading the file in a web browser and right-click in the download manager to get the url.

library(httr)
library(magrittr)

url <- "actual_url_goes_here"
ua <- user_agent(getOption("spades.useragent"))
request <- GET(url, ua, write_disk(tempfile()))

request$all_headers[[1]]$headers$`content-disposition` %>%
  strsplit(";") %>%
  `[[`(1) %>%
  grep("filename=", ., value = TRUE)

@eliotmcintire
Copy link
Contributor Author

That approach seems to work for sync.com!

I was not able to get it to work for Google Drive. I think there are many file types that Google Driver intercepts. I tried with a .txt file and it opened a Google Docs editor instead of downloading the file. Following through with the download link there, I was able to download, copy the URL from the download manager, but that link did not have a file.

Also, with large files, Google Drive gives you an error about not being able to scan such a large file.

Here is a large file to try on Google Drive:
Shareable link:
https://docs.google.com/uc?export=download&confirm=kvrB&id=0BxZrk9psrK4nR3ZjNTVtT1ZGcFE

File link:
https://doc-14-30-docs.googleusercontent.com/docs/securesc/agi0va3fq5queskqdj6rdj8tv13f9pea/ke27u7gcqutsk1hi9be644im8sefbk0f/1482177600000/12776779115079064749/12776779115079064749/0BxZrk9psrK4nR3ZjNTVtT1ZGcFE?e=download

@achubaty
Copy link
Contributor

Hmm I built my example using files on Google Drive (but they weren't large files). Perhaps large files are handled differently...

Note that your file link doesn't work for me in the browser, but the shareable link does.

@eliotmcintire
Copy link
Contributor Author

That may be a symptom of part of the problem... the file link I gave works for me in my browser. So, there may be something more than just that file name that is required. I can access that file in the browser, but I can't access it using the code above (it does work, however, for sync.com, indicating that a) it does work, b) I used it correctly, but c) it is not universal enough)

@achubaty
Copy link
Contributor

The direct URL changes using Google Drive -- I just tried from scratch on your file, and got a new url, which does work for me.

@eliotmcintire
Copy link
Contributor Author

Which means what?

@achubaty
Copy link
Contributor

which means that you're correct about the lack of generalizability. clearly these platforms aren't intended for use in this way.

@achubaty
Copy link
Contributor

Yes, that was how I was checking things. But I still get different links each time (your links above no longer work). My guess is that the direct file link is always different / time sensitive, whereas the google link is set up to redirect to a new file url each time.

@achubaty
Copy link
Contributor

This issue was moved to PredictiveEcology/SpaDES.core#30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants