Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] fread does not work with https url behind a proxy #1686

Closed
cderv opened this issue Apr 29, 2016 · 3 comments · Fixed by #5749
Closed

[bug] fread does not work with https url behind a proxy #1686

cderv opened this issue Apr 29, 2016 · 3 comments · Fixed by #5749
Labels
Milestone

Comments

@cderv
Copy link

cderv commented Apr 29, 2016

Hi,

I recently encounter a problem with https url to read a file with fread. I had the following error

#> Error in curl::curl_download(input, tt, mode = "wb", quiet = !showProgress): Timeout was reached

The problem is I am behind a proxy and curl_download does not seem to import current proxy setting on windows for IE. And I do not find a way to configure curl proxy setting before fread

However, with download.file function, downloading the file with https url works perfectly without configuring anything.
Looking at fread, curl::curl_download is used for secure url whereas for non secure, download.file is used ?

if (!is_secureurl(input)) {
      download.file(input, tt, mode = "wb", quiet = !showProgress)
    }
    else {
      if (!requireNamespace("curl", quietly = TRUE)) 
        stop("Input URL requires https:// connection for which fread() requires 'curl' package, but cannot be found. Please install the package using 'install.packages()'.")
      curl::curl_download(input, tt, mode = "wb", quiet = !showProgress)
    }

Could it be possible that now R handles https url with download.file, the suggested curl library is no longer need and would solves this problem?


For those who could try to reproduce behin a proxy :

url <- "https://d37djvu3ytnwxt.cloudfront.net/asset-v1:MITx+15.071x_3+1T2016+type@asset+block/songs.csv"
DT <- data.table::fread(url, verbose = T)
#> Error in curl::curl_download(input, tt, mode = "wb", quiet = !showProgress): Timeout was reached

and my sessionInfo()

#> R version 3.2.4 Revised (2016-03-16 r70336)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#> 
#> locale:
#> [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
#> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=French_France.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.9.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] clipr_0.2.0       magrittr_1.5      formatR_1.3      
#>  [4] htmltools_0.3.5   tools_3.2.4       curl_0.9.7       
#>  [7] Rcpp_0.12.4       stringi_1.0-1     rmarkdown_0.9.5  
#> [10] knitr_1.12.3      stringr_1.0.0     digest_0.6.9     
#> [13] reprex_0.0.0.9001 chron_2.3-47      evaluate_0.8.3
@jangorecki
Copy link
Member

According to curl#1 starting from R 3.2.2 it guess proxy on windows, so curl probably isn't required for https any more. But for older versions of R we would still need curl.
Not sure but maybe adding download.method argument to fread would manage https better, download.file can use download.method="libcurl" (or "curl"), then using curl::curldownload should not be even needed.

@MichaelChirico
Copy link
Member

@jangorecki I thought of adding the argument for #1668, but wasn't sure how often it would come up... more evidence in favor here

@cderv
Copy link
Author

cderv commented Apr 26, 2017

@jangorecki, with download.file and method = "libcurl", it it explained that we could use https_proxy environment variable with the form [user:password@]machine[:port] and it is working.

As now https is working by default with download.file, it could be a good idea to add download.file args and not depend on curl. For example, in remotes package, changes have been made to use this behaviour and work with download.file methods and its proxy config.

About this issue specifically, setting correctly the environnement variable seems to work:
Sys.setenv(HTTPS_PROXY=http://[user:password@]proxy:port), I now manage to use fread.

I discussed this solution in curl repo.

Example
Sys.getenv("HTTPS_PROXY")
#> [1] "http://myuser:mypassword@proxyIp:proxyPort"
url <- "https://d37djvu3ytnwxt.cloudfront.net/asset-v1:MITx+15.071x_3+1T2016+type@asset+block/songs.csv"
DT <- data.table::fread(url, verbose = T)
#> Input contains no \n. Taking this to be a filename to open
#> File opened, filesize is 0.002159 GB.
#> Memory mapping ... ok
#> Detected eol as \r only (no \n or \r afterwards). An old Mac 9 standard, discontinued in 2002 according to Wikipedia.
#> Positioned on line 1 after skip or autostart
#> This line is the autostart and not blank so searching up for the last non-blank ... line 1
#> Detecting sep ... ','
#> Detected 39 columns. Longest stretch was from line 1 to line 30
#> Starting data input on line 1 (either column names or first row of data). First 10 characters: year,songt
#> All the fields on line 1 are character fields. Treating as the column names.
#> Count of eol: 7574 (including 0 at the end)
#> Count of sep: 287812
#> nrow = MIN( nsep [287812] / (ncol [39] -1), neol [7574] - endblanks [0] ) = 7574
#> Type codes (point  0): 144441333313333333333333333333333333331
#> Type codes (point  1): 144441333313333333333333333333333333331
#> Type codes (point  2): 144441333313333333333333333333333333331
#> Type codes (point  3): 144441333313333333333333333333333333331
#> Type codes (point  4): 144441333313333333333333333333333333331
#> Type codes (point  5): 144441333313333333333333333333333333331
#> Type codes (point  6): 144441333313333333333333333333333333331
#> Type codes (point  7): 144441333313333333333333333333333333331
#> Type codes (point  8): 144441333313333333333333333333333333331
#> Type codes (point  9): 144441333313333333333333333333333333331
#> Type codes (point 10): 144441333313333333333333333333333333331
#> Type codes: 144441333313333333333333333333333333331 (after applying colClasses and integer64)
#> Type codes: 144441333313333333333333333333333333331 (after applying drop or select (if supplied)
#> Allocating 39 column slots (39 - 0 dropped)
#> Read 7574 rows. Exactly what was estimated and allocated up front
#>    0.010s (  3%) Memory map (rerun may be quicker)
#>    0.000s (  0%) sep and header detection
#>    0.005s (  1%) Count rows (wc -l)
#>    0.045s ( 11%) Column type detection (100 rows at 10 points)
#>    0.000s (  0%) Allocation of 7574x39 result (xMB) in RAM
#>    0.337s ( 85%) Reading data
#>    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
#>    0.000s (  0%) Coercing data already read in type bumps (if any)
#>    0.000s (  0%) Changing na.strings to NA
#>    0.397s        Total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants