Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread spends too much time in is_url/is_secureurl/is_file for long in memory input #2531

Closed
javrucebo opened this issue Dec 19, 2017 · 9 comments
Closed

Comments

@javrucebo
Copy link
Contributor

@javrucebo javrucebo commented Dec 19, 2017

Summary

When fread is fed with a character string as input the routine spends considerable amount of time detecting that the supplied input is not a filename or an url.

This is due to grepl not scaling well for large input as used in fread.

Example

is_url <- function(x) grepl("^(http|ftp)s?://", x)

Although the pattern is anchored at the beginning of the string, running grepl for large inputs will take a lot of time for large inputs (more detailed benchmarks further down).
This will lead then to the full call to fread possibly spending a third of the time for those supposedly simple checks. (example also below)

Possible solutions could be one or more of the following

  • changing the call to grepl to use PERL regexp engine perl=TRUE
    Alternatively use another method to determine whether the input starts with url (see benchmarks below)
  • adding an explicit optional argument which denotes string input, e.g. str= to denote that the input is to be considered as the data and skip the tests for url or file. This would be similar in spirit to the file= argument.
    As a side-effect this would also allow input to be read which consits only of url's and having no header, e.g.
fread("http://hkhfsk\nhttp://fhdkf\nhttp://kjfhskd\nhttp://hfkjf", header=FALSE)
  • consider an input, which exceeds a certain number of characters in length, not being an URL

Profiling example

library(data.table)
# create a random character input in 5 columns with 20000*200 lines
randomString <- function() {
  paste(sample(c(LETTERS, letters), sample(1:9,1)), collapse="")
}
input <- paste(rep(paste(replicate(20000,paste(replicate(5, randomString()), 
                                              collapse="\t")),
                         collapse="\n"),200),
               collapse="\n")
# profile the call to fread
Rprof()
invisible(fread(input, header=FALSE))
Rprof(NULL)
summaryRprof()
$by.self
        self.time self.pct total.time total.pct
".Call"      5.34    64.49       5.34     64.49
"grepl"      2.94    35.51       2.94     35.51

Benchmarking grepl and friends

Comparing different functions to verify whether a string starts with any of http(s)/ftp(s) or file shows that grepl scales badly and is by far the slowest of the tested variants.
Adding simply perl=TRUE already improves by around factor 100 for large inputs (code further below)

image

library(data.table)
library(ggplot2)
library(Rcpp)

# define alternative functions to detect URL's
grepl_full <- function(txt) { grepl("^(http|ftp)s?://", txt) }
grepl_substr <- function(txt) { grepl("^(http|ftp)s?://", substr(txt,1,8)) }
grepl_perl <- function(txt) { grepl("^(http|ftp)s?://", txt, perl=TRUE) }
grepl_perl_substr <- function(txt) { grepl("^(http|ftp)s?://", substr(txt,1,8), perl=TRUE) }
grepl_perl_stri_sub <- function(txt) { grepl("^(http|ftp)s?://", stringi::stri_sub(txt,1,8),
                                             perl=TRUE) }
stridetect <- function(txt) { stringi::stri_detect_regex(txt, "^(http|ftp)s?://") }
stristartswith <- function(txt) { 
  stringi::stri_startswith_fixed(txt, c("http://", "https://", "ftp://", "ftps://")) 
}
cppFunction('bool isURL_cpp(const std::string &x) {
               bool ret = FALSE;
               if (x.length() >= 8) {
                 std::string s = x.substr(0,8);
                 if (s.substr(0,7) == "http://" || 
                     s == "https://" ||
                     s.substr(0,6) == "ftp://" ||
                     s.substr(0,7) == "ftps://") {
                   ret = TRUE;
                 }
               }
               return(ret);
             }')
cppFunction('bool isURL_c(const char* x) {
               bool ret = FALSE;
               if (strlen(x) >= 8) {
                 if (!strncmp(x, "http://", 7) || 
                     !strncmp(x, "https://", 8) ||
                     !strncmp(x, "ftp://", 6) ||
                     !strncmp(x, "ftps://", 7)) {
                   ret = TRUE;
                 }
               }
               return(ret);
             }')


# number of string lengths to test with. max len = 1e+nn
nn <- 8
# create random strings of given length
for (i in 1:nn) {
  assign(paste0("str_",i), paste(sample(c(LETTERS, letters), 10^i, replace=TRUE), 
                                 collapse=""))
}
function_list <- c("grepl_full", "grepl_substr", "grepl_perl", "grepl_perl_substr", 
                   "grepl_perl_stri_sub", 
                   "isURL_cpp", "isURL_c", 
                   "stridetect", "stristartswith")

# cross join of functions and strings
dt <- CJ(fun=function_list, ind=1:nn)
# build code to run microbenchmark
mb_code <- paste("microbenchmark::microbenchmark(",
                 paste0("'", dt$fun, " ", 10^dt$ind, "'=", 
                        dt$fun, "(str_", dt$ind, ")", collapse=","),",times=10)")
# eval microbenchmark and calculate median
mb <- as.data.table(eval(parse(text=mb_code)))
mb[,c("fun", "nr"):=tstrsplit(as.character(expr), " ", fixed=TRUE)]
mb[,nr:=as.integer(nr)]
mb <- mb[,.(time=median(time)), by=.(fun,nr)]

# plot
ggplot(mapping=aes(nr, time/1e6)) + 
  geom_line(data=mb[fun %chin% c("grepl_full", "grepl_perl")],
            aes(col=fun), size=1.1) +
  geom_line(data=mb[!fun %chin% c("grepl_full", "grepl_perl")],
            aes(col=fun), size=0.5, linetype="dotted") +
  geom_point(data=mb, aes(col=fun)) +
  scale_y_log10(breaks=c(0.01,0.1,1,10,100,1000)) +
  scale_x_log10(breaks=10^(1:nn)) +
  labs(y="time in milliseconds", x="string length", 
       title="Benchmark string tests for isURL") +
  theme(legend.position = c(0.05,0.95), legend.justification = c(0,1))

sessionInfo

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_IE.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] tools_3.3.1 yaml_2.1.15

Results are similar with R 3.4.3 / data.table 1.10.4 / Windows 10 64bit

@st-pasha st-pasha added this to the v1.10.6 milestone Dec 19, 2017
@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Dec 19, 2017

Seems like a problem of the regex engine more than anything. anchoring to string beginning should mean that grepl performance is O(1), as far as I'm concerned. Strange that it appears otherwise.

@st-pasha
Copy link
Contributor

@st-pasha st-pasha commented Dec 19, 2017

probably R's regex engine attempts to build a histogram of the target string before applying the regexp.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Dec 19, 2017

@javrucebo
Copy link
Contributor Author

@javrucebo javrucebo commented Dec 19, 2017

Indeed it may be more a R issue, but something one can at least work around in fread

Using perl=TRUE does not seem to add any negative aspect and will be usally faster (in my benchmarks 100 times for large input)

See also help-page: ?grep:

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine,

The additional penalty (even when using perl=TRUE) does not seem to come from grep: Look at the benchmark when feeding grep with the shortened string though substr (grep_perl_substr) whereas using stringi::stri_sub performs with O(1) (grep_perl_strisub). So there seems to be an overhead when calling the base function with a large string. (Actually we see this even for the Rcpp versions)

@HughParsonage
Copy link
Member

@HughParsonage HughParsonage commented Jan 13, 2018

For later reference, the utility startsWith is easily the fastest (and clearest), but unfortunately depends on R (>= 3.3.0).

is_url <- function(x) grepl("^(http|ftp)s?://", x)
is_url_perl <- function(x) grepl("^(?:http|ftp)s?://", x, perl = TRUE)
is_url2 <- function(x) {
  `||`(startsWith(x, "ht") && startsWith(x, "http://") || startsWith(x, "https://"),
       startsWith(x, "ft") && startsWith(x, "ftp://") || startsWith(x, "ftps://"))
}

microbenchmark::microbenchmark(is_url(input), is_url_perl(input), is_url2(input), times = 50)
Unit: microseconds
#              expr         min          lq        mean       median          uq        max neval cld
# is_url(input)     1001673.642 1038144.688 1076978.567 1070702.5405 1112537.553 1185685.65    50   c
# is_url_perl(input)  23280.392   24793.807   30226.567   27345.3795   30737.237   51998.24    50  b 
# is_url2(input)          1.807       2.711     139.927      12.8005      18.975    6414.17    50 a  

URL <- "https://github.com/Rdatatable/data.table/issues/2531"
microbenchmark::microbenchmark(is_url(URL), is_url_perl(URL), is_url2(URL), times = 50)
# expr                      min     lq     mean median     uq    max neval cld
# is_url(URL)             6.626  7.228  8.40290  7.680  8.132 30.118    50  b 
# is_url_perl(URL)       65.657 66.259 67.40972 66.560 66.862 92.462    50   c
# is_url2(URL)            1.506  1.808  2.25304  2.108  2.409  8.132    50 a 

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Feb 16, 2018

This was a great report -- many thanks!
I've just fixed it and pushed to master. Please test and reopen if this doesn't fix it. I'm pretty sure it's much better now, but you never know. I pushed it straight to master (rather than a PR) so you can more easily test it just by upgrading to latest dev.
I added your test too so it now reads those http:// addresses as data as you suggested.

javrucebo added a commit to javrucebo/data.table that referenced this issue Feb 16, 2018
in `fread.R` in order to speed up execution time when `input` is the
actual data. `file.exists` is very slow, so checking first
for new lines speeds up the process.

See also discussion in Rdatatable#2531
@javrucebo
Copy link
Contributor Author

@javrucebo javrucebo commented Feb 16, 2018

Thank you for looking into this issue.
I am afraid it is not fixing the issue of slow execution time completely.
In the new code, once you go in the else clause of if (!missing(file)) the first thing you are checking (after verifying input is a length 1 character vector) is

if (file.exists(input)) {
  ....

unfortunately file.exists is also very slow for large input:

input <- paste(rep("1,2,3,4.567,some text field", 1e6),collapse="\n")
system.time(file.exists(input))
##   user  system elapsed 
##  0.693   0.000   0.694 
system.time(fread(input))
##   user  system elapsed 
##  0.861   0.000   0.862 

with the file.exists eating up 80% of the total fread execution time.

Fortunately the next check you are doing with grep (detecting newlines) is very fast (opposed to the suprisingly slow grepl checks:

system.time(length(grep('\\n|\\r', input)))
##   user  system elapsed 
##      0       0       0 

The easiest fix seems to be to swap the two conditions. The only case where I see this would make a difference is, if input is a filename and has a newline in the name, but I hope no ones would do this and then expect fread to be able to read the file. (and actually it wouldn't work as the c-routine for fread checks again for newlines and would treat such a file name as the actual input)

You suggested me to re-open the issue in case it's not working, but as reporter, I can not re-open it.

I have opened a PR #2630 with the proposed fix.

The other part of your fix with substring etc is indeed avoiding grepl is slightly faster, but due to the fact you check now for newlines earlier in the code the penalty should not be visible ... unless someone is giving as input a very long string without any newlines ... but that seems also like a pathological case.

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Feb 17, 2018

Thanks for the follow up. Yes I had thought file.exists() would be fast on huge input as the operating system would detect it was invalid in the first few characters. Interesting it isn't. Maybe R passes over it first before calling the OS for some reason. Anyway, will proceed with your PR, thanks.
Didn't know you can't reopen issues; must be just project members that see that option then. I'll see if there's a project setting to open that, and will bear it in mind in future. I've invited you to be project member too.

@javrucebo
Copy link
Contributor Author

@javrucebo javrucebo commented Feb 19, 2018

just for the record to fully close it.
With the changes applied in 8ab3028 and 3989ea2 fread speedup for data input through the input variable is substantial.

For input with > 0.5 MB size / 10000 rows the speedup between before and after fixes is 3 - 5.

Comparing fread spead with direct data input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants